Skip to content

Add CI for checking for broken links manually, weekly and in PRs#1633

Open
markcmiller86 wants to merge 83 commits intomainfrom
mcm86-check-urls-weekly
Open

Add CI for checking for broken links manually, weekly and in PRs#1633
markcmiller86 wants to merge 83 commits intomainfrom
mcm86-check-urls-weekly

Conversation

@markcmiller86
Copy link
Copy Markdown
Member

@markcmiller86 markcmiller86 commented Apr 18, 2023

Resolves: #1431

Links will get checked automatically weekly at 5:17 AM on Sundays. They can also be checked manually by just running the workflow.

We need to adjust the filters used in the checker and there are a lot of broken links found (see #1632)

  • Set list of Reviewers (at least one).
  • Add to Project BSSw Internal.
  • [ ] View the modified *.md files as rendered in GitHub.
  • [ ] If changes are to the GitHub pages site under the docs/ directory, consider viewing locally with Jekyll.
  • Watch for PR check failures.
  • Make any final changes to the PR based on feedback and review GitHub (and Jekyll) rendered files.
  • Ensure at least one reviewer signs off on the changes.
  • Once reviewer has approved and PR check pass, then merge the PR.

@markcmiller86 markcmiller86 changed the title Create check-published-links-weekly.yml Add CI for checking for broken links weekly (and manually) Apr 18, 2023
@bartlettroscoe
Copy link
Copy Markdown
Member

@markcmiller86, please @mention me and let me know when this is ready to review

@markcmiller86 markcmiller86 marked this pull request as draft April 18, 2023 21:50
@markcmiller86
Copy link
Copy Markdown
Member Author

Apologies...I converted to DRAFT for time being.

@markcmiller86
Copy link
Copy Markdown
Member Author

Have adjusted it to ignore...

  • docs -- those links are often not really public...EB should fix but we can do as encountered.
  • Events -- short lived and links are likely to stale over time

@markcmiller86
Copy link
Copy Markdown
Member Author

Ok, this addresses all the cases in #1632.

I am pulling out of DRAFT mode. Its ready for review.

@markcmiller86 markcmiller86 marked this pull request as ready for review April 19, 2023 00:00
@markcmiller86
Copy link
Copy Markdown
Member Author

@rinkug, @bartlettroscoe and @bernhold this is now ready for review.

The weekly CI check should result in a small list of either bonified broken links or false negatives (e.g. says its broken when its not) to try to check and then either go edit files to fix, add to ignore patterns or just ignore.

@vsoch
Copy link
Copy Markdown
Contributor

vsoch commented Jun 29, 2024

I'm going to unsubscribe from notifications here - good luck @markcmiller86 !

@bartlettroscoe
Copy link
Copy Markdown
Member

I think we should change this to a manual trigger and merge this.

Copy link
Copy Markdown
Member

@bartlettroscoe bartlettroscoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments ...

Wow, this GitHub job artifacts feature looks very useful.

Seems like you should upload and download a JSON file with the list of bad links over say the last 2 weeks and update the counts for the days/times where it was tested to fail. (A python script that creates and updates that JSON file should not be too bad.) Then, if a URL link check has filed with every check over the last 2 weeks (or whatever time length), then you report that as a GHA failure for this workflow and print out the bad URL links and their failure history.

Comment on lines +77 to +92
for f in ${{ steps.file_list.outputs.files }}; do
if [ "${f##*.}" != "md" ]; then
continue
fi
for ef in ${{ steps.file_list.outputs.ignore_file_patterns }}; do
if [ "$ef" = "$f" ]; then
continue 2 # ignore this file
fi
fd=$(echo $f | cut -d'/' -f1)
if [ "$ef" = "$fd" ]; then
continue 2 # ignore this dir
fi
done
linkchecker -f utils/LinkChecker/.linkcheckerrc file://$(pwd)/$f >> linkchecker.out || true
cat linkchecker.out >> linkchecker-all.out
done
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it might be good to break this out into a Python script taking arguments so it can be developed and tested locally. Also, the Python implementation for some of these operations is a bit cleaner than bash commands.

Comment on lines +99 to +103
if: ${{ github.event_name == 'pull_request' }}
uses: actions/upload-artifact@v4
with:
name: bad-links
path: bad_links.txt
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I think you might only want to upload the list of bad links for files on the 'main' branch, not a topic branch. Also, don't you want the full list of *.md files being processed when you generate the list of bad links?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the download-artifact call to get back a copy of this file? I guess that still needs to be added.

Comment on lines +112 to +113
# Keep the recurring failures and definitely bad lists in repo on
# branch manage-broken-links
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not correct, is it? GitHub Action's job artifacts system is being used instead, right?

@bernhold
Copy link
Copy Markdown
Member

A related suggestion: Along with the reporting, this action could generate a script that would automatically replace each of the bad URLs with text like "(this link is no longer available)".

The idea would not be to apply this blindly, but rather for bad links to be investigated and where appropriate alternatives can be found (e.g., relocated or equivalent content), those entries would be changed manually and the lines of the script commented out. Thus the generated script becomes the final cleanup for the remaining bad links.

The most likely links to go bad are registration and survey links that logically expire or close. We're likely to get a handful of those each month. And of course we will not be able to find alternatives for everything else that rots away.

@rinkug
Copy link
Copy Markdown
Member

rinkug commented Jun 26, 2025

@bernhold , @bartlettroscoe and @markcmiller86 : Status of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Periodic CI to find links that have gone bad

5 participants