Add CI for checking for broken links manually, weekly and in PRs#1633
Add CI for checking for broken links manually, weekly and in PRs#1633markcmiller86 wants to merge 83 commits intomainfrom
Conversation
…software/bssw.io into mcm86-check-urls-weekly
|
@markcmiller86, please @mention me and let me know when this is ready to review |
|
Apologies...I converted to DRAFT for time being. |
|
Have adjusted it to ignore...
|
|
Ok, this addresses all the cases in #1632. I am pulling out of DRAFT mode. Its ready for review. |
|
@rinkug, @bartlettroscoe and @bernhold this is now ready for review. The weekly CI check should result in a small list of either bonified broken links or false negatives (e.g. says its broken when its not) to try to check and then either go edit files to fix, add to ignore patterns or just ignore. |
|
I'm going to unsubscribe from notifications here - good luck @markcmiller86 ! |
|
I think we should change this to a manual trigger and merge this. |
bartlettroscoe
left a comment
There was a problem hiding this comment.
Just a few comments ...
Wow, this GitHub job artifacts feature looks very useful.
Seems like you should upload and download a JSON file with the list of bad links over say the last 2 weeks and update the counts for the days/times where it was tested to fail. (A python script that creates and updates that JSON file should not be too bad.) Then, if a URL link check has filed with every check over the last 2 weeks (or whatever time length), then you report that as a GHA failure for this workflow and print out the bad URL links and their failure history.
| for f in ${{ steps.file_list.outputs.files }}; do | ||
| if [ "${f##*.}" != "md" ]; then | ||
| continue | ||
| fi | ||
| for ef in ${{ steps.file_list.outputs.ignore_file_patterns }}; do | ||
| if [ "$ef" = "$f" ]; then | ||
| continue 2 # ignore this file | ||
| fi | ||
| fd=$(echo $f | cut -d'/' -f1) | ||
| if [ "$ef" = "$fd" ]; then | ||
| continue 2 # ignore this dir | ||
| fi | ||
| done | ||
| linkchecker -f utils/LinkChecker/.linkcheckerrc file://$(pwd)/$f >> linkchecker.out || true | ||
| cat linkchecker.out >> linkchecker-all.out | ||
| done |
There was a problem hiding this comment.
Seems like it might be good to break this out into a Python script taking arguments so it can be developed and tested locally. Also, the Python implementation for some of these operations is a bit cleaner than bash commands.
| if: ${{ github.event_name == 'pull_request' }} | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: bad-links | ||
| path: bad_links.txt |
There was a problem hiding this comment.
Note that I think you might only want to upload the list of bad links for files on the 'main' branch, not a topic branch. Also, don't you want the full list of *.md files being processed when you generate the list of bad links?
There was a problem hiding this comment.
Where is the download-artifact call to get back a copy of this file? I guess that still needs to be added.
| # Keep the recurring failures and definitely bad lists in repo on | ||
| # branch manage-broken-links |
There was a problem hiding this comment.
This comment is not correct, is it? GitHub Action's job artifacts system is being used instead, right?
|
A related suggestion: Along with the reporting, this action could generate a script that would automatically replace each of the bad URLs with text like "(this link is no longer available)". The idea would not be to apply this blindly, but rather for bad links to be investigated and where appropriate alternatives can be found (e.g., relocated or equivalent content), those entries would be changed manually and the lines of the script commented out. Thus the generated script becomes the final cleanup for the remaining bad links. The most likely links to go bad are registration and survey links that logically expire or close. We're likely to get a handful of those each month. And of course we will not be able to find alternatives for everything else that rots away. |
|
@bernhold , @bartlettroscoe and @markcmiller86 : Status of this PR? |
Resolves: #1431
Links will get checked automatically weekly at 5:17 AM on Sundays. They can also be checked manually by just running the workflow.
We need to adjust the filters used in the checker and there are a lot of broken links found (see #1632)
[ ] View the modified*.mdfiles as rendered in GitHub.[ ] If changes are to the GitHub pages site under thedocs/directory, consider viewing locally with Jekyll.