new script: media_url_finder.py

thiswillbeyourgithub · thiswillbeyourgithub · commit beaf8fa98cee · 2025-04-27T20:03:14.000+02:00
Signed-off-by: thiswillbeyourgithub &lt;26625900+thiswillbeyourgithub@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -316,6 +316,7 @@ Refer to [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wd
 * [TheFiche](scripts/TheFiche): create summaries for specific notions directly as a [logseq](https://github.com/logseq/logseq) page.
 * [FilteredDeckCreator](scripts/FilteredDeckCreator): directly create an [anki](https://ankitects.github.io/) filtered deck from the cards found by `wdoc`.
 * [Official Open-WebUI Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool), hosted [here](https://github.com/thiswillbeyourgithub/openwebui_custom_pipes_filters/blob/main/tools/wdoc_tools.py).
+* [MediaURLFinder](scripts/MediaURLFinder) simply leverages the `find_online_media` loader helper to use `playwright` and `yt-dlp` to find all the URLs of medias (videos, audio etc). This is especially useful if `yt-dlp` alone is not able to find the URL of a ressource.
 
 ## FAQ
 
diff --git a/scripts/MediaURLFinder/README.md b/scripts/MediaURLFinder/README.md
@@ -0,0 +1,35 @@
+# Media URL Finder
+
+This script helps find direct URLs for video and audio media embedded within a webpage.
+
+## Purpose
+
+Sometimes, tools like `yt-dlp` might not directly find the media URLs on a complex webpage. This script uses the `find_online_media` function from [wdoc](https://github.com/thiswillbeyourgithub/wdoc) as a fallback. `find_online_media` uses `playwright` to load the page in a headless browser and intercepts network requests to identify potential media URLs based on regex patterns.
+
+After finding potential URLs, the script uses `yt-dlp` to fetch metadata for each found link and returns the results as a JSON object.
+
+## Usage
+
+```bash
+python scripts/MediaURLFinder/media_url_finder.py --url="<URL_OF_THE_WEBPAGE>" [OPTIONS]
+```
+
+Replace `<URL_OF_THE_WEBPAGE>` with the actual URL you want to scan.
+
+**Optional arguments:**
+
+You can pass additional arguments accepted by `wdoc.utils.loaders.find_online_media`, such as:
+
+*   `--online_media_url_regex`: Custom regex to match media URLs.
+*   `--online_media_resourcetype_regex`: Custom regex to match resource types (e.g., 'media', 'video').
+*   `--headless=False`: Run the browser in non-headless mode (useful for debugging).
+
+**Example:**
+
+```bash
+python scripts/MediaURLFinder/media_url_finder.py --url="https://example.com/page_with_embedded_video"
+```
+
+This will output a JSON string containing the media URLs found and their metadata fetched by `yt-dlp`. If no media links are found, it will print a message and exit.
+
+*(This README was generated with the help of [aider.chat](https://github.com/Aider-AI/aider/issues))*
diff --git a/scripts/MediaURLFinder/media_url_finder.py b/scripts/MediaURLFinder/media_url_finder.py
@@ -0,0 +1,42 @@
+"""
+This script utilizes the `find_online_media` function from `wdoc.utils.loaders`
+to automatically discover URLs for video and audio content embedded within a
+given webpage URL. It serves as a fallback mechanism when `yt-dlp` is unable
+to directly identify the media links on the page.
+"""
+
+import json
+import sys
+import fire
+from wdoc.utils.loaders import find_online_media
+import yt_dlp as youtube_dl
+
+ydl_opts = {"dump_single_json": True, "simulate": True}
+
+
+def main(url: str, **kwargs) -> str:
+    out = find_online_media(url=url, **kwargs)
+    if not any(v for v in out.values()):
+        print("No media links found")
+        sys.exit(1)
+
+    d = {k: [] for k in out.keys()}
+
+    for k, v in out.items():
+        d[k] = {}
+        for link in v:
+            with youtube_dl.YoutubeDL(ydl_opts) as ydl:
+                j = ydl.download([url])
+                d[k][url] = j
+
+    # Remove unused
+    # keys = d.keys()
+    # for k in keys:
+    #     if not d[k]:
+    #         del d[k]
+
+    return json.dumps(d)
+
+
+if __name__ == "__main__":
+    fire.Fire(main)