Skip to content

Commit beaf8fa

Browse files
new script: media_url_finder.py
Signed-off-by: thiswillbeyourgithub <[email protected]>
1 parent c5828d3 commit beaf8fa

File tree

3 files changed

+78
-0
lines changed

3 files changed

+78
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,7 @@ Refer to [examples.md](https://github.com/thiswillbeyourgithub/wdoc/blob/main/wd
316316
* [TheFiche](scripts/TheFiche): create summaries for specific notions directly as a [logseq](https://github.com/logseq/logseq) page.
317317
* [FilteredDeckCreator](scripts/FilteredDeckCreator): directly create an [anki](https://ankitects.github.io/) filtered deck from the cards found by `wdoc`.
318318
* [Official Open-WebUI Tool](https://openwebui.com/t/qqqqqqqqqqqqqqqqqqqq/wdoctool), hosted [here](https://github.com/thiswillbeyourgithub/openwebui_custom_pipes_filters/blob/main/tools/wdoc_tools.py).
319+
* [MediaURLFinder](scripts/MediaURLFinder) simply leverages the `find_online_media` loader helper to use `playwright` and `yt-dlp` to find all the URLs of medias (videos, audio etc). This is especially useful if `yt-dlp` alone is not able to find the URL of a ressource.
319320

320321
## FAQ
321322

scripts/MediaURLFinder/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Media URL Finder
2+
3+
This script helps find direct URLs for video and audio media embedded within a webpage.
4+
5+
## Purpose
6+
7+
Sometimes, tools like `yt-dlp` might not directly find the media URLs on a complex webpage. This script uses the `find_online_media` function from [wdoc](https://github.com/thiswillbeyourgithub/wdoc) as a fallback. `find_online_media` uses `playwright` to load the page in a headless browser and intercepts network requests to identify potential media URLs based on regex patterns.
8+
9+
After finding potential URLs, the script uses `yt-dlp` to fetch metadata for each found link and returns the results as a JSON object.
10+
11+
## Usage
12+
13+
```bash
14+
python scripts/MediaURLFinder/media_url_finder.py --url="<URL_OF_THE_WEBPAGE>" [OPTIONS]
15+
```
16+
17+
Replace `<URL_OF_THE_WEBPAGE>` with the actual URL you want to scan.
18+
19+
**Optional arguments:**
20+
21+
You can pass additional arguments accepted by `wdoc.utils.loaders.find_online_media`, such as:
22+
23+
* `--online_media_url_regex`: Custom regex to match media URLs.
24+
* `--online_media_resourcetype_regex`: Custom regex to match resource types (e.g., 'media', 'video').
25+
* `--headless=False`: Run the browser in non-headless mode (useful for debugging).
26+
27+
**Example:**
28+
29+
```bash
30+
python scripts/MediaURLFinder/media_url_finder.py --url="https://example.com/page_with_embedded_video"
31+
```
32+
33+
This will output a JSON string containing the media URLs found and their metadata fetched by `yt-dlp`. If no media links are found, it will print a message and exit.
34+
35+
*(This README was generated with the help of [aider.chat](https://github.com/Aider-AI/aider/issues))*
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
"""
2+
This script utilizes the `find_online_media` function from `wdoc.utils.loaders`
3+
to automatically discover URLs for video and audio content embedded within a
4+
given webpage URL. It serves as a fallback mechanism when `yt-dlp` is unable
5+
to directly identify the media links on the page.
6+
"""
7+
8+
import json
9+
import sys
10+
import fire
11+
from wdoc.utils.loaders import find_online_media
12+
import yt_dlp as youtube_dl
13+
14+
ydl_opts = {"dump_single_json": True, "simulate": True}
15+
16+
17+
def main(url: str, **kwargs) -> str:
18+
out = find_online_media(url=url, **kwargs)
19+
if not any(v for v in out.values()):
20+
print("No media links found")
21+
sys.exit(1)
22+
23+
d = {k: [] for k in out.keys()}
24+
25+
for k, v in out.items():
26+
d[k] = {}
27+
for link in v:
28+
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
29+
j = ydl.download([url])
30+
d[k][url] = j
31+
32+
# Remove unused
33+
# keys = d.keys()
34+
# for k in keys:
35+
# if not d[k]:
36+
# del d[k]
37+
38+
return json.dumps(d)
39+
40+
41+
if __name__ == "__main__":
42+
fire.Fire(main)

0 commit comments

Comments
 (0)