Skip to content

Commit ca31ef4

Browse files
committed
fixed pypi, enabled cli usage w/ 0.0.4, changed package name, using poetry
1 parent 4ab7c84 commit ca31ef4

File tree

12 files changed

+93
-93
lines changed

12 files changed

+93
-93
lines changed

README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@
44
| | | ( | ( | ( | \ \ \ / | __/ |
55
_| _| _| \__._| \___| _| \__._| \_/\_/ _| \___| _|
66
7-
----------------------------
8-
md_crawler.py by @paulpierre
9-
----------------------------
7+
---------------------------------
8+
markdown_crawler - by @paulpierre
9+
---------------------------------
1010
A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file for each page
1111
https://github.com/paulpierre
12-
https://x.com/paulpierre
12+
https://x.com/paulpierre
1313
```
1414
<br><br>
1515

@@ -39,23 +39,23 @@ If you wish to simply use it in the CLI, you can run the following command:
3939

4040
Install the package
4141
```
42-
pip install md_crawler
42+
pip install markdown-crawler
4343
```
4444

4545
Execute the CLI
4646
```
47-
md_crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
47+
markdown-crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
4848
```
4949

5050
To run from the github repo, once you have it checked out:
5151
```
52-
pip install -r requirements.txt
53-
python3 md_crawler.py -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
52+
pip install .
53+
markdown-crawler -t 5 -d 3 -b ./markdown https://en.wikipedia.org/wiki/Morty_Smith
5454
```
5555

5656
Or use the library in your own code:
5757
```
58-
from md_crawler import md_crawl
58+
from markdown_crawler import md_crawl
5959
url = 'https://en.wikipedia.org/wiki/Morty_Smith'
6060
md_crawl(url, max_depth=3, num_threads=5, base_path='markdown')
6161
```
@@ -73,7 +73,7 @@ md_crawl(url, max_depth=3, num_threads=5, base_path='markdown')
7373

7474
The following arguments are supported
7575
```
76-
usage: md_crawler [-h] [--max-depth MAX_DEPTH] [--num-threads NUM_THREADS] [--base-path BASE_PATH] [--debug DEBUG]
76+
usage: markdown-crawler [-h] [--max-depth MAX_DEPTH] [--num-threads NUM_THREADS] [--base-path BASE_PATH] [--debug DEBUG]
7777
[--target-content TARGET_CONTENT] [--target-links TARGET_LINKS] [--valid-paths VALID_PATHS]
7878
[--domain-match DOMAIN_MATCH] [--base-path-match BASE_PATH_MATCH]
7979
base-url
@@ -82,7 +82,7 @@ usage: md_crawler [-h] [--max-depth MAX_DEPTH] [--num-threads NUM_THREADS] [--ba
8282
<br><br>
8383

8484
# 📝 Example
85-
Take a look at [example.py](https://github.com/paulpierre/md_crawler/blob/main/example.py) for an example
85+
Take a look at [example.py](https://github.com/paulpierre/markdown-crawler/blob/main/example.py) for an example
8686
implementation of the library. In this configuration we set:
8787
- `max_depth` to 3. We will crawl the base URL and 3 levels of children
8888
- `num_threads` to 5. We will use 5 parallel(ish) threads to crawl the website
@@ -95,13 +95,13 @@ implementation of the library. In this configuration we set:
9595

9696
And when we run it we can view the progress
9797
<br>
98-
> ![cli](https://github.com/paulpierre/md_crawler/blob/main/img/ss_crawler.png?raw=true)
98+
> ![cli](https://github.com/paulpierre/markdown-crawler/blob/main/img/ss_crawler.png?raw=true)
9999
100100
We can see the progress of our files in the `markdown` directory locally
101-
> ![md](https://github.com/paulpierre/md_crawler/blob/main/img/ss_dir.png?raw=true)
101+
> ![md](https://github.com/paulpierre/markdown-crawler/blob/main/img/ss_dir.png?raw=true)
102102
103103
And we can see the contents of the HTML converted to markdown
104-
> ![md](https://github.com/paulpierre/md_crawler/blob/main/img/ss_markdown.png?raw=true)
104+
> ![md](https://github.com/paulpierre/markdown-crawler/blob/main/img/ss_markdown.png?raw=true)
105105
106106
<br><br>
107107
# ❤️ Thanks
@@ -134,4 +134,4 @@ SOFTWARE.
134134
<br><br>
135135

136136
### html2text credits
137-
`md_crawler` makes use of html2text by the late and legendary [Aaron Swartz]([email protected]). The original source code can be found [here](http://www.aaronsw.com/2002/html2text). A modification was implemented to make it compatible with Python 3.x. It is licensed under GNU General Public License (GPL).
137+
`markdown_crawler` makes use of html2text by the late and legendary [Aaron Swartz]([email protected]). The original source code can be found [here](http://www.aaronsw.com/2002/html2text). A modification was implemented to make it compatible with Python 3.x. It is licensed under GNU General Public License (GPL).

dist/markdown-crawler-0.0.4.tar.gz

18.1 KB
Binary file not shown.
17.4 KB
Binary file not shown.

example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from md_crawler import md_crawl
1+
from markdown_crawler import md_crawl
22
url = 'https://rickandmorty.fandom.com/wiki/Evil_Morty'
33
print(f'🕸️ Starting crawl of {url}')
44
md_crawl(

md_crawler/__init__.py renamed to markdown_crawler/__init__.py

Lines changed: 3 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
from bs4 import BeautifulSoup
22
import urllib.parse
33
import threading
4-
from md_crawler import html2text
4+
from markdown_crawler import html2text
55
import requests
6-
import argparse
76
import logging
87
import queue
98
import time
@@ -291,7 +290,7 @@ def md_crawl(
291290
logging.basicConfig(level=logging.DEBUG)
292291
logger.debug('🐞 Debugging enabled')
293292

294-
logger.info(f'{BANNER}\n\n🕸️ Crawling {base_url} at ⏬ depth {max_depth} with 🧵 {num_threads} threads')
293+
logger.info(f'🕸️ Crawling {base_url} at ⏬ depth {max_depth} with 🧵 {num_threads} threads')
295294

296295
# Validate the base URL
297296
if not is_valid_url(base_url):
@@ -336,44 +335,4 @@ def md_crawl(
336335
for t in threads:
337336
t.join()
338337

339-
logger.info('🏁 All threads have finished')
340-
341-
342-
def main():
343-
arg_parser = argparse.ArgumentParser()
344-
arg_parser.description = 'A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file for each page'
345-
arg_parser.add_argument('--max-depth', '-d', required=False, default=3, type=int)
346-
arg_parser.add_argument('--num-threads', '-t', required=False, default=5, type=int)
347-
arg_parser.add_argument('--base-dir', '-b', required=False, default='markdown', type=str)
348-
arg_parser.add_argument('--debug', '-e', required=False, type=bool, default=False)
349-
arg_parser.add_argument('--target-content', '-c', required=False, type=str, default=None)
350-
arg_parser.add_argument('--target-links', '-l', required=False, type=str, default=DEFAULT_TARGET_LINKS)
351-
arg_parser.add_argument('--valid-paths', '-v', required=False, type=str, default=None)
352-
arg_parser.add_argument('--domain-match', '-m', required=False, type=bool, default=True)
353-
arg_parser.add_argument('--base-path-match', '-p', required=False, type=bool, default=True)
354-
arg_parser.add_argument('base_url', type=str)
355-
356-
# ----------------
357-
# Parse target arg
358-
# ----------------
359-
args = arg_parser.parse_args()
360-
361-
md_crawl(
362-
args.base_url,
363-
max_depth=args.max_depth,
364-
num_threads=args.num_threads,
365-
base_dir=args.base_path,
366-
target_content=args.target_content.split(',') if args.target_content and ',' in args.target_content else None,
367-
target_links=args.target_links.split(',') if args.target_links and ',' in args.target_links else [args.target_links],
368-
valid_paths=args.valid_paths.split(',') if args.valid_paths and ',' in args.valid_paths else None,
369-
is_domain_match=args.domain_match,
370-
is_base_path_match=args.base_match,
371-
is_debug=args.debug
372-
)
373-
374-
375-
# --------------
376-
# CLI entrypoint
377-
# --------------
378-
if __name__ == '__main__':
379-
main()
338+
logger.info('🏁 All threads have finished')

markdown_crawler/cli.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import argparse
2+
from markdown_crawler import (
3+
md_crawl,
4+
DEFAULT_TARGET_LINKS,
5+
BANNER
6+
)
7+
8+
def main():
9+
print(BANNER)
10+
arg_parser = argparse.ArgumentParser()
11+
arg_parser.add_argument('--max-depth', '-d', required=False, default=3, type=int, help='Max depth of child links to crawl')
12+
arg_parser.add_argument('--num-threads', '-t', required=False, default=5, type=int, help='Number of threads to use for crawling')
13+
arg_parser.add_argument('--base-dir', '-b', required=False, default='markdown', type=str, help='Base directory to save markdown files in')
14+
arg_parser.add_argument('--debug', '-e', required=False, type=bool, default=False, help='Enable debug mode')
15+
arg_parser.add_argument('--target-content', '-c', required=False, type=str, default=None, help='CSS target path of the content to extract from each page')
16+
arg_parser.add_argument('--target-links', '-l', required=False, type=str, default=DEFAULT_TARGET_LINKS, help='CSS target path containing the links to crawl')
17+
arg_parser.add_argument('--valid-paths', '-v', required=False, type=str, default=None, help='Comma separated list of valid relative paths to crawl, (ex. /wiki,/categories,/help')
18+
arg_parser.add_argument('--domain-match', '-m', required=False, type=bool, default=True, help='Crawl only links that match the base domain')
19+
arg_parser.add_argument('--base-path-match', '-p', required=False, type=bool, default=True, help='Crawl only links that match the base path of the base_url specified in CLI')
20+
arg_parser.add_argument('base_url', type=str, help='Base URL to crawl (ex. 🐍🎷 https://rickandmorty.fandom.com/wiki/Evil_Morty')
21+
if len(arg_parser.parse_args().__dict__.keys()) == 0:
22+
arg_parser.print_help()
23+
return
24+
# ----------------
25+
# Parse target arg
26+
# ----------------
27+
args = arg_parser.parse_args()
28+
29+
md_crawl(
30+
args.base_url,
31+
max_depth=args.max_depth,
32+
num_threads=args.num_threads,
33+
base_dir=args.base_path,
34+
target_content=args.target_content.split(',') if args.target_content and ',' in args.target_content else None,
35+
target_links=args.target_links.split(',') if args.target_links and ',' in args.target_links else [args.target_links],
36+
valid_paths=args.valid_paths.split(',') if args.valid_paths and ',' in args.valid_paths else None,
37+
is_domain_match=args.domain_match,
38+
is_base_path_match=args.base_match,
39+
is_debug=args.debug
40+
)
41+
42+
43+
# --------------
44+
# CLI entrypoint
45+
# --------------
46+
if __name__ == '__main__':
47+
main()
File renamed without changes.
-7.76 KB
Binary file not shown.
-21.8 KB
Binary file not shown.

pyproject.toml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
[build-system]
2+
requires = ["setuptools>=61.0"]
3+
build-backend = "setuptools.build_meta"
4+
5+
6+
[project]
7+
name = "markdown-crawler"
8+
version = "0.0.4"
9+
authors = [
10+
{ name="Paul Pierre", email="[email protected]" },
11+
]
12+
description = "A multithreaded 🕸️ web crawler that recursively crawls a website and creates a 🔽 markdown file for each page"
13+
readme = "README.md"
14+
requires-python = ">=3.4"
15+
classifiers = [
16+
"Programming Language :: Python :: 3",
17+
"License :: OSI Approved :: MIT License",
18+
"Operating System :: OS Independent",
19+
]
20+
21+
[project.urls]
22+
"Homepage" = "https://github.com/paulpierre/markdown-crawler"
23+
"Bug Tracker" = "https://github.com/paulpierre/markdown-crawler/issues"
24+
"Twitter" = "https://twitter.com/paulpierre"
25+
26+
[project.scripts]
27+
markdown-crawler = "markdown_crawler.cli:main"

setup.cfg

Lines changed: 0 additions & 2 deletions
This file was deleted.

setup.py

Lines changed: 0 additions & 31 deletions
This file was deleted.

0 commit comments

Comments
 (0)