Skip to content

Commit 8053a7c

Browse files
committed
Nick: updates on pypi
1 parent c164370 commit 8053a7c

File tree

2 files changed

+68
-148
lines changed

2 files changed

+68
-148
lines changed

apps/python-sdk/README.md

Lines changed: 67 additions & 147 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ The Firecrawl Python SDK is a library that allows you to easily scrape and crawl
66

77
To install the Firecrawl Python SDK, you can use pip:
88

9-
```bash
9+
```bash
1010
pip install firecrawl-py
1111
```
1212

@@ -17,26 +17,23 @@ pip install firecrawl-py
1717

1818
Here's an example of how to use the SDK:
1919

20-
```python
21-
from firecrawl.firecrawl import FirecrawlApp
20+
```python
21+
from firecrawl import FirecrawlApp, ScrapeOptions
2222

2323
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
2424

2525
# Scrape a website:
26-
scrape_status = app.scrape_url(
26+
data = app.scrape_url(
2727
'https://firecrawl.dev',
28-
params={'formats': ['markdown', 'html']}
28+
formats=['markdown', 'html']
2929
)
30-
print(scrape_status)
30+
print(data)
3131

3232
# Crawl a website:
3333
crawl_status = app.crawl_url(
3434
'https://firecrawl.dev',
35-
params={
36-
'limit': 100,
37-
'scrapeOptions': {'formats': ['markdown', 'html']}
38-
},
39-
poll_interval=30
35+
limit=100,
36+
scrape_options=ScrapeOptions(formats=['markdown', 'html'])
4037
)
4138
print(crawl_status)
4239
```
@@ -45,143 +42,80 @@ print(crawl_status)
4542

4643
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
4744

48-
```python
49-
url = 'https://example.com'
50-
scraped_data = app.scrape_url(url)
51-
```
52-
53-
### Extracting structured data from a URL
54-
55-
With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:
56-
57-
```python
58-
class ArticleSchema(BaseModel):
59-
title: str
60-
points: int
61-
by: str
62-
commentsURL: str
63-
64-
class TopArticlesSchema(BaseModel):
65-
top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")
66-
67-
data = app.scrape_url('https://news.ycombinator.com', {
68-
'extractorOptions': {
69-
'extractionSchema': TopArticlesSchema.model_json_schema(),
70-
'mode': 'llm-extraction'
71-
},
72-
'pageOptions':{
73-
'onlyMainContent': True
74-
}
75-
})
76-
print(data["llm_extraction"])
45+
```python
46+
# Scrape a website:
47+
scrape_result = app.scrape_url('firecrawl.dev', formats=['markdown', 'html'])
48+
print(scrape_result)
7749
```
7850

7951
### Crawling a Website
8052

8153
To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
8254

83-
```python
84-
idempotency_key = str(uuid.uuid4()) # optional idempotency key
85-
crawl_result = app.crawl_url('firecrawl.dev', {'excludePaths': ['blog/*']}, 2, idempotency_key)
86-
print(crawl_result)
55+
```python
56+
crawl_status = app.crawl_url(
57+
'https://firecrawl.dev',
58+
limit=100,
59+
scrape_options=ScrapeOptions(formats=['markdown', 'html']),
60+
poll_interval=30
61+
)
62+
print(crawl_status)
8763
```
8864

89-
### Asynchronous Crawl a Website
65+
### Asynchronous Crawling
9066

91-
To crawl a website asynchronously, use the `async_crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
67+
<Tip>Looking for async operations? Check out the [Async Class](#async-class) section below.</Tip>
9268

93-
```python
94-
crawl_result = app.async_crawl_url('firecrawl.dev', {'excludePaths': ['blog/*']}, "")
95-
print(crawl_result)
69+
To crawl a website asynchronously, use the `crawl_url_async` method. It returns the crawl `ID` which you can use to check the status of the crawl job. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
70+
71+
```python
72+
crawl_status = app.async_crawl_url(
73+
'https://firecrawl.dev',
74+
limit=100,
75+
scrape_options=ScrapeOptions(formats=['markdown', 'html']),
76+
)
77+
print(crawl_status)
9678
```
9779

9880
### Checking Crawl Status
9981

10082
To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
10183

102-
```python
103-
id = crawl_result['id']
104-
status = app.check_crawl_status(id)
105-
```
106-
107-
### Map a Website
108-
109-
Use `map_url` to generate a list of URLs from a website. The `params` argument let you customize the mapping process, including options to exclude subdomains or to utilize the sitemap.
110-
111-
```python
112-
# Map a website:
113-
map_result = app.map_url('https://example.com')
114-
print(map_result)
115-
```
116-
117-
### Crawl a website with WebSockets
118-
119-
To crawl a website with WebSockets, use the `crawl_url_and_watch` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
120-
121-
```python
122-
# inside an async function...
123-
nest_asyncio.apply()
124-
125-
# Define event handlers
126-
def on_document(detail):
127-
print("DOC", detail)
128-
129-
def on_error(detail):
130-
print("ERR", detail['error'])
131-
132-
def on_done(detail):
133-
print("DONE", detail['status'])
134-
135-
# Function to start the crawl and watch process
136-
async def start_crawl_and_watch():
137-
# Initiate the crawl job and get the watcher
138-
watcher = app.crawl_url_and_watch('firecrawl.dev', { 'excludePaths': ['blog/*'], 'limit': 5 })
139-
140-
# Add event listeners
141-
watcher.add_event_listener("document", on_document)
142-
watcher.add_event_listener("error", on_error)
143-
watcher.add_event_listener("done", on_done)
144-
145-
# Start the watcher
146-
await watcher.connect()
147-
148-
# Run the event loop
149-
await start_crawl_and_watch()
84+
```python
85+
crawl_status = app.check_crawl_status("<crawl_id>")
86+
print(crawl_status)
15087
```
15188

152-
### Scraping multiple URLs in batch
89+
### Cancelling a Crawl
15390

154-
To batch scrape multiple URLs, use the `batch_scrape_urls` method. It takes the URLs and optional parameters as arguments. The `params` argument allows you to specify additional options for the scraper such as the output formats.
91+
To cancel an asynchronous crawl job, use the `cancel_crawl` method. It takes the job ID of the asynchronous crawl as a parameter and returns the cancellation status.
15592

156-
```python
157-
idempotency_key = str(uuid.uuid4()) # optional idempotency key
158-
batch_scrape_result = app.batch_scrape_urls(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']}, 2, idempotency_key)
159-
print(batch_scrape_result)
93+
```python
94+
cancel_crawl = app.cancel_crawl(id)
95+
print(cancel_crawl)
16096
```
16197

162-
### Asynchronous batch scrape
98+
### Map a Website
16399

164-
To run a batch scrape asynchronously, use the `async_batch_scrape_urls` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the scraper, such as the output formats.
100+
Use `map_url` to generate a list of URLs from a website. The `params` argument let you customize the mapping process, including options to exclude subdomains or to utilize the sitemap.
165101

166-
```python
167-
batch_scrape_result = app.async_batch_scrape_urls(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']})
168-
print(batch_scrape_result)
102+
```python
103+
# Map a website:
104+
map_result = app.map_url('https://firecrawl.dev')
105+
print(map_result)
169106
```
170107

171-
### Checking batch scrape status
108+
{/* ### Extracting Structured Data from Websites
172109

173-
To check the status of an asynchronous batch scrape job, use the `check_batch_scrape_status` method. It takes the job ID as a parameter and returns the current status of the batch scrape job.
110+
To extract structured data from websites, use the `extract` method. It takes the URLs to extract data from, a prompt, and a schema as arguments. The schema is a Pydantic model that defines the structure of the extracted data.
174111

175-
```python
176-
id = batch_scrape_result['id']
177-
status = app.check_batch_scrape_status(id)
178-
```
112+
<ExtractPythonShort /> */}
179113

180-
### Batch scrape with WebSockets
114+
### Crawling a Website with WebSockets
181115

182-
To use batch scrape with WebSockets, use the `batch_scrape_urls_and_watch` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the scraper, such as the output formats.
116+
To crawl a website with WebSockets, use the `crawl_url_and_watch` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
183117

184-
```python
118+
```python
185119
# inside an async function...
186120
nest_asyncio.apply()
187121

@@ -195,10 +129,10 @@ def on_error(detail):
195129
def on_done(detail):
196130
print("DONE", detail['status'])
197131

198-
# Function to start the crawl and watch process
132+
# Function to start the crawl and watch process
199133
async def start_crawl_and_watch():
200134
# Initiate the crawl job and get the watcher
201-
watcher = app.batch_scrape_urls_and_watch(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']})
135+
watcher = app.crawl_url_and_watch('firecrawl.dev', exclude_paths=['blog/*'], limit=5)
202136

203137
# Add event listeners
204138
watcher.add_event_listener("document", on_document)
@@ -216,36 +150,22 @@ await start_crawl_and_watch()
216150

217151
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
218152

219-
## Running the Tests with Pytest
220-
221-
To ensure the functionality of the Firecrawl Python SDK, we have included end-to-end tests using `pytest`. These tests cover various aspects of the SDK, including URL scraping, web searching, and website crawling.
222-
223-
### Running the Tests
224-
225-
To run the tests, execute the following commands:
226-
227-
Install pytest:
228-
229-
```bash
230-
pip install pytest
231-
```
232-
233-
Run:
234-
235-
```bash
236-
pytest firecrawl/__tests__/e2e_withAuth/test.py
237-
```
238-
239-
## Contributing
240-
241-
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
153+
## Async Class
242154

243-
## License
155+
For async operations, you can use the `AsyncFirecrawlApp` class. Its methods are the same as the `FirecrawlApp` class, but they don't block the main thread.
244156

245-
The Firecrawl Python SDK is licensed under the MIT License. This means you are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the SDK, subject to the following conditions:
157+
```python
158+
from firecrawl import AsyncFirecrawlApp
246159

247-
- The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
160+
app = AsyncFirecrawlApp(api_key="YOUR_API_KEY")
248161

249-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
162+
# Async Scrape
163+
async def example_scrape():
164+
scrape_result = await app.scrape_url(url="https://example.com")
165+
print(scrape_result)
250166

251-
Please note that while this SDK is MIT licensed, it is part of a larger project which may be under different licensing terms. Always refer to the license information in the root directory of the main project for overall licensing details.
167+
# Async Crawl
168+
async def example_crawl():
169+
crawl_result = await app.crawl_url(url="https://example.com")
170+
print(crawl_result)
171+
```

apps/python-sdk/firecrawl/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
from .firecrawl import FirecrawlApp, JsonConfig, ScrapeOptions # noqa
1515

16-
__version__ = "2.4.0"
16+
__version__ = "2.4.3"
1717

1818
# Define the logger for the Firecrawl project
1919
logger: logging.Logger = logging.getLogger("firecrawl")

0 commit comments

Comments
 (0)