Implementation proposal #1

whalebot-helmsman · 2020-09-09T13:46:36Z

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

The text was updated successfully, but these errors were encountered:

Insutanto · 2020-09-09T16:06:00Z

Thank you @whalebot-helmsman

I agree with you. It looks so great that we can implement different message queues without implement different schedulers. I am tired of those DRY's problems. 😫
I have read the issues and PRs that your mention, they are very valuable. I will try to use DownloaderAwarePriorityQueue and queue-based implementation. That would be great for me to implement some modules in the future. 😸
In the end, thank you for your contributions to the Scrapy project. 😸

Insutanto · 2020-09-09T16:08:45Z

Hi @Insutanto

You doing nice work in this repo. I have the same desire: different message queues should be supported in scrapy.

Old implementations of this idea and one you have here share common disadvantage. For every type of queue you need to implement separate scheduler. Beside amount of work required such implementations can't use work done on improvement of scheduling. I am talking mostly about scrapy/scrapy#3520. The reason for going distributed(at least for me) is a lot of domains in a single crawl. Not using DownloaderAwarePriorityQueue makes crawling slower(like 10 times slower) according to benchmarks in mentioned PR.

To overcome this situation I developed and merged in scrapy/scrapy#3884 separation between logic of scheduler and external message queue.

It would be great for your project and scrapy community if you change from scheduler-based to queue-based.

More details and discussions can be find in scrapy/scrapy#4326. Example of such implementation for redis you can find in https://github.com/whalebot-helmsman/scrapy/blob/redis/scrapy/squeues.py#L101-L173 .

Also there is a PR for external queue protocol scrapy/scrapy#4783

Thanks for your proposal !

Insutanto added the enhancement New feature or request label Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation proposal #1

Implementation proposal #1

whalebot-helmsman commented Sep 9, 2020 •

edited

Loading

Insutanto commented Sep 9, 2020

Uh oh!

Insutanto commented Sep 9, 2020

Uh oh!

Implementation proposal #1

Implementation proposal #1

Comments

whalebot-helmsman commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Insutanto commented Sep 9, 2020

Uh oh!

Insutanto commented Sep 9, 2020

Uh oh!

whalebot-helmsman commented Sep 9, 2020 •

edited

Loading