Selenium Twitter Webcrawler – English Version

Important Notes

Snap Packages:
Selenium does not seem to support Snap packages. Therefore, do not use the Ubuntu distribution in combination with this project – Mint and Debian have proven to work well.
Firefox via Flatpak:
Using Firefox via Flatpak has not been tested with this project.
Google Chrome:
Chrome generally works with Selenium, but its integration may be less stable.
Twitter Login:
To use Twitter lists, you must be logged in. Lists are the only reliably chronological views. The best approach is to log in to Twitter with a new profile in Firefox and then copy this profile to the target server (including Raspberry Pi, etc.). Adjust the profile path accordingly in the script. You can find your Firefox profile by navigating to about:profiles. The bot can also work without login, but in that case, the page must be publicly accessible without authentication.

Overview

This project enables Twitter data crawling without using the official Twitter API. All retrieved tweets are automatically forwarded through two modules:

Telegram Bot:
With optional filtering (e.g., by specific keywords, lines, or locations).
Mastodon Bot:
Simple forwarding of tweets to Mastodon.

Additionally, there is a Control Bot for Telegram, allowing you to manage chat IDs, filter terms, and control the bot.

Installation & Configuration

Requirements

Python (including pip)
The following Python modules (installable via pip):
- See requirements.txt

Step-by-Step Guide

1. Install Python and Required Modules

Ensure Python and pip are installed. Then, install the required modules:

pip install -r requirements.txt

2. Adjustments for Publicly Accessible Twitter Data (No Login)

If you want to crawl Twitter data without logging in, make the following changes in the twitter_bot.py file:

Comment out:

# firefox_profile = webdriver.FirefoxProfile(firefox_profile_path)
# firefox_options.profile = firefox_profile

In the def main() function:

Uncomment:

driver = webdriver.Firefox(options=firefox_options)

Comment out:

# driver = webdriver.Firefox(options=firefox_options, firefox_profile=firefox_profile_path)

Additionally:
Comment out the delete_temp_files() function, as it is probably not needed in this mode.

3. Access to Non-Public Twitter Pages (e.g., Chronologically Sorted Lists)

Adjust the firefox_profile_path value in twitter_bot.py to access protected or personalized pages.
You can find your profile name under about:profiles in Firefox.

4. Target Pages and Module Selection

Add Twitter Pages:
Enter the Twitter page you want to capture tweets from in twitter_bot.py.
Disable Unnecessary Modules:
Comment out the calls to the Telegram or Mastodon bots in def main() if you do not need them:
```
# await telegram_bot.main(new_tweets)
# mastodon_bot.main(new_tweets)
```

5. Set Up API Keys

Telegram:
Get your API keys via BotFather and enter them into the respective files.
Mastodon:
The API key can be found in your instance's settings (under Development). Make sure the required permissions are granted – if changes are made, the API key must be regenerated. Also, specify your instance in the script. The Gemini API is used to generate free alt texts for images.
Gemini API (For Testing Purposes):
Add your Gemini API key to your ~/.bashrc. Open the file with:
```
nano ~/.bashrc
```
and add the line:
```
export GOOGLE_API_KEY="YOURAPIKEY"
```
You can get a free Gemini API key here: Gemini API Key.

6. Test Run of the Bot

Run the bot in the appropriate directory for testing:

python twitter_bot.py

Note:
Selenium usually tries to install the correct Geckodriver for Firefox automatically. If this does not work, download the Geckodriver manually:
- x64 & ARM: Geckodriver Releases
Extract Geckodriver and copy it to the system directory:
```
sudo cp geckodriver /usr/local/bin/geckodriver
```

7. Configure the Telegram Control Bot

If using the Telegram bot, add your API key to telegram_controll_bot.py.
It is recommended to use an absolute path instead of DATA_FILE = 'data.json' – do not forget to apply this change in telegram_bot.py as well.

8. Set Up Bots as a Service

To run the bot continuously in the background, set it up as a system service:

Create a service file:

sudo nano /etc/systemd/system/twitter_bot.service

Add the following content, adjusting YOURUSER and YOURAPIKEY:

[Unit]
Description=twitter_bot
After=network.target

[Service]
Environment="GEMINI_API_KEY=YOURAPIKEY"
WorkingDirectory=/home/YOURUSER/bots
ExecStart=/home/YOURUSER/bots/venv/bin/python3 /home/YOURUSER/bots/twitter_bot.py
Restart=always
RestartSec=10
User=YOURUSER
Group=YOURUSER

[Install]
WantedBy=multi-user.target

Reload system services:
```
sudo systemctl daemon-reload
```

Start and enable the service:

sudo systemctl start twitter_bot.service
sudo systemctl enable twitter_bot.service

Set up telegram_controll_bot similarly.

9. Completion

Congratulations – the bot should now be running successfully!

Acknowledgment

Special thanks to shaikhsajid1111. This project helped me understand how to use CSS selectors to extract tweets. It is particularly useful for beginners who want to crawl profiles, even though chronological sorting is often no longer available. My approach using Twitter lists offers more flexibility.

Best of luck using the Selenium Twitter Webcrawler!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LICENSE		LICENSE
README.md		README.md
data.json		data.json
existing_tweets.txt		existing_tweets.txt
mastodon_bot.py		mastodon_bot.py
requirements.txt		requirements.txt
telegram_bot.py		telegram_bot.py
telegram_control_bot.py		telegram_control_bot.py
twitter_bot.py		twitter_bot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selenium Twitter Webcrawler – English Version

Important Notes

Overview

Installation & Configuration

Requirements

Step-by-Step Guide

1. Install Python and Required Modules

2. Adjustments for Publicly Accessible Twitter Data (No Login)

3. Access to Non-Public Twitter Pages (e.g., Chronologically Sorted Lists)

4. Target Pages and Module Selection

5. Set Up API Keys

6. Test Run of the Bot

7. Configure the Telegram Control Bot

8. Set Up Bots as a Service

9. Completion

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Selenium Twitter Webcrawler – English Version

Important Notes

Overview

Installation & Configuration

Requirements

Step-by-Step Guide

1. Install Python and Required Modules

2. Adjustments for Publicly Accessible Twitter Data (No Login)

3. Access to Non-Public Twitter Pages (e.g., Chronologically Sorted Lists)

4. Target Pages and Module Selection

5. Set Up API Keys

6. Test Run of the Bot

7. Configure the Telegram Control Bot

8. Set Up Bots as a Service

9. Completion

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages