Skip to content

Commit ec73798

Browse files
authored
Reddit Scraper added (HarshCasper#1063)
* Add files via upload * Update README.md
1 parent 84a318b commit ec73798

File tree

4 files changed

+145
-1
lines changed

4 files changed

+145
-1
lines changed

Python/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@
132132
- [ROT13_Decryption](/Python/ROT13_Decryption)
133133
- [RSA_Key_Generation](/Python/RSA_Key_Generation)
134134
- [Random_Password_Generator](/Python/Random_Password_Generator)
135+
- [Reddit Scraper](/Python/Reddit_Scraper)
135136
- [Reddit_Wallpaper](/Python/Reddit_Wallpaper)
136137
- [S3_Bucket_Creator](/Python/S3_Bucket_Creator)
137138
- [S3_Bucket_Downloader](/Python/S3_Bucket_Downloader)
@@ -199,4 +200,4 @@
199200
- [Youtube_Bot](/Python/Youtube_Bot)
200201
- [Youtube_Video_Download](/Python/Youtube_Video_Download)
201202
- [Zipper](/Python/Zipper)
202-
- [Zoom_Automation](/Python/Zipper)
203+
- [Zoom_Automation](/Python/Zipper)

Python/Reddit_Scraper/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Reddit Scraper
2+
3+
- Using BeautifulSoup, a python library useful for web scraping, this script helps to scrape a desired subreddit to obtain all relevant data regarding its posts.
4+
5+
- We take user input for the subreddit name, and the maximum count of posts to be scraped, and store all this information in a `.csv` file of the user's choice.
6+
7+
## Setup instructions
8+
9+
- The requirements can be installed as follows:
10+
11+
```shell
12+
$ pip install -r requirements.txt
13+
```
14+
15+
## Working screenshots
16+
17+
### Terminal I/O
18+
![Image](https://i.imgur.com/T3CnaKY.png)
19+
20+
### `.csv` generated as follows
21+
![Image](https://i.imgur.com/HzvDFUp.png)
22+
23+
## Author
24+
[Rohini Rao](www.github.com/RohiniRG)
25+
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
import requests
2+
import csv
3+
import time
4+
from bs4 import BeautifulSoup
5+
6+
7+
def write_to_csv(row_array):
8+
"""
9+
The function stores the scraped info in a .csv file
10+
:param row_array:
11+
:return:
12+
"""
13+
# Headings for the first row of the file
14+
header_list = ['Title', 'Author', 'Date and Time', 'Upvotes',
15+
'Comments', 'Url']
16+
file_name = input('\nEnter the name of file to store the info: ')
17+
18+
# Adding info into the rows of the file
19+
with open(file_name + '.csv', 'a', encoding='utf-8') as csv_f:
20+
csv_pointer = csv.writer(csv_f, delimiter=',')
21+
csv_pointer.writerow(header_list)
22+
csv_pointer.writerows(row_array)
23+
24+
print(f'Done! Check your directory for {file_name}.csv file!')
25+
26+
27+
def scraper():
28+
"""
29+
The function scrapes the post info from the desired subreddit and stores it
30+
into the desired file.
31+
:return:
32+
"""
33+
subreddit = input('Enter the name of the subreddit: r/').lower()
34+
max_count = int(input('Enter the maximum number of entries to collect: '))
35+
36+
# Generating the URL leading to the desired subreddit
37+
url = 'https://old.reddit.com/r/' + subreddit
38+
39+
# Using a user-agent to mimic browser activity
40+
headers = {'User-Agent': 'Mozilla/5.0'}
41+
42+
req = requests.get(url, headers=headers)
43+
44+
if req.status_code == 200:
45+
# Parsing through the web page for obtaining the right html tags and
46+
# scraping the details required
47+
soup = BeautifulSoup(req.text, 'html.parser')
48+
print('\nCOLLECTING INFORMATION....')
49+
50+
attrs = {'class': 'thing'}
51+
counter = 1
52+
full = 0
53+
reddit_info = []
54+
while 1:
55+
for post in soup.find_all('div', attrs=attrs):
56+
try:
57+
title = post.find('a', class_='title').text
58+
59+
author = post.find('a', class_='author').text
60+
61+
time_stamp = post.time.attrs['title']
62+
63+
comments = post.find('a', class_='comments').text.split()[0]
64+
if comments == 'comment':
65+
comments = 0
66+
67+
upvotes = post.find('div', class_='score likes').text
68+
if upvotes == '•':
69+
upvotes = "None"
70+
71+
link = post.find('a', class_='title')['href']
72+
link = 'www.reddit.com' + link
73+
74+
# Storing the scraped data in an array
75+
reddit_info.append([title, author, time_stamp, upvotes,
76+
comments, link])
77+
78+
if counter == max_count:
79+
full = 1
80+
break
81+
82+
counter += 1
83+
except AttributeError:
84+
continue
85+
86+
if full:
87+
break
88+
89+
try:
90+
# To go to the next page
91+
next_button = soup.find('span', class_='next-button')
92+
next_page_link = next_button.find('a').attrs['href']
93+
94+
time.sleep(2)
95+
96+
req = requests.get(next_page_link, headers=headers)
97+
soup = BeautifulSoup(req.text, 'html.parser')
98+
except:
99+
break
100+
101+
# Writing the stored information in a .csv file
102+
print('DONE!\n')
103+
write_to_csv(reddit_info)
104+
105+
else:
106+
print('Error fetching results.. Try again!')
107+
108+
109+
if __name__ == '__main__':
110+
scraper()
111+
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
beautifulsoup4==4.9.3
2+
certifi==2020.12.5
3+
chardet==4.0.0
4+
idna==2.10
5+
requests==2.25.1
6+
soupsieve==2.2.1
7+
urllib3==1.26.4

0 commit comments

Comments
 (0)