Web Crawler

High Level Approach

This program first traverses to the main page of the social media page to gather the necessary cookies (csrf token and session id) and then passes that to the login page which passes that token along with login information over to the login page. From there, it keeps track of the session id and csrf token and pass those cookies to the requests whenever scraping a page. And then on the web-scraping portion, it uses HTMLParser (from html.parser) to search for <a> tags indicating a link, adding that link to a set of links if it has not already traversed it.

Challenges

Logging In

The first major issue that arose was logging in. This was particularly challenging because while it was clear that it was necessary to send some sort of 'payload' to get the login information through, it was unclear exactly how to go about that. What proved to be really helpful in this situation was using Postman to formulate how the initial few requests would go. For instance, I first tested with just sending a request to the main page to see what would get sent back (just the general format of the data). From there, I tested with a post request at the login page with the body being a 'x-www-form-urlencoded' with the username, password, and csrfmiddlewaretoken attached. I found how out to include these fields from viewing the 'payload' portion of the network tab (inspect element). It became obvious on how to approach the sending of the form data after viewing the code snippet for HTTP requests in Postman.

Chunked Encoding

Chunked encoding was initially an issue just because of how rare a chunked encoding segment was, I wasn't exactly sure how to ensure that the program read the hexadecimal based length field and then ignore the unneeded data after the semicolon. Also, another issue arose when I didn't even know if chunked encoding was necessary; namely I didn't check the header field. After discovering that, I made a fairly rudimentary transfer encoding parser that took every other line until a "0" showed up in an even numbered line. Then that would just be the "body"; primarily html.

Testing

Testing consisted primarily trial and error, running the code incrementally against the server in order to determine what the next required step to implement would be. When possible, I added the code to a separate file and manually tested the code with mock data. For instance, with chunked-encoding, to ensure that the program was parsing the chunked data correctly, I first wrote that code segment in a separate file and then fed that mock data. Mock data was gathered from the following links: HTTP Made Easy and Mozilla.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md
secret_flags		secret_flags
web-crawler		web-crawler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler

High Level Approach

Challenges

Logging In

Chunked Encoding

Testing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

gcooper407/webtrend

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

High Level Approach

Challenges

Logging In

Chunked Encoding

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages