Overview
I recently had a problem where I wanted to verify that a website:
- did not link to a given website on a blacklist
- did not return non 2XX status codes for a given route
Although there are many utilities you can probably piece together to solve this
problem, I decided to make my own utility Cerca
(which is Catalan for search).
Problem
Given a website URL, search the website for all external and internal links. Follow all internal links while not revisiting past links.
Solution
I decided to use the following approach:
- Initialize a queue with the website’s root (eg. https://wikipedia.com)
- Pop a URL off the queue.
- Retrieve the webpage and extract all links on the page if it was an internal link.
- Add each link to a queue, if it had not been seen.
- Repeat until the queue is empty.
To do this, I decided to use the following:
- Language : Python3
- HTML library : BeautifulSoup4
- Algorithm : Use simple BFS with a queue
The code can be seen here!
As part of this project, I also wanted to learn how to make a python module. It is actually fairly straightforward:
|
|
Example
|
|