dan's internet pad

Overview

I recently had a problem where I wanted to verify that a website:

Although there are many utilities you can probably piece together to solve this problem, I decided to make my own utility Cerca (which is Catalan for search).

Problem

Given a website URL, search the website for all external and internal links. Follow all internal links while not revisiting past links.

Solution

I decided to use the following approach:

  1. Initialize a queue with the website’s root (eg. https://wikipedia.com)
  2. Pop a URL off the queue.
  3. Retrieve the webpage and extract all links on the page if it was an internal link.
  4. Add each link to a queue, if it had not been seen.
  5. Repeat until the queue is empty.

To do this, I decided to use the following:

The code can be seen here!

As part of this project, I also wanted to learn how to make a python module. It is actually fairly straightforward:

1
2
3
4
# Contains your package information
$ vim setup.py
# Create the distribution, register a python account, and upload
$ python3.6 setup.py sdist register upload

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ pip3 install cerca
$ cerca https://google.ca
200 https://google.ca
200 https://google.ca/preferences?hl=en
200 https://google.ca/advanced_search?hl=en&authuser=0
404 https://google.ca/language_tools?hl=en&authuser=0
200 https://google.ca/intl/en/ads/
200 https://google.ca/services/
200 https://google.ca/intl/en/about.html
...

Resources

#Blog #Project