Computer, do this.

My thoughts, ideas and processes with computing

Home

About

Normans

My Github

Web-scraping with python

Posted: 2019-01-03 | Last updated: 2019-01-03

Doing a quick Google search for "best language for web scraping" brings up plenty of mentions of the Python programming language. I thought I'd explore this since I am learning python alongside my study of algorithmic computation and the ever-lasting problem at my workplace is pricing (I will get an effective and efficient solution one day).

Initial idea

So today, I spent a quick half an hour trying to set up a python IDE to get cracking (and because filling in spreadsheets are the root of all evil when you do it for eight hours a day). Incidentally I eventually settled on the default python IDLE as everything else was too confusing or didn't like what I was doing with the Windows PATH environment variables.

Within a couple of hours and a quick skim read of the Beautiful Soup documentation I had successfully fetched our website's home page, parsed all it's links and separated them in to domain, non-domain and non-TLD anchor links (just '/', a javascript function or 'mailto:'). This was nicely set in a few functions so that I could just pass in a URL to fetch and it would separate them, validate the URLs and print them nicely (or the first 10 of the 130 on the page).

This amount of productivity, I thought, was quite impressive for a beginner Python coder in such a short amount of time and I can see why it is such a popular language for this use-case!

Next steps

building-blocks-397143_960_720.jpg

So now I can fetch all the links on a page and categorise them loosely. Great but not very useful yet. So what's next?

A rough idea of an algorithm is forming in my head. The whole problem I have always found with scraping tools, whether programmed or as pre-built applications is that we need to set them up and configure them for each shoddily made competitor site and then they'll go and change their structure and the whole process will need to be done again.

What we need, is a quick, easy and variable way to tell a tool what to fetch from a page. Even if we have to enter the domain and then say; "pages with this element/id/class/picture of a frog are category pages, grab all the links from here when you get to one of these pages" with each run.

So, what we need is akin to a configuration for each run. OK, we can hold that in some kind of variable or ask it to be inputted when the tool is run, perhaps even hold it in a persistent location like a file.

Once we have this configuration we need to tell the tool to;

  1. Initialise a list urlsToCrawl to empty
  2. Fetch the content at the configured domain and store it in a variable page
  3. Fetch all of the links from page that we want to crawl and sort/validate them in to urlsToCrawl with a boolean flag crawled set to False to say they have not yet been crawled
  4. for each link in urlsToCrawl with crawled set to False
    1. Determine the type of the page using our defined cases in the configuration
    2. if the type is of interest
      1. fetch the relevant information as determined by the configuration
      2. fetch all links and add new ones to urlsToCrawl with crawled as False
      3. set the current link in urlsToCrawl's crawled flag to True
  5. Go back to step 4

In my mind this will be the overall program algorithm which I've tried to write in structured English.

I'm confident I could do this in any programming language I know; C++, Java, JavaScript, PHP... but Python appears to be the popular answer to the question. More research would be beneficial before implementation.

Persistence

Another brief thought I had while experimenting with my original Python program was what we would do with the data once we have it?

Well usually we output to .CSV (eurgh more spreadsheets) where I would then use formulas to compare and present unsightly, humongous tables to my colleagues with red colours where we are selling cheaper and green where we are making more....etc. etc.

What if we had some kind of database/files/serialised JSON where we could compare the data not only with our own but with the last crawl of the site - to see how things have changed?

More thought is needed on this still. I've always like the idea of a nice graph but it doesn't give the precise breakdown we would eventually need.

Inspired!

inspire

Well, that brief foray in to Python and Beautiful Soup did inspire me enough to write this article. It's something I would very much like to explore further.