Posted: 2019-01-03 | Last updated: 2019-01-03
Doing a quick Google search for "best language for web scraping" brings up plenty of mentions of the Python programming language. I thought I'd explore this since I am learning python alongside my study of algorithmic computation and the ever-lasting problem at my workplace is pricing (I will get an effective and efficient solution one day).
So today, I spent a quick half an hour trying to set up a python IDE to get cracking (and because filling in spreadsheets are the root of all evil when you do it for eight hours a day). Incidentally I eventually settled on the default python IDLE as everything else was too confusing or didn't like what I was doing with the Windows PATH environment variables.
This amount of productivity, I thought, was quite impressive for a beginner Python coder in such a short amount of time and I can see why it is such a popular language for this use-case!
So now I can fetch all the links on a page and categorise them loosely. Great but not very useful yet. So what's next?
A rough idea of an algorithm is forming in my head. The whole problem I have always found with scraping tools, whether programmed or as pre-built applications is that we need to set them up and configure them for each shoddily made competitor site and then they'll go and change their structure and the whole process will need to be done again.
What we need, is a quick, easy and variable way to tell a tool what to fetch from a page. Even if we have to enter the domain and then say; "pages with this element/id/class/picture of a frog are category pages, grab all the links from here when you get to one of these pages" with each run.
So, what we need is akin to a configuration for each run. OK, we can hold that in some kind of variable or ask it to be inputted when the tool is run, perhaps even hold it in a persistent location like a file.
Once we have this configuration we need to tell the tool to;
In my mind this will be the overall program algorithm which I've tried to write in structured English.
Another brief thought I had while experimenting with my original Python program was what we would do with the data once we have it?
Well usually we output to .CSV (eurgh more spreadsheets) where I would then use formulas to compare and present unsightly, humongous tables to my colleagues with red colours where we are selling cheaper and green where we are making more....etc. etc.
What if we had some kind of database/files/serialised JSON where we could compare the data not only with our own but with the last crawl of the site - to see how things have changed?
More thought is needed on this still. I've always like the idea of a nice graph but it doesn't give the precise breakdown we would eventually need.
Well, that brief foray in to Python and Beautiful Soup did inspire me enough to write this article. It's something I would very much like to explore further.