Nowadays we can enjoy lots of information on the Internet. This is fantastic because we can use this information sources to enrich our skills but, sometimes, as a developer, we want to use some pieces of information in our awesome software projects. Here is the moment when we face the problem of unstructured information from the Internet and need to find a way to structure this pieces of information.
Web scraping is a technique to programmatically extract data from websites. Basically, pieces of unstructured information become structures tet can be used in software projects. In a few words, web scraping means parsing HTML documents and extracting from some pieces of information to be used later in another context.
To build a simple web scraper is important to have some basic knowledge of HTML & XPath. HTML is so popular nowadays so I consider is not important to give more details about it. The only 2 things I want to say are:
- afterwords, an HTML document is an XML document;
- the tree structure of the DOM;
Is important to keep in mind this thing when we speak about XPath. XPath stands for XML Path Language and, in simple words, is used to navigate through DOM nodes.
Building a simple scraper in Ruby
I built a simple html page for this tutorial, to use it as guinea pig. The goal of this simple tutorial is to build a ruby script to fetch a list of products from this page and export it in JSON format.
First of all, I want to say few words about Nokogiri. It is an awesome ruby gem used to parse XML structures. It is written in C language and is part of Ruby on Rails framework.
To install Nokogiri on your system, you need to have ruby installed and do:
$ gem install nokogiri
$ sudo gem install nokogiri
To download the source code of the page we will use “open-uri”.
Once we saved the HTML page in “document” variable we can start to process it. I recommend using Chrome Developer Tools to explore DOM structure and XPATH Helper Chrome extension to test your XPaths. Chrome Developer Tools offers the possibility to copy the xpath of an HTML element.
Exploring DOM structure of the page, we can see each Reddit post is held in a div with class “s1us1wxs-0“, so to fetch the whole list of posts we will use this XPath query:
The explanation for this query is: Search the entire DOM for a ul element that has class ‘products’ and for this parent element, fetch all li children.
Next, we will collect in an array all the Nokogiri nodes corresponding to product boxes, will loop through this array and will build on each iteration a hash object with all product attributes.
The ruby script and the output could be found on next gist:
This is an extremely simple example of how to use xpaths and Nokogiri to fetch data in JSON format from an HTML web page. All the tools described in this article are much more complex and can be used to perform more complex tasks.