One of the problems I identified when I build a web scraper is execution time needed to process the DOM and the memory used for this task. If HTML source is considerably big and XPaths are complex, the time needed for processing will increase. In my opinion, a language such as Go is preferable in this situations.
In a previous article, I wrote about building a simple web scraper using ruby and Nokogiri. This approach could be dangerous when you need speed and a low memory footprint.
Go is extremely fast and efficient from the memory usage perspective. I found a nice package collection to be used to evaluate XPath expressions. It has three different packages that provide ways to evaluate xpaths over HTML documents, XML documents, and JSON documents. I will use the HTML part for the tutorial.
To install xpath package I simply need to:
$ go get github.com/antchfx/htmlquery
Fetching DOM structure & type definitions
In the next step, I will create a new go file called simple-scraper.go. It will start with package definition, I will import htmlquery package and fetch the HTML source of the page.
After that, I will define 2 types. First will be the Product type and second will be a list of products. This types will be used when I will process the DOM. As in the ruby tutorial, the goal is to fetch all products from the HTML page and output it as a JSON list.
DOM processing & building product list
Now is the moment to start processing the DOM and build the product list. I will use Find function to evaluate xpath over the DOM. This function will return a list of nodes containing products.
productNodes := htmlquery.Find(doc, "//ul[@class='products']/li")
I will iterate over this nodes and at each step, I will build the current product and append it to the final product list.
I will use the json.Marshal function to serialize the result array. This function will return an array of bytes containing the serialization and error. If marshal was done successfully then the error will be nil.
You can find the whole code in the gits below.