How to build a web scraper in Go

One of the problems I identified when I build a web scraper is execution time needed to process the DOM and the memory used for this task. If HTML source is considerably big and XPaths are complex, the time needed for processing will increase. In my opinion, a language such as Go is preferable in this situations.

In a previous article, I  wrote about building a simple web scraper using ruby and Nokogiri. This approach could be dangerous when you need speed and a low memory footprint.

Go is extremely fast and efficient from the memory usage perspective. I found a nice package collection to be used to evaluate XPath expressions. It has three different packages that provide ways to evaluate xpaths over HTML documents, XML documents, and JSON documents. I will use the HTML part for the tutorial.

In the next tutorial, I will show you how to build a web scraper using Go and xpath package. For the next experiment, I will use the same simple HTML page I build for the ruby article.

Installation

To install xpath package I simply need to:

$ go get github.com/antchfx/htmlquery

Fetching DOM structure & type definitions

In the next step, I will create a new go file called simple-scraper.goIt will start with package definition, I will import htmlquery package and fetch the HTML source of the page.

Note: I like to define an error handler function in my Go scripts to avoid the repetition of code used for error handling. In this case, the definition could be found in the next gist.

After that, I will define 2 types. First will be the Product type and second will be a list of products. This types will be used when I will process the DOM. As in the ruby tutorial, the goal is to fetch all products from the HTML page and output it as a JSON list.

DOM processing & building product list

Now is the moment to start processing the DOM and build the product list. I will use Find function to evaluate xpath over the DOM. This function will return a list of nodes containing products.

productNodes := htmlquery.Find(doc, "//ul[@class='products']/li")

I will iterate over this nodes and at each step, I will build the current product and append it to the final product list.

JSON serialization

I will use the json.Marshal  function to serialize the result array. This function will return an array of bytes containing the serialization and error. If marshal was done successfully then the error will be nil.

You can find the whole code in the gits below.

 

 

1 Star2 Stars3 Stars4 Stars5 Stars (4 votes, average: 4.00 out of 5)
Loading...