Data Extraction with Hpricot

by Jonas Alves

Issue: Vol 2, Issue 1 - All Stuff, No Fluff

published in June 2010

Jonas Alves is a developer based in São Paulo, Brazil. He started Ruby on Rails development early in 2008 and other Ruby libraries later. Jonas is currently employed by WebGoal, where Ruby helps to develop high quality software with high return on investment quickly.

Collecting data from websites manually can be very time consuming and error-prone.

One of our customers at WebGoal had 10 employees working 10 hours/day to collect data from some websites on the internet. The company’s leaders were complaining about the cost they’re having on this, so my team proposed to automate this task.

After a day testing tools in many languages (PHP, Java, C++, C# and Ruby), we found that Hpricot is the most powerful, yet simple to use, tool of its kind.

This company was used to using PHP in all of their internal systems. After reading our document about the Hpricot and Ruby advantages, they agreed to use them.

It helped them collect more data in less time than before and with less people on the job.

What is Hpricot?

As per the Hpricot’s wiki at GitHub, "Hpricot is a very flexible HTML parser, based on Tanaka Akira’s HTree and John Resig’s jQuery, but with the scanner recoded in C." You can use it to read, navigate into and even modify any XML document.

Why should I choose Hpricot?

It’s simple to use.  You can use CSS or XPath selectors.
Any CSS selector that works on jQuery should work on Hpricot too, because Hpricot is based on it.
It’s fast
Hpricot was written in the C programming language.
It’s less verbose
See for yourself:

Scenario: Extracting the team members’ names from the Rails Magazine website

Ruby + Hpricot

doc = Hpricot(open("http://railsmagazine.com/team"))

team = [] 

doc.search(".article-content td:nth(1) a").each do |a| 

team << a.inner_text

end 

puts team.join("\n")

PHP + DOM Document

<?php

$doc = new DOMDocument();

$doc->loadHTMLFile("http://railsmagazine.com/team");

$team = array();

$trs = $doc->getElementsByTagName(

'div')->item(0)->getElementsByTagName('tr');

foreach($trs as $tr) {

$a = $tr->getElementsByTagName('a')->item(0);

$team[] = $a->nodeValue;

}

print(implode("\n", $team));

?>

A similar comparison was included in the document we composed to convince our customer to use Ruby and Hpricot.

Look at the search methods. Hpricot shines with CSS selectors while PHP's DOM Document supports searching by only one tag or id at a time. With Hpricot's CSS selectors it's possible to find the desired elements with only one search.

It’s smart 
Hpricot tries to fix XHTML errors. 
In the PHP example, the DOM Document library shows 7 warnings about errors in the document. Hpricot doesn’t.
It’s Ruby! :)

Let’s code!

The above example is very simple. It loads the /team page at the Rails Magazine’s website and searches for the members' names.

In real life data extractions you will probably have to deal with pagination, authentication, search for something in a page, like ids, urls or names, and then use this data to load another page, and so on.

We are going to extract the Ruby Inside’s blog posts and their comments to show the basic functionalities of Hpricot. The data we will be retrieving includes the post title, author name, text and its comments, including its sender and text.

Let's start creating classes to hold the blog posts and comments data:

blog_post.rb
class BlogPost
attr_accessor :title, :author, :text, :comments
end

comment.rb
class Comment
attr_accessor :sender, :text

end

These are simple classes with some accessible (read and write) attributes.

We will also create a class named RubyInsideExtractor, which will be responsible for retrieving the data from the blog:

ruby_inside_extractor.rb
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'blog_post'
require 'comment'

class RubyInsideExtractor
attr_reader :blog_posts
@@web_address = "http://www.rubyinside.com/"
def initialize
@blog_posts = []
end
def import!

puts “not implemented”

end

end

The @blog_posts array will hold all the blog posts. @@web_address has the blog address, so we don't need to repeat it.

The import! method is where we will do the extraction.

After that, we will need a script to call the extraction and show the results, let's call it main.rb:

main.rb
#!/usr/bin/env ruby
require 'ruby_inside_extractor'

ri_extractor = RubyInsideExtractor.new
ri_extractor.import!
ri_extractor.blog_posts.each{ |post|
puts post.title
puts '=' * post.title.size
  puts 'by ' + post.author
puts
puts post.text
puts
post.comments.each do |comment|
    puts '~' * 10
    puts comment.sender + ' says:'
    puts comment.text
end
puts
}

After instantiating the extractor class and calling the import! method, this script prints each of the blog posts, including author and comments.

The very first thing we have to do, is to find out how many pages are there in the blog:

private

def page_count
doc = Hpricot(open(@@web_address))

# the number of the last page is in

# the penultimate link, inside the div

# with the class “pagebar”
# return doc.search(

# "div.pagebar a")[-2].inner_text.to_i

return 3

# I suggest forcing a low number because it would
# take long to extract all the 1060~ posts
end

The page_count method loads the blog's homepage and finds the last page number, located in the penultimate link inside the div containing pagination stuff, div.pagebar.

For this example the most important line is commented because it would take a little long to extract all of the, currently, 107 pages.

The Hpricot method loads a document and the search method returns an Array containing all the occurrences of the given selector.

Now, we’re going to load the posts page once for each page. Change your import! method:

def import!
1.upto(page_count) do |page_number|
page_doc = Hpricot(open(@@web_address + 'page/'

+ page_number.to_s))
end

end

This will load an Hpricot document for each of the blog pages. For instance, the address for the 5th page is http://www.rubyinside.com/page/5.

Let’s search for the url that leads to the page with the complete text and comments for each post:

def import!
1.upto(page_count) do |page_number|
    page_doc = Hpricot(open(@@web_address +

      'page/' + page_number.to_s))
    page_doc.search('.post.teaser').each do |entry_div|
      # we can access an element's attributes

      # as if it were a Hash
      post_url = entry_div.at('h2 > a')['href']
      @blog_posts << extract_blog_post(post_url)
    end
end
end

If you look at the Ruby Inside HTML code, you'll find that each blog post is inside a div with the post and teaser classes. The import! method is iterating over each of these divs and retrieving the url for the full post with comments. This url is found in the link inside the post title.

After that, it calls the extract_blog_post method, which we will create next, and adds its returning value to the @blog_posts array.

The at method searches for and returns the first occurrence of the selector.

Now, with this url in hand, we can load the page that holds the post title, full text and comments:

def extract_blog_post(post_url)

blog_post = BlogPost.new

post_doc = Hpricot(open(post_url))

blog_post

end

Now, let's collect the post title, author and text:

def extract_blog_post(post_url)
blog_post = BlogPost.new

post_doc = Hpricot(open(post_url))
blog_post.title = post_doc.at(

    '.entryheader h1').inner_text
  blog_post.author = post_doc.at(

    'p.byline a').inner_text
text_div = post_doc.at('.entrytext')

# removing unwanted elements
text_div.search('noscript').remove
blog_post.text = text_div.inner_text.strip

blog_post.comments =

    extract_comments(post_doc.at('ol.commentlist'))

blog_post
end

After retrieving the blog title, author and text, we also called the extract_comments method. This method, which we will create next, will return an array of comments.

The remove method removes the elements from the document. We're using it because there is a <noscript> tag with text inside the div with the entrytext class.

Finally, we'll retrieve the post’s comments:

def extract_comments(comments_doc)
comments = []
comments_doc.search('li').each { |comment_doc|
    comment = Comment.new
    comment.sender =

      comment_doc.at('cite').inner_text
    comment.text = comment_doc.at('p').inner_text
    comments << comment
} rescue nil
comments
end

After extracting every post and comments, the Ruby Inside extractor is ready. Run your main.rb to see the result. :)

Resources

Complete code for the article
Hpricot wiki