Website Scraping Using Ruby and Nokogiri

In this post, we're going to walk through how to scrape website data in Ruby. We'll use the Nokogiri gem to grab HTML data from Hostelworld, store that data in an array, then print hostel prices to a CSV file. This is an easily extensible block of code that you can use to pull data from around the web. 

Start by creating the two files you'll need for this project. A Ruby file to run the script, and a CSV file to write the data to. 

# in the command line
touch hostelworld.rb
touch sanfrancisco.csv

In the Ruby file, we'll start by requiring the necessary gems and libraries:

# hostelworld.rb
#! /usr/bin/env ruby

require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'csv'

After this, we'll have to go back to the command line and run: 

bundle install

In the next part, we'll call the page using an HTTParty GET request in hostelworld.rb, storing the data in a local variable, page. Then we'll store that data as a Nokogiri object. 

page = HTTParty.get('http://www.hostelworld.com/hostels/San-Francisco')
nokogiri_page = Nokogiri::HTML(page)

Note here that I didn't use the complicated URL which Hostelworld provides after doing a standard search on their site...

http://www.hostelworld.com/search?search_keywords=San+Francisco%2C+USA&country=USA&city=San-Francisco&date_from=2016-06-26&date_to=2016-06-29&number_of_guests=2

Instead, I used the simplest one possible, which doesn't filter based on current availability and makes it easy to pass parameters through later. For example, if you want to check page 2 of these results, simply use:

http://www.hostelworld.com/hostels/San-Francisco?page=2

If we add a line "puts nokogiri_page" and run the file, we get a look at the Nokogiri object. What the heck is this? It's 4,200 lines of code! The object returned to manipulate and work with in your Ruby file is actually the same thing you can see by right clicking on a web page and selecting "view page source." See?  

Luckily, the Nokogiri library makes it easy to parse this data down to exactly what you're looking for. One of their handy methods is the ability to search by CSS selectors. For more information on Nokogiri's superpowers, I recommend this article from their website: Searching a XML/HTML document. 

But for our example, we're interested in the prices of hostels on Hostelworld. To pinpoint the elements we're looking for, right click on the web page again, but this time select "inspect". In the inspect panel, we can navigate around the page to see the CSS selectors for every portion of the page. The price seems to always be in this <span class="price"> tag. 

Now we can Nokogiri exactly what to narrow in on. You could look for names, ratings, or prices, among other options. Let's print out just one price to take a look at it. Add this line to hostelworld.rb, then run the file (ruby hostelworld.rb) in the command line. 

puts nokogiri_page.css('span.price').first

Now, we get a more manageable portion of the page source. Mine looks like this: 

<span class="price"><a href="http://www.hostelworld.com/hosteldetails.php/Encore-Express-Hotel-Hostel/San-Francisco/33320"><span class="hide-for-small-only">From:</span> €32.69</a></span>

Now we can really hone in on the price. To view the number as a simple integer, delete the currency symbol in front and use the to_i method. Here, I instantiated an empty array and then ran a loop to add all values on the first page to prices_array. 

prices_array = []

nokogiri_page.css('span.price').each do |price|
  price = price.text.delete('€, ')  
  prices_array.push(price.to_i)
end

print prices_array

This gives us a list of all the prices on the page. But closer examination reveals that we've captured a few zeroes and it captures every hostel's price twice.

[0, 0, 0, 34, 34, 36, 36, 30, 30, 31, 31, 42, 42, 21, 21, 32, 32, 27, 27, 29, 29, 30, 30, 37, 37, 27, 27, 25, 25, 30, 30, 32, 32, 50, 50, 37, 37, 74, 74, 33, 33, 46, 46]

So, to clean the result up add this: 

prices_array.delete(0)
prices_array = prices_array.select.with_index { |_, i| i.odd? }
print prices_array

Finally, we have captured the prices in an array successfully! 

For the last step, we want to add the functionality which will print this array in CSV format to a different file, sanfrancisco.csv. Using Ruby's CSV Class, this is incredibly easy. 

CSV.open('sanfrancisco.csv', 'w') do |csv|
  csv << prices_array
end

The result isn't very sexy; we've just got a bunch of numbers in a file. But with this data, we can accurately gauge the average prices for hostels in San Francisco.

More importantly, we've written a script that can access data from all corners of the internet. Maybe you want to compare prices in different cities around the globe or compare the average set of prices from different search engines. Additionally, it's possible to save this data to a database or make the search run automatically using the Mechanize gem.

Here's the completed Ruby file. In just 20 lines of code, we've opened up a whole world of possibilities. Also you can check out the Github repository