Screen scraping with PhantomJS

One of the more useful tools to ‘scrape’ data from the internet I’ve been playing around with is PhantomJS, a headless WebKit scriptable with JavaScript API – basically it’s a web browser without a graphical user interface.

Slice 1

Why would anyone want to use a web browser without a browser?” you might ask?

Well apart being able to manipulate a website pages DOM to scrape data and other useful nuggets of information (images, links, etc) PhantomJS is also a great tool for running functional website testing, screen capture, page automation, and network monitoring.

Anyhows, using PhantomJS is pretty straight forward, and the only tricky bit for me was installing all the necessary files on my hosting server.

 JavaScript example
console.log('Loading a web page');
var page = require('webpage').create();
var url = 'http://phantomjs.org/';
page.open(url, function (status) {
 //Page is loaded!
 phantom.exit();
});
Installing PhantomJS on my hosting server

This was slightly tricky as I finally figured out I needed to install the 64bit version to overcome a lot of the dependency issues I was getting.  The basic steps were:

  1. Enable Godaddy SSH.
  2. Open Terminal (on mac)
    • ssh username@hostname
    • cd ~
    • wget (the 64bit version of phantomjs for linux)
    • tar xvf
Test harness to execute

I created a cron job to run my script, but a simple way to test the javascript file is through a simple PHP harness:

<?php

$output = shell_exec("/[path to phantomjs]/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /[path to javscript file]/testMyScrape.js");

echo "<pre>$output</pre>";

?>