The other day I noticed a scraper I setup a month ago was no longer working. It turned out that the site I needed to scrape once a day changed their page to load all their content via jQuery after the page loaded.
I was using the standard urllib2 library, but since this was just grabbing the html before it was processed, I needed another solution.
Working on an Amazon EC2 instance with no GUI, I first tried ghost.py, but that required PyQt or PySide to work properly, and if you ever tried to get those installed on EC2 you too know my frustration.
PhantomJS is truly headless and requires no UI/X11 related libraries to work.
sudo pip install selenium
sudo yum install gcc gcc-c++ make git openssl-devel freetype-devel fontconfig-devel
tar -xjvf phantomjs-1.9.1-linux-x86_64.tar.bz2
sudo cp phantomjs-1.9.1-linux-x86_64/bin/phantomjs /usr/bin/
Here is some sample Python code to test:
from selenium import webdriver
driver = webdriver.PhantomJS()
For more info on finding elements by css selectors and clicking buttons on the page etc, check out the Selenium documentation.
From this point you can use Selenium methods or your favorite parsing library (BeautifulSoup + lxml) to churn through the page_source however you want, to meet your needs.
Hope this was helpful ☺
UPDATE: This post was on the frontpage of Hacker News all day (thanks everyone!), you can follow the conversation there.