PHP Abstract Episode 22: Screen Scraping

Check out the latest PHP Abstract podcast (episode 22) from Dev Zone. I’m the guest speaker! The podcast is on web scraping, a practice in which I have (unfortunately) become somewhat proficient. Leave a comment on Dev Zone or on this entry and let me know what you think!

2 Comments

  1. Damian Moore says:

    Thanks for the podcast. I just listened to it and found it very relevant to what i’m doing. I’m a 3rd year computer science student studying at Essex University, England. My dissertation involves web scraping with PHP so I have started to become familiar with some of the libraries you mentioned. I will be extracting data from airline websites to compare prices between them, however I have stumbled across websites that won’t play ball. When the website http://www.britishairways.com/ is queried with a browser, the results are generated using JavaScript so the retrieved HTML is useless. Just wondering if you had any tips on how to tackle this – perhaps I could somehow use PHP to control Firefox and retrieve the DOM after JavaScript has been interpreted, but this sounds a bit over the top. Any suggestions very much appreciated. Thanks again for the podcast.

  2. admin says:

    In short, if JavaScript is being used to display the information, then either the information is being embedded in the JavaScript itself somehow (possibly using two-way encryption) or the JavaScript is using an XmlHttpRequest to get the data after the initial page load. In the former case, you’d have to find a way to parse the JavaScript, either with a third-party JavaScript parsing or tokenizing library or with your own homegrown solution using something like regular expressions (though I suggest that only as a last resort). In the latter case, you can simply find URI of and parameters being sent in the XmlHttpRequest and have your scraping application send a request there to get the information and proceed normally with parsing the output of the request. (This may require spoofing the HTTP header that specifies the URL of the referring page, as some applications will check for that.) Good luck and feel free to let me know if I can offer any other assistance.