<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: PHP Abstract Episode 22: Screen Scraping</title>
	<atom:link href="http://matthewturland.com/2007/10/18/php-abstract-episode-22-screen-scraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://matthewturland.com/2007/10/18/php-abstract-episode-22-screen-scraping/</link>
	<description></description>
	<lastBuildDate>Sat, 31 Dec 2011 15:29:33 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: admin</title>
		<link>http://matthewturland.com/2007/10/18/php-abstract-episode-22-screen-scraping/comment-page-1/#comment-151</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Tue, 23 Oct 2007 19:03:34 +0000</pubDate>
		<guid isPermaLink="false">http://ishouldbecoding.com/2007/10/18/php-abstract-episode-22-screen-scraping/#comment-151</guid>
		<description>&lt;p&gt;In short, if JavaScript is being used to display the information, then either the information is being embedded in the JavaScript itself somehow (possibly using two-way encryption) or the JavaScript is using an XmlHttpRequest to get the data after the initial page load. In the former case, you&#039;d have to find a way to parse the JavaScript, either with a third-party JavaScript parsing or tokenizing library or with your own homegrown solution using something like regular expressions (though I suggest that only as a last resort). In the latter case, you can simply find URI of and parameters being sent in the XmlHttpRequest and have your scraping application send a request there to get the information and proceed normally with parsing the output of the request. (This may require spoofing the HTTP header that specifies the URL of the referring page, as some applications will check for that.) Good luck and feel free to let me know if I can offer any other assistance.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>In short, if JavaScript is being used to display the information, then either the information is being embedded in the JavaScript itself somehow (possibly using two-way encryption) or the JavaScript is using an XmlHttpRequest to get the data after the initial page load. In the former case, you&#8217;d have to find a way to parse the JavaScript, either with a third-party JavaScript parsing or tokenizing library or with your own homegrown solution using something like regular expressions (though I suggest that only as a last resort). In the latter case, you can simply find URI of and parameters being sent in the XmlHttpRequest and have your scraping application send a request there to get the information and proceed normally with parsing the output of the request. (This may require spoofing the HTTP header that specifies the URL of the referring page, as some applications will check for that.) Good luck and feel free to let me know if I can offer any other assistance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Damian Moore</title>
		<link>http://matthewturland.com/2007/10/18/php-abstract-episode-22-screen-scraping/comment-page-1/#comment-153</link>
		<dc:creator>Damian Moore</dc:creator>
		<pubDate>Tue, 23 Oct 2007 13:35:31 +0000</pubDate>
		<guid isPermaLink="false">http://ishouldbecoding.com/2007/10/18/php-abstract-episode-22-screen-scraping/#comment-153</guid>
		<description>Thanks for the podcast. I just listened to it and found it very relevant to what i&#039;m doing. I&#039;m a 3rd year computer science student studying at Essex University, England. My dissertation involves web scraping with PHP so I have started to become familiar with some of the libraries you mentioned. I will be extracting data from airline websites to compare prices between them, however I have stumbled across websites that won&#039;t play ball. When the website http://www.britishairways.com/ is queried with a browser, the results are generated using JavaScript so the retrieved HTML is useless. Just wondering if you had any tips on how to tackle this - perhaps I could somehow use PHP to control Firefox and retrieve the DOM after JavaScript has been interpreted, but this sounds a bit over the top. Any suggestions very much appreciated. Thanks again for the podcast.</description>
		<content:encoded><![CDATA[<p>Thanks for the podcast. I just listened to it and found it very relevant to what i&#8217;m doing. I&#8217;m a 3rd year computer science student studying at Essex University, England. My dissertation involves web scraping with PHP so I have started to become familiar with some of the libraries you mentioned. I will be extracting data from airline websites to compare prices between them, however I have stumbled across websites that won&#8217;t play ball. When the website <a href="http://www.britishairways.com/" rel="nofollow">http://www.britishairways.com/</a> is queried with a browser, the results are generated using JavaScript so the retrieved HTML is useless. Just wondering if you had any tips on how to tackle this &#8211; perhaps I could somehow use PHP to control Firefox and retrieve the DOM after JavaScript has been interpreted, but this sounds a bit over the top. Any suggestions very much appreciated. Thanks again for the podcast.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->
