How-To (and How-Not-To) on Web Scraping

A friend of mine who shall remain nameless pointed a post out to me on the PHP DZone web site recently. Noting that the article’s content was misinformed at best and downright ignorant at worst, even when examining it sheerly from the author’s knowledge of PHP as a language, this friend asked that I set the author straight.

I gladly obliged with a comment on the post, having become somewhat of an authority on the application topic myself. As much of an unorthodox practice as web scraping may be, there are some methodologies for it that are obviously better than others. The aforementioned post illustrates a lot of the ones to avoid, and my arguments against them.

Later, I randomly encountered a post on the blog at xml.lt on the topic of web scraping using the DOM extension. This article showcases recommended practices and reasoned arguments against bad (and unfortunately common) alternatives. The author comes across as being significantly more informed on both the language and the application in the article’s content and code examples.

If you’re looking for references on topic of web scraping with PHP, there’s always the article I wrote for the December 2007 issue of php|architect magazine, of which you can still purchase an electronic copy in PDF format. At some point, I also hope to write a short book on the subject. Until then, if you have related questions, you can generally reach me in the #phpc channel on Freenode, under the nick Elazar. I’m always glad to give out advice on web scraping and PHP, as I’m sure my good friend Jared Folkins (who is also my “Little Sis” from the PHPWomen Big Sis/Little Sis mentoring program) will attest.

2 Comments

  1. …a href=”http://php.dzone.com/news/writing-website-scrapers-php”>recent articles covering it) on his blog today as an author of a previous article published in php|architect

  2. …a href=”http://php.dzone.com/news/writing-website-scrapers-php”>recent articles covering it) on his blog today as an author of a previous article published in php|architect