Posts tagged ‘Web Scraping’

Webscrapers Mailing List

Daniel Stenberg, one of the primary authors of the libcurl library on which the PHP cURL extension is based, was kind enough to comment on and clarify a recent blog post of mine regarding web scraping using the PHP and cURL. He later sent me a tweet to invite me to a new mailing list for web scraping enthusiasts just before tweeting a public invitation. In addition to the mailing list itself, the web site also has links to books (including my book) and popular tools related to the subject. I think this is awesome and I encourage anyone with an interest in web scraping, professional or recreational, to join.

Gotcha on Scraping .NET Applications with PHP and cURL

Obligatory pitch: Many other useful tidbits like this can be yours by purchasing my book, php|architect’s Guide to Web Scraping with PHP.

I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I’d share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. The code looked something like the sample below.

$ch = curl_init();
curl_setopt_array($ch, array(
    CURLOPT_URL => 'http://...',
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => array(
        'field1' => 'value1',
        // ...
    ),
    // ...
));

The issue I ran into had to do with a behavior of the CURLOPT_POSTFIELDS setting that’s easy to overlook. This is a segment of its description from the PHP manual page for the curl_setopt() function.

If value is an array, the Content-Type header will be set to multipart/form-data.

If the form being submitted is not set to have an enctype attribute value of multipart/form-data in the form’s markup, .NET returns a 500-level HTTP response with no further information on what causes the error (for security purposes). This presumably happens because it’s expecting one value for the Content-Type request header and getting another.

Setting CURLOPT_HEADER and CURLOPT_VERBOSE to true helped to reveal that this was the issue. The fix is pretty simple: instead of passing the array itself for CURLOPT_POSTFIELDS, pass the result of wrapping it in a call to the  http_build_query() function (see its PHP manual page). This converts it to a properly formatted query string, which causes cURL to use the default Content-Type header value of application/x-www-form-urlencoded instead.

Tools like Firebug can help you to examine requests made by a browser. Together with these settings for cURL, you can modify your script’s requests to match those of your browser as closely as possible, making gotchas like this less likely to trip you up.

Renaming a DOMNode in PHP

A recent work assignment had me using PHP to pull HTML data into a DOMDocument instance and renaming some elements, such as b to strong or i to em. As it turns out, renaming elements using the DOM extension is rather tedious.

Version 3 of the DOM standard introduces a renameNode() method, but the PHP DOM extension doesn’t currently support it.

The $nodeName property of the DOMNode class is read-only, so it can’t be changed that way.

A node can be created with a different name in the same document, but if you specify a value to go along with it, any entities in that value are automatically encoded, so it’s not possible to pass in the intended inner content of a node if it contains other nodes.

The only method I’ve found that works is to replicate the attributes and child nodes of the original node. Attributes are fairly easy, but I ran into an issue replicating children where only the first child of any given node was replicated within its intended replacement and the remaining children were omitted. Here’s the original code that was exhibiting this behavior.

foreach ($oldNode->childNodes as $childNode) {
    $newNode->appendChild($childNode);
}

The reason for this behavior is that the $childNodes property of $oldNode is implicitly modified when $childNode is transferred from it to $newNode, so the internal pointer of $childNodes to the next child in the list is no longer accurate.

To get around this, I took advantage of the fact that any node with any child nodes will always have a $firstChild property pointing to the first one. The modified code that takes this approach is below and has the behavior I originally set out to implement.

while ($oldNode->firstChild) {
    $newNode->appendChild($oldNode->firstChild);
}

If you’re curious, below is the full code segment for renaming a node.

$newNode = $oldNode->ownerDocument->createElement('new_element_name');
if ($oldNode->attributes->length) {
    foreach ($oldNode->attributes as $attribute) {
        $newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
    }
}
while ($oldNode->firstChild) {
    $newNode->appendChild($oldNode->firstChild);
}
$oldNode->ownerDocument->replaceChild($newNode, $oldNode);

Another potential “gotcha” is the argument order of the replaceChild() method, which is the new node followed by the old node rather than the reverse that most people might expect. Thanks to Joshua May for pointing that one out to me; I might never have understood why I was getting a “Not Found Error” DOMException otherwise.

Webcast Slides

Hard to believe it’s been that long, but two months ago I mentioned the free webcast series sponsored by Adobe and leading up to php|tek 2009.

I’ve posted the slides from my webcast on February 27. If you weren’t able to make it, I gave an introduction to what web scraping is, basic details of the HTTP protocol, available resources for developing web scraping applications, and best practices. I know there are plans to make the audio from the webcast and I will update this post with a link once it becomes available.

If the slides and audio aren’t enough for you, I will in all likelihood be giving an extended version of the presentation that includes both retrieval and analysis as part of the Unconference event at php|tek. Look forward to seeing you there!

How-To (and How-Not-To) on Web Scraping

A friend of mine who shall remain nameless pointed a post out to me on the PHP DZone web site recently. Noting that the article’s content was misinformed at best and downright ignorant at worst, even when examining it sheerly from the author’s knowledge of PHP as a language, this friend asked that I set the author straight.

I gladly obliged with a comment on the post, having become somewhat of an authority on the application topic myself. As much of an unorthodox practice as web scraping may be, there are some methodologies for it that are obviously better than others. The aforementioned post illustrates a lot of the ones to avoid, and my arguments against them.

Later, I randomly encountered a post on the blog at xml.lt on the topic of web scraping using the DOM extension. This article showcases recommended practices and reasoned arguments against bad (and unfortunately common) alternatives. The author comes across as being significantly more informed on both the language and the application in the article’s content and code examples.

If you’re looking for references on topic of web scraping with PHP, there’s always the article I wrote for the December 2007 issue of php|architect magazine, of which you can still purchase an electronic copy in PDF format. At some point, I also hope to write a short book on the subject. Until then, if you have related questions, you can generally reach me in the #phpc channel on Freenode, under the nick Elazar. I’m always glad to give out advice on web scraping and PHP, as I’m sure my good friend Jared Folkins (who is also my “Little Sis” from the PHPWomen Big Sis/Little Sis mentoring program) will attest.

Book Review: PHP Web 2.0 Mashup Projects

You can find this review in podcast form on the Zend Developer Zone PHP Abstract Podcast.

I received an e-mail recently from a very nice gentleman at Packt Publishing, a UK-based publishing company focused on providing hands-on application-oriented publications to IT professionals, particularly those specific to open source technologies. Their representative asked if I would be willing to review one of their books, namely PHP Web 2.0 Mashup Projects by Shu-Wai Chow. Reviewing books is not something I had done before, so I thought I would give it a good old-fashioned college try.

In a supersaturated market, it is difficult to make an impression with a PHP book these days. The books of real value are those that focus on ways to apply the language to real world problems. These books delve into the depths of a particular application domain, showing PHP code and outlining design principles along the way. They are useful to current and prospective PHP programmers alike because they can introduce both not to PHP itself, but to an existing class of problems and how PHP can be applied to solve them. PHP Web 2.0 Mashup Projects is one of these books.

Most technology-related books on the shelves are several inches thick and an inherently daunting chore to sift through. Luckily, this book is not one of those. Do not let the size fool you, though; it is positively packed with useful information. It hits the high points of each topic it covers, giving you enough in the way of code samples and step-by-step explanations to get started, as well as resources to help you get better acquainted with topics that might be of particular interest to you.

The book is divided into six chapters, each of which covers a set of particular protocols, data formats, and APIs for acquiring and processing data in order to create a particular mashup application. These projects include:

  • A search engine to find products on Amazon by their Universal Product Code
  • A search engine to combine results from MSN and Yahoo!
  • A video jukebox that pulls songs from Last.fm and videos from YouTube
  • A traffic incident reporting application that sends SMS alerts
  • An illustrated tube station line map using Google Maps and Flickr for related photos

The book’s structure and layout make it easy to follow, whether you prefer to read it linearly or jump around to specific sections. It is an excellent reference that I can see myself returning to time and time again.

One of the strengths of the book is that it has a very wide base of coverage. It starts by introducing basics in interacting with web services and extracting the desired data from their responses using core PHP libraries. The REST, XML-RPC, and SOAP protocols and the WSDL standard are all covered in enough depth to get you started, so you can work with a web service regardless of the protocol or protocols it offers. The author does an excellent job of selecting example web services and data standards from large and well-known to small and obscure. For real world APIs, you will find the likes of Amazon, YouTube, Google, and Flickr, as well as sources that might not be household names, such as the Internet UPC Database. Data standards include general formats like XML, RDF, and JSON and more specialized formats like RSS and XSPF.

Another strength is that the book encourages good principles from the start. It advocates object-oriented design principles for code reuse and a DRY philosophy. It suggests using third-party libraries such as those in PEAR in order to avoid unnecessary reinvention of the wheel, but still shows you how to roll your own if and when it becomes necessary. The books also covers usability, particularly in the last chapter when it discusses AJAX and race conditions, and pays special attention to application security, an area of increasing concern in web applications. Unlike some books, this one includes tips for development outside its own showcased projects to alleviate you from having to spend your own time troubleshooting common issues or digging for solutions to “gotcha” situations.

And last but certainly not least, the book demonstrates that sometimes you have to be resourceful in locating and acquiring your data, particularly in Chapter 5 where one of my own areas of interest, web scraping, is covered. The topic is explained in plain language and supplemented with examples walking you through exactly how it can be used to acquire data for your own mashups. Web scraping is not a frequently broached topic and I applaud the author for making a point to include it. I believe it is a genuinely useful methodology that can help in data acquisition when no other options are available.

I cannot give the book an entirely glowing review, though. There are some errata present, both in content and code samples. Most are small, but some are enough to throw off a reader not already familiar with the material being covered. I’ve submitted some of these via the publisher’s web site already, though I have yet to receive any related communications or see them show up on the web site at the time that I write this review. These issues are able to be corrected, though, and the quality of the book’s content outshines them.

Overall, PHP Web 2.0 Mashup Projects is an excellent example of creativity in finding new ways to aggregate data sets in useful combinations. It is a testament to the possibilities of the internet when access to data is opened up and freedom to use that data enables developers to create exciting and inspiring new solutions. Mashups show the internet’s potential increasing in leaps and bounds and this book can get you on your way to contributing to their future development.

Web Scraping Article Published

Just a quick post to announce (albeit a little late) the December 2007 issue of php|architect, which includes my article on web scraping. Please buy a copy, give it a read, and feel free to post comments on the forum thread for the article. I’d love to hear some reader feedback!

You may noticed that I’ve added a new page for publications. This will become the home for any content I produce that gains any sort of recognition, be it a podcast, article, book review, presentation slides, or what have you. Anytime anything new goes there, I’ll try to make a point to write a post about it.

Article for php|architect

One of the things that has kept me away from my blog for the past few weeks is an article I’ve been working on for php|architect magazine. It should be included in the December 2007 issue and is entitled “Web Scraping.” So, if the topic interests you, keep an eye out for it. If you aren’t sure if the topic interests you, you can check out my episode on the Zend Developer Zone PHP Abstract podcast for a brief high-level description. I’ll probably post about this again once the issue comes out, but I thought I’d give a heads up to anyone out there that might buy issues of the magazine on an issue-by-issue basis.

PHP Abstract Episode 22: Screen Scraping

Check out the latest PHP Abstract podcast (episode 22) from Dev Zone. I’m the guest speaker! The podcast is on web scraping, a practice in which I have (unfortunately) become somewhat proficient. Leave a comment on Dev Zone or on this entry and let me know what you think!