Gotcha on Scraping .NET Applications with PHP and cURL

Published in Web Scraping on Jul 7, 2010

Obligatory pitch: Many other useful tidbits like this can be yours by purchasing my book, php[architect]'s Guide to Web Scraping with PHP.

I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I'd share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. The code looked something like the sample below.

$ch = curl_init();
curl_setopt_array($ch, array(
    CURLOPT_URL => 'http://...',
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => array(
        'field1' => 'value1',
        // ...
    ),
    // ...
));

The issue I ran into had to do with a behavior of the CURLOPT_POSTFIELDS setting that’s easy to overlook. This is a segment of its description from the PHP manual page for the curl_setopt() function.

If value is an array, the Content-Type header will be set to multipart/form-data.

If the form being submitted is not set to have an enctype attribute value of multipart/form-data in the form's markup, .NET returns a 500-level HTTP response with no further information on what causes the error (for security purposes). This presumably happens because it's expecting one value for the Content-Type request header and getting another.

Setting CURLOPT_HEADER and CURLOPT_VERBOSE to true helped to reveal that this was the issue. The fix is pretty simple: instead of passing the array itself for CURLOPT_POSTFIELDS, pass the result of wrapping it in a call to the http_build_query() function (see its PHP manual page). This converts it to a properly formatted query string, which causes cURL to use the default Content-Type header value of application/x-www-form-urlencoded instead.

Tools like Firebug can help you to examine requests made by a browser. Together with these settings for cURL, you can modify your script’s requests to match those of your browser as closely as possible, making gotchas like this less likely to trip you up.