Gotcha on Scraping .NET Applications with PHP and cURL

Obligatory pitch: Many other useful tidbits like this can be yours by purchasing my book, php|architect’s Guide to Web Scraping with PHP.

I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I’d share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. The code looked something like the sample below.

$ch = curl_init();
curl_setopt_array($ch, array(
    CURLOPT_URL => 'http://...',
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => array(
        'field1' => 'value1',
        // ...
    ),
    // ...
));

The issue I ran into had to do with a behavior of the CURLOPT_POSTFIELDS setting that’s easy to overlook. This is a segment of its description from the PHP manual page for the curl_setopt() function.

If value is an array, the Content-Type header will be set to multipart/form-data.

If the form being submitted is not set to have an enctype attribute value of multipart/form-data in the form’s markup, .NET returns a 500-level HTTP response with no further information on what causes the error (for security purposes). This presumably happens because it’s expecting one value for the Content-Type request header and getting another.

Setting CURLOPT_HEADER and CURLOPT_VERBOSE to true helped to reveal that this was the issue. The fix is pretty simple: instead of passing the array itself for CURLOPT_POSTFIELDS, pass the result of wrapping it in a call to the  http_build_query() function (see its PHP manual page). This converts it to a properly formatted query string, which causes cURL to use the default Content-Type header value of application/x-www-form-urlencoded instead.

Tools like Firebug can help you to examine requests made by a browser. Together with these settings for cURL, you can modify your script’s requests to match those of your browser as closely as possible, making gotchas like this less likely to trip you up.

4 Comments

  1. Daniel Ice says:

    Great tip. I have been doing lots of spidering over time and never seen this tip. Thanks.

  2. Hi.

    I’m the main author of libcurl, and while I’m not fluent in PHP nor in the curl extension for PHP, I must say that your post/explanation here greatly misses the point. The problem is really _not_ the Content-Type: header. Most likely servers don’t care one yota about that header.

    What counts, is that you made a multipart formpost instead of a regular one. The entire POST was done in a completely different encoding as url-encoding and multipart are far far away from each other.

    (Me personally, I don’t like how the curl exension made the two different posts that easy to mix up.)

    Have fun with that scraping!

  3. @Daniel Thanks for the taking the time to comment and make that clarification. I didn’t mean to imply that this was an error on the part of libcurl or its PHP extension, merely a behavior of the latter that’s easy to miss if you don’t read the docs carefully.

    I agree that the end result is that the POST data is formatted differently and the difference in header value is more a consequence of that and a sympton rather than a cause. One thing I didn’t mention explicitly is that I’ve never known PHP to behave like .NET does in this situation, so the behavior came as somewhat of a surprise to me.

  4. EvNix says:

    nice tutorial i prefer using snoopy class though!
    i guess snoopy uses fsockopen
    but i am not very sure