Gotcha on Scraping .NET Applications with PHP and cURL
Obligatory pitch: Many other useful tidbits like this can be yours by purchasing my book, php[architect]'s Guide to Web Scraping with PHP.
I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I'd share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. The code looked something like the sample below.
$ch = curl_init(); curl_setopt_array($ch, array( CURLOPT_URL => 'http://...', CURLOPT_POST => true, CURLOPT_POSTFIELDS => array( 'field1' => 'value1', // ... ), // ... ));
The issue I ran into had to do with a behavior of the
CURLOPT_POSTFIELDS setting that’s easy to overlook. This is a segment of its description from the PHP manual page for the
If value is an array, the Content-Type header will be set to multipart/form-data.
If the form being submitted is not set to have an
enctype attribute value of
multipart/form-data in the form's markup, .NET returns a 500-level HTTP response with no further information on what causes the error (for security purposes). This presumably happens because it's expecting one value for the
Content-Type request header and getting another.
true helped to reveal that this was the issue. The fix is pretty simple: instead of passing the array itself for
CURLOPT_POSTFIELDS, pass the result of wrapping it in a call to the
http_build_query() function (see its PHP manual page). This converts it to a properly formatted query string, which causes cURL to use the default
Content-Type header value of
Tools like Firebug can help you to examine requests made by a browser. Together with these settings for cURL, you can modify your script’s requests to match those of your browser as closely as possible, making gotchas like this less likely to trip you up.