Gotcha on Scraping .NET Applications with PHP and cURL
Obligatory pitch: Many other useful tidbits like this can be yours by purchasing my book, php[architect]'s Guide to Web Scraping with PHP.
I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I'd share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. The code looked something like the sample below.
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => 'http://...',
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => array(
'field1' => 'value1',
// ...
),
// ...
));
The issue I ran into had to do with a behavior of the CURLOPT_POSTFIELDS
setting that’s easy to overlook. This is a segment of its description from the PHP manual page for the curl_setopt()
function.
If value is an array, the Content-Type header will be set to multipart/form-data.
If the form being submitted is not set to have an enctype
attribute value of multipart/form-data
in the form's markup, .NET returns a 500-level HTTP response with no further information on what causes the error (for security purposes). This presumably happens because it's expecting one value for the Content-Type
request header and getting another.
Setting CURLOPT_HEADER
and CURLOPT_VERBOSE
to true
helped to reveal that this was the issue. The fix is pretty simple: instead of passing the array itself for CURLOPT_POSTFIELDS
, pass the result of wrapping it in a call to the http_build_query()
function (see its PHP manual page). This converts it to a properly formatted query string, which causes cURL to use the default Content-Type
header value of application/x-www-form-urlencoded
instead.
Tools like Firebug can help you to examine requests made by a browser. Together with these settings for cURL, you can modify your script’s requests to match those of your browser as closely as possible, making gotchas like this less likely to trip you up.