Posts tagged ‘.NET’

Gotcha on Scraping .NET Applications with PHP and cURL

Obligatory pitch: Many other useful tidbits like this can be yours by purchasing my book, php|architect’s Guide to Web Scraping with PHP.

I recently wrote a PHP script to scrape data from a .NET application. In the process of developing this script, I noticed something interesting that I thought I’d share. In this case, I was using the cURL extension, but the tip isn’t necessarily specific to that. One thing my script did was submit a POST request to simulate a form submission. The code looked something like the sample below.

$ch = curl_init();
curl_setopt_array($ch, array(
    CURLOPT_URL => 'http://...',
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => array(
        'field1' => 'value1',
        // ...
    ),
    // ...
));

The issue I ran into had to do with a behavior of the CURLOPT_POSTFIELDS setting that’s easy to overlook. This is a segment of its description from the PHP manual page for the curl_setopt() function.

If value is an array, the Content-Type header will be set to multipart/form-data.

If the form being submitted is not set to have an enctype attribute value of multipart/form-data in the form’s markup, .NET returns a 500-level HTTP response with no further information on what causes the error (for security purposes). This presumably happens because it’s expecting one value for the Content-Type request header and getting another.

Setting CURLOPT_HEADER and CURLOPT_VERBOSE to true helped to reveal that this was the issue. The fix is pretty simple: instead of passing the array itself for CURLOPT_POSTFIELDS, pass the result of wrapping it in a call to the  http_build_query() function (see its PHP manual page). This converts it to a properly formatted query string, which causes cURL to use the default Content-Type header value of application/x-www-form-urlencoded instead.

Tools like Firebug can help you to examine requests made by a browser. Together with these settings for cURL, you can modify your script’s requests to match those of your browser as closely as possible, making gotchas like this less likely to trip you up.

The Yin and Yang of Typing

Without a little background in programming languages or computer science in general, it’s entirely possible that typing systems are not something that have crossed your mind. I thought I’d take a blog entry to share some of my thoughts on how it’s affecting the creation and evolution of languages.

First of all, Benjamin C. Pierce probably has a point: terminology used to refer to typing concepts is about as useful as buzzwords like AJAX or Web 2.0 these days. Be that as it may, I’m going to reach back into the recesses of what I recall from the programming languages course I took in college to recall some of this terminology.

If you aren’t familiar with static versus dynamic typing or strong versus weak typing, it may be worth it to read up on those before proceeding with the rest of this blog entry. Here are a few examples of each:

  • Static/weak – C
  • Static/strong – Java
  • Dynamic/weak – PHP
  • Dynamic/strong – Python

The line between strong versus weak typing seems to be blurred as languages like these evolve. The reason for this is that each side of typing has its advantages. Strong typing allows for compile-time checking, which can serve to eliminate human error, as well as performance optimizations from being aware of types at compile-time. They can also serve to make source code more intuitive to follow in some respects. Weak typing, on the other hand, can allow for higher levels of abstraction and, by proxy, the need for less code in order to allow identical operations to be executed on multiple types. It can also allow for things like variable variables, variable functions, and other interesting features not possible in strongly-typed languages.

Yet languages on either side of the proverbial fence are drawing in strengths from the other side. Java, before limited to the flexibility that could be provided by polymorphism while still maintaining strong typing, introduced generics in 1.5, whereby typing was still enforced but a higher level of logic abstraction was enabled for developers. By the same token, PHP has had explicit typecasting for a while and more recently in 5.1 introduced type hinting for array and object types (which may extend to scalar types in later versions). C# in 3.5 adds type inferencing, which while it’s only syntactic sugar at least alleviates the need for verbosity when performing the most common method of initialization (i.e. setting a variable of a given class to an object instance of that class, as opposed to one involving a subclass of one or more of the classes involved).

It’s also becoming commonplace for dynamically typed language interpreters to get ported to Java and .NET in order to leverage the features of those languages and the native libraries of the host language in the existing execution environment. Take these examples for instance.

In short, some level of control over typing is obviously a desired feature in any useful language. As well, I don’t think a language can be truly useful without having a bit of both worlds to some degree. The reason for the existence of programming languages is to enable developers to control machines whose primary purpose is to manipulate data (and, as has been pointed out many times before, are stupid and do what we tell them to do). If control over said manipulation is hampered by the typing system, it hampers the effectiveness of the language. In this, I have to agree with Ludwig Wittgenstein, who said, “The limits of my language mean the limits of my world.”