Archive for the ‘Uncategorized’ Category.

IE6 Compatibility Testing via Virtualization

In my last entry, I noted that I was testing my use of SWFUpload in both IE6 and IE7. You may wonder how I managed this. Google has plenty of information on “hack” methods to get both versions of IE to coexist on a single XP installation.

If you have the machine power for it, though, there is a method that is actually supported by Microsoft to accomplish this. It employs virtualization which is becoming increasingly popular in the computing world. It is useful for two particular applications: to make server environment installations independent of the host operating system and hardware and to allow multiple operating systems to coexist on the same hardware without the need for partitioning storage devices.

In 2003, Microsoft bought out a company called Connectix which specialized in virtualization software, one of their main products being Virtual PC. Microsoft subsequently released a rebranded version of VPC as a free download.

Once the need to test applications on IE6 and IE7 for cross-compatibility was realized, Microsoft also began sporadically releasing a freely downloadable series of up-to-date Windows XP images with IE6 pre-installed and expiration dates after which the images would no longer function. The latest image is set to expire in early June 2008.

With Microsoft pushing adoption of Windows Vista and IE7, it’s uncertain as to whether or not Microsoft will continue releasing these VPC images. However, it does appear that Microsoft plans to continue developing Virtual PC. Therefore, if you have an extra Windows XP license to spare, this arrangement is a nice and relatively lightweight solution to making IE6 and IE7 available for testing on the same machine.

How-To (and How-Not-To) on Web Scraping

A friend of mine who shall remain nameless pointed a post out to me on the PHP DZone web site recently. Noting that the article’s content was misinformed at best and downright ignorant at worst, even when examining it sheerly from the author’s knowledge of PHP as a language, this friend asked that I set the author straight.

I gladly obliged with a comment on the post, having become somewhat of an authority on the application topic myself. As much of an unorthodox practice as web scraping may be, there are some methodologies for it that are obviously better than others. The aforementioned post illustrates a lot of the ones to avoid, and my arguments against them.

Later, I randomly encountered a post on the blog at xml.lt on the topic of web scraping using the DOM extension. This article showcases recommended practices and reasoned arguments against bad (and unfortunately common) alternatives. The author comes across as being significantly more informed on both the language and the application in the article’s content and code examples.

If you’re looking for references on topic of web scraping with PHP, there’s always the article I wrote for the December 2007 issue of php|architect magazine, of which you can still purchase an electronic copy in PDF format. At some point, I also hope to write a short book on the subject. Until then, if you have related questions, you can generally reach me in the #phpc channel on Freenode, under the nick Elazar. I’m always glad to give out advice on web scraping and PHP, as I’m sure my good friend Jared Folkins (who is also my “Little Sis” from the PHPWomen Big Sis/Little Sis mentoring program) will attest.

Graphs and Relational Databases

Situations involving hierarchical data and relational databases are quite common in web applications. Trees lend themselves quite well to providing organizational structures a web site, such as sitemaps and breadcrumb trails. A slightly less common and different type of situation, where the application is just as useful and a solution is a bit more complex to derive, is one involving graphical data (as in graphs, not graphics) and relational databases. These situations have issues like the shortest path problem and find their solutions in graph theory such as the A* algorithm or Dijkstra’s algorithm. An example of such a situation is an airline web site that requires the ability to locate connecting and round-trip flights and find the flight path with the lowest cost in terms of time or ticket price.

If you use MySQL, this chapter from “Get It Done with MySQL 5″ is a fairly verbose but comprehensive guide to using MySQL to store graphical data. It includes background information such as terminology used in graph theory and has numerous implementation examples of adjacency list graph models, nested set graph models, and breadth-first and depth-first graph search algorithms.

For Oracle users, there’s a slightly more application-oriented tutorial that assumes more theoretical knowledge on the part of the reader. It shows that Oracle’s hierarchical data features unfortunately can’t be used in cases where cycles might exist in graphs (which is handy if you’re trying to detect them) and then goes on to show an implementation that uses temporary tables to store a summary of a graph analysis. A worthy side note is that part 2 of that tutorial deals with a more specialized approach using state machines that may or may not be applicable to your situation.

If you’d like more information on this topic, a good place to look is your local university. Most with a computer science program offer a course in theory of computation, which deals with topics like these as well as context-free grammars in the context of developing programming languages. Even if you never actually use this information to develop a language yourself, it can still serve a good purpose: it make you more informed when engaging in discussions about language development, and it can increase your appreciation for the beauty of a language from a user perspective.

Goodbye WordPress, Hello Habari

So after eventually getting fed up with WordPress, especially after the WYSIWYG editor disappered in the 2.3.3 update, I finally decided to bite the bullet and migrate my blog over to Habari. Once I’d been through the process, I thought I’d write a short blog entry about the experience.

First, there was the matter of content. Though it wasn’t as easy or intuitive as it could have been to track down how to migrate content from WordPress, once I knew how, it was a snap. Simply go to Admin > Plugins, activate the WordPress Importer plugin (which comes bundled with the release), then go to Admin > Import and you’ll have a WordPress Database option. From that point, it’s just a matter of putting in the authentication credentials to point Habari at the WordPress database and it seamlessly imports all your data into the Habari database.

Next came making Habari support my existing URL scheme from WordPress. It turns out that Habari has a database table for rewrite rules, but currently no section of the admin area to manage it. Ergo, the only way to add to or change these is to do it manually. Luckily, there was a blog entry from Michael Harris that detailed all this and even provided the exact INSERT statement needed.

After that came my blog theme. If the Habari developers are ex-WordPress developers as I’ve heard, they must not have liked the WordPress API much, because the two sure are different. This made theme migration look cumbersome enough that I decided to simply retire my old blog theme in favor of a slightly tweaked version of one of the stock themes available for Habari, namely Whitespace.

Finally, there were plugins. I wanted to continue using Akismet to manage content spam, as that had tended to serve me well while I was using WordPress. Luckily, Chris Davis has created an Akismet plugin. I downloaded the archive into /user/plugins, decompressed it, and then had to dig around in the plugin’s PHP file and add in my WordPress API key and blog URL. It would be nice if this was updated to use the configuration API that Habari offers for plugins. I tried the Blogroll plugin and didn’t really care for its interface. In that particular area, I actually liked how WordPress did things.

I experienced two particularly strange things during the process of migrating my blog. One occurred when I tried to swap out directories to make the new Habari-based version of my blog live. When I did that, all plugins mysteriously deactivated. I had to go back into Admin > Plugins and reactive them individually. They all seemed to retain their settings, at least.

The other oddity happened after I activated the TinyMCE plugin so that I could use a browser-based WYSIWYG interface to edit content. The dashboard screen in the admin area (and only that screen, from what I can tell) started throwing an “exception without a stack frame” error. I’ve e-mailed the author on that one, so we’ll see what happens.

Overall, though, I’m very satisfied with Habari and look forward to using it to catch up on the backlog of post ideas I’ve managed to build up over the past few weeks.

The Yin and Yang of Typing

Without a little background in programming languages or computer science in general, it’s entirely possible that typing systems are not something that have crossed your mind. I thought I’d take a blog entry to share some of my thoughts on how it’s affecting the creation and evolution of languages.

First of all, Benjamin C. Pierce probably has a point: terminology used to refer to typing concepts is about as useful as buzzwords like AJAX or Web 2.0 these days. Be that as it may, I’m going to reach back into the recesses of what I recall from the programming languages course I took in college to recall some of this terminology.

If you aren’t familiar with static versus dynamic typing or strong versus weak typing, it may be worth it to read up on those before proceeding with the rest of this blog entry. Here are a few examples of each:

  • Static/weak – C
  • Static/strong – Java
  • Dynamic/weak – PHP
  • Dynamic/strong – Python

The line between strong versus weak typing seems to be blurred as languages like these evolve. The reason for this is that each side of typing has its advantages. Strong typing allows for compile-time checking, which can serve to eliminate human error, as well as performance optimizations from being aware of types at compile-time. They can also serve to make source code more intuitive to follow in some respects. Weak typing, on the other hand, can allow for higher levels of abstraction and, by proxy, the need for less code in order to allow identical operations to be executed on multiple types. It can also allow for things like variable variables, variable functions, and other interesting features not possible in strongly-typed languages.

Yet languages on either side of the proverbial fence are drawing in strengths from the other side. Java, before limited to the flexibility that could be provided by polymorphism while still maintaining strong typing, introduced generics in 1.5, whereby typing was still enforced but a higher level of logic abstraction was enabled for developers. By the same token, PHP has had explicit typecasting for a while and more recently in 5.1 introduced type hinting for array and object types (which may extend to scalar types in later versions). C# in 3.5 adds type inferencing, which while it’s only syntactic sugar at least alleviates the need for verbosity when performing the most common method of initialization (i.e. setting a variable of a given class to an object instance of that class, as opposed to one involving a subclass of one or more of the classes involved).

It’s also becoming commonplace for dynamically typed language interpreters to get ported to Java and .NET in order to leverage the features of those languages and the native libraries of the host language in the existing execution environment. Take these examples for instance.

In short, some level of control over typing is obviously a desired feature in any useful language. As well, I don’t think a language can be truly useful without having a bit of both worlds to some degree. The reason for the existence of programming languages is to enable developers to control machines whose primary purpose is to manipulate data (and, as has been pointed out many times before, are stupid and do what we tell them to do). If control over said manipulation is hampered by the typing system, it hampers the effectiveness of the language. In this, I have to agree with Ludwig Wittgenstein, who said, “The limits of my language mean the limits of my world.”

Zend Framework and Remember The Milk

I’ve posted a few times on Twitter related to my latest project and a few people have already asked me about it, so I figured it was worth a blog post.

My first project for the Zend Framework was Zend_Service_Simpy, a service module providing a lightweight wrapper around the API for the Simpy social bookmarking service.

My latest project is another service module for the Zend Framework. This time, though, it’s for the Remember The Milk API. RTM is basically a TODO list on serious steroids. It’s the Swiss Army Knife of task management. It allows you to manage multiple lists of tasks. You can add them easily from a variety of mediums, tag them, prioritize them, set deadlines for them, have them repeat, get reminders for them, tie them to physical real world locations, and share them. RTM offers great support for integration with Google applications including Google Calendar, iGoogle, and Gmail (plus offline access powered by Google Gears). They’re also very big into supporting mobile devices, including those running on Windows Mobile as well as the iPhone.

If you like, you can check out my original proposal for this module. I can already say that the API will end up changing a little, though, but it’s good enough to give you a general idea of what the capabilities of the finished service module will be. I only actively started implementation recently and things are progressing at a fairly rapid pace. I still have unit tests and documentation to handle, but hopefully there’s a shot at seeing it moved to core within the next two releases of the framework.

Google Reader and Yahoo! Pipes

I ran into a situation recently that I thought I’d share. I use Google Reader to manage the feeds that I read regularly. PHPDeveloper.org is among my favorite news syndication web sites. However, some of its posts, in particular those dealing with job posts or additions to CakePHP’s Bakery, aren’t interesting to me.

Eventually, I came to the conclusion that I could wrap the feed in a Yahoo! Pipe in order to filter out the uninteresting information. (I know, the irony of using Google and Yahoo products together might seem anything from ironic to downright unholy to some.)

Unfortunately, doing so meant that I had to remove the original PHPDeveloper.org feed from Google Reader and add the new pipe-wrapped feed in its place. Because (as best I can tell) certain things are tracked per feed rather than per URL (old items) or per item (read statuses), this meant losing all information specific to the old feed.

Granted, I only had to do this once, but I wish it had occurred to me earlier. Google Reader may have search capability (which took forever to be included), but that’s not the same as being able to have content filtering automatically handled for me whenever I view the contents of a feed.

So my line of thought continued. It would be nice if there was an easy way to maintain the user experience of adding feeds through my preferred browser, Mozilla Firefox, but to have new feeds be automatically wrapped in a Yahoo! Pipe “behind the scenes.” This would allow me to go back and manipulate feed content later if I saw a repeating pattern in specific content that didn’t interest me.

Another unfortunate trait of this situation is that Yahoo! Pipes doesn’t currently offer a web service API, or it might make implementing my idea significantly easier. While the AJAX interface exposes server interaction logic, it’s obfuscated to the point where it makes reverse engineering attempts infeasible. It’s unfortunate, because I think a marriage of the features of each of these services would make the result all the more useful for their users.

More Oracle and Java Woes

Today I continued the trek toward completing the project described in my last entry. Though I don’t think I ran into as many issues today as I did in the past week or so of working on the project, today certainly had it’s fair share.

First up was a rather interesting exception being thrown by a JDBC operation, namely “java.sql.SQLException: SQL string is not Query.” This is apparently intended to be JDBC’s way of explaining that PreparedStatement.executeQuery() doesn’t work for DML operations. To execute one of those, you have to use either execute() or executeUpdate(). Thankfully, a forum thread was able to point me in the right direction on that one.

Next on the list, if Oracle JDeveloper 10.1.3.3.0 tells you “The WAR file is already up to date,” don’t believe it! I don’t know what logic it’s using to decide whether or not the class files constituting a WAR file are out-of-date, but there are definitely some cases where it’s flawed. I spent a better part of the morning trying to figure out why everything from undeploying and redeploying the EAR file to bouncing the OAS installation was still giving me illogical output. Come to find out, I didn’t know the WAR file not being updated was relevant to the problem at the time, but it certainly proved to be in the end! Tried searching for a bug report on this, but came up empty, so maybe it’s just me.

Last but not least, I take issue with the language used in the mod_plsql User’s Guide to describe its process of file upload handling. Though it never explicitly states this, it seems to imply that the internal handling of performing an INSERT operation to place data for an uploaded file into the document table takes place in a separate transaction from that of the action procedure that gets executed afterward.

You have to go to the PL/SQL User’s Guide to read why this is not the case. To sum it up, a transaction can span multiple procedures. A procedure being executed as a data cartridge operates within a transaction that is implicitly committed when that procedure terminates so long as no uncaught exceptions are raised. However, until that point, the effects of any DML operations executed are only visible to the procedure. This includes the INSERT procedure performed by mod_plsql on the document table. What this effectively means is that the only way something other than the procedure can see that the inserted record exists unless the procedure does an explicit COMMIT.

If you read my last post, you know that I was calling a servlet from the data cartridge. You can probably imagine the amount of aggravation this caused me when I ran my servlet locally without issue, had to backtrack to figure out where the servlet was failing when it was deployed, and then found out that a single COMMIT statement at the beginning of my data cartridge procedure made things work as expected. So, yay for lacking Oracle documentation.

I did get the servlet working, though. It can now pull data from the database, convert it from Excel binary to CSV format, and put the converted data back into the database. So, the Clean Content API, while not specifically designed for the purpose for which I’m using it, is at least a somewhat capable solution. That basically sums up my day, folks. I’ll be back on Monday to do it again.

Extracting Data from Excel with Oracle Clean Content

I got assigned an interesting project at work recently. It involved receiving a file upload via PL/SQL. This in itself is relatively trivial and easy to accomplish when running data cartridges on Oracle Application Server via mod_plsql. What was less unremarkable about the nature of the task was that the uploaded file was intended to be a Microsoft Excel binary file containing a single worksheet. Unfortunately, PL/SQL isn’t so divergent in its available native packages that it has readily available functionality to easily handle this situation.

Luckily, my boss had recently visited the annual Oracle OpenWorld conference and while there learned of a new technology of theirs that could help: the Outside In Clean Content API. I’m uncertain as to whether this product came under Oracle’s branding as the result of a merger, buyout, partnership, or what have you. After poking around the net, I saw that it has thus far received very little coverage, presumably because it was a relatively new release.

Clean Content’s primary purpose is to “identify and remove sensitive, confidential or proprietary metadata and hidden information from Microsoft Office documents.” Of course, to be able to accomplish this, it needs to be capable of extracting said data from these document formats. As a side feature of sorts, they expose this functionality in their API, which is available in the form of C++, C#, and Java libraries.

Originally, when I began work on the project, the requirements stated that the uploaded file would be in CSV format. The format requirement was changed later, after I had developed a prototype capable of handling a CSV file. To adapt my existing work to this new requirement, I developed a Java servlet to supplement it, which the data cartridge would call using UTL_HTTP.REQUEST.

This servlet received the name of an uploaded Excel file, used JDBC to pull the binary data from the local database, used the Clean Content API to convert it to CSV format, and used JDBC again to put the converted data back into the same table. It didn’t end up amounting to much in the way of LOC, but it did require some learning on my part.

First off, the Clean Content API is structured in a SAX-like fashion. The best resources to learn it are actually both included in the free download: the Developer’s Guide and the JavaDoc API documentation. Examples in the former show how to restrict the API to analysis only (i.e. not modifying the document data), provide in-memory data to the API (via a ByteBuffer), and how to specify a handler class to intercept events. You may have to peruse several examples to find all this out, but it’s all there if you take the time to read through it (and selectively skip all the parts having to do with document manipulation).

Your handler class has to extend the BaseElementHandler or GenericElementHandler class in the API. I recommend the latter during development, as its start() method can help in the debugging process by showing you what data is being extracted.

The startTextCell() method will indicate when the parser is within a spreadsheet cell containing textual data. However, the TextCellElement it receives contains only coordinate information, not the value of the cell. (Quick note: the coordinate system is 0-based, meaning that the coordinates of the first cell of the spreadsheet are 0, 0.) To actually capture the text, you have to use the text() method. This is a little confusing, but the reason is that it’s possible to encounter textual metadata outside of the spreadsheet cells. A simple class flag property can be used so you know when you are or aren’t within spreadsheet cells when this event occurs.

The startDataCell() method indicates when numeric data is encountered. Something worth mentioning here is that the Excel binary format houses dates as integers. To convert such a number back to its equivalent date, take the date 1/1/1900 and add that number of days to it using GregorianCalendar.add(). An example of this is 39,085, which corresponds to 1/3/2007. You can further format this further by passing the return value of GregorianCalendar.getTime() to SimpleDateFormat.format().

One oddity I ran into during development that was unrelated to the Clean Content API was with the JDBC library. I executed a SELECT query, got back a ResultSet object, and then attempted to call ResultSet.getBytes() to place the value of a BLOB column into a byte array. This was so I could pass that to ByteBuffer.wrap() to be used with the Clean Content API later. However, the returned byte array always came back severely truncated judging by its length and the fact that the Clean Content API could not determine the document type based on it. I wasn’t able to get around to examining the content byte by byte to determine the cause of this, but I did find a solution: ResultSet.getBlob() returns a Blob object and Blob.getBytes() returns the needed (complete) byte array. Apparently Oracle condones this method of obtaining the value, so rather than beat myself up trying to figure out the weirdness that is this situation, I followed the well-beaten path.

Beyond troubleshooting these oddities, along with relearning how to write servlets and learning how to test them in Oracle JDeveloper and deploy them using Oracle Enterprise Manager (and running into this issue in the process), the process of implementing these project requirements was pretty straightforward. Hope my learning experiences end up helping someone else out there. I’m sure there are other existing solutions that could have been applied here, but if nothing else, it showed that there’s more than one way to skin this cat.

Web Scraping Article Published

Just a quick post to announce (albeit a little late) the December 2007 issue of php|architect, which includes my article on web scraping. Please buy a copy, give it a read, and feel free to post comments on the forum thread for the article. I’d love to hear some reader feedback!

You may noticed that I’ve added a new page for publications. This will become the home for any content I produce that gains any sort of recognition, be it a podcast, article, book review, presentation slides, or what have you. Anytime anything new goes there, I’ll try to make a point to write a post about it.