Archive for December 2007

Google Reader and Yahoo! Pipes

I ran into a situation recently that I thought I’d share. I use Google Reader to manage the feeds that I read regularly. PHPDeveloper.org is among my favorite news syndication web sites. However, some of its posts, in particular those dealing with job posts or additions to CakePHP’s Bakery, aren’t interesting to me.

Eventually, I came to the conclusion that I could wrap the feed in a Yahoo! Pipe in order to filter out the uninteresting information. (I know, the irony of using Google and Yahoo products together might seem anything from ironic to downright unholy to some.)

Unfortunately, doing so meant that I had to remove the original PHPDeveloper.org feed from Google Reader and add the new pipe-wrapped feed in its place. Because (as best I can tell) certain things are tracked per feed rather than per URL (old items) or per item (read statuses), this meant losing all information specific to the old feed.

Granted, I only had to do this once, but I wish it had occurred to me earlier. Google Reader may have search capability (which took forever to be included), but that’s not the same as being able to have content filtering automatically handled for me whenever I view the contents of a feed.

So my line of thought continued. It would be nice if there was an easy way to maintain the user experience of adding feeds through my preferred browser, Mozilla Firefox, but to have new feeds be automatically wrapped in a Yahoo! Pipe “behind the scenes.” This would allow me to go back and manipulate feed content later if I saw a repeating pattern in specific content that didn’t interest me.

Another unfortunate trait of this situation is that Yahoo! Pipes doesn’t currently offer a web service API, or it might make implementing my idea significantly easier. While the AJAX interface exposes server interaction logic, it’s obfuscated to the point where it makes reverse engineering attempts infeasible. It’s unfortunate, because I think a marriage of the features of each of these services would make the result all the more useful for their users.

More Oracle and Java Woes

Today I continued the trek toward completing the project described in my last entry. Though I don’t think I ran into as many issues today as I did in the past week or so of working on the project, today certainly had it’s fair share.

First up was a rather interesting exception being thrown by a JDBC operation, namely “java.sql.SQLException: SQL string is not Query.” This is apparently intended to be JDBC’s way of explaining that PreparedStatement.executeQuery() doesn’t work for DML operations. To execute one of those, you have to use either execute() or executeUpdate(). Thankfully, a forum thread was able to point me in the right direction on that one.

Next on the list, if Oracle JDeveloper 10.1.3.3.0 tells you “The WAR file is already up to date,” don’t believe it! I don’t know what logic it’s using to decide whether or not the class files constituting a WAR file are out-of-date, but there are definitely some cases where it’s flawed. I spent a better part of the morning trying to figure out why everything from undeploying and redeploying the EAR file to bouncing the OAS installation was still giving me illogical output. Come to find out, I didn’t know the WAR file not being updated was relevant to the problem at the time, but it certainly proved to be in the end! Tried searching for a bug report on this, but came up empty, so maybe it’s just me.

Last but not least, I take issue with the language used in the mod_plsql User’s Guide to describe its process of file upload handling. Though it never explicitly states this, it seems to imply that the internal handling of performing an INSERT operation to place data for an uploaded file into the document table takes place in a separate transaction from that of the action procedure that gets executed afterward.

You have to go to the PL/SQL User’s Guide to read why this is not the case. To sum it up, a transaction can span multiple procedures. A procedure being executed as a data cartridge operates within a transaction that is implicitly committed when that procedure terminates so long as no uncaught exceptions are raised. However, until that point, the effects of any DML operations executed are only visible to the procedure. This includes the INSERT procedure performed by mod_plsql on the document table. What this effectively means is that the only way something other than the procedure can see that the inserted record exists unless the procedure does an explicit COMMIT.

If you read my last post, you know that I was calling a servlet from the data cartridge. You can probably imagine the amount of aggravation this caused me when I ran my servlet locally without issue, had to backtrack to figure out where the servlet was failing when it was deployed, and then found out that a single COMMIT statement at the beginning of my data cartridge procedure made things work as expected. So, yay for lacking Oracle documentation.

I did get the servlet working, though. It can now pull data from the database, convert it from Excel binary to CSV format, and put the converted data back into the database. So, the Clean Content API, while not specifically designed for the purpose for which I’m using it, is at least a somewhat capable solution. That basically sums up my day, folks. I’ll be back on Monday to do it again.

Extracting Data from Excel with Oracle Clean Content

I got assigned an interesting project at work recently. It involved receiving a file upload via PL/SQL. This in itself is relatively trivial and easy to accomplish when running data cartridges on Oracle Application Server via mod_plsql. What was less unremarkable about the nature of the task was that the uploaded file was intended to be a Microsoft Excel binary file containing a single worksheet. Unfortunately, PL/SQL isn’t so divergent in its available native packages that it has readily available functionality to easily handle this situation.

Luckily, my boss had recently visited the annual Oracle OpenWorld conference and while there learned of a new technology of theirs that could help: the Outside In Clean Content API. I’m uncertain as to whether this product came under Oracle’s branding as the result of a merger, buyout, partnership, or what have you. After poking around the net, I saw that it has thus far received very little coverage, presumably because it was a relatively new release.

Clean Content’s primary purpose is to “identify and remove sensitive, confidential or proprietary metadata and hidden information from Microsoft Office documents.” Of course, to be able to accomplish this, it needs to be capable of extracting said data from these document formats. As a side feature of sorts, they expose this functionality in their API, which is available in the form of C++, C#, and Java libraries.

Originally, when I began work on the project, the requirements stated that the uploaded file would be in CSV format. The format requirement was changed later, after I had developed a prototype capable of handling a CSV file. To adapt my existing work to this new requirement, I developed a Java servlet to supplement it, which the data cartridge would call using UTL_HTTP.REQUEST.

This servlet received the name of an uploaded Excel file, used JDBC to pull the binary data from the local database, used the Clean Content API to convert it to CSV format, and used JDBC again to put the converted data back into the same table. It didn’t end up amounting to much in the way of LOC, but it did require some learning on my part.

First off, the Clean Content API is structured in a SAX-like fashion. The best resources to learn it are actually both included in the free download: the Developer’s Guide and the JavaDoc API documentation. Examples in the former show how to restrict the API to analysis only (i.e. not modifying the document data), provide in-memory data to the API (via a ByteBuffer), and how to specify a handler class to intercept events. You may have to peruse several examples to find all this out, but it’s all there if you take the time to read through it (and selectively skip all the parts having to do with document manipulation).

Your handler class has to extend the BaseElementHandler or GenericElementHandler class in the API. I recommend the latter during development, as its start() method can help in the debugging process by showing you what data is being extracted.

The startTextCell() method will indicate when the parser is within a spreadsheet cell containing textual data. However, the TextCellElement it receives contains only coordinate information, not the value of the cell. (Quick note: the coordinate system is 0-based, meaning that the coordinates of the first cell of the spreadsheet are 0, 0.) To actually capture the text, you have to use the text() method. This is a little confusing, but the reason is that it’s possible to encounter textual metadata outside of the spreadsheet cells. A simple class flag property can be used so you know when you are or aren’t within spreadsheet cells when this event occurs.

The startDataCell() method indicates when numeric data is encountered. Something worth mentioning here is that the Excel binary format houses dates as integers. To convert such a number back to its equivalent date, take the date 1/1/1900 and add that number of days to it using GregorianCalendar.add(). An example of this is 39,085, which corresponds to 1/3/2007. You can further format this further by passing the return value of GregorianCalendar.getTime() to SimpleDateFormat.format().

One oddity I ran into during development that was unrelated to the Clean Content API was with the JDBC library. I executed a SELECT query, got back a ResultSet object, and then attempted to call ResultSet.getBytes() to place the value of a BLOB column into a byte array. This was so I could pass that to ByteBuffer.wrap() to be used with the Clean Content API later. However, the returned byte array always came back severely truncated judging by its length and the fact that the Clean Content API could not determine the document type based on it. I wasn’t able to get around to examining the content byte by byte to determine the cause of this, but I did find a solution: ResultSet.getBlob() returns a Blob object and Blob.getBytes() returns the needed (complete) byte array. Apparently Oracle condones this method of obtaining the value, so rather than beat myself up trying to figure out the weirdness that is this situation, I followed the well-beaten path.

Beyond troubleshooting these oddities, along with relearning how to write servlets and learning how to test them in Oracle JDeveloper and deploy them using Oracle Enterprise Manager (and running into this issue in the process), the process of implementing these project requirements was pretty straightforward. Hope my learning experiences end up helping someone else out there. I’m sure there are other existing solutions that could have been applied here, but if nothing else, it showed that there’s more than one way to skin this cat.

Web Scraping Article Published

Just a quick post to announce (albeit a little late) the December 2007 issue of php|architect, which includes my article on web scraping. Please buy a copy, give it a read, and feel free to post comments on the forum thread for the article. I’d love to hear some reader feedback!

You may noticed that I’ve added a new page for publications. This will become the home for any content I produce that gains any sort of recognition, be it a podcast, article, book review, presentation slides, or what have you. Anytime anything new goes there, I’ll try to make a point to write a post about it.

The Acme of Skill

OK, I know I promised a post on how NULL in Oracle scares me, but I think I’ll save that for another day. For the moment, I’ve had something else on my mind recently. Someone I know is apparently of the opinion that PHP is “on the way out.” I have to vehemently disagree with this, and not just because PHP is my language of preference.

For starters, there are major corporations that are actively using PHP. Yahoo, current employer of Sara Golemon, is a great example. Facebook, a social networking site whose advertising program threatens Google Adsense enough that they created the Open Social initiative and brought in other companies in order to compete, is another.

While usage of PHP took a slight dip about two years ago, probably due in part to hype growing popularity and advances in other technologies like .NET and Ruby as well as the low adoption rate of PHP 5, its use is back on the rise. The performance improvements and addition of new OOP features are only making PHP a better, more well-rounded solution for the enterprise.

Major corporations are finally getting away from fearing competition from the open source community and are starting to embrace it. Oracle has collaborated with Zend to enable them to produce Zend Core for Oracle. The Oracle Technology Network web site has a dedicated section for PHP Developers as well as a manual and a cookbook. Oracle develops the OCI8 PHP PDO driver and have made fairly recent updates to it to support database resident connection pooling, fast application notification, and other notable 11g features. (Check out Chris Jones’ blog for more info on that project.) Oracle is also beginning to release a substantial number of projects, particularly developer tools, as open source software.

Oracle isn’t the only one, either. Microsoft has even started getting into the game. A FastCGI add-on is now available for IIS 6, Microsoft’s web server. FastCGI is frequently used when Apache is not being run or running PHP as an Apache module is not an option for other reasons, such as shared hosting services that want to support both PHP 4 and PHP 5 on the same machine, so this is quite the boon for Microsoft shops. Microsoft is getting involved in the production of a new PHP database driver for SQL Server 2005. I can say on personal authority that multiple Microsoft representatives were present at ZendCon 2007 and made a presentation to the conference attendees on that very subject.

So this all seriously begs this question: why are these corporations, some of which have been specifically shown to be opposed to open source, now trying to play nice? Keep your friends close and your enemies closer, anyone? If you can’t beat them, join them? OK, enough cliche anecdotes. I think I’ve made my point here. PHP isn’t going anywhere and it’s certainly not “on the way out.” These companies are putting a significant time and energy into supporting integration with their products by open source software and I don’t think they’d make that investment if PHP’s overall outlook was limited to the short-term.

I don’t believe it was ever the specific intention of the open source movement to compete with large companies and their proprietary products, but merely to fill a gap in software needs perceived by the consumer. As such, I find a particular quote by Sun Tzu, author of “The Art of War,” to be appropriate here: “For to win one hundred victories in one hundred battles is not the acme of skill. To subdue the enemy without fighting is the acme of skill.”