Extracting Data from Excel with Oracle Clean Content

I got assigned an interesting project at work recently. It involved receiving a file upload via PL/SQL. This in itself is relatively trivial and easy to accomplish when running data cartridges on Oracle Application Server via mod_plsql. What was less unremarkable about the nature of the task was that the uploaded file was intended to be a Microsoft Excel binary file containing a single worksheet. Unfortunately, PL/SQL isn’t so divergent in its available native packages that it has readily available functionality to easily handle this situation.

Luckily, my boss had recently visited the annual Oracle OpenWorld conference and while there learned of a new technology of theirs that could help: the Outside In Clean Content API. I’m uncertain as to whether this product came under Oracle’s branding as the result of a merger, buyout, partnership, or what have you. After poking around the net, I saw that it has thus far received very little coverage, presumably because it was a relatively new release.

Clean Content’s primary purpose is to “identify and remove sensitive, confidential or proprietary metadata and hidden information from Microsoft Office documents.” Of course, to be able to accomplish this, it needs to be capable of extracting said data from these document formats. As a side feature of sorts, they expose this functionality in their API, which is available in the form of C++, C#, and Java libraries.

Originally, when I began work on the project, the requirements stated that the uploaded file would be in CSV format. The format requirement was changed later, after I had developed a prototype capable of handling a CSV file. To adapt my existing work to this new requirement, I developed a Java servlet to supplement it, which the data cartridge would call using UTL_HTTP.REQUEST.

This servlet received the name of an uploaded Excel file, used JDBC to pull the binary data from the local database, used the Clean Content API to convert it to CSV format, and used JDBC again to put the converted data back into the same table. It didn’t end up amounting to much in the way of LOC, but it did require some learning on my part.

First off, the Clean Content API is structured in a SAX-like fashion. The best resources to learn it are actually both included in the free download: the Developer’s Guide and the JavaDoc API documentation. Examples in the former show how to restrict the API to analysis only (i.e. not modifying the document data), provide in-memory data to the API (via a ByteBuffer), and how to specify a handler class to intercept events. You may have to peruse several examples to find all this out, but it’s all there if you take the time to read through it (and selectively skip all the parts having to do with document manipulation).

Your handler class has to extend the BaseElementHandler or GenericElementHandler class in the API. I recommend the latter during development, as its start() method can help in the debugging process by showing you what data is being extracted.

The startTextCell() method will indicate when the parser is within a spreadsheet cell containing textual data. However, the TextCellElement it receives contains only coordinate information, not the value of the cell. (Quick note: the coordinate system is 0-based, meaning that the coordinates of the first cell of the spreadsheet are 0, 0.) To actually capture the text, you have to use the text() method. This is a little confusing, but the reason is that it’s possible to encounter textual metadata outside of the spreadsheet cells. A simple class flag property can be used so you know when you are or aren’t within spreadsheet cells when this event occurs.

The startDataCell() method indicates when numeric data is encountered. Something worth mentioning here is that the Excel binary format houses dates as integers. To convert such a number back to its equivalent date, take the date 1/1/1900 and add that number of days to it using GregorianCalendar.add(). An example of this is 39,085, which corresponds to 1/3/2007. You can further format this further by passing the return value of GregorianCalendar.getTime() to SimpleDateFormat.format().

One oddity I ran into during development that was unrelated to the Clean Content API was with the JDBC library. I executed a SELECT query, got back a ResultSet object, and then attempted to call ResultSet.getBytes() to place the value of a BLOB column into a byte array. This was so I could pass that to ByteBuffer.wrap() to be used with the Clean Content API later. However, the returned byte array always came back severely truncated judging by its length and the fact that the Clean Content API could not determine the document type based on it. I wasn’t able to get around to examining the content byte by byte to determine the cause of this, but I did find a solution: ResultSet.getBlob() returns a Blob object and Blob.getBytes() returns the needed (complete) byte array. Apparently Oracle condones this method of obtaining the value, so rather than beat myself up trying to figure out the weirdness that is this situation, I followed the well-beaten path.

Beyond troubleshooting these oddities, along with relearning how to write servlets and learning how to test them in Oracle JDeveloper and deploy them using Oracle Enterprise Manager (and running into this issue in the process), the process of implementing these project requirements was pretty straightforward. Hope my learning experiences end up helping someone else out there. I’m sure there are other existing solutions that could have been applied here, but if nothing else, it showed that there’s more than one way to skin this cat.

Comments are closed.