— edited 18-Oct-2007 —
I wrote this post when I tired, and grumpy particularly at PDF more because of time wasted finding out than future time wasted. I broke one of my own rules (#6) about blogging – although I wrote the rule about commenting it applies to posting as well. This is still an annoyance, but I’ve been working with the workaround for a week now and it’s not that bad, really.
— end edit —
So, I was going to save this for another post, but in accordance with my UI quest, I’m currently using a Tablet PC. I didn’t jump into this project lightly; I tried out a Wacom tablet for a few weeks, then I tried a Nokia N800 (absolutely cool gadget; I’m going to be eBaying mine now that I graduated to a full Tablet PC), and now I’m using a Fujitsu T4220, and it rocks out.
But that’s not the real reason for this post, I’ll talk about the emergent wonders of tablet PCs later. The real reason for this post is this:
I wanted to get this tablet for a number of reasons, but one pretty major use case is that my graduate school career has led me to reading a lot of research. I mean, a staggeringly large number of journal articles. Since I want to do this as efficiently as possible, I thought long and hard about my processing of all this stuff and in very brief summary, here’s what I came up with.
- I need a nice interface to read PDFs, since virtually everything available via online library databases is in PDF format.
- I need the ability to mark them up, highlight stuff that is interesting, and store them in a digital format
- I need to classify and meta tag them, and put them in some sort of document repository.
Well, since this is the digital age, it makes sense that I ought to read the PDFs in digital form (this is a stretch for me, I really like paper), which is facilitated by a tablet since I can actually see the page when it’s in the portrait configuration. It also makes sense that I ought to mark up the file in Acrobat, using the native highlighting and searching tools, which is also facilitated by the tablet for obvious reasons.
Here’s the problem. Apparently *every* PDF file, in every digital library, is tagged with headers, or footers, or bates numbers, or some other tag that halts the OCR recognition of the PDF file. If you google “This page contains renderable text”, you’ll see that this has been a complaint since Acrobat 6 at least. So you can’t just OCR the document and get a nice, mark-up-able document.
Now, I know what you’re thinking. There has to be a workaround, right? Of course, there is. You can manually remove the headers and try again. Oh, now there’s a footer; you can take that out too (manually) and try again. Oh, now there’s a bates number, okay, take that out too. There’s STILL some renderable text in there somewhere, well, now you can either try and edit out the blocks of renderable text (again, manually, made more entertaining by the fact that you can’t just right click on the page and say “remove renderable text”), or you can export the entire document to a graphics file (say, a TIFF), re-convert it to a PDF file (which turns the entire document into a rasterized image), and THEN run the OCR tool to get an actual mark-up-able document. This process is made more enjoyable by the fact that Acrobat will turn that 300 page dissertation you’re reading as part of your research into 300 distinct TIFF files, which you then need to recombine into a PDF file. Multiply this by 100, and you’ll see what sort of a barrier to productivity this is for me to get started organizing my existing document collection.
This is CLOSE TO THE DUMBEST THING I HAVE EVER SEEN. And I’ve seen a LOT of bad design. Rather than prompting me “This document has renderable text” and giving me “Cancel” as the only option, any feature-driven developer would say, “Gosh, people get really frustrated by this. I know, because I can read the results of a simple google search. We need to change this right away! Here, I’ll make it so that you can just click ‘Treat existing renderable text as white space’ or even prompt the user to rasterize the renderable text and embed it in the document, then OCR the resulting file!”
The only conceivable reason I can imagine that this hasn’t taken place is because your lovable electronic document vendor wants to make it a colossally, enormously painful process for someone to actually do anything to the document they’re providing you to use. Thank you, electronic document vendor. You’re going to be wasting about 20% of the time that you’re saving me by giving me electronic access to this document in the first place.
Progress is grand. Collide it with self-interest, progress seems to lose out more often than not.
Now, if you’ll pardon me, I’m going to go get some sleep. Then I’m going to get up in the morning and go to work. Then I’m going to come home, and instead of enjoying some family time with my kids, I’m going to fart around with manual document conversion.