Optical character recognition (OCR) in Google Docs

| Tuesday, June 22, 2010

Labels: ,

A couple of months ago, my co-worker, Mike, showed up at my desk with a pile of paper, each of the yellowed sheets densely covered with an ancient-looking typewriter font. His wife had recently discovered parts of her family chronicles in the attic, typed up by her grandmother many years ago! Now he was wondering if there was a way for her to continue writing the chronicles in Google Docs.

The papers sat on my desk for a while, but recently, I returned them to Mike with a smile, cheerfully telling him that what started as my 20% project is now ready for everyone to use -- Google Docs now officially supports importing scanned documents. What we launched as an experimental feature for the Documents List Data API last year is now available on the upload page: check the “Convert text from PDF or image files to Google Docs documents”, upload your scanned images (JPEG, GIF, PNG) or PDFs, and Google Docs will extract text and formatting from the scans for you to edit away.

For the technically curious: we’re using Optical Character Recognition (OCR) that our friends from Google Books helped us set up. OCR works best with high-resolution images, and not all formatting may be preserved. The original images will be included in the new document to make it easier for you to correct mistakes. Supported languages include English, French, Italian, German and Spanish, with more languages and character sets on their way. We’re looking forward to get feedback from you while we keep improving the feature over the next months.

And Mike’s scanned family chronicles have even been extended by an additional chapter in Google Docs: his wife recently had a baby boy named James!


Antoine said...

Can you convert the txt in an existing PDF that is in google docs already, or do you have to download it and then re-upload it to use the OCR?

Anoo said...

This is so awesome - please tell me it's a little more accurate than the voice transcription with google voice :)

(or at least as funny)

Can't wait to try this one out!

Mr. Quindazzi said...

I too am very curious about converting over .PDF files already stored in Google docs. This is a great new tool.

Bob said...

Great work. Perhaps the option of the OCR text remaining as meta data while the PDF stays intact. While this does not do much for editing it does a lot for search. The OCR does well enough for searching text. Just a thought. I don't know all of your goals with the project so this may be a lame idea. I appreciate the work you have done. This improvement to Docs will be of great value "as is." Thank you.

dima said...

Really it would be great to convert existing PDF files already uploaded to Google Docs.

dima said...

Really that would be great to convert existing PDF files already uploaded to Google Docs.

searchengineman said...

I'm curious to know if this feature is being applied to all PDF documents indexed by Google.

Computator said...

now we just need OCR to google Translate!

Arthur Gouveia said...

Why only the first ten pages are converted? Google Docs simply ignores the final pages if the PDF file is more than 10 pages long.

James said...

potentially nice feature, but unfortunately all my non-OCR'd pdf's are in the 50-300 meg+ size range. The new tool seems to cap on anything over 25 megs, but not until waiting for the entire file to upload :(