WHAT'S OCR?
Almost everyone is familiar with a scanner. Scanners are great: they let us take
photographs and documents from the 3D world of pens and paper, and bring them
into the digital world inhabited by our computers. Sometimes, however, even
people who are familiar with scanners don't understand exactly what a computer
does with something once it has been scanned. To put it briefly, computers have
different ways of dealing with pictures and text. A scanner is more or less just
a digital camera. When you scan something, the scanner sends it to your computer
as though it were a picture. It doesn't matter whether you are scanning photos
of your beloved neighborhood movie theatre, or that old term paper you lovingly
typed out back when you were in college in the winter of 1975.
Why is that important? It's important because the computer can do all kinds of
neat things with text. Type up something in your word processor, and the
computer understands that you've given it a bunch of words and letters. If you
want to find how many times you used the word "stinky," no problem... the
computer can count them. If you decide that all of those letters are ugly, no
problem... you can tell the computer to change the font, and with one command
you can change the typeface on every single letter.
The computer can do all kinds of neat things with pictures, too, but they aren't
really the kind of things that help you when you are dealing with the written
word. Want to make erase that jerk standing in the background of your perfect
picture of the Gateway Arch? No problem: a couple minutes with an image editing
program like Paint Shop Pro or Photoshop, and he's gone. Want to remove that
wart from Aunt Hortense's forehead? Presto... it's gone.
But let's think about that old college term paper again. Once you've scanned it,
what if you decide you want to change the ugly monospaced font that your old
typewriter used? Remember: as far as the computer is concerned, at this point,
that term paper is still just a picture, no different from your picture of Aunt
Hortense. If you wanted to change the font, you could do it... but it would
require manually redrawing every last letter in the entire document.
More importantly, since your scan is still a picture, you can't do any of the
other important things that we can do with text, like automatically finding and
replacing misspelled words.
This is where OCR - Optical Character Recognition - comes to the rescue. An OCR
program can look at the "picture" of your document, "read" the document, and
convert it to text. Really, really smart programmers with big giant brains have
devised methods of looking at the little black areas on the white paper and
figuring out what character was typed there. Got a straight vertical line, with
a shorter horizontal line extending out from the bottom at a right angle? Oh...
that's a capital L. And so it goes.
There are some drawbacks, however. OCR programs are rarely perfect, and a poor
quality original - for example, a document that has been faxed and photocopied a
couple times - will be fraught with errors. My own experience with various
consumer-grade OCR programs has been mixed. It's great on a clearly-typed
original document. But if the original isn't clear, well... if you're a fast
typist, you might find that it takes less time to retype the whole document
rather than correct all the errors in the OCR.
If you're going to do much scanning, it helps to know a little more about the
way pictures are stored on the computer. There isn't just one format for storing
photos; there are several, each with various advantages and disadvantages. Most
people are familiar with a few of these: GIFs, JPGs, and BMPs, for example.
Images stored in the .gif and .jpg format are commonly used on websites. The
TIFF, however, is much better format for storing scanned documents. The TIFF
file specification (abbreviated as .tif) includes a way for the computer to
recognize multi-page images. That means that if you scan a five-page document,
the computer can store it as a single file. Most other formats would require the
document to be stored as five separate image files (one for each page).
The problem with putting .tif images on a website is that most web browsers
don't know how to display them. There are add-ins that can be installed to give
the browser this ability, but few users would bother.
Then what is the best option for putting a scanned document on a website? From a
functional standpoint, the best option is to OCR it and then put it online as a
text or HTML document. Unfortunately, that's a lot of work, especially if you
don't have good OCR software and a good, high-quality original document. There
is, however, one other relatively easy workaround: You can convert your TIFF
files to PDFs. This is sometimes called "TIFF wrapped in PDF."
The PDF format, designed by Adobe, is a "portable document format." Files saved
as PDFs can be viewed on practically any type of computer. Almost all modern
computers will have a PDF viewer installed (usually Adobe's Acrobat Reader), and
most web browsers will automatically launch the viewer if a user tries to open a
PDF file via the web.
A "TIFF wrapped in PDF" isn't really a proper PDF file. Computers can still
recognize the text in a PDF document as being text; that is, it's still possible
to do things like copying that text and pasting it into another document, or
searching for a particular word or phrase within the document. That isn't
possible with a TIFF wrapped in PDF; as far as the computer is concerned, that
document is still really just a series of pictures. Wrapping TIFFs in PDF, then,
still isn't a perfect solution. But it does have the advantage of making scanned
documents easily readable via the web.
PDF to Text OCR Converter:
Convert scanned PDF and image files to plain text files.
See Also:
What is OCR?
What is OCR? OCR Technology
PDF to Text OCR Converter:
Convert scanned PDF and image files to plain text files.
PDF to HTML
Converter: Convert PDF files to HTML documents.
PDF to Text
Converter: Convert PDF files to plain text files.
PDF to
Vector Converter: Convert PDF files to PS, EPS, WMF, EMF, XPS, PCL, HPGL,
SWF, SVG, etc. vector files.
PDF to Image
Converter: Convert PDF files to TIF, TIFF, JPG, GIF, PNG, BMP, EMF, PCX, TGA
formats.
DocConverter COM
Component (+HTML2PDF.exe): Convert HTML, DOC, RTF, XLS, PPT, TXT etc.
files to PDF files, it is depend on
PDFcamp Printer
product.
Image to
PDF Converter: Convert 40+ image formats to PDF files.
HTML
Converter: Convert HTML files to TIF, TIFF, JPG, JPEG, GIF, PNG, BMP, PCX,
TGA, JP2 (JPEG2000), PNM, etc. formats.
More PDF Products
Home |
Products |
Downloads |
Support |
Links | Contact
Copyright © 2000- VeryPDF.com, Inc. All rights reserved.
Send comments about this site to the webmaster.