Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
OCR systems require calibration to read a specific font; early versions needed to be programmed with images of each character, and worked on one font at a time. "Intelligent" systems with a high degree of recognition accuracy for most fonts are now common. Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.
In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by Paul W. Handel who obtained a US patent on OCR in USA in 1933 (U.S. Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329). Tauschek's machine was a mechanical device that used templates and a photodetector.
In 1949 RCA engineers worked on the first primitive computer-type OCR to help blind people for the US Veterans Administration, but instead of converting the printed characters to machine language, their device converted it to machine language and then spoke the letters. It proved far too expensive and was not pursued after testing.
In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency in the United States, addressed the problem of converting printed messages into machine language for computer processing and built a machine to do this, reported in the Washington Daily News on 27 April 1951 and in the New York Times on 26 December 1953 after his U.S. Patent 2,663,758 was issued. Shepard then founded Intelligent Machines Research Corporation (IMR), which went on to deliver the world's first several OCR systems used in commercial operation.
In 1955, the first commercial system was installed at the Reader's Digest. The second system was sold to the Standard Oil Company for reading credit card imprints for billing purposes. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages. IBM and others were later licensed on Shepard's OCR patents.
In about 1965, Reader's Digest and RCA collaborated to build an OCR Document reader designed to digitise the serial numbers on Reader's Digest coupons returned from advertisements. The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font. The reader was connected directly to an RCA 301 computer (one of the first solid state computers). This reader was followed by a specialised document reader installed at TWA where the reader processed Airline Ticket stock. The readers processed documents at a rate of 1,500 documents per minute, and checked each document, rejecting those it was not able to process correctly. The product became part of the RCA product line as a reader designed to process "Turn around Documents" such as those utility and insurance bills returned with payments.
The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post Office (GPO). In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since 1971. OCR systems read the name and address of the addressee at the first mechanised sorting center, and print a routing bar code on the envelope based on the postal code. To avoid confusion with the human-readable address field which can be located anywhere on the letter, special ink (orange in visible light) is used that is clearly visible under ultraviolet light. Envelopes may then be processed with equipment based on simple barcode readers.
In 1974 Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led development of the first omni-font optical character recognition system — a computer program capable of recognizing text printed in any normal font. He decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies — the CCD flatbed scanner and the text-to-speech synthesizer. On January 13, 1976 the successful finished product was unveiled during a widely-reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind .
In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-to-computer text conversion. Kurzweil Computer Products became a subsidiary of Xerox known as Scansoft, now Nuance Communications .
1992-1996 Commissioned by the U.S. Department of Energy (DOE), Information Science Research Institute (ISRI) conducted the most authoritative of the Annual Test of OCR Accuracy for 5 consecutive years in the mid-90s. Information Science Research Institute (ISRI) is a research and development unit of University of Nevada, Las Vegas. ISRI was established in 1990 with funding from the U.S. Department of Energy. Its mission is to foster the improvement of automated technologies for understanding machine printed documents .
Desktop & Server OCR Software
OCR software and ICR software technology are analytical artificial intelligence systems that consider sequences of characters rather than whole words or phrases. Based on the analysis of sequential lines and curves, OCR and ICR make 'best guesses' at characters using database look-up tables to closely associate or match the strings of characters that form words.
WebOCR & OnlineOCR
With IT technology development, the platform for people to use software has been changed from single PC platform to multi-platforms such as PC +Web-based+ Cloud Computing + Mobile devices. After 30 years development, OCR software started to adapt to new application requirements. WebOCR also known as OnlineOCR or Web-based OCR service, has been a new trend to meet larger volume and larger group of users after 30 years development of the desktop OCR. Internet and broadband technologies have made WebOCR & OnlineOCR practically available to both individual users and enterprise customers. Since 2000, some major OCR vendors began offering WebOCR & Online software, a number of new entrants companies to seize the opportunity to develop innovative Web-based OCR service, some of which are free of charge services.
Since OCR technology has been more and more widely applied to paper-intensive industry, it is facing more complex images environment in the real world. For example: complicated backgrounds, degraded-images, heavy-noise, paper skew, picture distortion, low-resolution, disturbed by grid & lines, text image consisting of special fonts, symbols, glossary words and etc. All the factors affect OCR products’ stability in recognition accuracy.
In recent years, the major OCR technology providers began to develop dedicated OCR systems, each for a special type of images. They combine various optimization methods related the special image, such as business rules, standard expression, glossary dictionary and rich information contained in color image, to improve the recognition accuracy.
Such strategy to customize OCR technology is called “Application-Oriented OCR” or "Customized OCR", widely used in the fields of Business-card OCR, Invoice OCR, Screenshot OCR, ID card OCR, Driver-license OCR or Auto plant OCR, and so on.
Current state of OCR technology
Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 71% to 98%; total accuracy can be achieved only by human review. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research.
Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters.
On-line character recognition is sometimes confused with Optical Character Recognition (see Handwriting recognition). OCR is an instance of off-line character recognition, where the system recognizes the fixed static shape of the character, while on-line character recognition instead recognizes the dynamic motion during handwriting. For example, on-line recognition, such as that used for gestures in the Penpoint OS or the Tablet PC can tell whether a horizontal mark was drawn right-to-left, or left-to-right. On-line character recognition is also referred to by other terms such as dynamic character recognition, real-time character recognition, and Intelligent Character Recognition or ICR.
On-line systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history). Among these are the input devices for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this product. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.
Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.
It is necessary to understand that OCR technology is a basic technology also used in advanced scanning applications. Due to this, an advanced scanning solution can be unique and patented and not easily copied despite being based on this basic OCR technology.
For more complex recognition problems, intelligent character recognition systems are generally used, as artificial neural networks can be made indifferent to both affine and non-linear transformations.
A technique which is having considerable success in recognising difficult words and character groups within documents generally amenable to computer OCR is to submit them automatically to humans in the reCAPTCHA system.
- List of optical character recognition software
- Automatic number plate recognition
- Book scanning
- Computational linguistics
- Computer vision
- Digital Library
- Digital pen
- Digital Mailroom
- Institutional repository
- Machine learning
- Music OCR
- Optical mark recognition
- Raster to vector
- Raymond Kurzweil
- Sketch recognition
- Speech recognition
- Voice recording
- ^ "Reading Machine Speaks Out Loud" , February 1949, Popular Science.
- ^ Holley, Rose (April 2009). "How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs". D-Lib Magazine. http://www.dlib.org/dlib/march09/holley/03holley.html. Retrieved 5 January 2011.
- ^ Suen, C.Y., et al (1987-05-29). Future Challenges in Handwriting and Computer Applications. 3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987. http://users.erols.com/rwservices/pens/biblio88.html#Suen88. Retrieved 2008-10-03
- ^ Tappert, Charles C., et al (1990-08). The State of the Art in On-line Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 12 No 8, August 1990, pp 787-ff. http://users.erols.com/rwservices/pens/biblio90.html#Tappert90c. Retrieved 2008-10-03
- ^ LeNet-5, Convolutional Neural Networks
- Unicode OCR - Hex Range: 2440-245F Optical Character Recognition in Unicode
Optical character recognition software Free software Proprietary software See also
Wikimedia Foundation. 2010.
Look at other dictionaries:
optical character recognition — ➔ recognition * * * optical character recognition UK US noun [U] (ABBREVIATION OCR) IT ► the process by which a piece of electronic equipment recognizes printed or written letters or numbers: »The scanner uses optical character recognition to… … Financial and business terms
optical character recognition — (OCR) ability of a computer to recognize printed or handwritten characters by means of an optical scanner and specialized software … English contemporary dictionary
optical character recognition — n. electronic identification of alphanumeric characters, esp. those typewritten or printed on paper, for computer processing or storage … English World dictionary
Optical Character Recognition — Texterkennung oder auch Optische Zeichenerkennung (Abkürzung OCR von englisch Optical Character Recognition, selten auch: OZE) ist ein Begriff aus dem IT Bereich und beschreibt die automatische Texterkennung von einer gedruckten Vorlage.… … Deutsch Wikipedia
optical character recognition — Computers. the process or technology of reading data in printed form by a device (optical character reader) that scans and identifies characters. Abbr.: OCR [1960 65] * * * ˌoptical ˈcharacter recognition f38 [optical character recognition] noun … Useful english dictionary
Optical Character Recognition — Reconnaissance optique de caractères Pour les articles homonymes, voir ROC et OCR. La reconnaissance optique de caractères (ROC), ou encore appelé vidéocodage (traitement postal, chèque bancaire) désigne les procédés informatiques pour la… … Wikipédia en Français
Optical Character Recognition — Texterkennung; optische Zeichenerkennung; OCR * * * Optical Character Recognition, OCR, OCR Software, optische Zeichenerkennung … Universal-Lexikon
optical character recognition — optinis simbolių atpažinimas statusas T sritis automatika atitikmenys: angl. optical character recognition vok. optische Zeichenerkennung, f rus. оптическое распознавание символов, n pranc. reconnaissance optique de caractères, f … Automatikos terminų žodynas
optical character recognition — optinis ženklų atpažinimas statusas T sritis informatika apibrėžtis Spausdintų arba ranka rašytų ženklų vaizdų skaitymas ir atpažinimas pagal jiems būdingus požymius. Skaitomas teksto grafinis vaizdas ir persiunčiamas į kompiuterį. Speciali… … Enciklopedinis kompiuterijos žodynas
optical character recognition — optical character recog.nition n [U] technical computer software that recognizes numbers and letters of the alphabet which are written on paper, so that information from paper documents can be ↑scanned into a computer … Dictionary of contemporary English