What are the main challenges in OCR?

Handwriting recognition — especially cursive or doctor's handwriting. Low-quality scans — faded ink, skewed pages, coffee stains. Complex layouts — tables, multi-column text, mixed text and images. Non-Latin scripts — Arabic (right-to-left), Chinese (thousands of characters), mathematical notation. And historical documents — degraded paper, archaic fonts, unusual spelling conventions.

What is the difference between OCR and a multimodal LLM for reading documents?

OCR extracts text — it converts image pixels to characters. A multimodal LLM can also understand the content — reading a contract and identifying the key clauses, not just transcribing the words. For raw text extraction at scale, dedicated OCR tools (Tesseract, AWS Textract, Google Document AI) are faster and cheaper. For understanding what the text means, multimodal LLMs add value beyond raw OCR.

OCR (Optical Character Recognition)

Q: What is OCR in simple terms?

OCR is technology that reads text from images the way a human reads a page — looking at the visual shapes of letters and converting them into digital text. Take a photo of a receipt, run OCR, and you get the itemised text you can search, copy, and process. Scan a 100-year-old handwritten letter, run OCR, and you get a searchable digital transcript.

Q: How does modern OCR work?

Modern OCR uses deep learning — typically a combination of CNN (to extract visual features from the image), RNN or transformer (to model the sequential nature of text), and CTC loss (to align the output character sequence with the input image without needing character-level annotation). This end-to-end approach replaced the earlier pipeline of image preprocessing, character segmentation, and template matching.

⚡ OCR (Optical Character Recognition) converts images of text — scanned documents, photos of signs, handwritten notes — into machine-readable text. One of AI’s oldest and most widely deployed capabilities, it enables document digitisation, automated data extraction, accessibility tools, and real-time translation of physical text. Modern deep learning OCR handles handwriting, complex layouts, and hundreds of languages.

Category: Computer Vision · Difficulty: Beginner · Last updated: 15 May 2026 · 4 min read

OCR — What It Is, How AI Reads Text from Images & Where It Is Deployed at Scale

What is OCR?

A billion documents exist only as physical paper or scanned images — inaccessible to search engines, databases, and AI systems that require digital text. Every handwritten form filled out at a hospital, every printed invoice from a supplier, every legal contract stored as a PDF scan, every historical archive — none of it is searchable or processable without OCR.

OCR bridges the physical and digital. It looks at an image of text and converts what it sees into characters a computer can store, search, and process. The output is not a picture of the letter A — it is the character “A” in digital form, fully searchable and editable.

OCR is one of the oldest AI problems — first commercially applied in the 1950s to read zip codes. It is also one of the most widely deployed: every banking app that reads a cheque, every Google Lens that translates a foreign menu, every hospital that digitises patient forms, every postal service that sorts mail by reading addresses.

How OCR works ?

Image preprocessing — deskewing (straightening tilted scans), noise removal, contrast enhancement, binarisation (converting to black and white).
Text detection — find regions of the image containing text. Object detection models locate text blocks regardless of position, orientation, or font.
Character recognition — a CNN extracts visual features from each detected text region. A transformer or RNN models the sequential nature of characters within words and lines.
Post-processing — language models correct obvious errors (OCR may misread “rn” as “m”), apply dictionary lookup, and reconstruct layout structure (tables, columns, headings).

Real-world examples

Not theory — what real teams actually shipped using this technique.

Google Lens — point your phone at a menu in Japanese, Korean, or Arabic and real-time OCR plus translation displays the English equivalent overlaid on the original text, live through the camera.
NHS document digitisation — the UK National Health Service has digitised millions of patient records using OCR, converting handwritten and typed clinical notes into searchable, structured electronic health records.
Invoice processing automation — accounts payable teams use OCR to extract supplier name, invoice number, line items, and totals from PDF invoices — replacing hours of manual data entry with seconds of automated extraction, feeding directly into ERP systems.

Common pitfalls

Handwriting variance — cursive handwriting, personal abbreviations, and poor pen quality remain genuinely difficult. Modern handwriting recognition is good but not reliable enough for high-stakes applications without human review.
Layout complexity — multi-column documents, tables, and mixed text-image layouts confuse line-level OCR that assumes text flows left-to-right, top-to-bottom. Layout analysis models (like LayoutLM) address this.
Low-quality inputs — heavily degraded scans, watermarks, and poor lighting significantly reduce accuracy. Preprocessing quality directly determines OCR quality.
Language and font coverage — OCR systems trained on common fonts and languages perform poorly on rare scripts, historical fonts, or domain-specific symbols (mathematical notation, musical scores, chemical formulae).

Frequently asked questions

QUESTION 1 What is OCR in simple terms?

ANSWER 1 Technology that reads text from images — converting pixel representations of letters into digital characters that can be searched, edited, and processed.

QUESTION 2 How does modern OCR work?

ANSWER 2 CNN extracts visual features + transformer/RNN models character sequences + CTC loss aligns output to input — end-to-end deep learning replacing the earlier segmentation-then-classify pipeline.

QUESTION 3 What are the main OCR challenges?

ANSWER 3 Handwriting recognition, low-quality scans, complex multi-column layouts, non-Latin scripts, and historical documents with degraded paper and archaic fonts.

QUESTION 4 What is the difference between OCR and a multimodal LLM?

ANSWER 4 OCR extracts text. A multimodal LLM understands the meaning — identifying key contract clauses, not just transcribing words. Dedicated OCR tools are faster and cheaper for raw extraction.

📬 Get one concept + one use case every Tuesday. Join the newsletter →