Extract text from images and documenst

Rating & reviews (0 reviews)
Study notes

1. Read text from images - OCR technology:
  • Read API
    • Read small to large volumes of text (images and PDFs).
    • New OCR API generation - greater accuracy.
    • Can read printed text in multiple languages, handwritten English only.
    • Asynchronoperation. Initial call returns ID to be used to retrieve the results.
    • The results of from the Read API are broken down by:
      It is read at line level.
      • page
      • line
      • word
  • Image analysis API
    • Preview, with reading text functionality added
    • Read small amounts of text from images.
    • Returns contextual information, including line number and position.
    • Results are returned immediately (synchronous) from a single function call.
    • Analyz images past extracting text, including detecting content ...
2. Extract data from forms with Form Recognizer cognitive services
Use Form Recognizer cognitive service.
Create bounding boxes around detected objects in an image (text area) end then extract text.
Form Recognizer provides underlying models that have been trained on thousands of form examples.
Component services:
  • Document analysis models
    • Take an input files:
      • JPEG, PNG, PDF, and TIFF
      • less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier
      • 50 x 50 pixels to 10000 x 10000 pixels
      • training data set must max 500 pages
    • Return a JSON file with the location of text in bounding boxes, text content, tables, selection marks (also known as checkboxes or radio buttons), and document structure.
  • Prebuilt models
    Detect and extract information from document images and return the extracted data in a structured JSON output.
    Prebuilt models:
    • W-2 forms
    • Invoices
    • Receipts
    • ID documents
  • Custom models
    Extract data from forms specific to your business.
    Can be trained by calling the Build model API, or using Form Recognizer Studio.
    • Take an input files:
    • JPEG, PNG, PDF, and TIFF
    • less than 500 MB for paid (S0) tier and 4 MB for free (F0) tier
    • 50 x 50 pixels to 10000 x 10000 pixels
    • training data set must max 500 pages
Process:
  • Subscribe to a resource:
    • Cognitive Service resource
    • Form Recognizer resource
  • Make sure input requirements are met
  • Decide what component of Form Recognizer to use (it is about document analysis models, not prebuild or custom)
    • Layout model
      Analyzes and extracts text, tables, selection marks, and other structure elements like titles, section headings, page headers, page footers, and more
    • Read model
      Extract print and handwritten text including words,locations, and detected languages.
    • General Document model
      Extract key-value pairs in addition totext and document structure information.
To create an application that extracts data from use a prebuilt model. These models do not need to be trained.
To create an application to extract data from your industry-specific forms create a custom model. This model needs to be trained:
  • Form Recognizer service supports supervised machine learning. You can train custom models and create composite models with form documents and JSON documents that contain labeled fields
  • Train using Cognitive services:
    • Store sample forms in an Azure blob container, along with JSON files containing layout and label field information
    • Generate a shared access security (SAS) URL for the container.
    • Use the Build model REST API function (or equivalent SDK method).
    • Use the Get model REST API function (or equivalent SDK method) to get the trained model ID.
  • Train using Form Recognizer studio
    • Custom template models
      Accurately extract labeled key-value pairs, selection marks, tables, regions, and signatures from documents. Training only takes a few minutes, and more than 100 languages are supported.
    • Custom neural models
      Deep learned models that combine layout and language features to accurately extract labeled fields from documents.
      Best for semi-structured or unstructured documents.
a) Form Recognizer models using the REST API
Custom model
Rest API - have model ID (get it after training is finalised)
Pass this ID when call get_analyze_result function
Response in JSON format (text boxes location and text- words)

b) Extract data with Form Recognizer Studio
  • Document analysis models
    • Read
      Extract printed and handwritten text lines, words, locations, and detected languages from documents and images.
    • Layout
      Extract text, tables, selection marks, and structure information from documents (PDF and TIFF) and images (JPG, PNG, and BMP).
    • General Documents
      Extract key-value pairs, selection marks, and entities from documents.
  • Prebuilt models
  • Custom models (must train model):
    • Create a Form Recognizer or Cognitive Services resource
    • Collect at least 5-6 sample forms for training and upload them to your storage account container.
    • Configure cross-domain resource sharing (CORS).
    • Create a custom model project in Form Recognizer Studio.
    • Apply labels to text.
    • Train your model - receive a Model ID and Average Accuracy for tags.
    • Test model

References