Document Form Extraction

Much of society's most valuable data lives in formulaic documents, aka forms. Common documents like driver's licenses, passports, receipts, invoices, pay stubs, and—most recently and urgently—vaccination records are all forms.

Forms have standardized fields, but each instance has its own values. This pay stub, for example, has fields Pay period, Pay Day, Company, Employee, etc. that are the same for all Gusto pay stubs.

Pay stub example
Example of a pay stub form. Source: Gusto blog.


We would like to make the information in each document programmatically accessible. If we were verifying income, for example, we would want to convert the pay stub above into a key-value map like:

{
    "Pay period": "Jul 12, 2019 - Jul 25, 2019",
    "Pay Day": "Aug 2, 2019",
    "Company": "Lockman-Klein",
    "Employee": "Belle Tremblay",
    "Gross Earnings": "$1,480.00"
}

Downstream processing might further parse the dates, date ranges, and currencies.

Extracting form information is not the coolest topic, but it's extremely valuable and challenging. A large line-up of tools promises state-of-the-art results with the latest and greatest AI, but we put those claims to the test and came away unimpressed.

Our study

We annotated a small dataset of television ad invoices and ran it through four off-the-shelf APIs for generic form extraction. We graded each product on functionality, accuracy, response time, ease-of-use, and business considerations like cost and data security. Please see the Methodology tab for more detail about our test protocol and evaluation criteria.

As a rule, we do not accept affiliate commissions, to keep our reports as unbiased as possible.

Recommendations

  • None of the products we tested were able to consistently find all of the form data. The best one only found 2.8 of 4 correct form field pairs (key and value, paired together) on average. For complex documents, this technology simply isn't ready.

  • For simpler documents, we suspect the results would be much better. In this case, we recommend Google Form Parser—it was the most accurate service in our test and by far the fastest. For PDF documents, it's also the only service fast enough to be part of a synchronous pipeline.

  • One alternative strategy is to first use unstructured text extraction, then write your own post-processing to match keys to values. For this approach, we recommend Amazon Textract. With Textract's unstructured text output and a fairly naïve post-processing heuristic, we found 2.9 out of 4 correct fields on average—better than any of the dedicated form extraction tools, and substantially cheaper.

  • ABBYY Cloud OCR SDK seems to require codification of form layouts in advance. This is not possible for the applications we have in mind, where many different flavors of a form have different layouts.

  • Microsoft Form Recognizer requires a custom model to be trained before extracting data. To us, this feels like a lot of extra overhead, which would only be justified if the model accuracy were far superior to competing products. Unfortunately, Form Recognizer's average recall was the worst of the services we tested.

  • There are specialized services to extract text from receipts, invoices, business cards, and tax documents. Given our poor results with general-purpose form extraction, we suggest exploring these specialized products if you have one of these form types.

Our pick: 

Google Form Parser

Ratings

We organized our notes and test results for each form extraction service into five dimensions, scored from 0 (total failure) to 100 (perfect). These were then combined into an overall score with a weighted average. The dimensions and weights are:

  • Functionality (35%): does the product do what we need and what it says it does?
  • Business considerations (20%): cost, data policies, platform ecosystem
  • Accuracy (20%): response rate, and recall of correct key-value pairs.
  • Speed (15%): response time
  • Ease of use (10%): documentation quality, understandable response format, web-based demo, etc.

Please see the Methodology page for detailed definitions of each dimension.

The blue radar plots below show the dimension scores for each product. The gray background plot shows the breakdown for the best-rated product, for context.

Google Form Parser

ReviewDemoProduct

Score:

75
None

Amazon Textract

ReviewDemoProduct

Score:

53
None

Microsoft Form Recognizer

ReviewDemoProduct

Score:

47
None

ABBYY Cloud OCR SDK

ReviewDemoProduct

Score:

None

Forms are documents with standardized fields and variable values. They are one of the most elemental ways to store and communicate information, so they pop up everywhere. Some common examples include:

  • ID cards
  • tax forms
  • invoices and receipts
  • health and vaccination records

Why you might want document form extraction

Document form extraction is the process of turning forms into actionable data, in an automated, scalable fashion. With a pay stub, for example, we want to turn the document:

Pay stub example
Source: Gusto blog.


into a key-value map:

{
    "Pay period": "Jul 12, 2019 - Jul 25, 2019",
    "Pay Day": "Aug 2, 2019",
    "Company": "Lockman-Klein",
    "Employee": "Belle Tremblay",
    "Gross Earnings": "$1,480.00"
}

After a little bit of extra processing to cast the extracted strings into dates and numbers, we could use this data to verify the customer's employment, or help them track and forecast their savings over time, or compare their earnings to the industry standard—whatever our business use case might be.

General-purpose document form extraction is relatively easy for most people, but very hard to automate. The pay stub example shows why.

Pay stub example
Source: Gusto blog. Annotations our own.


  • Some values have no explicit keys at all.1 Others have two keys because they're in tables, where the row and column labels together define the field. Tables also have the additional problem of substantial distance between the keys and the values.

  • The association between key and value depends on a subjective reading of page layout, punctuation, and style. Some keys and values are arranged vertically, others horizontally. Some keys are delineated by colons, others bold font.

  • Every payment processor uses a different layout. We could hard-code the location of the fields in the Gusto form, but the layout, style, and field names of ADP and Paychex pay stubs are different, even though the underlying information is the same.

A sprawling marketplace of solutions

The marketplace of text extraction products was vast and confusing before the AI revolution, and it has only grown worse. Broadly speaking, these tools operate at one of three levels.

Product types by complexity and value
Broadly speaking, there are three types of products that extract data from form documents.


At the most basic level is Optical Character Recognition (OCR) that extracts raw text from images. This is a well-established technology, but it doesn't do much to unlock the business value in form documents.

The most potentially valuable task at the top of our pyramid is template-filling. In this scenario, we have our own fixed schema of keys and we want to find the values from each document that "fill" each slot in the template. As far as we can tell, this remains an ambitious research goal, rather than a solved technology.

For this study, we focused on the second-level: key-value mapping. These tools construct key-value pairs from extracted text but they don't attempt to match the information to a predetermined schema.

What to look for in a form extraction product

Within the class of key-value mapping tools, form extraction products differ along several dimensions.

Dimension Definition
Functionality
  • Does the service generally do what it claims?
  • Can we get an answer, regardless of accuracy and speed?
  • Range of input types and sizes allowed
  • Quotas and rate limits
Business considerations
  • Pricing model and estimated total cost
  • Data policies: privacy, encryption, retention
  • Active iteration on product development
  • Customer support
  • Reliability: service level agreement, up time
  • Ecosystem: how vibrant and developed is the surrounding platform?
Accuracy
  • Does the product find the values that we’re looking for?
  • Does the product find the keys that reference the correct values, even if those keys aren't matched to a standard schema?
  • Does the product correctly associate keys with values?
  • Is the tool more accurate than heuristic post-processing of unstructured text?
Speed
  • Synchronous vs. Asynchronous options: under what constraints is a synchronous call possible?
  • Distribution of response times
Ease of use
  • Navigating the vendor's product landscape
  • Documentation quality and completeness
  • Is there a GUI demo for getting started and sanity checking?
  • API design
  • Format of the output. Is it human-readable? Can it be serialized? How much post-processing is needed?
  • Other engineering "gotchas" or unpleasant surprises.


Another key question to ask that doesn't quite fit into this rubric is whether a specialized tool is available for your use case. For invoices and receipts, Taggun, Rossum, Google Procurement DocAI, and Microsoft Azure Form Recognizer Prebuilt Receipt model all explicitly target these kinds of documents. For identity verification, try Onfido, Trulioo, or Jumio.

Feature comparison

We have compiled information about functionality and business considerations from each product's website. The ref link in each cell indicates the source of the information. Please see the individual product reviews for the results of our hands-on evaluation.

Feature ABBYY Cloud OCR SDK Amazon Textract Google Form Parser Microsoft Form Recognizer
Input file formats BMP, DCX, PCX, PNG, JPEG, PDF, TIFF, GIF, DjVu, JBIG2ref JPG, PNG, PDFref PDF, TIFF, GIFref JPG, PNG, PDF (text or scanned), TIFFref
Input file size limit 30 MB, 32K x 32K pixels for images.ref 10 MB for JPG and PNG files, 500 MB for PDFs.ref 20 MBref 50 MB. For images: 10K x 10K pixels. For PDFs: 17 x 17 inches, 200 pages.ref
Processing model Asynchronousref Async for all file types, synchronous option for JPG and PNG.ref Synchronous up to 5 pages, async up to 100 pages or 2,000 pages (the docs are contradictory).ref API docs call it a "Long-Running Operation (LRO)". The call can be blocking or non-blocking.ref
Cost Pre-paid or monthly subscription for a fixed number of pages. Subscriptions run from $30/month for 500 pages up to $840/month for 30K pages. Each form field counts as 0.2 pages.ref Varies by region and depends on the desired layout complexity: lines, forms, and/or tables. For US-East-2 with unstructured text and forms output (but not tables): $0.05/page up to 1M pages, $0.04/page for pages over 1M.ref $0.065/page for first 5M pages/month, $0.05/page beyond 5M pages/month.ref $0.05/page, but unstructured text and custom forms extraction require separate calls, so $0.10/page total.ref
Quotas Quotas are listed for free trials but not for paid accounts, which might suggest there are no limits for paid accounts?ref

2 synchronous transactions/sec for US East and West.

1 synchronous transaction/sec for other supported regions.

2 async submissions/sec for all supported regions.

600 simultaneous async jobs in US East & West, 100 in other regions.ref

Usage counts against total Google Cloud Project quotas.

1,800 requests/user/min.

600 online requests/project/min.

5 concurrent async batch requests per project.

10,000 pages actively being processed simultaneously.ref

Unclear
Miscellaneous Max training set size: 500 pages.ref


Notes

  1. We use the term key to mean the text that names a field within a given form.

Scope

This article is meant for applied data scientists and engineers, as well as data science and engineering team leads who want to understand more about document form extraction, or need to choose a service to use for document form extraction.

Scoring Rubric

We grouped our notes and ratings into five areas, based on the dimensions described in the Domain Guide. For each area, we score the products from 0 (nonexistent) to 100 (perfect), then compute the total score as a weighted average of the dimensions. Our weights for the dimensions are:

Dimension Weight
Functionality 35%
Business considerations 20%
Accuracy 20%
Speed 15%
Ease of use 10%


Ease of use in this comparison has only 10% weight, which is much lower than for the Data App Frameworks comparison. Data App Frameworks are much more about the development experience; Document Form Extraction tools, on the other hand, are much more standardized APIs.

Our accuracy measures are recall-based. For each test document, we count how many of the 4 ground-truth key-value pairs a form extraction service returns, ignoring any other output from the API. The final score for that service is the average recall over all test documents.

  • We use the Jaro-Winkler string similarity function to compare extracted and ground-truth text and decide if they match.

  • Some services return a confidence score with output text. We ignore this; a product scores a match if its output text matches one of the ground-truth pairs, regardless of the confidence score.

In our results table, we also have a row called "Mean recall of unstructured text plus custom key-value mapping". This is a baseline to compare the canonical recall against. For each service, we requested unstructured text, in addition to the semi-structured key-value pairs. We then created our own set of key-value candidates by associating each pair of successive text blocks. For example, if the unstructured output was

["Pay Day:", "Aug 2, 2019", "912 Silver St.", "Suite 1966"]
then our heuristic approach would return:

{
    "Pay Day:": "Aug 2, 2019",
    "Aug 2, 2019": "912 Silver St.",
    "912 Silver St.": "Suite 1966"
}

Most of these candidate pairs are nonsense, but because we evaluate based on recall, this method turns out to be a reasonable baseline.

Selecting the challengers

We first narrowed our set of potential products to those that:

  • Have either a free-trial or a pay-as-you-go pricing model, to avoid the enterprise sales process

  • Claim to be machine learning/AI-based, vs. human-processing

  • Have a self-service API.

Of the tools that met these criteria, we chose the four that seemed to best fit the requirements for our test scenario (details below).

For this evaluation, we're not worried about handwritten forms, languages other than English, or images of documents. We also assume we don't have a machine learning team on standby to train a custom model.

The Challenge

To extract metadata from political campaign advertising invoices

Suppose we want to build a service that helps political campaigns verify and track their ad spending. When a campaign receives an invoice from a broadcaster they upload it to our hypothetical service, and we respond (quickly, if possible) with a verification that the invoice is legit and matches a planned outlay (or not). For this challenge, we want to extract the invoice number, advertiser, start date of the invoice, and the gross amount billed.

For example, suppose our customer submits this invoice:

Annotated invoice example 1


The correct answer for this invoice would be:

{
    "Contract #": "26795081",
    "Advertiser:": "House Majority Forward",
    "Flight:": "2/5/20 - 2/18/20",
    "Total $": "$33,500.00"
}

To extract answers like this at scale, we need a text extraction service with the following features:

  • Key-value mapping, not just OCR for unstructured text
  • Accepts PDFs
  • Responds quickly, preferably synchronously.
  • Handles forms with different flavors. Each broadcast network uses its own form, with different layout, style, and keys, even though the information is the same. Here's a second example with the corresponding correct answer:
Annotated invoice example 2
{
    "Contract #": "4382934",
    "Advertiser": "Diana Harshbarger-Congress-R (135459)",
    "schedule Dates": "05/28/20-06/03/20",
    "Grand Total:": "$1,230.00"
}

The first example uses the key "Flight" to indicate the starting and ending dates of the ad campaign, while the second says "Schedule Dates". There are other subtle differences in punctuation (colons), currency symbols, and date formatting.

Data

The documents in our test set are TV advertisement invoices for 2020 political campaigns. The documents were originally made available by the FCC, but we downloaded them from the Weights & Biases Project DeepForm competition (blog post, competition, code repo). Specifically, we randomly selected 51 documents from the 1,000 listed in Project DeepForm's 2020 manifest and downloaded the documents directly from the Project DeepForm Amazon S3 bucket fcc-updated-sample-2020.

The DeepForm project did create ground truth annotations, but we ignored them. DeepForm is focused on end-to-end template filling, a much more challenging task than what we're asking our challenger products to do. We also noticed more errors in the DeepForm annotations than we were comfortable with. Creating our own ground truth allowed us to evaluate each service's ability to find relevant key-value pairs, without worrying about how those pairs should slot into our standard schema.

Annotating form documents is tricky, and we made many small decisions to keep the comparison as uniform and fair as possible.

  • Sometimes a PDF document's embedded text differs from the naked eye interpretation. We've gone with the visible text as much as possible.

  • Sometimes a key and value that should be paired are far apart on the page, usually because they're part of a table. The second example above illustrates this: the key "Grand Total:" is separated from its value "$1,230.00" by a different data element. We have included these in our annotations knowing this is a very difficult task for any automated system, although we chose fields that are not usually reported in tables.

  • Dates are arranged in many different ways. When presented as a range, we have included the whole range string as the correct answer, but when separated we only include the start date.