Google Form Parser, a review and how-to

Warning

ML products evolve quickly and this review is likely out of date. It was our best understanding of this product as of June 2021.

Note

To avoid bias, Crosstab Data Science does not accept compensation from vendors for product reviews.

What do driver’s licenses, passports, receipts, invoices, pay stubs, and vaccination records have in common?

They are all types of forms, and they contain some of society’s most valuable information. The process of converting these documents into computationally usable data is called form extraction. Google’s form extraction solution is Form Parser, launched last year to compete with Amazon Textract and Microsoft Form Recognizer.

Form Parser, Textract, and Form Recognizer are general-purpose form extraction tools in that they attempt to find relevant data in any type of input document, not just common items like receipts or invoices. The Google and Amazon offerings are also fully off-the-shelf in that they don’t require any input data for model training or fine-tuning.

I was curious how Form Parser stacked up against the competition, so I tested the three services on real documents and evaluated each service’s ease of use, accuracy, and speed. Despite its youth, Form Parser was the best, with an average synchronous response time of 3.3 seconds and 70% recall. Whether this performance is good enough depends on your application, but the whole space clearly has a lot of room for improvement.

Google Form Parser Highlights

Surprise & Delight	Friction & Frustration
Fast response time for synchronous calls.	The output is not native Python, can’t be serialized with native Python tools, and is hard to explore from a Python REPL.
The most accurate of the services I tested.	Extracted text blocks only contain references to the actual text. You have to write a utility function to retrieve the actual text.

What is form extraction?

Suppose we want to use the information in our customers’ pay stubs, maybe to help them qualify for loans. A pay stub is a form; it is a stylized mapping of fields—let’s call them keys—to values. Take this example from the Gusto blog:

Our brains process this easily but the information is not accessible to a computer. We need the information in a data format more like this:

{
    "Pay period": "Jul 12, 2019 - Jul 25, 2019",
    "Pay Day": "Aug 2, 2019",
    "Company": "Lockman-Klein",
    "Employee": "Belle Tremblay",
    "Gross Earnings": "$1,480.00"
}

It’s helpful to step back and contrast this with two other information extraction tasks: optical character recognition (OCR) and template-filling. OCR tools extract raw text from images of documents; it is well-established technology but only the first step in capturing the meaning of a form.

Broadly speaking, there are three types of products that extract data from form documents. Image by author.

Template-filling, on the other hand, is the holy grail; it seeks to not only extract information but slot it correctly into a predetermined schema. So far, template filling is only possible for some common and standardized types of forms like invoices, receipts, tax documents, and ID cards.¹

Key-value mapping is in the middle. It’s easier than template-filling, but still hard. The Gusto demo shows why:

Some values have no keys. Others have two keys because they’re in tables, where the row and column labels together define the field, even though they’re far apart on the page.
The association between key and value depends on a subjective reading of page layout, punctuation, and style. Some keys and values are arranged vertically, others horizontally. Some keys are delineated by colons, others bold font.
Manually specifying field locations and formats is a non-starter because form layouts and styles vary across processors and across time. It’s not just pay stubs; US driver’s licenses, as another example, vary across states and time. Medical record formats differ between providers, facilities, and database systems.

Google Form Parser, on paper

Ecosystem

Form Parser is available on Google Cloud Platform, as part of the Document AI tool suite.

It can be a bit of a chore to navigate Google’s product line-up. The Document AI suite has three kinds of tools, which it calls processors: General, Specialized, and Custom. The general-purpose tools are OCR, Document Splitter, and Form Parser. The specialized tools process specific kinds of documents, while the custom services allow you to train a custom document classification or entity extraction model.

To make things even more confusing, Google’s Cloud Vision service also does text extraction. What’s more, some of the Document AI services are currently invite-only and it’s not always clear which ones. Form Parser, fortunately, is one of the generally available services. A good place to get started is the Form Parser guide, in the Document AI documentation.

Features and limitations

Form Parser’s features and constraints are roughly on par with or slightly better than competitors’ products.

Form Parser can process documents synchronously or asynchronously.
The synchronous processor is limited to smaller files in both page count and bytes but accepts a broader range of file types. According to the Form Parser documentation, the synchronous processor accepts documents up to 5 pages long, in any of these formats: PDF, TIFF, GIF, JPEG, PNG, BMP, WEBP.
The async processor only accepts files in PDF, TIFF, or GIF format, but accepts documents with up to 2,000 pages.
Form Parser’s quotas and rate limits are the same as other Document AI services. They are more generous than other services I tried and should be plenty to build a new product on. Google allows 1,800 requests per minute, 5 concurrent async batch requests per project, and a total of 10,000 pages may be processed simultaneously.

Cost

Form Parser is noticeably more expensive than other services, at $0.065 per page up to 5 million pages in a month, and $0.05 per page above 5 million pages. Amazon Textract and Microsoft Form Recognizer both start at $0.05/page for generic forms.²

Data policies

In terms of data policies, the Document AI Data Usage FAQ asserts that Google:

does not use user content for any purpose other than to provide the requested service,
does not store input documents on its servers except while the document is actively being processed,
does not currently use input documents to improve Document AI models,
does not claim ownership of input documents,
does not make input documents public or share them with any other parties except that it may share with a 3rd party vendor who provides some aspect of the Document AI service.

Google’s Data Processing terms document has more detail.

Developer experience

I started by setting up the Document AI service in the Google Cloud Platform console, following the quickstart set-up guide.

Form Parser’s documentation is…ok. The main Document AI page has a drag-and-drop demo to see what kind of information you can extract from one of your own documents. There is a very nice how-to guide with code samples for both synchronous and async API calls in several languages, and information about quotas, pricing, and data policies is easily discoverable and readable.

On the other hand, the code examples in the user guide only show how to work with PDFs, not images, and there is no dedicated explanation about how to work with the response object. The documentation is sometimes confusing. For example, it uses the term “processor” to mean both the synchronous vs. asynchronous API calls, as well as the document type (receipt, invoice, etc).

Calling Form Parser with the Python SDK is fairly straightforward. First set the GOOGLE_APPLICATION_CREDENTIALS environment variable, then copy or load the following items from the Google Cloud Console into the code: google_project_id, google_location, and google_processor_id.

Suppose we want to parse this PDF invoice (see why in the next section):³

Source: Project DeepForm, originally from the US Federal Communications Commission. See the *data* description below.

The following snippet constructs and executes a synchronous Form Parser request, given the document_path:

from google.cloud import documentai_v1beta3 as documentai

google_processor_name = (
    f"projects/{google_project_id}/locations/{google_location}/processors/"
    f"{google_processor_id}"
)

doc_ai = documentai.DocumentProcessorServiceClient()

with open(document_path, "rb") as f:
    document_bytes = f.read()

request = {
    "name": google_processor_name,
    "document": {"content": document_bytes, "mime_type": "application/pdf"},
}

response = doc_ai.process_document(request=request)
type(response)

google.cloud.documentai_v1beta3.types.document_processor_service.ProcessResponse

Parsing the response is not as simple as it should be, for two reasons. First, the output is a custom type (presumably based on protobuf) that is hard to explore in a Python REPL and can’t be serialized with standard Python tools.

To explore the response object’s structure, there is a hidden _pb attribute is accessible with dir and tab-complete. That’s how I figured out what to do in the following snippets.

For serialization, my approach was to parse the response into a Pandas DataFrame before saving it to disk. Here’s the gist of it:

import pandas as pd

def convert_form_parser_output(response):
    """Convert Form Parser output to a DataFrame."""
    
    data = []
    for page in response.document.pages:
        for pair in page.form_fields:
            row = dict(
                page=page.page_number,
                key_text=slice_text(pair.field_name, response.document.text),
                key_confidence=pair.field_name.confidence,
                value_text=slice_text(pair.field_value, response.document.text),
                value_confidence=pair.field_value.confidence,
            )
            data.append(row)

    return pd.DataFrame(data)

df = convert_form_parser_output(response)
print(df.head())

   page         key_text  key_confidence                 value_text  value_confidence
0     1  Agency Order #:        0.996694                    9333209          0.996694
1     1           Buyer:        0.981321             Bassett, Laura          0.981321
2     1       Assistant:        0.975439  JENNA NUBAR\n202-872-5880          0.975439
3     1             CPE:        0.972257               509/545/8023          0.972257
4     1          Flight:        0.967986           2/5/20 - 2/18/20          0.967986

What does the slice_text function do? I’m glad you asked! The second annoyance in handling Form Parser output is that the individual text blocks don’t contain actual text. Instead, they contain start and end indexes relative to a giant blob of all document text.⁴ This probably saves some bandwidth and memory, but it makes it harder to learn how the output is structured and to sanity check the results.

The workaround is to write a quick utility function like this:

def slice_text(element, text) -> str:
    """Find the text represented by a Google Form Parser document element."""
    spans = element.text_anchor.text_segments
    element_text = ""
    for span in spans:
        try:
            start_index = span.start_index
        except AttributeError:
            start_index = 0

        element_text += text[start_index: span.end_index]

    return element_text.strip()

Test methodology

Choosing the candidates

I first narrowed the set of potential products to those that:

Have either a free trial or a pay-as-you-go pricing model, to avoid the enterprise sales process
Claim to be machine learning/AI-based, vs. human-processing.
Don’t require form fields and locations to be specified manually in advance.
Have a self-service API.

Of the tools that met these criteria, Amazon Textract, Google Form Parser, and Microsoft Form Recognizer seemed to best fit the specifics of my test scenario.

The challenge

Suppose we want to build a service to verify and track ad campaign spending. When a customer receives an invoice from a broadcaster they upload it to our hypothetical service and we respond with a verification that the invoice is legit and matches a budgeted outlay (or not). For this challenge, we want to extract the invoice number, advertiser, start date of the invoice, and the gross amount billed.

To illustrate, the correct answers for the example invoice in the previous section are:

Source: Project DeepForm, originally from the US Federal Communications Commission.

{
    "Contract #": "26795081",
    "Advertiser:": "House Majority Forward",
    "Flight:": "2/5/20 - 2/18/20",
    "Total $": "$33,500.00"
}

To do this at scale, we need a form extraction service with some specific features:

Key-value mapping, not just OCR for unstructured text
Accepts PDFs
Responds quickly, preferably synchronously.
Handles forms with different flavors. Each broadcast network uses its own form, with a different layout and style.

We don’t need the service to handle handwritten forms, languages other than English, or images of documents. Let’s assume we don’t have a machine learning team on standby to train a custom model.

The data

The documents in my test set are TV advertisement invoices for 2020 US political campaigns. The documents were originally made available by the FCC, but I downloaded them from the Weights & Biases Project DeepForm competition. Specifically, I randomly selected 51 documents from the 1,000 listed in Project DeepForm’s 2020 manifest and downloaded the documents directly from the Project DeepForm Amazon S3 bucket fcc-updated-sample-2020.

I created my own ground-truth annotations for the 51 selected invoices because Project DeepForm’s annotations are meant for the more challenging task of template-filling.⁵

Evaluation criteria

I measured correctness with recall. For each test document, I count how many of the 4 ground-truth key-value pairs each service found, ignoring any other output from the API. The final score for each service is the average recall over all test documents.

I use the Jaro-Winkler string similarity function to compare extracted and ground-truth text and decide if they match.
Form Parser attaches a confidence score to each extracted snippet. I ignore this; if the correct answer is anywhere in the response, I count it as a successful hit.

The results table (below) includes a row for custom key-value mapping. For each service, I requested unstructured text in addition to the key-value pairs, then created my own key-value pairs by associating each pair of successive text blocks. For example, if the unstructured output was

["Pay Day:", "Aug 2, 2019", "912 Silver St.", "Suite 1966"]

then this heuristic approach would return:

{
    "Pay Day:": "Aug 2, 2019",
    "Aug 2, 2019": "912 Silver St.",
    "912 Silver St.": "Suite 1966"
}

Most of these pairs are nonsense, but because I evaluated with recall, this naïve method turned out to be a reasonable baseline.

Results

Form Parser’s average synchronous response time in our test was an outstanding 3.3 seconds, far better than the competition. Even the 90th percentile of response time was only 4.7 seconds. Depending on the application, Form Parser may be a viable option for a real-time use case, i.e. a customer-facing product.

Form Parser was also more accurate than Textract and Form Recognizer, although the results were less decisive and even Form Parser was not especially accurate. Form Parser found only 2.8 out of 4 form fields correctly, on average (median: 3 out of 4).

Form Parser also did poorly as the first step in my custom, naïve key-value mapping, finding only 0.4 of 4 correct answers on average. This is a concern because this kind of custom mapping can be a good fallback for cases when the off-the-shelf form extraction fails, but it doesn’t seem to be a good option with Google’s service.

**Accuracy and speed results.** Bold italics indicated the best result for each measure. For Microsoft Form Recognizer, the results are averaged over 25 documents held out of model training, rather than the full set of 51 documents. For unstructured text with Microsoft’s product, the test set was only 19 documents because 6 requests failed. The response time averages for Amazon Textract are underestimates because some requests took longer than my async job poller timeout at 3 minutes.
Measure	Amazon Tetxtract	Google Form Parser	Microsoft Form Recognizer
Mean recall (out of 4)	2.4	2.8	*2.2*
Median recall (out of 4)	2	3	3
Mean recall, custom key-value mapping (out of 4)	*2.9*	0.4	*2.9*
Mean response time (seconds)	65.4	*3.3*	25.3
90th percentile response time (seconds)	173.1	*4.7*	41.1

Final thoughts

Because forms contain so much valuable information, form extraction is an alluring field. My test showed that for general-purpose, off-the-shelf form extraction, Google Form Parser is a serious contender. Its response times were considerably faster than Microsoft Form Recognizer and Amazon Textract and its results had marginally better recall.

All in all, I was underwhelmed by the results, even for Form Parser. An average recall of 70% is probably not good enough for most applications. If your documents are less noisy than the invoices I used, Form Parser might work, but you will likely be better off using a service that focuses on your specific document type.

References

Listing image by salvatore ventura on Unsplash.

Footnotes

For invoices and receipts, in particular, see Taggun, Rossum, Google Procurement DocAI, and Microsoft Azure Form Recognizer Prebuilt Receipt model. For identity verification, try Onfido, Trulioo, or Jumio.↩︎
Microsoft Form Recognizer requires a separate call to get unstructured text, so depending on how you count, it could be $0.10/page instead of $0.05.↩︎
The main Form Parser documentation page kinda hints that images should be passed in a base 64 encoding, but this is not the case. Read them as regular bytes objects, just like a PDF, and use MIME type image/png.↩︎
According to the documentation, text anchors are supposed to contain actual text in the content field, but in our experience, they don’t. The value of that field is always an empty string.↩︎
Annotating form documents is very tricky, and I made many small decisions to keep the evaluation as simple and fair as possible. Sometimes a PDF document’s embedded text differs from the naked eye interpretation. I went with the visible text as much as possible. Sometimes a key and value that should be paired are far apart on the page, usually because they’re part of a table. I kept these in the dataset, although I chose fields that are not usually reported in tables. Dates are arranged in many different ways. When listed as a range, I have included the whole range string as the correct answer, but when separated I included only the start date.↩︎