January 22, 2025

We improved table extraction in DocParse with a new AI model

By Henry Lindeman, Engineer and Dhruv Kaliraman, Engineer

Aryn’s goal is to provide analytics over unstructured documents, typically PDFs. We believe that integral to that goal is the ability to break down these documents into a more structured form that we can use to extract more information than you might be able to by simply stripping out all of the text and embedding it with a language model. Since documents are typically intended for human consumption, and humans use eyeballs to ingest information from documents, we believe that the first step of computer document processing is fundamentally a computer vision problem. To that end, we created a computer vision model to segment documents into a collection of physical, typed elements representing how a real person would likely parse the document.

With many documents, a large portion of customer questions can be answered by a table or set of tables within the document. If I have Boeing’s 10-K form from 2024 and I want to know how many airplanes they sold, the answer is in a table. If I have the municipal budget report of Wilmington, NC, and I want to know what percent of their budget they spent on policing, the answer is in a table. Like with documents, the tables themselves are laid out to optimize for human consumption - that is, visually. I can’t ask a PDF for an indexable data structure representing a particular table on a particular page because all the PDF sees is a collection of rectangles with text inside them. For Aryn to query these tables, we first need a way to turn a picture of a table into such a data structure.

‍

Concretely

Let me formalize this better:

Given a picture of a table, output HTML that generates the table.

The choice of HTML might seem kind of arbitrary (and it is), but it is well-understood, can generate almost anything recognizable as a table, and for most well-structured tables, it can be converted to a Pandas DataFrame for downstream operations.

We get the picture of the table from the whole-document segmentation model, which outputs ‘table’ elements (among others) with bounding boxes that we can use to identify section of the document with the table. Now, we will need to extract the contents.

‍

How do we know if a table extraction solution is any good?

To determine if a specific table extraction method is good, we collected a number of metrics to measure them, mostly pulled from research.

TEDS: Short for ‘Tree Edit Distance Similarity,’ this metric converts the predicted and ground-truth HTML to a tree, and then measures tree edit distance. This is the academic standard metric for the table recognition problem, but it can be a little bit hard to form an intuition about what gets penalized - for example, depending on the shape of a table, dropping a row might be better or worse than dropping a column. TEDS comes in two flavors - TEDS-Struct and TEDS-Content. TEDS-Struct only cares about the shape of the tree, whereas TEDS-Content also penalizes a prediction for messing up the text in the cells (by adding a string edit distance cost component).

GRITS: Short for ‘Grid Table Similarity,’ this metric converts the predicted and ground-truth HTML into a grid of cells, and computes a 2-d largest-common substructure. GRITS was invented by the team that created Table Transformer, a popular table recognition model. I find GRITS slightly easier to reason about - I think it better represents how similar two tables are. That said, it is also an NP-hard problem, so the researchers implemented a O(n^4) approximation, which gets slow when tables get big. Furthermore, when table models get good, GRITS scores approach 1 a lot faster than TEDS scores, so it can be hard to make a meaningful decision based on them as the difference between two models’ GRITS scores could be less than 0.005, which is not a lot of signal. GRITS comes in three flavors - GRITS_Con, GRITS_Top, and GRITS_Loc. The nuances between them are out of scope of this, but they largely agree with each other.

Acc_Con: This one is simple. For what proportion of tables did the model predict exactly the ground truth?

Manual Visual Inspection: At the end of the day, we have two issues. We need to decide whether model A or model B is better, and all the quantitative metrics are really telling us is “they’re pretty close.” So, we can simply have both models make a bunch of predictions and then look at them side-by-side. This has another bonus in that it allows us to evaluate models on data that hasn’t been labelled - since we’re just comparing tables with our eyes, we don’t need any ground truth. To get a quantitative measure out of this process, we classified each table by whether model A was better, model B was better, they were both good, or they both sucked.

We use the quantitative metrics to filter out models that definitely won’t meet our needs, and then we use visual inspection to select a model from a list of candidates that pass the bar.

‍

Off-the-shelf solutions

We evaluated a number of off-the-shelf solutions during this work. I’ll go through a few of them here, and showcase the kinds of additional issues that we faced, and why we decided to train our own model.

ChatGPT

In the age of the LLM, your first instinct might be to try handing an LLM the table and having it do all the hard work. We tried this, and for sufficiently complicated tables, it didn’t do too well. Furthermore, being an opaque text-generation machine, there wasn’t much we could do to correct it, since we didn’t have access to any intermediate data. LLMs have gotten better since we started, so it’s entirely conceivable that something like OpenAI’s o3 can do this task better than its predecessors, but it’s also likely that asking o3 to extract 20 tables in a document will take an hour and cost $400, which is slow and expensive. (Assuming $20 per task, 10 minutes per task with some concurrency)

‍

UniTable

UniTable is a recent, state-of-the-art, unified model for table extraction. It is an autoregressive vision model - that is, it looks at a picture of a table and then outputs HTML tags one at a time, like ChatGPT. The ‘unified’ part in its name references that not only does it output HTML tags, it also outputs bounding boxes and ocr’d text, such that all three sequences can be stitched together into a single HTML table with all the content filled in.

UniTable did quite well… for sufficiently small tables. Once tables get big, two things happen: UniTable gets slow, since it is predicting one cell at a time, and then it fills up its context window and halts. Additionally, if UniTable accidentally skips a cell, or produces the wrong number of columns, you can get results like this:

‍

A UniTable output. Not the diagonal columns, spilling over because it generated the wrong number of columns.

‍

We need something fast that can handle large tables, so UniTable does not fit our use case.

‍

Amazon Textract

Textract is a powerful document processing and OCR service from AWS. They’ve spent a lot of time on table extraction, so maybe we can just leverage their experience and use it? In fact, Textract was one of the higher-quality solutions we looked at, but it was also quite slow, and quite expensive. Using Textract to process a single table costs more than a single page in Aryn DocParse, so it’s really a non-starter.

‍

Table Transformer

I mentioned Table Transformer (TATR) earlier, and it solves a lot of these problems. First, as an object detection model, it can predict the entire table structure in a single inference, so it’s pretty fast. It’s also open source, so we can just run it at low cost . This also lets us tweak the post-processing to handle various edge cases we’ve run into. Lastly, it gets strong qualitative scores on the benchmarks.

For these reasons, Table Transformer was the winner in our initial search, and for a few months we ran it in DocParse. However, it had some issues when parsing very large tables, which prompted us to dig deeper and eventually train our own model.

‍

Deeper Dive into Table Transformer

DETR Architecture

Table Transformer is a DETR model. DETR is a popular architecture for modern object detection models based on top of the Transformer architecture. It works like this: a backbone convolutional neural network (CNN) looks at the input picture and converts that to a series of feature vectors. Those feature vectors act as the input tokens to a transformer encoder stack. Finally a transformer decoder stack uses the context from the encoder for cross attention and predicts object bounding boxes and classes.

‍

‍

The number of objects a DETR model can predict is controlled by the constant n_queries - the decoder input is essentially just n_queries dummy tokens, which gather context from the cross-attention operations and coordinate with each other to predict all of the objects in the image.

‍

Table Transformer Specifics

Table Transformer uses the DETR Architecture to predict rows, columns, spanning cells, projected row headers, column headers, and tables. These classes are sufficient to generate the HTML representation of a table, although with a fairly complex piece of post-processing. For example:

‍

Rows are in blue, columns are in orange, spanning cells are in lime, the column header is in purple.

‍

Those combine into the following HTML representation:

‍

There are a few OCR errors - TATR relies on an external OCR model to give it the bounding boxes and contents of the text in order to slot the text into the correct positions in the table. A lot of the post-processing work goes toward this.

‍

Upshot

TATR uses n_queries = 125, so when you have a very large table with more than 125 objects in it (rows + columns + spanning cells; the other classes should be O(1) per table), TATR is physically unable to generate enough objects to represent the table. Even when the number of objects is close to but below 125, it has issues, because everything it produces must be correct and meaningfully so. We tried setting n_queries to something larger, but since the model wasn’t trained with that setting it just added noise.

‍

Table representation between objects and HTML - red boxes are intermediate cells derived from the larger objects. Note the several missing rows.

‍

Let’s train a model!

To further improve DocParse’s table extraction quality, we need to create and train a new model to better handle these scenarios. Table Transformer mostly does what we need it to, so we’ll mostly follow that architecture and methodology. That is, we’ll predict row and column objects with a DETR-style model and use that to generate HTML.

DETR-Style?

Yes! We use a modified DETR architecture called Deformable DETR, which does effectively the same thing but implements a few improvements. Firstly, it uses a special attention mechanism, Deformable Attention, that allows it to learn additional weights to determine the attention graph. Secondly, it uses the encoder to generate ‘candidate’ input tokens for the decoder - so the inputs to the decoder are more dynamic. These features combine to create a model that converges faster and is better at handling smaller objects.

This is perfect for our big table problem - larger tables will have smaller objects because there are more of them and the CNN backbone has us scale all the images to the same size. Additionally, we train with n_queries = 300, which is sufficient for pretty much every table that fits on a page.

‍

Training Data

Deep learning is, of course, all about the data. The original TATR was trained on a dataset called PubTables-1M and a canonicalized form of FinTabNet available on HuggingFace. PubTables consists of scientific articles from PubMed, and FinTabNet contains mostly financial documents and SEC filings. We train on these datasets.

We also created a canonicalized version of SynthTabNet, a synthetic dataset meant to broaden the horizons of prospective table models from just the finance and medical research domains. To canonicalize, we followed the same algorithm described by the TATR team here. In addition to IBM’s synthetic data, we generated two sets of our own synthetic data, one representing very large tables - on the order of 10-20 columns and 30-80 rows - and the other representing very small tables, 2 columns by 1-5 rows.

‍

Testing Data

Because of where we integrated the various metrics and the formats of the datasets, we tested different datasets with different metrics. For GRITS and Acc_Con we tested the test splits of our training data, along with a canonicalized ICDAR-2013. ICDAR-2013 was a table recognition competition; the dataset contains just 258 high-quality hand-annotated tables. For TEDS we evaluated FinTabNet (the non-canonicalized version), PubTabNet, and KoneTabNet. PubTabNet is another set of PubMed articles, so we avoided training on it to avoid duplicating data with PubTables. KoneTabNet is a set of just 5 hand-labelled tables we created from a Kone elevator manual. These are particularly challenging large tables, like the one pictured above that TATR undergenerates.

‍

Training Process

The model that ended up performing the best was trained for 17 epochs total, with each epoch containing 720k datapoints sampled randomly from just FinTabNet and PubTables. We dropped the learning rate by 0.9 every epoch, and at epochs 5 and 10 we dropped it by an additional 0.4. We tried about a million permutations of including the synthetic datasets in the training run, but in the end none of those checkpoints performed as well (per the quantitative metrics) as the model trained only on ‘real’ data.

‍

Let’s evaluate a model!

GRITS Measurements

Here we compare Table Transformer with the model we trained, which we call rdd17 (real, double drop, cpt 17). GRITS and Acc_Con are heavily embedded in the training code, so it only makes sense to run them in the training process - which means not comparing against non-tatr style models.

‍

‍

I’ll note that we limited PubTables to 10,000 datapoints for evaluation, from the 100k in the dataset. This was purely for speed, but could introduce a little bit of noise into the measurements.

rdd17 and TATR are roughly equivalent where the GRITS metrics are concerned; rdd17 holds a slight edge in some categories and TATR holds a slight edge in others, although rdd17 fairly consistently outperformed on Acc_Con. We did not find a way to push the GRITS scores higher with more or different training, without sacrificing performance on some dataset.

‍

TEDS Measurements

Here we compare our model with Table Transformer, and also PaddleOCR’s table recognition solution and AWS Textract. We did not integrate UniTable into the evaluation pipeline for the reasons discovered in using it with a standalone notebook.

‍

‍

I’ll explain the hybrid model in a minute. Paddle’s table model was trained on PubTabNet, so it’s not surprising that it performs quite well on that dataset. Textract does quite well across the board, though TATR is a little better on the open datasets. TATR fares significantly worse on the large tables of KoneTabNet, however. Our rdd17 model is a little bit worse overall than TATR on the open datasets, but strictly better than anything else on the big KoneTabNet tables.

‍

Visual Inspections

Since TATR and rdd17 were essentially tied, we didn’t have a great way to decide that one was better than the other. rdd17 was better for some tables, and TATR was better for others. So, we decided to look at tables that customers sent us under both models. At the end of the day, whatever performs best on our customers’ data is what we should deploy, regardless of any open benchmarks.

‍

‍

In comparing hundreds of tables by hand, we discovered that, fairly reliably, rdd17 was better for big tables, and TATR was better for smaller tables. Surprising, I know. The new model whose raison d'être is to be better for big tables was better for big tables, and worse for small tables. Accordingly we implemented a fancy machine learning technique known as an ‘if statement’ to use rdd17 for big tables and TATR for small tables. That’s the hybrid model above, and in our following visual inspection we confirmed that it gives us the best of both worlds, as the TEDS metrics suggest. The visual inspection table is a little bit misleading as in general where TATR was better it was only a little bit better, but where the hybrid model was better it tended to be a more significant difference.

‍

‍

Conclusion

So, if you need to parse documents, use Aryn DocParse. It’s really good, especially at tables.