March 12, 2025

Table Model Selection in DocParse

Henry Lindeman, Engineer

In a previous blog post we introduced a new hybrid table extraction model. This model comprises two models, one of which is better at large tables (deformable_detr) and the other of which is better at smaller tables (table_transformer). ‘Large’ tables are tables with a large number of rows and columns, which is hard to predict when we haven’t yet done the table extraction, since that’s what gives us the rows and columns. When you use table extraction in DocParse, we approximate ‘largeness’ with the absolute size of the table in pixels in order to determine which model to use.

Usually this does the right thing, but occasionally it doesn’t. For example, a customer submitted a document with a physically large table - but it was large because it was just really zoomed in; in reality it was just 6 rows and 2 columns. Accordingly we routed it to deformable_detr, when it would have been extracted better by table_transformer. Here’s some example code that extracts a table which has a small number of columns and rows but is physically large in the document:

from aryn_sdk.partition import partition_file
from aryn_sdk.partition.art import draw_with_boxes

data = partition_file("examplepdf.pdf", use_ocr=True, extract_table_structure=True)
ims = draw_with_boxes("examplepdf.pdf", data, draw_table_cells=True)
ims[0]

Because it was routed to deformable_detr, you'll notice that it missed a row and combined the top and bottom 2 rows incorrectly. We’re introducing an escape hatch parameter - table_extraction_options.model_selection. Simply set table_extraction_options={"model_selection": "table_transformer"} or table_extraction_options={"model_selection": "deformable_detr"} to specify the exact model you want to use. Setting model_selection to “table_transformer” on the above example table, we can get the correct output:

from aryn_sdk.partition import partition_file
from aryn_sdk.partition.art import draw_with_boxes

data = partition_file(
	"examplepdf.pdf", 
	use_oce=True,
	extract_table_structure=True, 
	table_extraction_options={"model_selection": "table_transformer"}
)
ims = draw_with_boxes("examplepdf.pdf", data, draw_table_cells=True)
ims[0]

If you want to get real technical, you can also write conditional expressions to inform docparse how to choose the model. More complete documentation is here, but as an overview, expressions look like a series of  metric comparison threshold -> model statements separated by semicolons. The single-model expressions are also unconditional expressions. Supported models are “table_transformer” and “deformable_detr”. Supported metrics are “pixels” (the number of pixels in the max dimension of the table) and “chars” (the number of OCR’d characters in the table). Here are some examples:

1. "pixels > 500 -> deformable_detr; table_transformer"
# If max dimension pixels > 500, use deformable detr
# Otherwise use table transformer

2. "chars < 200 -> table_transformer; pixels < 500 -> table_transformer; deformable_detr"
# If there are less than 200 characters in the table, use table transformer
# If the max dimension of the table is less than 500 pixels, use table transformer
# Otherwise use deformable detr

3. "chars > 300 -> deformable_detr"
# If there are more than 300 characters in the table, use deformable detr
# Otherwise use table transformer (this is the default in case no unconditional expression is specified)


4. "deformable_detr; pixels < 400 -> table_transformer"
# Use deformable detr always. Stuff after an unconditional expression is unprocessed.
# You can even use that space for comments! e.g.
# "deformable_detr; we use deformable detr here because we think it's neat"

Sign up here to get an API key and get started with the Aryn SDK here.

If you have any questions or feedback, please contact us on Slack!