Benchmarking PDF segmentation and parsing models

The Aryn Team

Sep 26, 2024

Updated: Nov 5, 2024

Much of today’s unstructured data is trapped deep in documents like PDFs that are difficult to parse and analyze. These documents are often rich with a variety of information: images, diagrams, tables, text and more. A key challenge for most document analysis workflows is to accurately unlock this data and appropriately transform and index it for a variety of use cases (LLM-based RAG applications, document processing workflows, unstructured data analytics and more).

In this article we discuss the results of a bakeoff where we compare the PDF segmenting and labelling capabilities of several systems. We show that the Aryn Partitioner, whose first step is running the document through a segmenting and labelling model, significantly outperforms the others (2-3x better than alternatives) on the DocLayNet competition dataset, a modern segmentation and labeling benchmark. While the Aryn Partitioner does a lot more, we focused on segmentation and labeling -- the first step in the document ETL process -- because that made a big difference in answer quality for our customers' complex workloads. With that established, let’s dive a bit deeper into the Aryn Partitioner, the evaluation results and the methodology we used.

Intro to the Aryn Partitioner

The Aryn Partitioner performs the first step of document processing workflows, by parsing and extracting all the information available in the PDF. The model breaks down your PDF into its constituent components (tables, captions, images, etc.), labels them, draws bounding boxes around the components, extracts the underlying text, table structure, and images, and then returns the output as JSON. For example, take a look at the document below:

Here we see boxes around the components the model has identified, accompanied with a label describing what each box contains (Image, Caption, Table etc.). The Partitioner in fact, does far more than segment and label your documents. It can detect table structure, identify individual cells within your table (as shown above), perform Optical Character Recognition (OCR) on the text in your document, and more!

As customers have been trying out the Partitioner, we’ve been receiving requests to perform a head to head comparison between our model and some of the other open-source and commercially available alternatives that offer similar capabilities. In this article, we evaluate each of these models against a benchmark and explain our methodology. We limit the comparison between these models on just their ability to correctly segment and label PDFs. While the Aryn Partitioner supports several other features such as OCR, table structure detection, etc., many of the alternatives do not provide these features so we did not include these capabilities in the comparison. Additionally, we ran each of these models on a slightly modified version of the DocLayNet competition dataset to make the comparison fair. This benchmark contains a diverse range of documents including reports, manuals, patents and more.

Results

We used the Mean Average Precision (mAP) and Mean Average Recall (mAR) metrics to measure accuracy. A higher mAP score indicates a more comprehensive and precise segmentation of the document, which is a proxy for the quality of the partitioning. The following are the results:

DocLayNet Competition Dataset **	mAP	mAR	Aryn vs others (mAP)	Aryn vs others (mAR)
Aryn Partitioner	0.640	0.747	1.0 x	1.0 x
Amazon Textract	0.423	0.507	1.51 x	1.47 x
Unstructured.io (YOLOX)	0.347	0.505	1.85 x	1.48 x
Azure AI Document Intelligence	0.266	0.475	2.41 x	1.57 x

We used the publicly available COCO evaluation (primary challenge metric) to compare each model’s output to the ground truth labeling. As can be seen above, the Aryn Partitioner has an mAP that is 1.5x to 2.4x better and an mAR that is roughly 1.5x better than the above alternatives. Taking it a step further, when we've fine-tuned the underlying open source segmentation model for customer workloads, we've seen an improvement of up to 6x on mAP and 3.3x on mAR than the above alternatives.

** We only chose to compare the Aryn Partitioner to models and services that segment and label documents and provide bounding boxes around the components identified. These bounding boxes need to specify the coordinates of the components in the document.

Methodology

Benchmark Used

For the evaluation we used the DocLayNet competition dataset, which consists of 498 pages, each of them annotated with bounding boxes. This dataset includes a variety of different types of documents including reports, manuals, academic papers, patents, and more. There are 11 different types of labels that DocLayNet uses for components in documents :

Type	Description
Title	Large Text
Text	Regular Text
Caption	Description of an image or table
Footnote	Small text found near the bottom of the page
Formula	LaTeX or similar mathematical expression
List-item	Part of a list
Page-footer	Small text at bottom of page
Page-header	Small text at top of page
Image	A Picture or diagram.
Section-header	Medium-sized text marking a section.
table	A grid of text.

It is important to note that the Aryn Partitioner was NOT trained on the DocLayNet competition dataset but on the DocLayNet dataset (which are two different sets of documents). The DocLayNet competition dataset has a different distribution of documents from the DocLayNet dataset.

Making the comparison fair

To ensure a fair and accurate comparison between the models, we had to take some steps to reconcile the differences between the labels each of these models/services use. While the Aryn Partitioner outputs the same 11 labelling classes as DocLayNet, the other models in this document do not. Here are the labels from the other systems:

Microsoft Document Intelligence	Textract	Unstructured
figure	LAYOUT_FIGURE	Image
footnote	LAYOUT_FOOTER	Footer
pageFooter
pageHeader	LAYOUT_HEADER	Header
sectionHeading	LAYOUT_SECTION_HEADER
table	LAYOUT_TABLE
		Table
text	LAYOUT_TEXT	NarrativeText
		UncategorizedText
title	LAYOUT_TITLE	Title
		Formula
pageNumber	LAYOUT_PAGE_NUMBER
	LAYOUT_KEY_VALUE
	LAYOUT_LIST	ListItem
		FigureCaption

Given that each of these models labels documents differently, we removed pages from the dataset that would unfairly benefit a particular model:

The Aryn Partitioner was the only model that used the “Formula” label. We removed 4 pages out of 500 from the benchmark that had a “Formula” label in the ground truth.
Textract is the only model that contains a “key value” label (the Aryn Partitioner does not explicitly use this label, but you can extract key-value pairs through our ETL library Sycamore) . The competition dataset contained 1 page that Textract detected as a “key value” object. We removed this page from the dataset as well.
DocLayNet and the Aryn Partitioner handle Lists in a very fine grained manner. Each item of a list is labeled as a “list item” whereas other models classify all these items as one block. Additionally, Aryn draws bounding boxes around each list item whereas some other models draw 1 bounding box around the entire list. There were a total of 161 pages in the DocLayNet competition dataset that contained lists. We removed these from the comparison as well.
- Note: By removing pages with list items, we are avoiding penalizing models that don’t use list-item as a label. By default, these models would score poorly on those pages and we wanted to be as fair as possible to models that don’t follow DocLayNet guidelines.

Overall, this process took out 165 pages and left 333 pages from the DocLayNet competition dataset in the final benchmark set.

A second issue was to ensure that we fairly compared each of these models to the ground truth. Since each of the models labels documents slightly differently, we created a common baseline that consisted of the following labels:

Page Header
Picture
Section Header
Table
Text

These labels, we believe, are a lowest common denominator that encapsulates the label set for all of the models. We then mapped each of the model’s labels to this set:

Document Intelligence	Textract	Unstructrued	Aryn Partitioner
pageHeader → Page Header	LAYOUT_HEADER → Page Header	FigureCaption → Text	Caption → Text
figure → Picture	LAYOUT_FIGURE → Picture	Header → Section Header	Page Header → Page Header
title → Section Header	LAYOUT_SECTION_HEADER → Section Header	Image → Picture	Image → Picture
sectionHeading → Section Header	LAYOUT_TABLE → Table	Title → Section Header	Section Header → Section Header
table → Table	LAYOUT_TEXT → Text	Table → Table	Table → Table
text → Text	LAYOUT_TITLE → Section Header	NarrativeText → Text	Text → Text
		UncategorizedText → Text	Title → Section Header

What this meant was for example, when performing a comparison between Document Intelligence and the ground truth, if Document Intelligence labeled a bounding box as “figure”, we considered that correct as long as the ground truth labeled it as “Picture”. Similarly, if Unstructured labeled something as “narrative_text”, it was considered correct if the ground truth labeled it as “Text”.

Some DocLayNet labels do not show up in the table above, such as Page Footer and Footnote. We removed them because each model handles these very differently, and it would skew the results unfairly. For example, Unstructured does not have a Page Footer equivalent label, and merges both Page Footer and FootNote when it draws the bounding boxes.

Conclusion

As shown by our evaluation, the Aryn Partitioner clearly outperforms the alternatives when it comes to segmenting and labelling PDFs. Its underlying segmentation model has a 2-3x better mAP score and a 1.5x better mAR score than the next closest alternatives. But, the Aryn Partitioner does much more than simply segmentation and labeling. It also offers table segmentation, text extraction with and without OCR, and post processing to optimize the output for RAG, GenAI, and compliance use cases. We have so much more planned; this is just the beginning! While your mileage may vary based on the document type and use case, we recommend that you try out the Aryn Partitioner and its state of the art capabilities.

Get Started Today!

Try out the Aryn Partitioner in the Aryn Partitioning Service - all you need is an API key to get started (sign up here for free). You can access the cloud service through the Aryn Playground, Aryn SDK or through a Sycamore script. You can also run the partitioner locally through a Sycamore script by setting the use_partitioning_service boolean to false. We’d love to hear your feedback on the service or any feature requests you have for your workloads.

To learn more, visit the Aryn Partitioning Service documentation.

Email us: info@aryn.ai

Join the Sycamore Slack