top of page

Benchmarking PDF segmentation and parsing models

The Aryn Team

Updated: Nov 5, 2024

Much of today’s unstructured data is trapped deep in documents like PDFs that are difficult to parse and analyze. These documents are often rich with a variety of information: images, diagrams, tables, text and more. A key challenge for most document analysis workflows is to accurately unlock this data and appropriately transform and index it for a variety of use cases (LLM-based RAG applications, document processing workflows, unstructured data analytics and more).


In this article we discuss the results of a bakeoff where we compare the PDF segmenting and labelling capabilities of several systems. We show that the Aryn Partitioner, whose first step is running the document through a segmenting and labelling model, significantly outperforms the others (2-3x better than alternatives) on the DocLayNet competition dataset, a modern segmentation and labeling benchmark. While the Aryn Partitioner does a lot more, we focused on segmentation and labeling -- the first step in the document ETL process -- because that made a big difference in answer quality for our customers' complex workloads. With that established, let’s dive a bit deeper into the Aryn Partitioner, the evaluation results and the methodology we used.


Intro to the Aryn Partitioner


The Aryn Partitioner performs the first step of document processing workflows, by parsing and extracting all the information available in the PDF. The model breaks down your PDF into its constituent components (tables, captions, images, etc.), labels them, draws bounding boxes around the components, extracts the underlying text, table structure, and images, and then returns the output as JSON. For example, take a look at the document below:



Here we see boxes around the components the model has identified, accompanied with a label describing what each box contains (Image, Caption, Table etc.). The Partitioner in fact, does far more than segment and label your documents. It can detect table structure, identify individual cells within your table (as shown above), perform Optical Character Recognition (OCR) on the text in your document, and more!


As customers have been trying out the Partitioner, we’ve been receiving requests to perform a head to head comparison between our model and some of the other open-source and commercially available alternatives that offer similar capabilities. In this article, we evaluate each of these models against a benchmark and explain our methodology. We limit the comparison between these models on just their ability to correctly segment and label PDFs. While the Aryn Partitioner supports several other features such as OCR, table structure detection, etc., many of the alternatives do not provide these features so we did not include these capabilities in the comparison. Additionally, we ran each of these models on a slightly modified version of the DocLayNet competition dataset to make the comparison fair. This benchmark contains a diverse range of documents including reports, manuals, patents and more.


Results


We used the Mean Average Precision (mAP) and Mean Average Recall (mAR) metrics to measure accuracy. A higher mAP score indicates a more comprehensive and precise segmentation of the document, which is a proxy for the quality of the partitioning. The following are the results:

DocLayNet Competition Dataset **

mAP

mAR

Aryn vs others (mAP)

Aryn vs others (mAR)

Aryn Partitioner

0.640

0.747

1.0 x

1.0 x

Amazon Textract

0.423

0.507

1.51 x

1.47 x

Unstructured.io  (YOLOX)

0.347

0.505

1.85 x

1.48 x

Azure AI Document Intelligence

0.266

0.475

2.41 x

1.57 x

We used the publicly available COCO evaluation (primary challenge metric) to compare each model’s output to the ground truth labeling. As can be seen above, the Aryn Partitioner has an mAP that is 1.5x to 2.4x better and an mAR that is roughly 1.5x better than the above alternatives. Taking it a step further, when we've fine-tuned the underlying open source segmentation model for customer workloads, we've seen an improvement of up to 6x on mAP and 3.3x on mAR than the above alternatives.


** We only chose to compare the Aryn Partitioner to models and services that segment and label documents and provide bounding boxes around the components identified. These bounding boxes need to specify the coordinates of the components in the document.


Methodology


Benchmark Used


For the evaluation we used the DocLayNet competition dataset, which consists of 498 pages, each of them annotated with bounding boxes. This dataset includes a variety of different types of documents including reports, manuals, academic papers, patents, and more. There are 11 different types of labels that DocLayNet uses for components in documents :

Type

Description

Title

Large Text

Text

Regular Text

Caption

Description of an image or table

Footnote

Small text found near the bottom of the page

Formula

LaTeX or similar mathematical expression

List-item

Part of a list

Page-footer

Small text at bottom of page

Page-header

Small text at top of page

Image

A Picture or diagram.

Section-header

Medium-sized text marking a section.

table

A grid of text.

It is important to note that the Aryn Partitioner was NOT trained on the DocLayNet competition dataset but on the DocLayNet dataset (which are two different sets of documents). The DocLayNet competition dataset has a different distribution of documents from the DocLayNet dataset.


Making the comparison fair


To ensure a fair and accurate comparison between the models, we had to take some steps to reconcile the differences between the labels each of these models/services use. While the Aryn Partitioner outputs the same 11 labelling classes as DocLayNet, the other models in this document do not. Here are the labels from the other systems:

Microsoft Document Intelligence

Textract

Unstructured

figure

LAYOUT_FIGURE

Image

footnote

LAYOUT_FOOTER

Footer

pageFooter



pageHeader

LAYOUT_HEADER

Header

sectionHeading

LAYOUT_SECTION_HEADER


table

LAYOUT_TABLE




Table

text

LAYOUT_TEXT

NarrativeText



UncategorizedText

title

LAYOUT_TITLE

Title



Formula

pageNumber

LAYOUT_PAGE_NUMBER



LAYOUT_KEY_VALUE



LAYOUT_LIST

ListItem



FigureCaption

Given that each of these models labels documents differently, we removed pages from the dataset that would unfairly benefit a particular model:

  • The Aryn Partitioner was the only model that used the “Formula” label. We removed 4 pages out of 500 from the benchmark that had a “Formula” label in the ground truth.

  • Textract is the only model that contains a “key value” label (the Aryn Partitioner does not explicitly use this label, but you can extract key-value pairs through our ETL library Sycamore) . The competition dataset contained 1 page that Textract detected as a “key value” object. We removed this page from the dataset as well.

  • DocLayNet and the Aryn Partitioner handle Lists in a very fine grained manner. Each item of a list is labeled as a “list item” whereas other models classify all these items as one block. Additionally, Aryn draws bounding boxes around each list item whereas some other models draw 1 bounding box around the entire list. There were a total of 161 pages in the DocLayNet competition dataset that contained lists. We removed these from the comparison as well.

    • Note: By removing pages with list items, we are avoiding penalizing models that don’t use list-item as a label. By default, these models would score poorly on those pages and we wanted to be as fair as possible to models that don’t follow DocLayNet guidelines.

Overall, this process took out 165 pages and left 333 pages from the DocLayNet competition dataset in the final benchmark set.

A second issue was to ensure that we fairly compared each of these models to the ground truth. Since each of the models labels documents slightly differently, we created a common baseline that consisted of the following labels:

  1. Page Header

  2. Picture

  3. Section Header

  4. Table

  5. Text

These labels, we believe, are a lowest common denominator that encapsulates the label set for all of the models. We then mapped each of the model’s labels to this set:

Document Intelligence

Textract

Unstructrued

Aryn Partitioner

pageHeader → Page Header

LAYOUT_HEADER → Page Header

FigureCaption → Text

Caption → Text

figure → Picture

LAYOUT_FIGURE → Picture

Header → Section Header

Page Header → Page Header

title → Section Header

LAYOUT_SECTION_HEADER → Section Header

Image → Picture

Image → Picture

sectionHeading → Section Header

LAYOUT_TABLE → Table

Title → Section Header

Section Header → Section Header

table → Table

LAYOUT_TEXT → Text

Table → Table

Table → Table

text → Text

LAYOUT_TITLE → Section Header

NarrativeText → Text

Text → Text



UncategorizedText → Text

Title → Section Header

What this meant was for example, when performing a comparison between Document Intelligence and the ground truth, if Document Intelligence labeled a bounding box as “figure”, we considered that correct as long as the ground truth labeled it as “Picture”. Similarly, if Unstructured labeled something as “narrative_text”, it was considered correct if the ground truth labeled it as “Text”.


Some DocLayNet labels do not show up in the table above, such as Page Footer and Footnote. We removed them because each model handles these very differently, and it would skew the results unfairly. For example, Unstructured does not have a Page Footer equivalent label, and merges both Page Footer and FootNote when it draws the bounding boxes.


Conclusion


As shown by our evaluation, the Aryn Partitioner clearly outperforms the alternatives when it comes to segmenting and labelling PDFs. Its underlying segmentation model has a 2-3x better mAP score and a 1.5x better mAR score than the next closest alternatives. But, the Aryn Partitioner does much more than simply segmentation and labeling. It also offers table segmentation, text extraction with and without OCR, and post processing to optimize the output for RAG, GenAI, and compliance use cases. We have so much more planned; this is just the beginning! While your mileage may vary based on the document type and use case, we recommend that you try out the Aryn Partitioner and its state of the art capabilities.

Get Started Today!


Try out the Aryn Partitioner in the Aryn Partitioning Service - all you need is an API key to get started (sign up here for free). You can access the cloud service through the Aryn Playground, Aryn SDK or through a Sycamore script. You can also run the partitioner locally through a Sycamore script by setting the use_partitioning_service boolean to false. We’d love to hear your feedback on the service or any feature requests you have for your workloads.


Email us: info@aryn.ai

Join the Sycamore Slack


bottom of page