top of page

AI-powered document parsing,  table extraction, and ETL.

Accurately extract and transform tables, images, and text for LLM-based apps, RAG frameworks, and vector databases. 

Video:

Welcome

Up to 6x more accurate, 5x faster, 5x cheaper 

PDF partitioning
PDF to JSON
PDF partitioning
PDF partitioning

1. Use Aryn DocParse to easily chunk and extract data from your documents into structured JSON.

2. Take the JSON output and run additional ETL steps to load your vector database with Aryn DocPrep.

Try your own doc in the Aryn Playground

Chunk, embed, and load your data 

table extraction

Aryn

DocParse

Doc to JSON

DocPrep

ETL for Docs

JSON

{...}

Pinecone logo
OpenSearch logo
ElasticSearch logo
Weaviate logo
DuckDB logo

Why Aryn?

Higher quality chunking

Aryn DocParse is up to 6x more accurate and 5x faster than alternatives. Structure and extract data from PDFs, HTML, presentations and more using purpose-built AI models. Tackle complex documents with tables, images, text, graphs, and infographics.

Use declarative dataflows

Aryn DocPrep generates ETL pipeline code for processing and loading your unstructured data into your vector databases. Choose from variety of chunking strategies and vector embedding models. Customize your pipeline code with data extraction transforms and more.

Reliably load
vector databases

Easily load vector databases and hybrid search engines using Aryn DocPrep's connectors, such as Pinecone, OpenSearch, Weaviate, Elasticsearch, Qdrant, and DuckDB. DocPrep's generated ETL pipelines can scale from processing one to thousands of documents.

Open source and 
cloud native

Aryn DocParse's base AI model is open source and is available on Hugging FaceAryn DocPrep generates ETL pipelines using the Sycamore document ETL library, which is 100% open source (Apache License v2.0). It's customizable with data transforms and UDFs.

Use cases

Developers use Aryn in financial services, healthcare, manufacturing, eCommerce, and customer support. 

Research and discovery

Prepare data for apps that enable analysts and researchers to ask hard questions on complex documents that include tables, infographics, and complicated layouts. Discover and use critical information that would otherwise be missed.

Reporting on unstructured data feeds

Create structured reports from unstructured data to answer key business questions. Run scheduled pipelines that extract, enrich, and store information from diverse datasets, such as Salesforce data, health records, or contracts.

Technical knowledge bases

Empower technical knowledge workers with AI-assistants by processing manuals, technical documents, installation guides, and catalogs for RAG systems. Answer technical questions and find information from properly chunked data.

Customer support

Deliver high-quality data to co-pilots to empower customer support teams, healthcare professionals, or empower customers to directly query knowledge bases, support tickets, FAQs, healthcare records, and other info sources. 

Installation

Installing the SDK for Aryn DocParse and the Sycamore library is quick and simple. Learn more

​

bottom of page