top of page
  • Jon Fritz

Learn more about unstructured analytics in Aryn's new paper

The Aryn team just published a new paper on our approach to unstructured analytics, and we're excited to share it with you! We discuss the components of a system that can be used to run analytics queries on unstructured data. It builds upon ideas in a previous blog post on this topic, and we walk through how document ETL, indexing, and LLM-powered query planners comprise this system.



Here's the abstract: "LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild."


You can read the paper here, and let us know what you think on Slack.

Comments


bottom of page