Aryn

October 22, 2024

Building trust by closing the verification gap in AI-powered unstructured analytics

By Matt Welsh, Chief Architect, and Mehul Shah, CEO

Introduction

Modern AI models are incredible guessing machines. Comprised of billions of parameters, these models can discern patterns from a large data corpus or across large search spaces to generate predictions that now go beyond human capabilities.

As Terence Tao highlights, we have seen remarkable progress in a variety of fields where we can apply AI and independently verify their “guessed” output. For example, in materials science, we’ve seen that AI can predict new materials and new properties of materials. We now have models that predict weather and, in particular, inclement weather such as paths of tropical storms better than the state-of-the-art simulations. In chemistry, AI models have solved a problem that seemed almost intractable — protein folding — and now can assist with experimental determination of protein structure and drug discovery. Most recently, going beyond solving Olympiad-level problems, mathematicians have used AI to assist in generating formal proofs of novel results. In the scientific disciplines, verification is a natural part of the process, and in mathematics, a new language, Lean, allows automatic verification of formal statements.

So with all this remarkable progress in science and mathematics, why is it still so hard to build a trustworthy question-answering system on a large corpus of data? The answer is simple — the output of current approaches is hard to verify. For example, even recent Large Language Models (LLMs) with strong reasoning capabilities like GPT-o1 often come back with confident and opaque responses that are difficult for humans to check. Although more accurate, these are not enough for many enterprise applications in financial services, pharmaceuticals, and healthcare, for example.

We argue that AI-powered search and analytics systems for unstructured data fundamentally need to be more transparent and easier to verify in order for us to trust them in enterprise applications. For example, on a document corpus of airplane accidents, suppose you ask “How many red airplanes crashed in the Mohave dessert in the winter?”, and the system response is “839”. With today’s systems, you’d need to repeat the work manually to check the computation.

Instead, we posit that AI-powered analytics systems need to explain how an answer was computed, make it easy to inspect intermediate results, and audit the steps that generated the answer. Further, they need to enable iterative interactions with data, queries, and results, to allow human users to test hypotheses and navigate the data freely.

In this post, we describe our approach to explainability and verification in our LLM-powered unstructured analytics system, called Luna, built using LLMs “all the way down”.

‍

The power of the interact-generate-verify loop with GenAI

LLMs are not oracles, and one of the largest obstacles to the adoption of generative AI in enterprises is their lack of reliability. This is inherent in their design, and we cannot overlook it or assume that the problem will one day just “go away” (though one can always hope). Despite their limitations, though, countless companies are finding that the current generative AI models bring tremendous value in applications where the penalty for errors is low or when humans can be effectively added “into the loop.”

In the past few years, for example, we have seen rapid adoption of coding assistants such as GitHub co-pilot and ChatGPT. Developers no longer sift through documentation or online forum histories, but instead interact with LLMs to generate completions while coding or prompt them for skeletons for entire projects. Verifying that the code is correct is naturally part of a developer’s workflow. Developers check the output is correct through manual inspection, testing, and code reviews. They can then follow-up and interact with LLMs to modify or generate more code as needed.

The effectiveness of the interact, generate, and verify loop is not exclusive to coding – this pattern is useful in any task where AI can generate solutions that are hard for humans to come by, but easy for humans to verify. We believe it can also be applied for analytics on unstructured data sets, where people can rely on the results to drive critical business decisions. Unlike coding, however, asking humans to verify AI-generated answers to analytics questions is more challenging.

Current techniques for question-answering like RAG lack the needed visibility and control to do that — while they cite sources, exactly how answers are reached is opaque. These techniques do not offer the transparency and explainability that one would expect from reports created by humans. When a business analyst manually creates a report on some data, her natural workflow is to explain and breakdown her methodology and judgments, which helps build confidence in her conclusions.

Our approach in Luna follows a pattern similar to that of an analyst – AI computations are broken down into steps that humans can audit and follow-up on, which helps them verify its answers and how they are attained.

‍

Unstructured document analytics

One of the most interesting challenges we have seen from customers is extracting insight from datasets consisting of unstructured documents — PDFs, Word documents, PowerPoint, HTML, or even plain text — which typically require human analysts to read, interpret, and extract relevant data from.

Consider an enterprise with a large volume of insurance claim documents, equipment maintenance reports, or customer interview transcripts. Often, these documents consist of both semi-structured data (such as lists or tables), as well as unstructured, human-generated text, or graphs, figures, images, and infographics. Automated analysis of such documents is very challenging beyond simple metadata extraction and filtering.

LLMs present a new opportunity to build an unstructured query engine that can treat such documents in a manner similar to tables in a relational database – and process them with a mix of traditional logical operations and new semantic operations performed by an LLM operating on the content of the document. This allows customers to ask questions in natural language and the system automatically chooses the best strategy to answer those questions relieving the burden on users and programmers.

As a simple example, consider asking a question such as “what were the top reasons for claim rejection related to generic medications”. At one level, this is a fairly straightforward query that could be posited in a relational database using SQL, but there are complications. Interpreting the “reasons for a claim rejection” and “related to generic medications” are not simple text-based operations. This is where LLMs come in — being able to leverage the semantic processing power of language models makes it possible, for the first time, to pose complex, open-ended queries on such data.

Luna combines the power of LLMs with the use of database-like query plans to enable an entirely new way to query unstructured data. Broadly, we use LLMs to translate natural language questions into query plans with a mix of both logical and LLM-based semantic operations. More details on the Luna approach can be found in this blog post.

Below is a screenshot of our Luna query UI, allowing users to ask complex questions of large, unstructured datasets, and to interact in a multimodal fashion with the AI data analyst. The screenshot shows a query over accident reports from the NTSB.

‍

‍

Explain, inspect, and follow-on exploration is the key to verifiability

In building AI-powered analytics solutions for our customers, we’ve learned there are a few key features needed to enable verifiability and trust:

Show the underlying query plan.
Allow the user to inspect intermediate results.
Allow the user to ask follow-up questions and guide the AI.

The first is to expose the query plan — that is, the set of computational operations being performed by the system — to the end user. Answering a complex analytics question typically involves multiple stages of a query pipeline involving filtering, aggregation, and summarization of data. The query plan reveals the computation being performed by the analytics system in a way that can be checked for correctness, independently of its execution.

Our Luna system exposes the plan generated from a user query as a simple JSON object representing the nodes — a GUI interface for non-technical users is coming soon. Given the query "What was the breakdown of aircraft types for incidents with substantial damage”, we can see the resulting plan as a QueryDatabase operation followed by a TopK operation, which in our implementation also performs semantic group-bys and counts:

‍

‍

While inspecting the query plan is often enough to convince oneself that the data generated by the query is likely to be correct, a further validation is possible by inspecting the data flowing out of each of the query operations. The Luna UI allows the user to explore the raw data at each stage of the query plan, drilling down to individual records and linking back to the original source documents.

‍

‍

Finally, we find that supporting an iterative, exploratory mode of interaction with the AI is essential. Users can test hypotheses and explore different aspects of the data by asking follow-on questions, such as “what about incidents without substantial damage” or “show only results in California”. The conversational history with the AI allows a user to refer to previous queries or results implicitly, making this interaction much more natural, much like asking questions of a human analyst.

‍

What comes next

We’re only now starting to glimpse the potential of LLMs coupled with database-like query processing for document analytics. Luna, and systems like it, represent a new kind of query engine — one that leans heavily on the reasoning and multi-modal abilities of LLMs to perform sophisticated computation — that enables powerful query-time analyses which were previously infeasible for unstructured data.

We recognize that building confidence and ensuring verifiability are critical in enterprise use cases, and this new class of query engine needs to be created around these key tenets. With Luna, we have taken the first few steps in this direction and are excited to explore what’s next.

We’d love to show you Luna in action, and learn more about how you want to interact and explore your unstructured data. Contact us at info@aryn.ai.