Aryn

August 28, 2024

Using the Aryn Partitioning Service with an LLM to analyze diagrams

By Abhijit Pujare, Product Management

Much of today’s unstructured data is trapped deep in documents like PDFs that are difficult to parse and analyze. The Aryn Partitioning Service helps you unlock this unstructured data by identifying images, diagrams, tables and more in your documents and extracting them into a JSON object you can use for further processing.

In this blog post we’ll walk you through this Colab notebook that uses the Partitioning Service to extract a workflow diagram from a Tesla battery manual and gets a step-by-step text description of it. You can do something similar for diagrams trapped within your documents and use the output for a variety of applications. For example, you can use this output for improving search across images or embed the output and ingest it into a vector database for LLM applications. Let’s dive right into it!

Before we get started, make sure you have an Aryn API key (visit https://www.aryn.ai/get-started to obtain your API key if you don’t have one) and an Open AI API key (visit https://platform.openai.com/api-keys to get a key).

The first two cells install all the necessary packages we need and import them. Let’s look at the third cell which sets the API keys. If you haven’t done so yet, please navigate to the left pane and choose the key option to set your two keys and toggle the “Notebook access” option.

‍

‍

Once you’ve done that, the next two cells will download the file from S3 and display the PDF.

Now, let’s look at the cell which makes a call to the Aryn Partitioning Service to segment the PDF.

‍

## Make a call to the partitioning service and set extract_images to true.      
partitioned_file = partition_file(file, aryn_api_key, extract_images=True, extract_table_structure=True, use_ocr=True)      

# show the pdf with bounding boxes
draw_with_boxes(file_name, partitioned_file)[0]

‍

The partitioned_file variable is a JSON object that contains the details of all the components the partitioning service had detected. If you inspect the JSON, you’ll notice an ‘elements’ array that contains everything the service has detected.

‍

'elements': [{'type': 'Section-header',
  'bbox': [0.042795360789579504,
   0.038177805813876066,
   0.6400381290211397,
   0.0610986328125],
  'properties': {'score': 0.5336548686027527, 'page_number': 1},
  'text_representation': 'Powerwall 3 Example System Configurations\n'},
 {'type': 'Section-header',
  'bbox': [0.04302589416503906,
   0.10283722617409446,
   0.3062180103975184,
   0.11810086337002841],
  'properties': {'score': 0.32144585251808167, 'page_number': 1},
  'text_representation': 'Powerwall 3 with Gateway 3\n'},
 {'type': 'Image',
  'bbox': [0.04383692124310662,
   0.10298660278320312,
   0.7640514418658089,
   0.3396760420365767],
  'properties': {'score': 0.6494213342666626,
   'image_size': [1244, 540],
   'image_mode': 'RGB',
   'image_format': None,
   ...
}

Elements are the constituent components that make up a document. Some examples include tables, images, section headers, formulas etc. You can find the entire list here.

The draw_with_boxes method shows the PDF with the bounding boxes around the elements it has extracted.

‍

‍

You'll notice that there's two overlapping bounding boxes, one for the section header "Powerwall 3 with Gateway 3" and one for the image. We care about the image so we'll focus on that one. When you run into such a situation while using the Partitioning service, you can pick either or both boxes to process for your use case.

The next few cells of the notebook extract the image element from the JSON output and turn its binary representation into a JPEG. Finally let’s look at the last cell of the Colab notebook.

It first uses pydantic to form the response_format parameter for the OpenAI call telling it to respond with a list of descriptions for each step. It then makes a call to OpenAI and displays the JSON output:

‍

{
  "steps": [
    {
      "image_title": "Grid Connection",
      "image_description": "The diagram begins with the grid icon representing the main electricity source."
    },
    {
      "image_title": "Meter Installation",
      "image_description": "Next, the electricity flows into a meter socket panel, which is connected to a meter that measures the energy consumption."
    },
    {
      "image_title": "Gateway 3",
      "image_description": "From the meter, the energy passes into Gateway 3, which manages the energy flow."
    },
    {
      "image_title": "Load Center",
      "image_description": "The energy then moves to the load center where it distributes power to various backup loads in the home."
    },
    {
      "image_title": "Powerwall 3",
      "image_description": "The Powerwall 3, depicted next, stores energy for backup purposes."
    },
    {
      "image_title": "Solar Installation (Optional)",
      "image_description": "An optional solar panel connection is shown, indicating that solar energy can also feed into the system for additional power supply."
    }
  ]
}

Here we have a step-by-step description in JSON of the diagram! You can take this output and integrate it into your image processing workflows as metadata for better search etc.

In this blog post we walked through a Colab notebook that used the Aryn Partitioning Service’s model to segment a PDF, pulled out a diagram from it and then asked an LLM to summarize the diagram. If you want to take this output and run it through a model to get embeddings which you can then then ingest into a vector database, check out Aryn’s open source Sycamore ETL library.