Aryn

September 9, 2024

New materialize transform makes it easy to debug and checkpoint your Sycamore ETL pipelines

By Eric Anderson, Engineer

When creating more complex ETL pipelines, the process of iterating and debugging can be difficult and slow. Many Sycamore transforms will recompute the entire dataset, meaning rerunning partitioning calls to the Aryn Partitioning Service and LLM-based transforms, which are inherently slow (with calls out to services like OpenAI). As a result, developers want a way to inspect the intermediate state of the pipeline and avoid unnecessary recomputation. As we develop more complicated data preparation pipelines, we have found these problems have become more acute, and so we developed the materialize transform to ameliorate them.

‍

‍

Sycamore uses a data structure called DocSets, which are collections of documents defined by the pipeline that calculates them using lazy execution. DocSets support a variety of transformations to help customers easily handle unstructured data and prepare it for search and analytics in Sycamore. This model is similar to the Spark model of DataSets.

Materialize writes and reads executed DocSets with a variety of options in order to help debug pipelines and avoid unnecessary re-computation.

‍

Materialize use case 1: inspection/understanding pipelines

Developers often start from example scripts. Imagine you have an example script like the following that partitions the document, extracts entity information, and then embeds the document and writes it to OpenSearch:

‍

# note that code in here will need minor tweaking, e.g. defining paths to work.       
# completely working examples can be found in github examples and notebooks
# https://github.com/aryn-ai/sycamore/tree/main/notebooks
import sycamore
def custom_transform(d):
    if 'parent_id' in d:
        d.properties['has_parent'] = True
    else:
        d.properties['has_parent'] = False

pipeline = (
    sycamore.init()
    .read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .regex_replace(COALESCE_WHITESPACE)
    .extract_entity(entity_extractor=OpenAIEntityExtractor(
        "title", llm=davinci_llm, prompt_template=title_template))
    .map(f=custom_transform)
    .write.opensearch(os_client_args=os_client_args, index_name="example1")
)

‍

In a pipeline there are places where it could behave unexpectedly. For example, after partitioning, after LLM operations, or after custom transforms. Therefore we stick materialize transforms after those steps to store the intermediate DocSets so we can inspect them:

‍

import sycamore
def custom_transform(d):
    if 'parent_id' in d:
        d.properties['has_parent'] = True
    else:
        d.properties['has_parent'] = False
    
pipeline = (
    sycamore.init()
    .read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .materialize(path="/tmp/after_partition")
    .regex_replace(COALESCE_WHITESPACE)
    .extract_entity(entity_extractor=OpenAIEntityExtractor(
        "title", llm=davinci_llm, prompt_template=title_template))
    .materialize(path="/tmp/after_llm")
    .map(f=custom_transform)
    .materialize(path="/tmp/after_custom")
    .write.opensearch(os_client_args=os_client_args, index_name="example1")
)

‍

Now we can inspect the results by printing them out:

‍

sycamore.init().read.materialize("/tmp/after_custom").show(2)

‍

Which would display two elements of the DocSet, and would appear similar to:

‍

{'properties': {'path': 'lib/sycamore/sycamore/tests/resources/data/pdfs/Transformer.pdf',       
                'filetype': 'application/pdf',
                'title': '\n"Attention Is All You Need"',
                'has_parent': 'False'},
  'elements': [],
  'lineage_id': 'cec7ea43-d510-4e54-8960-84bb2458c349',
  'doc_id': 'a7e1b4f4-6c7e-11ef-90e5-e40d36f1e1ae',
  'type': 'pdf',
  'binary_representation': b'<569417 bytes>'},
{'properties': {'score': 0.8773394823074341,
              'page_number': 1,
              'page_numbers': [1],
              'path': 'lib/sycamore/sycamore/tests/resources/data/pdfs/Transformer.pdf',
              'title': '\n"Attention Is All You Need"',
              'has_parent': 'True'},
'type': 'Section',
'binary_representation': b'<438 bytes>',
'text_representation': 'Attention Is All You Need\n'
                      'Niki Parmar∗ Google Research nikip@google.com\n'
                      'Ashish Vaswani∗ Google Brain <326 chars>',
'bbox': (0.18969529095818013,
        0.1263285272771662,
        0.8113471536075367,
        0.4394565096768466),
'elements': [],
'lineage_id': '9c909411-6cc0-4d5c-93c0-030de330d64e',
'doc_id': '4bbd806f-dbd9-4290-96ee-514776122668',
'parent_id': 'a7e1b4f4-6c7e-11ef-90e5-e40d36f1e1ae'}

‍

or we can display them:

‍

from sycamore.utils.pdf_utils import show_pages     
show_pages(sycamore.init().read.materialize("/tmp/after_partition"))

‍

You can leave all of these in permanently (the performance impact for writing locally will be tiny) and then only inspect what happens when queries or other operations do not work as expected.

‍

Materialize use case 2: Checkpointing or sharing data

While working on a project, a single developer might get a mostly working version of their pipeline. They could then use materialize to checkpoint that version so that other people can use it while they continue to do development.

Also, when developers work together on a related problem, they want to be sure that they are operating on the same data. For example, they want to make sure they are all using the same partitioning of the data, or they are loading exactly the same documents into their indexes. Materialize enables this by allowing developers to write to shared storage:

‍

pipeline = (     
    sycamore.init()
    .read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    ...
    .materialize(path="s3://example-bucket/example-prefix")
)
pipeline.execute()

‍

The execute() forces the pipeline to run and discards the output. To generate the materialized output, it would work equally well to .take() or .write.opensearch(...) on the pipeline.

Then that materialized output can be used by many different developers:

‍

(
    sycamore.init()
    .read.materialize(path="s3://example-bucket/example-prefix")
    .write.pinecone(index_name="example")
)

‍

Materialize uses pyarrow.fs to resolve the paths, so any filesystem pyarrow supports will work.

‍

Materialize use case 3: Avoiding re-execution in your pipeline

In the previous two sections, we’ve seen how to write materialized data and then read it separately. You could use this to avoid re-execution of slow steps during development by manually materializing and explicitly reading. However, it is simpler to be able to have your entire pipeline inline and use materialized stages if they are stored. We can take our original example and change the source_mode to use_stored:

‍

pipeline = (
    sycamore.init()
    .read.binary(paths, binary_format="pdf")
    .partition(partitioner=ArynPartitioner())
    .materialize(path="/tmp/after_partition", source_mode=sycamore.MATERIALIZE_USE_STORED)
    .regex_replace(COALESCE_WHITESPACE)
    .extract_entity(entity_extractor=OpenAIEntityExtractor(
        "title", llm=davinci_llm, prompt_template=title_template))
    .materialize(path="/tmp/after_llm", source_mode=sycamore.MATERIALIZE_USE_STORED)
    .map(f=custom_transform)
    .materialize(path="/tmp/after_custom", source_mode=sycamore.MATERIALIZE_USE_STORED)
    .merge(merger=GreedyElementMerger())
    .spread_properties(["path"])
    .split_elements(tokenizer=tokenizer, max_tokens=512)
    .explode()
    .embed(embedder=SentenceTransformerEmbedder())
)
pipeline.write.opensearch(os_client_args=os_client_args, index_name="example1")

‍

Now if you run this code twice, only the code after the last materialize will re-execute. If you ran pipeline.clear_materialize(), the entire pipeline would re-execute. If you were to run pipeline.clear_materialize(path="/tmp/after_custom"), the pipeline would run from the after_llm materialize on down.

If you do not completely execute the pipeline, materialize will not use the partial execution as a source. Partial execution could happen as a result of hitting ctrl-c, or if you run docs = pipeline.take(1). In both cases, the pipeline will not execute to completion since it will stop as soon as it has a single document to return. In these situations, materialize will not use the partial results as inputs. If you explicitly specify reading from a materialized directory (use case #2 via sycamore.init().read.materialize(path)), materialize will always read from the path and will log a warning if the source was not the result of a complete execution.

‍

Materialize use case 4: Auto-materialize

If you have a very complicated pipeline, it can be annoying to have to stick materialize steps after all of the stages of the pipeline. Instead, AutoMaterialize can stick materialize steps after each stage of the pipeline for you:

‍

from sycamore.materialize import AutoMaterialize      
ctx = sycamore.init()
# enable AutoMaterialize and let it choose a temporary directory to use. It will 
# log the materialized directory.
ctx.rewrite_rules.append(AutoMaterialize())
pipeline = (
    ctx.read.binary(...)
    ...
    .write.opensearch(...)
)

‍

At the end of execution there will be a directory (usually in /tmp) which will contain subdirectories with materialized data from each of the transforms in the pipeline.

You can specify the base directory for auto-materialize and you can specify specific names for transforms:

‍

from sycamore.materialize import AutoMaterialize
ctx = sycamore.init()
# override base directory for materialization      
ctx.rewrite_rules.append(AutoMaterialize("/home/example/subdir"))      

# Name some of the directories for some stages in the pipeline. 
pipeline = (
    ctx
    .read.document(docs, materialize={"name": "reader"})
    .map(noop_fn, materialize={"name": "noop"})
    .execute()
)
# materialized output will be written to root/reader and root/noop
# the names specified here need to be unique

‍

Request for feedback

Materialize is a relatively new feature, so we encourage you to provide feedback after using it on our Slack channel. You can find detailed documentation for materialize here, and get started with Sycamore here. Internally, though we expected to use materialize primarily for debugging Sycamore scripts, we found the features to avoid re-computation and share DocSets just as useful. Give it a try and let us know what you like, what you don’t like, and what you want us to add.