Relevant Context, Part 2: Ingestion

Published

April 9, 2025

“Ingestion” generically refers to the process of transforming a source document into a target format without - or minimising - loss of information.

The dominant document formats used for commercial legal agreements are Microsoft Word and Adobe PDF. Adobe’s PDF format can be further subdivided into “text”-based - where the characters on each page are represented as unicode characters - and “image”-based - where the contents of each page are effectively represented as an image file that must be passed through an optical chartacter recognition (OCR) process as part of the ingestion process. Each of these formats presents their own unique challenges.

Reliable, high-quality, at-scale ingestion of the wide variety of document formats in common use today is an enduringly challenging problem - so much so that it continues to support a sizeable enterprise market of sustainably profitable ingestion-as-a-service offerings tailored to specialist domain niches.

On the other hand, there are now a number of mature open source projects in the space that have changed what is achievable for smaller teams on more constrained budgets. With modest engineering resources, it is now possible - although it remains non-trivial - to set up and maintain a custom in-house ingestion pipeline for your particular domain.

Representations Suitable for Use With LLMs

Binary formats (such as .docx and .pdf) cannot be directly used in an LLM context window: they must first be transformed into a pure text representation. But that does not imply that we are constrained to only representing textual information: document structure and formatting from the source document - elements of which are just as important for human comprehension as the text itself - can also be retained to the extent that they may be represented using text.

Even binary formats used in “what you see is what you get” (WYSIWYG) editors such as Microsoft Word represent some of their structure in text directly visible to the end-user: clause numbering in a legal contract, for example. But things like bolding, heading sizes, page breaks, and so forth, are represented using non-textual means - effectively a separate metadata layer which sits alongside the text itself, rather than existing as part of it. Bits of structure and formatting which may be essential to the interpretation of their subject text will be lost to the extent that they cannot be conveted to a textual representation during the ingestion process.

Happily, there now exist a number of mature, comparatively standardised, text-only representation formats. All of these are in principle suitable for use in LLM context windows, with the better-known variants being naturally preferred for their ubiquity in LLM training datasets. These formats enable varying levels of structural and formatting complexity depending on the particular exigencies of your use-case.

Perhaps best known - even outside of technical circles - is Markdown, which has been around since 2004 and was formalised into the CommonMark specification in 2014. Markdown’s biggest strength is its simplicity: the specification can be learned in afternoon. But simplicity is also its biggest limitation: primarily focussed on web publishing, Markdown struggles with representations of document structure which go beyond simple heading-level demarcation. Although it has been extended to meet some of these challenges, it remains a fundamentally limited choice for high-fedility, pure-text representation of legal contracts. While it has the twin advantages of being both human-readable and well-understood by LLMs, it fundamentally is not a good choice for a canonical textual representation of a legal contract.

It can be an excellent choice for certain LLM-backed point solutions where structure and formatting are less important; but in those cases, you should choose to transform a higher-quality, “lossless” representation of your document into Markdown as and when needed, rather than use Markdown itself as the canonical representation. Unlike transforming proprietary binary formats to pure-text representations, transforming between text-based representations is usually trivial.

Canonical Representation

Perhaps the best-known alternative pure-text representations are Asciidoc and reStructuredText, both of which are better suited to preserving complex document structure while remaining human-legible. LLMs also appear to be comfortable with XML-based formats - although the trade-off there is perhaps reduced legibility for humans. Should we pick one of these and call it a day?

There may seem a neat symmetry between ingesting into, and storing your documents in, the same format that will ultimately be used with your LLM-backed features. But there are some big trade-offs to this approach, and we think many of the benefits are largely illusory, given the existence of better alternatives.

Let’s take a step back and think about what we’re trying to achieve and what we’ve established:

  • We want bulletproof reliability in the ingestion process: getting great results from our product features is hard enough without having to deal with junk input data created by a poor ingestion process.
  • We want to use our documents with LLMs without sacrificing important structure and formatting, certainly.
  • However, we also know that we will have uses for our documents which do not directly - or immediately - involve LLMs, and for which we may want to leverage document structure or features that cannot be represented in a simple text representation at all. Chief amongst these are what we have referred to in Part 1 of this series as “Document and Chunk Enhancement” processing.
  • We will almost certainly also want to be able to work with the document, or parts of the document, as structured objects in code, rather than just partial strings.
  • Images cannot be losslessly represented using pure-text representations, and tables can be difficult to impossible to represent, depending on their complexity.
  • Transforming between “near-text” representations and Markdown / Asciidoc / reStructuredText is usually trivially easy - although it may not be lossless, depending on the target format chosen.

Laying it all out like that, it’s clear that our actual constraint isn’t that meaningful document structure and formatting must be immediately representable in pure-text, but that it must be trivial to produce arbitrary pure-text representations from whatever interim canonical format we do choose to use. Loosening the requirement to store our canonical ingested documents in plain text means we can lean into a whole host of extra-textual features, allowing us to store and make sense of images and tables, work with structured objects in code, produce different target transformations for different purposes, and leverage the richer structure and formatting we have captured during ingestion in our metadata extraction and enhancement activities. We should assume that any format with these features is a reasonable candidate for our canonical representation, and indeed we should be ambivalent to any format possessing these features, all else being held equal.

As such, we should be much more interested in the robustness of the transformation performed by our ingestion software - including its performance across the major source formats commonly used in our domain - than the particulars of the chosen representation format (assuming it is high-resolution).

Back to top