Building a Scalable Document Processing Pipeline With LlamaParse, Confluent Cloud, and MongoDB

Utsav Talwar , Shashank Pandey, Scott Sappenfield (Confluent), and Tuana Çelik (LlamaIndex)

As organisations generate increasing volumes of data, extracting meaningful insights from unstructured documents has become a significant challenge. This blog presents an advanced architecture that leverages cloud storage, streaming technology, machine learning, and a database to deliver a robust and efficient document processing pipeline.

Introduction: Modern document processing challenges

Organisations today are drowning in documents. PDFs, reports, contracts, and other text-heavy files contain valuable information, but extracting that knowledge efficiently presents significant challenges. Traditional document processing approaches often suffer from several limitations:

  • Scalability issues when processing large volumes of documents.
  • Limited intelligence in parsing complex document structures.
  • Batch processing delays hinder timely, real-time insights.
  • Difficulty integrating processed data into downstream applications.

Modern businesses require solutions capable of processing documents at scale, extracting their semantic meaning, and making this information immediately available for applications like search, recommendation systems, or business intelligence. Our architecture meets these challenges through a streaming-based approach that leverages advanced, cutting-edge technologies.

Architecture overview: From raw files to structured data

Figure 1. Reference architecture.
Reference architecture diagram. At the top right is data storage with raw files in a AWS S3 bucket. This data then flows through the architecture and Confluent Cloud, and then goes into MongoDB as structured data.

Our solution features a sophisticated real-time data processing pipeline that combines cloud storage, stream processing, machine learning, and a persistent database to create a comprehensive system for document enrichment and analysis.

At its core, the architecture follows a streaming data pattern where information flows continuously from source to destination, being transformed and enriched along the way. Let's walk through the key components:

  • AWS S3 bucket: Serves as the primary data lake, storing raw PDF documents.
  • Python ingestion script: Reads files from S3 and coordinates document processing.
  • LlamaParse: Provides intelligent document parsing and chunking.
  • Confluent: Acts as the central nervous system with two main topics:
    • "raw": Contains parsed document chunks.
    • "summary_embedding": Stores processed chunks with embeddings.
  • Apache flink: Processes streaming data and generates embeddings using ML.
  • Confluent schema registry: Handles data contracts, ensuring consistent data formats.
  • MongoDB: Stores the final processed documents with their embeddings.

This architecture excels in scenarios requiring real-time document processing with machine learning enrichment, such as semantic search applications, content classification, or knowledge management systems.

Data ingestion: Efficient document chunking with LlamaParse

The journey begins with PDF documents stored in an AWS S3 bucket. The ingestion layer, built with Python, handles the following tasks:

  • Document retrieval: The Python script connects to AWS S3 using configured credentials to access stored PDF documents.
  • Intelligent parsing with LlamaParse: The system fundamentally transforms how PDFs are processed. Instead of a simple extraction that treats these complex documents as mere sequences of text, it harnesses the power of LlamaParse. This sophisticated document parsing tool goes beyond simple character recognition, offering an intelligent understanding of document structure and layout. LlamaParse meticulously identifies and interprets critical formatting elements such as:
    • Tables: Accurately distinguishing rows, columns, and cell content, preserving tabular data integrity.
    • Images: Identify images in text, including additional context based on where the image is in the overall layout.
    • Headers: Recognising hierarchical headings and subheadings is crucial for organising a s document effectively.
    • Other formatting elements: Including lists, bolded text, italics, and various layout components, ensuring that the semantic meaning and visual presentation are preserved during parsing.

By leveraging LlamaParse, the system ensures that we don’t lose context over the document, by employing a parsing strategy that can make use of classic OCR, as well as LLMs and LVMs..

The following Python code demonstrates how to initialise your parser and extract relevant information.

# Initialize LlamaParse with your API key
parser = LlamaParse(
api_key=os.getenv("LLAMA_PARSE_API_KEY"),
result_type="json"  # Get JSON output 
)

# Parse the PDF with file name in extra_info
parsed_doc = parser.parse(file_bytes, extra_info={"file_name": file_name})
  • Document chunking: LlamaParse breaks down documents into manageable chunks, typically at the page level, while preserving the context and metadata of each chunk. This chunking approach provides several benefits:
    • Better processing efficiency for large documents.
    • More precise context for embedding generation.
    • Improved search granularity in the final application.

The processed chunks are then ready for the next stage of the pipeline. The Python script handles any parsing errors gracefully, ensuring the pipeline remains robust even when encountering malformed documents.

Streaming infrastructure: Leveraging Confluent Cloud

Confluent Cloud, a fully managed Apache Kafka service, serves as the backbone of our architecture. This streaming platform offers several advantages:

  1. Decoupled components: Kafka separates data producers (document parsers) from consumers (processing engines), allowing each to operate at its own pace. Similarly, LlamaParse, Flink, and MongoDB all process and ingest at different throughputs independently with decoupling.
  2. Scalability: The platform handles high throughput with configurable partitioning (6 partitions per topic in our implementation).
  3. Data resilience: Kafka's replication ensures no document chunks are lost during processing.
  4. Schema management: Confluent Schema Registry provides strong guarantees for schema evolution (forward and backwards compatibility).

Our implementation uses two main Kafka topics:

  • raw: Contains the parsed document chunks from LlamaParse.
  • summary_embedding: Stores the processed chunks with their generated embeddings.

The Avro schema for the embedding messages ensures consistency:

{
  "type": "record",
  "name": "summary_embedding_value",
  "namespace": "org.apache.flink.avro.generated.record",
  "fields": [
    {
      "name": "content",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "embeddings",
      "type": ["null", {"type": "array", "items": ["null", "float"]}],
      "default": null
    }
  ]
}

This schema defines the structure for each message, containing the original text content and its corresponding vector embeddings.

Once document chunks are flowing through Apache Kafka, the real magic happens in the processing layer. Apache Flink, a powerful stream processing framework, consumes data from the raw topic and applies transformations to generate embeddings. Flink continuously processes the stream of document chunks from Kafka and continuously produces the enriched summary_embedding stream back to Kafka.

Embeddings are numerical vector representations that capture the semantic meaning of text. They enable powerful capabilities like:

  • Semantic search (finding documents by meaning, not just keywords).
  • Document clustering and classification.
  • Similarity detection between documents.
  • Foundation for sophisticated AI applications.

Our implementation uses AWS Bedrock for embedding generation through Flink SQL:

-- Create the embedding model
CREATE MODEL AWSBedrockEmbedding INPUT (text STRING) OUTPUT (embeddings ARRAY<FLOAT>)
WITH (
    'bedrock.connection' = 'bedrock-connection',
    'task' = 'embedding',
    'provider' = 'BEDROCK'
);

-- Create the destination table
CREATE TABLE summary_embedding (
    content STRING,
    embeddings ARRAY<FLOAT>
);

-- Insert transformed data
INSERT INTO summary_embedding
SELECT
    CAST(val as STRING),
    embeddings
FROM
    raw,
    LATERAL TABLE (ML_PREDICT('AWSBedrockEmbedding', CAST(val as STRING)));

This SQL defines how Flink should:

  1. Connect to AWS Bedrock for ML capabilities.
  2. Define the destination structure for embeddings.
  3. Transform incoming text by generating embeddings through the ML_PREDICT function.

The result is a continuous stream of document chunks paired with their semantic vector representations.

Data consumption: Avro deserialization for schema evolution

On the consumption side, a dedicated consumer application reads from the embedded_data topic. This application handles several important tasks:

  1. Message consumption: Efficiently reads messages from Kafka with proper offset management.
  2. Avro deserialization: Converts the binary Avro format back to usable objects using the Schema Registry.
  3. Error handling and retries: Manages potential failures in consumption or processing.

Avro deserialization is particularly important for maintaining compatibility as the pipeline evolves. The Schema Registry ensures that even if the schema changes over time (adding new fields, for example), the consumer can still correctly interpret messages produced with older schemas.

The consumer application is implemented with multi-threading to maximise throughput, allowing parallel processing of messages from different partitions.

Storage strategy: MongoDB for flexible document storage

The final destination for our processed document chunks is MongoDB, a document-oriented database well-suited for storing complex, nested data structures, including vector embeddings.

MongoDB offers several advantages for this architecture:

  1. Flexible schema: Accommodates varying document structures and metadata.
  2. Vector storage: Efficiently stores and indexes high-dimensional embedding vectors.
  3. Query capabilities: Supports semantic search through vector similarity queries.
  4. Scalability: Handles large document collections through sharding.
  5. Integration options: Easily connects with downstream applications and visualisation tools.

The consumer application inserts each processed document chunk into MongoDB, preserving both the original text content and the generated embeddings. This makes the data immediately available for applications that need to search or analyse the document collection.

How MongoDB differentiates from other vector databases

MongoDB stands out as a versatile choice for vector storage, especially when compared to specialized vector databases. Here's why:

  1. Native integration: MongoDB's core strength lies in its ability to store and manage both structured and unstructured data, including vector embeddings, within a single platform. Unlike standalone vector databases that often require separate data synchronization and management, MongoDB Atlas Vector Search allows you to store your original data and its corresponding embeddings together in the same document. This eliminates data silos and simplifies your architecture.
  2. Flexible data model: MongoDB's document model provides unparalleled flexibility. You can store your raw text, metadata, and vector embeddings in a single JSON-like document. This means you don't need to normalize your data across multiple tables or systems, making schema evolution easier and reducing development complexity.
  3. Comprehensive query capabilities: Beyond simple vector similarity searches, MongoDB allows you to combine vector search with other powerful query operators, such as filtering by metadata, geospatial queries, or full-text search. This enables more nuanced and precise retrieval of information, which is crucial for advanced AI applications.
  4. Operational maturity and ecosystem: MongoDB is a mature, battle-tested database with a robust ecosystem of tools, drivers, and integrations. It offers enterprise-grade features like scalability, high availability, security, and a rich set of developer tools. Specialized vector databases, while effective for their niche, may lack the broader operational capabilities and community support of a general-purpose database like MongoDB.
  5. Cost-effectiveness and simplification: By consolidating your data storage and vector search capabilities into a single database, you can reduce operational overhead and cost. You avoid the need to manage and scale separate database systems, simplifying your infrastructure and streamlining your development workflow.

In essence, while dedicated vector databases excel at one specific task, MongoDB provides a holistic solution that handles your entire data lifecycle, from ingestion and storage to advanced querying and analytics, all within a single, integrated platform.


By implementing this architecture, organisations can transform their document processing capabilities from static, batch-oriented systems to dynamic, real-time pipelines that extract meaningful insights from unstructured content. The combination of cloud storage, streaming processing, machine learning, and flexible storage creates a powerful foundation for document-centric applications that drive business value.

Get started today by exploring the complete implementation in the GitHub repo.

New to MongoDB? Deploy a free instance of MongoDB Atlas to see how it can effortlessly power your AI-driven applications.

Not yet a Confluent customer? Start your free trial of Confluent Cloud today and receive $400 to spend during your first 30 days!

Sign up to LlamaCloud to get started with LlamaParse, LlamaExtract and more! Want to build your own custom agentic workflow? Check out LlamaIndex.