Building a Scalable Document Processing Pipeline With LlamaParse, Confluent Cloud, and MongoDB

Utsav Talwar , Shashank Pandey, Scott Sappenfield (Confluent), and Tuana Çelik (LlamaIndex)
September 10, 2025 | Updated: September 11, 2025

As organisations generate increasing volumes of data, extracting meaningful insights from unstructured documents has become a significant challenge. This blog presents an advanced architecture that leverages cloud storage, streaming technology, machine learning, and a database to deliver a robust and efficient document processing pipeline.

Introduction: Modern document processing challenges

Organisations today are drowning in documents. PDFs, reports, contracts, and other text-heavy files contain valuable information, but extracting that knowledge efficiently presents significant challenges. Traditional document processing approaches often suffer from several limitations:

Scalability issues when processing large volumes of documents.

Limited intelligence in parsing complex document structures.

Batch processing delays hinder timely, real-time insights.

Difficulty integrating processed data into downstream applications.

Modern businesses require solutions capable of processing documents at scale, extracting their semantic meaning, and making this information immediately available for applications like search, recommendation systems, or business intelligence. Our architecture meets these challenges through a streaming-based approach that leverages advanced, cutting-edge technologies.

Architecture overview: From raw files to structured data

Figure 1. Reference architecture.

Reference architecture diagram. At the top right is data storage with raw files in a AWS S3 bucket. This data then flows through the architecture and Confluent Cloud, and then goes into MongoDB as structured data.

Our solution features a sophisticated real-time data processing pipeline that combines cloud storage, stream processing, machine learning, and a persistent database to create a comprehensive system for document enrichment and analysis.

At its core, the architecture follows a streaming data pattern where information flows continuously from source to destination, being transformed and enriched along the way. Let's walk through the key components:

AWS S3 bucket: Serves as the primary data lake, storing raw PDF documents.

Python ingestion script: Reads files from S3 and coordinates document processing.

LlamaParse: Provides intelligent document parsing and chunking.

Confluent: Acts as the central nervous system with two main topics:

"raw": Contains parsed document chunks.
"summary_embedding": Stores processed chunks with embeddings.

Apache flink: Processes streaming data and generates embeddings using ML.

Confluent schema registry: Handles data contracts, ensuring consistent data formats.

MongoDB: Stores the final processed documents with their embeddings.

This architecture excels in scenarios requiring real-time document processing with machine learning enrichment, such as semantic search applications, content classification, or knowledge management systems.

Data ingestion: Efficient document chunking with LlamaParse

The journey begins with PDF documents stored in an AWS S3 bucket. The ingestion layer, built with Python, handles the following tasks:

Document retrieval: The Python script connects to AWS S3 using configured credentials to access stored PDF documents.

Intelligent parsing with LlamaParse: The system fundamentally transforms how PDFs are processed. Instead of a simple extraction that treats these complex documents as mere sequences of text, it harnesses the power of LlamaParse. This sophisticated document parsing tool goes beyond simple character recognition, offering an intelligent understanding of document structure and layout. LlamaParse meticulously identifies and interprets critical formatting elements such as:

Tables: Accurately distinguishing rows, columns, and cell content, preserving tabular data integrity.
Images: Identify images in text, including additional context based on where the image is in the overall layout.
Headers: Recognising hierarchical headings and subheadings is crucial for organising a s document effectively.
Other formatting elements: Including lists, bolded text, italics, and various layout components, ensuring that the semantic meaning and visual presentation are preserved during parsing.

By leveraging LlamaParse, the system ensures that we don’t lose context over the document, by employing a parsing strategy that can make use of classic OCR, as well as LLMs and LVMs..

The following Python code demonstrates how to initialise your parser and extract relevant information.

# Initialize LlamaParse with your API key
parser = LlamaParse(
api_key=os.getenv("LLAMA_PARSE_API_KEY"),
result_type="json"  # Get JSON output 
)

# Parse the PDF with file name in extra_info
parsed_doc = parser.parse(file_bytes, extra_info={"file_name": file_name})

Document chunking: LlamaParse breaks down documents into manageable chunks, typically at the page level, while preserving the context and metadata of each chunk. This chunking approach provides several benefits:

Better processing efficiency for large documents.
More precise context for embedding generation.
Improved search granularity in the final application.

The processed chunks are then ready for the next stage of the pipeline. The Python script handles any parsing errors gracefully, ensuring the pipeline remains robust even when encountering malformed documents.

Streaming infrastructure: Leveraging Confluent Cloud

Confluent Cloud, a fully managed Apache Kafka service, serves as the backbone of our architecture. This streaming platform offers several advantages:

Decoupled components: Kafka separates data producers (document parsers) from consumers (processing engines), allowing each to operate at its own pace. Similarly, LlamaParse, Flink, and MongoDB all process and ingest at different throughputs independently with decoupling.

Scalability: The platform handles high throughput with configurable partitioning (6 partitions per topic in our implementation).

Data resilience: Kafka's replication ensures no document chunks are lost during processing.

Schema management: Confluent Schema Registry provides strong guarantees for schema evolution (forward and backwards compatibility).

Our implementation uses two main Kafka topics:

raw: Contains the parsed document chunks from LlamaParse.

summary_embedding: Stores the processed chunks with their generated embeddings.

The Avro schema for the embedding messages ensures consistency:

{
  "type": "record",
  "name": "summary_embedding_value",
  "namespace": "org.apache.flink.avro.generated.record",
  "fields": [
    {
      "name": "content",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "embeddings",
      "type": ["null", {"type": "array", "items": ["null", "float"]}],
      "default": null
    }
  ]
}

This schema defines the structure for each message, containing the original text content and its corresponding vector embeddings.

Once document chunks are flowing through Apache Kafka, the real magic happens in the processing layer. Apache Flink, a powerful stream processing framework, consumes data from the raw topic and applies transformations to generate embeddings. Flink continuously processes the stream of document chunks from Kafka and continuously produces the enriched summary_embedding stream back to Kafka.

Embeddings are numerical vector representations that capture the semantic meaning of text. They enable powerful capabilities like:

Semantic search (finding documents by meaning, not just keywords).

Document clustering and classification.

Similarity detection between documents.

Foundation for sophisticated AI applications.

Our implementation uses AWS Bedrock for embedding generation through Flink SQL:

-- Create the embedding model
CREATE MODEL AWSBedrockEmbedding INPUT (text STRING) OUTPUT (embeddings ARRAY<FLOAT>)
WITH (
    'bedrock.connection' = 'bedrock-connection',
    'task' = 'embedding',
    'provider' = 'BEDROCK'
);

-- Create the destination table
CREATE TABLE summary_embedding (
    content STRING,
    embeddings ARRAY<FLOAT>
);

-- Insert transformed data
INSERT INTO summary_embedding
SELECT
    CAST(val as STRING),
    embeddings
FROM
    raw,
    LATERAL TABLE (ML_PREDICT('AWSBedrockEmbedding', CAST(val as STRING)));

This SQL defines how Flink should:

Connect to AWS Bedrock for ML capabilities.

Define the destination structure for embeddings.

Transform incoming text by generating embeddings through the ML_PREDICT function.

The result is a continuous stream of document chunks paired with their semantic vector representations.

Data consumption: Avro deserialization for schema evolution

On the consumption side, a dedicated consumer application reads from the embedded_data topic. This application handles several important tasks:

Message consumption: Efficiently reads messages from Kafka with proper offset management.

Avro deserialization: Converts the binary Avro format back to usable objects using the Schema Registry.

Error handling and retries: Manages potential failures in consumption or processing.

Avro deserialization is particularly important for maintaining compatibility as the pipeline evolves. The Schema Registry ensures that even if the schema changes over time (adding new fields, for example), the consumer can still correctly interpret messages produced with older schemas.

The consumer application is implemented with multi-threading to maximise throughput, allowing parallel processing of messages from different partitions.

Storage strategy: MongoDB for flexible document storage

The final destination for our processed document chunks is MongoDB, a document-oriented database well-suited for storing complex, nested data structures, including vector embeddings.

MongoDB offers several advantages for this architecture:

Flexible schema: Accommodates varying document structures and metadata.

Vector storage: Efficiently stores and indexes high-dimensional embedding vectors.

Query capabilities: Supports semantic search through vector similarity queries.

Scalability: Handles large document collections through sharding.

Integration options: Easily connects with downstream applications and visualisation tools.

The consumer application inserts each processed document chunk into MongoDB, preserving both the original text content and the generated embeddings. This makes the data immediately available for applications that need to search or analyse the document collection.

How MongoDB differentiates from other vector databases

MongoDB stands out as a versatile choice for vector storage, especially when compared to specialized vector databases. Here's why:

Native integration: MongoDB's core strength lies in its ability to store and manage both structured and unstructured data, including vector embeddings, within a single platform. Unlike standalone vector databases that often require separate data synchronization and management, MongoDB Atlas Vector Search allows you to store your original data and its corresponding embeddings together in the same document. This eliminates data silos and simplifies your architecture.

Flexible data model: MongoDB's document model provides unparalleled flexibility. You can store your raw text, metadata, and vector embeddings in a single JSON-like document. This means you don't need to normalize your data across multiple tables or systems, making schema evolution easier and reducing development complexity.

Comprehensive query capabilities: Beyond simple vector similarity searches, MongoDB allows you to combine vector search with other powerful query operators, such as filtering by metadata, geospatial queries, or full-text search. This enables more nuanced and precise retrieval of information, which is crucial for advanced AI applications.

Operational maturity and ecosystem: MongoDB is a mature, battle-tested database with a robust ecosystem of tools, drivers, and integrations. It offers enterprise-grade features like scalability, high availability, security, and a rich set of developer tools. Specialized vector databases, while effective for their niche, may lack the broader operational capabilities and community support of a general-purpose database like MongoDB.

Cost-effectiveness and simplification: By consolidating your data storage and vector search capabilities into a single database, you can reduce operational overhead and cost. You avoid the need to manage and scale separate database systems, simplifying your infrastructure and streamlining your development workflow.

In essence, while dedicated vector databases excel at one specific task, MongoDB provides a holistic solution that handles your entire data lifecycle, from ingestion and storage to advanced querying and analytics, all within a single, integrated platform.

By implementing this architecture, organisations can transform their document processing capabilities from static, batch-oriented systems to dynamic, real-time pipelines that extract meaningful insights from unstructured content. The combination of cloud storage, streaming processing, machine learning, and flexible storage creates a powerful foundation for document-centric applications that drive business value.

Get started today by exploring the complete implementation in the GitHub repo.

New to MongoDB? Deploy a free instance of MongoDB Atlas to see how it can effortlessly power your AI-driven applications.

Not yet a Confluent customer? Start your free trial of Confluent Cloud today and receive $400 to spend during your first 30 days!

Sign up to LlamaCloud to get started with LlamaParse, LlamaExtract and more! Want to build your own custom agentic workflow? Check out LlamaIndex.

← Previous

Real-Time Materialized Views With MongoDB Atlas Stream Processing

For developers coming from a relational database background, the concept of a "join" is a fundamental part of the data model. But in a document database like MongoDB, this approach is often an anti-pattern. While MongoDB offers the $lookup aggregation stage, relying on it for every read operation can lead to performance bottlenecks and create fragile architectures. From relational thinking to document thinking In relational databases, query optimization often revolves around minimizing or optimizing joins. In a normalized relational schema, data is split into multiple related tables, and the optimizer’s job is to figure out how to pull that data together efficiently at query time. Because joins are built into the relational model and cost is reduced through indexing strategies, database architects spend a lot of effort making joins fast. In MongoDB—and most NoSQL document databases—the philosophy is different: it is perfectly acceptable, and often recommended, to duplicate data when it serves an application's query patterns. Instead of joining related entities on every read, MongoDB encourages data models where data that is accessed together is stored together in the same document, pre-aggregated or denormalized to match access needs. Avoiding excessive joins (or $lookup stages) isn’t a limitation—it’s a deliberate design choice that trades some extra storage for dramatically reduced query latency, simpler query logic, and predictable performance at scale. This is why patterns like materializing “query‑optimized” collections, whether batch‑computed or stream‑updated, are so powerful in the MongoDB world: instead of computing the relationship at read time, you store the relationship in exactly the form your application needs. In the past, a common solution for generating data summaries was to precompute and store them (often called “materializing” the data) through a scheduled batch job. This pre-aggregated data was useful for reports but suffered from data staleness, making it unsuitable for real-time applications. So, how do you handle complex data relationships and serve real-time insights without a heavy reliance on expensive joins or stale, batch-processed data? The answer lies in a modern approach that leverages event-driven architecture and a powerful pattern: continuously updated, query‑optimized collections (essentially materializing enriched data in real time) with MongoDB Atlas Stream Processing . The core problem: Why we avoid joins in MongoDB MongoDB is built on the principle that "data that is accessed together should be stored together." This means that denormalization is not only acceptable but is often the recommended practice for achieving high performance. When data for a single logical entity is scattered across multiple documents, developers often face two key problems: 1. The "fragmented data" problem As applications evolve, their data models can become more complex. What starts as a simple, one-document-per-entity model can turn into a web of interconnected data. To get a complete picture of an entity, your application must perform multiple queries or a $lookup to join these pieces of information. If you need to update this complete entity, you're faced with the complexity of a multi-document transaction to ensure data consistency. This overhead adds latency to both read and write operations, slowing down your application and complicating your code. The diagram below illustrates this problematic, join-heavy architecture, which we consider an anti-pattern in MongoDB. Figure 1. Data fragmentation across multiple collections. 2. The "microservice" problem In a microservice architecture, each service often owns its own database, oftentimes running in its own MongoDB cluster, promoting autonomy and decoupling. But this creates a new challenge for data sharing. One service might need data owned by another, and the classic join approach doesn't work. While federated databases might seem like a solution for joining data across different clusters, they are typically designed for ad-hoc queries and analytics rather than high-performance, real-time application queries. MongoDB's $lookup stage has a crucial constraint: it cannot work across different databases, which is a common scenario in this architecture. This forces developers into inefficient synchronous API calls or complex, manual data synchronization pipelines. The solution: Enabling CQRS with event processing The answer to these problems—the reliance on joins, fragmented data, and microservice coupling—lies in an event-driven architecture paired with the power of materialized views. This architectural pattern is a well-established concept known as Command Query Responsibility Segregation (CQRS). At its core, CQRS separates a system's "Command" side (the write model) from its "Query" side (the read model). Commands are actions that change the state of the system, such as CreateOrder or UpdateProduct . Queries are requests that retrieve data, like GetOrderDetails . By embracing this pattern, your core application can handle commands with a transactional, write-focused data model. For all your read-heavy use cases, you can build separate, highly optimized read models. This is where event processing becomes the crucial link. When a command is executed, your write model publishes an event to a stream (e.g., OrderCreated ). This event is an immutable record of what happened. MongoDB Atlas's native change streams are the perfect tool for generating this event stream directly from your primary data collection. A separate process—our stream processor—continuously listens to this event stream. It consumes the OrderCreated event, along with any other related events such as changes in Product, Customers and Payments, and uses this information to build a denormalized, query-optimized collection — effectively materializing enriched data for fast queries. This view becomes the query side of your CQRS pattern. As you can see in the diagram below, this modern CQRS architecture directly solves the problems of fragmentation and joins by separating the read and write concerns entirely. Figure 2. Data access with the CQRS pattern. For a microservice architecture, here is what the CQRS pattern would look like: Figure 3. CQRS pattern applied across multiple services. The tangible benefits: Why this approach matters Embracing this event-driven, real-time data materialization pattern is not just about elegant architecture—it provides real, measurable benefits for your applications and your bottom line. Blazing-fast queries: Since the data is already pre-joined and shaped to your query requirements, there is no need for expensive, real-time $lookup or aggregation operations. Queries against this continuously updated, query‑optimized collection become simple, fast reads, resulting in significantly lower latency and a more responsive user experience. Reduced resource consumption: By offloading the computational work of joins and aggregations to the continuous stream processor, you dramatically reduce the workload on your primary operational database. The highly efficient queries against the materialized view consume far less CPU and RAM. More economical deployments: The reduced resource consumption translates directly into more economical deployments. Since your primary database is no longer burdened with complex read queries, you can often run on smaller, less expensive instances. You are trading more costly CPU and RAM for cheaper storage, which is a highly favorable economic trade-off in modern cloud environments. Improved performance predictability: With a consistent and low-resource query model, you eliminate the performance spikes and variability that often come with complex, on-demand queries. This leads to a more stable and predictable application load, making it easier to manage and scale your services. How MongoDB Atlas Stream Processing can help Building a custom stream processing pipeline from scratch can be complex and expensive, requiring you to manage multiple systems like Apache Flink or Kafka Streams. MongoDB Atlas Stream Processing is a fully managed service that brings real-time stream processing directly into the MongoDB ecosystem. It takes on the responsibility of the "Query Side" of your CQRS architecture, allowing you to implement this powerful pattern without the operational overhead. With MongoDB Atlas Stream Processing, you can use a continuous aggregation pipeline to: Ingest events from a MongoDB Change Stream or an external source like Apache Kafka. Transform and enrich the data on the fly. Continuously materialize the data into a target collection using the $merge stage . Crucially, $merge can either create a new collection or continuously update an existing one. It intelligently inserts new documents including starting with an initial sync , replaces or updates existing ones, and even deletes documents from the target collection when a corresponding deletion event is detected in the source stream. The diagram below provides a detailed look at the stream processing pipeline, showing how data flows through various stages to produce a final, query-optimized target collection. Figure 4. Atlas stream processing pipeline. This gives you an elegant way to implement a CQRS pattern, ensuring your read and write models are decoupled and optimized for their specific tasks, all within the integrated MongoDB Atlas platform. It allows you to get the best of both worlds: a clean, normalized model for your writes and a high-performance, denormalized model for all your reads, without the manual complexity of a separate ETL process. It’s a modern way to apply the core principles of MongoDB data modeling at scale, even across disparate databases. By moving away from join-heavy query patterns and embracing real-time, query‑optimized collections with MongoDB Atlas Stream Processing, you can dramatically cut query latency, reduce resource usage, and build applications that deliver consistent, lightning‑fast experiences at scale. Whether you’re modernizing a monolith, scaling microservices, or rethinking your data model for performance, Atlas Stream Processing makes it easy to adopt a CQRS architecture without the overhead of managing complex stream processing infrastructure yourself. Start building your continuously updated, high-performance read models today with MongoDB Atlas Stream Processing .

September 9, 2025

Next →

Cars24 Improves Search For 300 Million Users With MongoDB Atlas

The Indian multinational online car marketplace Cars24 serves 300 million users globally. The company offers services that span sales, insurance, maintenance, financing, and more, reshaping the entire car ownership journey. Speaking at MongoDB .local Bengaluru in July 2025 , Pradeep Sharma, Head of Technology at Cars24, shared how MongoDB has been a key driver of Car24’s digital transformation journey. Specifically, he highlighted two recent use cases that show how MongoDB Atlas has helped Cars24 scale, improve its search capabilities, and reduce its architectural complexity. Matching the growing scale with simplified and expanded search Cars24 has operations in multiple countries, and a diverse customer base. Over the years, the company has used customer data, behavior analytics, and operational workflows to build, evolving from being a platform for buying and selling cars, to an end-to-end ecosystem, supported by a hub of interconnected systems. At the start of its journey, Cars24 relied on legacy databases for managing and searching data, such as Postgres. Their relational database set-up would store information, synchronize the data to a separate “bolt-on” search engine (such as Elasticsearch), manually indexing it, and then querying the index. While initially effective for a small application ecosystem, these processes became bottlenecked as the organization’s services grew. Multiple engineering teams piped data into a single search index, which often resulted in synchronization challenges and overwhelming administrative overhead. Cars24 faced three core limitations with this setup: Lower developer productivity: Exponential effort was spent maintaining pipelines and synchronizing procedures. Developers had little bandwidth for building business features or innovation. Architectural complexity: Ensuring data sync consistency required multiple pipelines and race logic. This led to inefficiencies in real-time dashboard updates for agents. Operational overhead: Maintaining separate systems for database and search—alongside provisioning, patching, scaling, and monitoring—strained resources. Seeking an integrated approach, Cars24 embraced MongoDB Atlas, hosted on Google Cloud . MongoDB Atlas would serve as a single, consistent, modern database and embedded search solution, powered by Apache Lucene. MongoDB Atlas Search also enabled Cars24 to run queries directly in the database. This eliminated the need to synchronise data between systems while delivering real-time results. This unified approach allowed the company’s developers to transition from managing complex synchronization mechanisms to building applications. Furthermore, the reduced administrative overhead enabled Cars24 to consolidate the team’s efforts, and to streamline query execution across the ecosystem. Thanks to MongoDB Atlas and MongoDB Atlas Search, Cars24 was able to: Avoid "synchronization tax”: Switching to MongoDB Atlas eliminated the need for data synchronization and the additional tooling this mandated. Real-time searches can be performed from a single interface and workflow. Deliver new search features faster: By using a single, unified API across database and search operations, new features can be delivered rapidly. Work with a fully managed platform: With MongoDB Atlas, Cars24’s engineers can focus more on application development and building products, rather than thinking about managing indexes, syncing, and more. Following this successful migration, Cars24 decided to also use MongoDB Atlas to replace one of its legacy databases, ArangoDB. The switch to MongoDB Atlas eliminated major roadblocks for other critical search capabilities. From ArangoDB to MongoDB: Streamlined operations and 50% cost savings As Cars24 scaled new services globally, it encountered limitations with its geospatial search solution, which was based on ArangoDB. This included performance bottlenecks, weak transactions as it was difficult to guarantee consistent data operations, and a limited ecosystem which meant that scaling developer onboarding and troubleshooting became increasingly onerous. Moving to MongoDB Atlas enabled Cars24 to transition its geospatial services, consolidating its data storage and search capabilities under a single, versatile platform. “We now have a highly available architecture, and an amazing team at MongoDB that has our back,” said Sharma. MongoDB offered a proven architecture for high availability, scalability, and real-world production readiness: Enhanced scalability: MongoDB’s ability to scale massive workloads supports Cars24’s growing global presence. Reliable transactions: MongoDB provides robust multi-document ACID transactions across shards, meeting mission-critical needs. Streamlined operations: MongoDB offers a single platform that is not limited to a database only. By consolidating its geospatial search workload under MongoDB, Cars24 has reduced maintenance and operational overhead. Not only did Cars24 cut costs in half by moving to MongoDB, but the widespread market adoption of MongoDB Atlas also means that Cars24 can continue to rapidly onboard developers familiar with MongoDB, a recruiting priority for Cars24’s growing development team. “To give you an idea, one of our business units had a developer team of less than 10 about a year ago. Now they are a triple-digit team,” said Sharma. “If we are going to keep introducing new developers, for a product coming up or scaling up, it becomes very important to focus on the community skills and support provided by our technology partner.” “Now that we have moved from ArangoDB to MongoDB Atlas, our developers are the happiest,” he added. Cars24 is now looking to consolidate even more of its application and data workflows under MongoDB Atlas. With the growing number of developers joining Cars24’s engineering teams, plans are to utilize MongoDB Atlas further to enhance productivity, scalability, and data-driven insights. Visit the MongoDB Atlas Learning Hub to learn more about Atlas. To learn more about MongoDB Atlas Search, visit our product page .

October 12, 2025