Explore o novo chatbot do Developer Center! O MongoDB AI chatbot pode ser acessado na parte superior da sua navegação para responder a todas as suas perguntas sobre o MongoDB .

Desenvolvedor do MongoDB
Central de desenvolvedor do MongoDBchevron-right
Idiomaschevron-right
Pythonchevron-right

Building a Foreign Correspondent With MongoDB, Anthropic's Claude, Python

Marko Aleksendrić14 min read • Published Dec 09, 2024 • Updated Dec 09, 2024
IAPython
Ícone do FacebookÍcone do Twitterícone do linkedin
Avalie esse Artigo
star-empty
star-empty
star-empty
star-empty
star-empty
At the end of 2024, it seems like the race of large language models (LLMs) is still one of the most interesting topics around. While small and large companies are finding more and more ways to integrate these powerful neural networks into their own system, in this article, we will explore a scenario in which one of the finest (and most human-like) LLMs—Anthropic's Claude family—is used to create brief news summaries in English (or any other language) starting from a list of Serbian news articles.
According to Wikipedia, the Serbian language (the mother tongue of the author) is currently situated around 30th place, with roughly 0.1% of websites online. Put simply, this is very small compared to English (roughly 50%) and even Spanish, Russian, German, and French (all around 4% to 5%).
In this article, we will imagine that we are creating a simple web application aimed to help our (virtual) friend: a foreign correspondent situated in a country whose language is (still) an unknown pleasure for him/her.
The flow of data will be simplified: We will begin with a simple CSV file containing the articles. The articles, potentially coming from different news sources, have a minimal structure: a title, a body of content, and a date. They are in Serbian. The principle is simple: The articles are short enough that they will not be split and chunked. Each article will get its own embedding vector and be stored in a MongoDB collection.
The embeddings that we will use—Embedić—are created specifically for the Serbian language.
As stated:
Novak Zivanic has made a significant contribution to the field of natural language processing with the release of Embedić, a suite of Serbian text embedding models. These models are specifically designed for information retrieval and retrieval-augmented generation (RAG) tasks.
Embedding models are fundamental when working with the meaning of words, sentences, and text in general. Mathematically, embeddings are rather simple. They represent high-dimensional data (in our case, text, but it could be images, sounds, or other meaningful data structures) in a lower-dimension but denser vector space, while preserving semantic relationships. There is an excellent introduction to vector embeddings and I wholeheartedly recommend it if you want to get the whole concept. Some LLM model providers, such as OpenAI, generate vector embeddings, while others, like Anthropic, do not.
Since we want to use particular embeddings—aptly named Embedić—we will use the HuggingFace Sentence Transformer, a library that helps convert text into numerical vectors with semantic meaning. Sentence transformers excel at the tasks that we need: finding similar sentences, semantic search, and text clustering. Sentence transformers provide a myriad of models updated or released very often, they are free to use—there are no API costs involved—and they can be fine-tuned. They might not provide state-of-the art performance like OpenAI or Voyage.
Once the articles are "translated" into their universally understandable format (vectorized) we will save them to a MongoDB collection on Atlas cluster and create a vector index. The MongoDB vector index enables fast and efficient vector searching and is [documented] (https://www.mongodb.com/pt-br/docs/atlas/atlas-vector-search/vector-search-overview/).
Finally, with the vector index in place and after MongoDB Atlas completes the process of indexing, the MongoDB database is ready to become the backbone of our correspondent system. Enter Claude, one of the top-performing LLM families, provided by Anthropic. While you could easily switch this setup with OpenAI or some model provided by GROQ, we will use the Anthropic Claude Opus model, which plays particularly nice with articulated text and summarization.
To generate the final text, we will implement a simple version of the RAG methodology, very similar to the one described in the excellent book Building AI-intensive Python Applications. Simply put, retrieval-augmented generation, or RAG, is an approach in which a number of documents (or pieces of a document) that are similar (mathematically speaking, close in a vector space) to the query (also a vector!) are returned from a database and then passed to the model with a set of instructions, in order to get a combined response that contains (hopefully) all the relevant data from the retrieved documents.
After a MongoDB pipeline retrieves the relevant news articles—and here, we are really only scratching the surface of MongoDB's flexibility and scalability—the pieces are stitched together by Claude through a prompt. Again, these steps could be optimized in many different ways, but that is not the topic of this article. The retrieval process could be improved through reranking (although in this case, when each article is treated as a unit and is assigned a single embedding, this is not so important). The retrieval query could create "topics" or "stories" and make references to it maintaining a logical and chronological consistency and so on. The context size of the latest LLM models has increased, and it allows us to implement various approaches for this kind of project, but we will stick to a simple, illustrative solution.
This project is organized as a Jupyter notebook and is available on GitHub. The repo also contains the CSV file containing the news used in the tutorial, as well as the requirements file with the necessary packages.

Installing the dependencies

To run this project, we will need to install some dependencies. If you wish to reproduce the Jupyter notebook, create a new directory (mine is called Correspondent) and inside it, create a brand new virtual environment with the command:
1python -m venv venv
Activate the virtual environment and create a requirements.txt file:
1anthropic==0.38.0
2ipykernel==6.29.5
3ipython==8.28.0
4pandas==2.2.3
5pymongo==4.10.1
6python-dotenv==1.0.1
7scikit-learn==1.5.2
8tqdm==4.66.5
9transformers==4.46.0
Package versions change over time. To ensure the exact reproduction of this notebook, specific versions should be used. By the time that you will be reading this article, it is very probable that some of these packages will have newer versions. Notice, however, that the SciKit-Learn package is optional (and quite heavy), so if you do not wish to play with classical machine learning algorithms, you can skip it.
While the installation of the PyMongo driver and the Anthropic package is straightforward, the transformers library will be used to generate the embeddings, and the remaining packages are necessary for the Jupyter notebook to run smoothly.
Now comes the installation of the packages:
1pip install -r requirements.txt
Now, you are ready to create a Jupyter notebook and import the necessary packages in the first cell:
1import json
2import logging
3import os
4import time
5from datetime import datetime
6from typing import Dict, List, Optional, Tuple
7
8# Third-party
9import pandas as pd
10from anthropic import Anthropic
11from dotenv import load_dotenv
12from pymongo import MongoClient
13from pymongo.errors import BulkWriteError, ConnectionFailure
14from sentence_transformers import SentenceTransformer
15from tqdm.autonotebook import tqdm, trange

The data and the embeddings

First, let's read the news CSV file with pandas:
1articles = pd.read_csv('fa_articles.csv', encoding='utf-8')
2articles.head()
The articles won't mean much to you unless you are proficient in Serbian. These are articles encompassing politics, economy, social matters, and similar topics. They have been filtered—so no art, entertainment, or sports.
Now, let's import and install the sentence transformers and our embedding model of choice—Embedić. Yes, that is the letter ć, and it is pronounced like in the Italian word cappuccino.
1# Initialize Embedić model for Serbian language embeddings
2# !pip install -U sentence-transformers
3from sentence_transformers import SentenceTransformer
4from tqdm.notebook import tqdm
5from datetime import datetime
6from typing import List, Dict
7import json
8
9model = SentenceTransformer('djovak/embedic-large')
Now, we will create a function for concatenating the title and the content of the article and passing it to the sentence transformer embedding function as they need to represent a single compact unit:
1def generate_embeddings(df: pd.DataFrame, batch_size: int = 32) -> pd.DataFrame:
2 """
3 Generate embeddings for concatenated title and content using Embedić.
4
5 Args:
6 df: DataFrame with 'title' and 'content' columns
7 batch_size: Number of texts to process at once
8
9 Returns:
10 DataFrame with added 'embedding' column
11 """
12 # Create a copy of the dataframe
13 df_emb = df.copy()
14
15 # Concatenate title and content
16 print("Concatenating title and content...")
17 df_emb['text_for_embedding'] = df_emb['title'] + " " + df_emb['content']
18
19 # Generate embeddings in batches
20 print("Generating embeddings...")
21 embeddings = []
22
23 # Convert texts to list for batch processing
24 texts = df_emb['text_for_embedding'].tolist()
25
26 # Process in batches with progress bar
27 for i in tqdm(range(0, len(texts), batch_size)):
28 batch_text = texts[i:i + batch_size]
29 batch_embeddings = model.encode(batch_text)
30 embeddings.extend(batch_embeddings.tolist())
31
32 # Add embeddings to dataframe
33 df_emb['embedding'] = embeddings
34
35 # Remove temporary concatenated text column
36 df_emb = df_emb.drop('text_for_embedding', axis=1)
37
38 print(f"Generated embeddings for {len(df_emb)} articles")
39 print(f"Embedding dimension: {len(embeddings[0])}")
40
41 return df_emb
The function is heavily commented and it has a couple of informative print statements. If you run it, you will see that the particular embedding (the large version) has 1024 dimensions.
To preview the generated embeddings, we can use the following Jupyter cell:
1print("\nDataFrame shape:", df_with_embeddings.shape)
2print("\nColumns:", df_with_embeddings.columns.tolist())
3
4print("\nSample embedding (first 5 dimensions):")
5print(df_with_embeddings['embedding'].iloc[0][:5])
6
7embedding_lengths = df_with_embeddings['embedding'].apply(len)
8print("\nAll embeddings have same dimension:", embedding_lengths.nunique() == 1)
9print("Embedding dimension:", embedding_lengths.iloc[0])

Prepare the data for MongoDB

The data is now ready to be inserted into a MongoDB collection, but it has to be processed a bit to become a list of dictionaries:
1def prepare_for_mongodb(df: pd.DataFrame) -> List[Dict]:
2 documents = []
3
4 for _, row in tqdm(df.iterrows(), total=len(df)):
5 document = {
6 'title': row['title'],
7 'content': row['content'],
8 'date': row['artDate'], # Assuming you have a date column
9 'embedding': row['embedding'],
10 'metadata': {
11 'embedding_model': 'djovak/embedic-large',
12 'created_at': datetime.now(),
13 'last_updated': datetime.now()
14 }
15 }
16 documents.append(document)
17
18 return documents

Inserting the data into the collection

Now, it is the moment to finally access our MongoDB database. Let's add some imports:
1import os
2from datetime import datetime
3from dotenv import load_dotenv
4from pymongo import MongoClient
5from typing import Dict, List, Optional, Tuple
6import logging
7
8load_dotenv()
9MONGODB_URI = os.getenv("MONGODB_URI")
10DB_NAME = os.getenv("DB_NAME")
11COLLECTION_NAME = os.getenv("COLLECTION_NAME")
12
13logging.basicConfig(level=logging.INFO)
14logger = logging.getLogger(__name__)
The dotenv library will be used to read the variables from the .env file which has the following format:
1MONGODB_URI = "youURI"
2COLLECTION_NAME="articles"
3DB_NAME="rag_articles"
4ANTHROPIC_API_KEY=sk-xxx…
Finally, let's wrap the process of connecting to our MongoDB database:
1def get_mongodb_connection():
2 try:
3 client = MongoClient(MONGODB_URI)
4 client.admin.command('ping')
5 logger.info("Successfully connected to MongoDB")
6
7 db = client[DB_NAME]
8 collection = db[COLLECTION_NAME]
9 return client, db, collection
10
11 except Exception as e:
12 logger.error(f"Failed to connect to MongoDB. Error: {str(e)}")
13 raise ConnectionFailure(f"MongoDB connection failed: {str(e)}")
14
15client, db, collection = get_mongodb_connection()
After a bit of preparation, we are now ready to ingest the data into the collection:
1def insert_documents_to_mongodb(documents: List[Dict], batch_size: int = 1000)->Tuple[int, int]:
2 successful = 0
3 failed = 0
4
5 for i in tqdm(range(0, len(documents), batch_size)):
6 batch = documents[i:i + batch_size]
7 try:
8 # Insert batch with ordered=False for better performance
9 result = collection.insert_many(batch, ordered=False)
10 successful += len(result.inserted_ids)
11 except BulkWriteError as e:
12 # Handle partial failures in batch
13 successful += e.details['nInserted']
14 failed += len(batch) - e.details['nInserted']
15 print(f"Batch {i//batch_size + 1} had {e.details['nInserted']} successful and "
16 f"{len(batch) - e.details['nInserted']} failed insertions")
17
18 return successful, failed
Let's use our newly created inserting function and perform the actual insertion of the data, making sure that the collection is empty:
1collection.delete_many({})
2# Insert documents
3successful, failed = insert_documents_to_mongodb(mongodb_documents)
4print(f"Successfully inserted: {successful} documents")
5print(f"Failed insertions: {failed} documents")

Creating the vector index

Now that our data is inserted, let's create the vector index:
1def create_vector_index(
2 connection_string: str,
3 database_name: str,
4 collection_name: str,
5 index_name: str = "vector_index"
6) -> None:
7 """
8 Create a vector search index in MongoDB using PyMongo.
9
10 Args:
11 connection_string: MongoDB connection string
12 database_name: Name of the database
13 collection_name: Name of the collection
14 index_name: Name of the vector index
15 """
16 # Connect to MongoDB
17 client = client
18
19 # Get database and collection
20 db = db
21 collection = collection
22
23 # Define the index configuration
24 index_config = {
25 "name": index_name,
26 "definition": {
27 "fields": [
28 {
29 "numDimensions": 1024,
30 "path": "embedding",
31 "similarity": "cosine",
32 "type": "vector"
33 }
34 ]
35 }
36 }
37
38 try:
39 # Create the vector search index
40 collection.create_search_index(index_config)
41 print(f"Successfully created vector index '{index_name}'")
42
43 except Exception as e:
44 print(f"Error creating vector index: {str(e)}")
45 finally:
46 client.close()
The creation of the vector index in Python is deliberately a bit more verbose than it should be, but it is important to notice how simple and straightforward the vector index creation process actually is. We need to provide the connection, database, and collection data, and the following:
  • The desired index name—in our case, it is vector_index
  • The path or the field that you need to be indexed and that contains the vector array—in our articles’ case, it is called embedding
  • The similarity function used for measuring distances between vectors—we are using the cosine similarity, a fast and simple measure that compares only the cosines of the angle that two inserted vectors make, taking as the vertex the origin of the vector space. Another option might be the Euclidean distance.
  • The type of the index is vector, and that part is straightforward

Creating a semantic search function

With everything, especially our precious vector index, in place, we are now ready to implement the semantic search function. The function will be using a very simple MongoDB aggregation pipeline to provide the similarity score along with the results:
1def semantic_search(collection, query: str, k: int = 5) -> List[Dict]:
2 """
3 Perform semantic search using Embedić vector similarity.
4
5 Args:
6 collection: MongoDB collection
7 query: Search query text - in Serbian!
8 k: Number of results to return
9
10 Returns:
11 List of matching documents
12 """
13 try:
14 # Generate embedding for query using Embedić
15 query_embedding = model.encode(query).tolist()
16
17 pipeline = [
18 {
19 "$vectorSearch": {
20 "index": "vector_index",
21 "queryVector": query_embedding,
22 "path": "embedding",
23 "numCandidates": 100,
24 "limit": k
25 }
26 },
27 {
28 "$project": {
29 "title": 1,
30 "content": 1,
31 "date": 1,
32 "score": { "$meta": "vectorSearchScore" }
33 }
34 }
35 ]
36
37 results = list(collection.aggregate(pipeline))
38 return results
39
40 except Exception as e:
41 logger.error(f"Search error: {str(e)}")
42 raise
This search function is rather simple. It performs only the vector search, but it expects a query in Serbian that is to be passed to the embedding model. Later, we will use the Anthropic client to make an extra call and translate the query from an arbitrary language to Serbian. For now, let’s stick with this. Some words to get you started might be “litijum” (for lithium), “ekonomija” (economy), and so on.
Let’s add a function for formatting the results:
1def display_search_results(query: str, results: List[Dict]):
2 """
3 Display search results in a readable format
4 """
5 print(f"\nQuery: {query}")
6 print("=" * 80)
7
8 for i, doc in enumerate(results, 1):
9 print(f"\n{i}. {doc['title']}")
10 print(f"Score: {doc['score']:.3f}")
11 print("-" * 80)
12
13# Let's test the search
14queries = [
15 "Novi Sad","Litijum", "Ekonomija"
16]
17
18for query in queries:
19 results = semantic_search(collection, query, k=6)
20 display_search_results(query, results)
21 print("\n" + "=" * 100 + "\n")
After running the previous cell, you should get a nicely formatted output with the title of the article and a similarity score. The MongoDB documentation on vector searching provides an excellent and complete guide. Tthe book Building AI Intensive Python Applications reads as an exciting sci-fi novel and provides excellent guidelines regarding best practices when using vector search.

Mixing it up with other algorithms

If you take the time and examine the provided Juyter notebook, you will find a subchapter on performing a simple K-means clustering on the embeddings. The vectors are very suitable for these types of classical machine learning algorithms, and combining them with modern LLMs might yield even better results in particular fields. Let’s not forget that ensemble models, for instance, are still dominating numerous classification fields. Here, we will not delve deeper into the clustering, but it is interesting to notice that the algorithm, paired with a powerful LLM like Claude Opus, is able to identify and describe very precisely five to six clusters with very cohesive relationships.

Claude enters the chat

It is time to introduce our second main actor to the stage: ]Anthropic's Claude:
1#!pip install anthropic
2from anthropic import Anthropic
3import os
4from typing import List, Dict, Optional
5import time
6
7load_dotenv()
8
9# Initialize Claude client
10anthropic = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))
This simple article will focus solely on MongoDB (with PyMongo) and the Anthropic client, but you are more than welcome to adapt this to fit your needs by leveraging some of the LLM frameworks, like LangChain or Llamaindex which offer incredible integrations, code reusability, and flexibility.
The following function is a bit long, but it is intentionally verbose and encapsulates explicitly a couple of Claude prompts. In a real-world application, these prompts would be extracted and fed to the function itself after some trials and tests, but in this case, we will just hardcode them:
1def generate_news_report(
2 query: str,
3 collection,
4 model: SentenceTransformer,
5 anthropic_client: Anthropic,
6 language: str = "English",
7 top_k: int = 10,
8 max_retries: int = 3,
9 retry_delay: int = 2
10) -> str:
11 # Translate query to Serbian using Claude
12 translation_prompt = f"Translate the following {language} text to Serbian latin. Provide only the translation, nothing else: '{query}'"
13 try:
14 translation_response = anthropic_client.messages.create(
15 model="claude-3-opus-20240229",
16 max_tokens=100,
17 temperature=0,
18 messages=[{
19 "role": "user",
20 "content": translation_prompt
21 }]
22 )
23
24 serbian_query = translation_response.content[0].text.strip()
25 except Exception as e:
26 raise Exception(f"Translation failed: {str(e)}")
27
28 # Generate embedding for Serbian query
29 query_embedding = model.encode(serbian_query).tolist()
30
31 # Perform vector search
32 pipeline = [
33 {
34 "$vectorSearch": {
35 "index": "vector_index",
36 "queryVector": query_embedding,
37 "path": "embedding",
38 "numCandidates": 100,
39 "limit": top_k
40 }
41 },
42 {
43 "$project": {
44 "title": 1,
45 "content": 1,
46 "date": 1,
47 "score": { "$meta": "vectorSearchScore" }
48 }
49 }
50 ]
51
52 results = list(collection.aggregate(pipeline))
53
54 # Prepare content for Claude
55 articles_text = "\n\n".join([
56 f"Article {i+1}:\nDate: {doc['date']}\nTitle: {doc['title']}\nContent: {doc['content'][:500]}..."
57 for i, doc in enumerate(results)
58 ])
59
60 # Create prompt for report generation
61 prompt = f"""You are an expert journalist and news analyst. Based on the following {top_k} most relevant Serbian news articles about "{query}" (translated to Serbian as "{serbian_query}"),
62create a concise, well-structured news report in {language}. The report should:
63- Be around 250-300 words
64- Start with a clear headline
65- Include key facts, dates, and relevant context
66- Maintain journalistic neutrality
67- Focus on the most newsworthy aspects
68- Include a brief conclusion or outlook
69Here are the articles:
70{articles_text}
71Please write the report in a professional journalistic style."""
72
73 # Get response from Claude with retries
74 for attempt in range(max_retries):
75 try:
76 response = anthropic_client.messages.create(
77 model="claude-3-opus-20240229",
78 max_tokens=1000,
79 temperature=0.3,
80 messages=[{
81 "role": "user",
82 "content": prompt
83 }]
84 )
85
86 report = response.content[0].text.strip()
87 return report
88
89 except Exception as e:
90 if attempt == max_retries - 1:
91 raise Exception(f"Failed to generate report: {str(e)}")
92 time.sleep(retry_delay)
93 continue
Although long, the function is rather simple. It performs a couple of tasks:
  • The query and the language of the report are provided, so first, we use Claude to translate the search query to Serbian to be properly evaluated in the embeddings space.
  • The translated query is fed to the semantic search pipeline that we used before, and a list of relevant articles are returned.
  • The magic ingredient is the Claude prompt. LLMs have been around for quite some time now, and I am sure you will find excellent guides on how to create meaningful and useful prompts. It is a bit of a mix between science and art, but in this case, we are opting for a simple solution.
  • Finally, the Claude API is called and the result is returned.
If you try the report generation function with the following parameters:
1try:
2 query = "Investments in mining"
3
4 report = generate_news_report(
5 query=query,
6 collection=collection,
7 model=model, # our existing sentence transformer model
8 anthropic_client=anthropic,
9 top_k=10,
10 )
11
12 print("\nGenerated News Report:")
13 print("=" * 80)
14 print(report)
15
16except Exception as e:
17 print(f"Error generating report: {str(e)}")
And if you use the provided set of news articles, the resulting report should be something like the following:

Generated News Report

Headline: Serbia Grapples with Mining Investments Amidst Environmental Concerns and Political Debates
In recent weeks, the topic of mining investments in Serbia has been at the forefront of public discourse, with a particular focus on the potential lithium mine in the Jadar Valley. The Serbian government and mining giant Rio Tinto have been pushing for the project, while environmental activists and some opposition politicians have voiced their concerns…
You should try the function with other languages too. I tried it with Italian and the results were satisfying.

Conclusão

Although this is a simple little project, which can and should be implemented in a much more concise way, it offers plenty of room for experimenting. Claude provides three models: Opus, Sonnet, and Haiku. Each of them has different characteristics and is well worth trying out. The art of prompting was already mentioned—there could be many improvements and much experimentation. Other models could be used, like OpenAI's or Llama models. The system, as is, is not meant to be running as a server, so trying a locally hosted LLM through Ollama or a similar solution is also something worth trying.
You are invited to try out this example and use your own data. It can be really anything that is of interest to you, like office reports and classical literature (you might need to perform some chunking if the content doesn’t quite fit Anthropic’s rather large context. Even better, replace the used LLM with another one—the latest ChatGPT or some open-source model—and compare the results.
In this type of setup, MongoDB proves to be an ideal database. The created reports could be saved upon generation, revised or reviewed by a human (at least, for fact-checking), and they could be linked to the original articles, creating a cascade of more complex narratives that follow a logical and chronological path that can then again be fed to an LLM. The possibilities are really endless, and experimenting with MongoDB and AI unlocks interesting scenarios and is also fun!
It is really impressive how quickly we can prototype a nontrivial Generative AI - powered application with an appropriate LLM and MongoDB as the database, leveraging the powerful vector search capabilities!
Principais comentários nos fóruns
Ainda não há comentários sobre este artigo.
Iniciar a conversa

Ícone do FacebookÍcone do Twitterícone do linkedin
Avalie esse Artigo
star-empty
star-empty
star-empty
star-empty
star-empty
Relacionado
Tutorial

Codificação com Mark: abstraindo junções e subconjuntos em Python


Mar 19, 2024 | 11 min read
Tutorial

Filtragem inteligente: um guia para gerar pré-filtros para pesquisa semântica do Atlas


Oct 02, 2024 | 20 min read
Tutorial

Planejador de partes favoritas do operador Joe com Playwright, LlamaIndex e MongoDB Atlas Vector Search


Nov 12, 2024 | 11 min read
Tutorial

Como implementar memória de trabalho em AI agentes de IA e sistemas de Agente para aplicativos de IA em tempo AI real


Nov 18, 2024 | 13 min read
Sumário