Building a Foreign Correspondent With MongoDB, Anthropic's Claude, Python
Avalie esse Artigo
At the end of 2024, it seems like the race of large language models (LLMs) is still one of the most interesting topics around. While small and large companies are finding more and more ways to integrate these powerful neural networks into their own system, in this article, we will explore a scenario in which one of the finest (and most human-like) LLMs—Anthropic's Claude family—is used to create brief news summaries in English (or any other language) starting from a list of Serbian news articles.
According to Wikipedia, the Serbian language (the mother tongue of the author) is currently situated around 30th place, with roughly 0.1% of websites online. Put simply, this is very small compared to English (roughly 50%) and even Spanish, Russian, German, and French (all around 4% to 5%).
In this article, we will imagine that we are creating a simple web application aimed to help our (virtual) friend: a foreign correspondent situated in a country whose language is (still) an unknown pleasure for him/her.
The flow of data will be simplified: We will begin with a simple CSV file containing the articles. The articles, potentially coming from different news sources, have a minimal structure: a title, a body of content, and a date. They are in Serbian. The principle is simple: The articles are short enough that they will not be split and chunked. Each article will get its own embedding vector and be stored in a MongoDB collection.
The embeddings that we will use—Embedić—are created specifically for the Serbian language.
As stated:
Novak Zivanic has made a significant contribution to the field of natural language processing with the release of Embedić, a suite of Serbian text embedding models. These models are specifically designed for information retrieval and retrieval-augmented generation (RAG) tasks.
Embedding models are fundamental when working with the meaning of words, sentences, and text in general. Mathematically, embeddings are rather simple. They represent high-dimensional data (in our case, text, but it could be images, sounds, or other meaningful data structures) in a lower-dimension but denser vector space, while preserving semantic relationships. There is an excellent introduction to vector embeddings and I wholeheartedly recommend it if you want to get the whole concept. Some LLM model providers, such as OpenAI, generate vector embeddings, while others, like Anthropic, do not.
Since we want to use particular embeddings—aptly named Embedić—we will use the HuggingFace Sentence Transformer, a library that helps convert text into numerical vectors with semantic meaning. Sentence transformers excel at the tasks that we need: finding similar sentences, semantic search, and text clustering. Sentence transformers provide a myriad of models updated or released very often, they are free to use—there are no API costs involved—and they can be fine-tuned. They might not provide state-of-the art performance like OpenAI or Voyage.
Once the articles are "translated" into their universally understandable format (vectorized) we will save them to a MongoDB collection on Atlas cluster and create a vector index. The MongoDB vector index enables fast and efficient vector searching and is [documented] (https://www.mongodb.com/pt-br/docs/atlas/atlas-vector-search/vector-search-overview/).
Finally, with the vector index in place and after MongoDB Atlas completes the process of indexing, the MongoDB database is ready to become the backbone of our correspondent system. Enter Claude, one of the top-performing LLM families, provided by Anthropic. While you could easily switch this setup with OpenAI or some model provided by GROQ, we will use the Anthropic Claude Opus model, which plays particularly nice with articulated text and summarization.
To generate the final text, we will implement a simple version of the RAG methodology, very similar to the one described in the excellent book Building AI-intensive Python Applications. Simply put, retrieval-augmented generation, or RAG, is an approach in which a number of documents (or pieces of a document) that are similar (mathematically speaking, close in a vector space) to the query (also a vector!) are returned from a database and then passed to the model with a set of instructions, in order to get a combined response that contains (hopefully) all the relevant data from the retrieved documents.
After a MongoDB pipeline retrieves the relevant news articles—and here, we are really only scratching the surface of MongoDB's flexibility and scalability—the pieces are stitched together by Claude through a prompt. Again, these steps could be optimized in many different ways, but that is not the topic of this article. The retrieval process could be improved through reranking (although in this case, when each article is treated as a unit and is assigned a single embedding, this is not so important). The retrieval query could create "topics" or "stories" and make references to it maintaining a logical and chronological consistency and so on. The context size of the latest LLM models has increased, and it allows us to implement various approaches for this kind of project, but we will stick to a simple, illustrative solution.
To run this project, we will need to install some dependencies. If you wish to reproduce the Jupyter notebook, create a new directory (mine is called Correspondent) and inside it, create a brand new virtual environment with the command:
1 python -m venv venv
Activate the virtual environment and create a requirements.txt file:
1 anthropic==0.38.0 2 ipykernel==6.29.5 3 ipython==8.28.0 4 pandas==2.2.3 5 pymongo==4.10.1 6 python-dotenv==1.0.1 7 scikit-learn==1.5.2 8 tqdm==4.66.5 9 transformers==4.46.0
Package versions change over time. To ensure the exact reproduction of this notebook, specific versions should be used. By the time that you will be reading this article, it is very probable that some of these packages will have newer versions. Notice, however, that the SciKit-Learn package is optional (and quite heavy), so if you do not wish to play with classical machine learning algorithms, you can skip it.
While the installation of the PyMongo driver and the Anthropic package is straightforward, the transformers library will be used to generate the embeddings, and the remaining packages are necessary for the Jupyter notebook to run smoothly.
Now comes the installation of the packages:
1 pip install -r requirements.txt
Now, you are ready to create a Jupyter notebook and import the necessary packages in the first cell:
1 import json 2 import logging 3 import os 4 import time 5 from datetime import datetime 6 from typing import Dict, List, Optional, Tuple 7 8 # Third-party 9 import pandas as pd 10 from anthropic import Anthropic 11 from dotenv import load_dotenv 12 from pymongo import MongoClient 13 from pymongo.errors import BulkWriteError, ConnectionFailure 14 from sentence_transformers import SentenceTransformer 15 from tqdm.autonotebook import tqdm, trange
First, let's read the news CSV file with pandas:
1 articles = pd.read_csv('fa_articles.csv', encoding='utf-8') 2 articles.head()
The articles won't mean much to you unless you are proficient in Serbian. These are articles encompassing politics, economy, social matters, and similar topics. They have been filtered—so no art, entertainment, or sports.
Now, let's import and install the sentence transformers and our embedding model of choice—Embedić. Yes, that is the letter ć, and it is pronounced like in the Italian word cappuccino.
1 # Initialize Embedić model for Serbian language embeddings 2 # !pip install -U sentence-transformers 3 from sentence_transformers import SentenceTransformer 4 from tqdm.notebook import tqdm 5 from datetime import datetime 6 from typing import List, Dict 7 import json 8 9 model = SentenceTransformer('djovak/embedic-large')
Now, we will create a function for concatenating the title and the content of the article and passing it to the sentence transformer embedding function as they need to represent a single compact unit:
1 def generate_embeddings(df: pd.DataFrame, batch_size: int = 32) -> pd.DataFrame: 2 """ 3 Generate embeddings for concatenated title and content using Embedić. 4 5 Args: 6 df: DataFrame with 'title' and 'content' columns 7 batch_size: Number of texts to process at once 8 9 Returns: 10 DataFrame with added 'embedding' column 11 """ 12 # Create a copy of the dataframe 13 df_emb = df.copy() 14 15 # Concatenate title and content 16 print("Concatenating title and content...") 17 df_emb['text_for_embedding'] = df_emb['title'] + " " + df_emb['content'] 18 19 # Generate embeddings in batches 20 print("Generating embeddings...") 21 embeddings = [] 22 23 # Convert texts to list for batch processing 24 texts = df_emb['text_for_embedding'].tolist() 25 26 # Process in batches with progress bar 27 for i in tqdm(range(0, len(texts), batch_size)): 28 batch_text = texts[i:i + batch_size] 29 batch_embeddings = model.encode(batch_text) 30 embeddings.extend(batch_embeddings.tolist()) 31 32 # Add embeddings to dataframe 33 df_emb['embedding'] = embeddings 34 35 # Remove temporary concatenated text column 36 df_emb = df_emb.drop('text_for_embedding', axis=1) 37 38 print(f"Generated embeddings for {len(df_emb)} articles") 39 print(f"Embedding dimension: {len(embeddings[0])}") 40 41 return df_emb
The function is heavily commented and it has a couple of informative print statements. If you run it, you will see that the particular embedding (the large version) has 1024 dimensions.
To preview the generated embeddings, we can use the following Jupyter cell:
1 print("\nDataFrame shape:", df_with_embeddings.shape) 2 print("\nColumns:", df_with_embeddings.columns.tolist()) 3 4 print("\nSample embedding (first 5 dimensions):") 5 print(df_with_embeddings['embedding'].iloc[0][:5]) 6 7 embedding_lengths = df_with_embeddings['embedding'].apply(len) 8 print("\nAll embeddings have same dimension:", embedding_lengths.nunique() == 1) 9 print("Embedding dimension:", embedding_lengths.iloc[0])
The data is now ready to be inserted into a MongoDB collection, but it has to be processed a bit to become a list of dictionaries:
1 def prepare_for_mongodb(df: pd.DataFrame) -> List[Dict]: 2 documents = [] 3 4 for _, row in tqdm(df.iterrows(), total=len(df)): 5 document = { 6 'title': row['title'], 7 'content': row['content'], 8 'date': row['artDate'], # Assuming you have a date column 9 'embedding': row['embedding'], 10 'metadata': { 11 'embedding_model': 'djovak/embedic-large', 12 'created_at': datetime.now(), 13 'last_updated': datetime.now() 14 } 15 } 16 documents.append(document) 17 18 return documents
Now, it is the moment to finally access our MongoDB database. Let's add some imports:
1 import os 2 from datetime import datetime 3 from dotenv import load_dotenv 4 from pymongo import MongoClient 5 from typing import Dict, List, Optional, Tuple 6 import logging 7 8 load_dotenv() 9 MONGODB_URI = os.getenv("MONGODB_URI") 10 DB_NAME = os.getenv("DB_NAME") 11 COLLECTION_NAME = os.getenv("COLLECTION_NAME") 12 13 logging.basicConfig(level=logging.INFO) 14 logger = logging.getLogger(__name__)
The dotenv library will be used to read the variables from the .env file which has the following format:
1 MONGODB_URI = "youURI" 2 COLLECTION_NAME="articles" 3 DB_NAME="rag_articles" 4 ANTHROPIC_API_KEY=sk-xxx…
Finally, let's wrap the process of connecting to our MongoDB database:
1 def get_mongodb_connection(): 2 try: 3 client = MongoClient(MONGODB_URI) 4 client.admin.command('ping') 5 logger.info("Successfully connected to MongoDB") 6 7 db = client[DB_NAME] 8 collection = db[COLLECTION_NAME] 9 return client, db, collection 10 11 except Exception as e: 12 logger.error(f"Failed to connect to MongoDB. Error: {str(e)}") 13 raise ConnectionFailure(f"MongoDB connection failed: {str(e)}") 14 15 client, db, collection = get_mongodb_connection()
After a bit of preparation, we are now ready to ingest the data into the collection:
1 def insert_documents_to_mongodb(documents: List[Dict], batch_size: int = 1000)->Tuple[int, int]: 2 successful = 0 3 failed = 0 4 5 for i in tqdm(range(0, len(documents), batch_size)): 6 batch = documents[i:i + batch_size] 7 try: 8 # Insert batch with ordered=False for better performance 9 result = collection.insert_many(batch, ordered=False) 10 successful += len(result.inserted_ids) 11 except BulkWriteError as e: 12 # Handle partial failures in batch 13 successful += e.details['nInserted'] 14 failed += len(batch) - e.details['nInserted'] 15 print(f"Batch {i//batch_size + 1} had {e.details['nInserted']} successful and " 16 f"{len(batch) - e.details['nInserted']} failed insertions") 17 18 return successful, failed
Let's use our newly created inserting function and perform the actual insertion of the data, making sure that the collection is empty:
1 collection.delete_many({}) 2 # Insert documents 3 successful, failed = insert_documents_to_mongodb(mongodb_documents) 4 print(f"Successfully inserted: {successful} documents") 5 print(f"Failed insertions: {failed} documents")
Now that our data is inserted, let's create the vector index:
1 def create_vector_index( 2 connection_string: str, 3 database_name: str, 4 collection_name: str, 5 index_name: str = "vector_index" 6 ) -> None: 7 """ 8 Create a vector search index in MongoDB using PyMongo. 9 10 Args: 11 connection_string: MongoDB connection string 12 database_name: Name of the database 13 collection_name: Name of the collection 14 index_name: Name of the vector index 15 """ 16 # Connect to MongoDB 17 client = client 18 19 # Get database and collection 20 db = db 21 collection = collection 22 23 # Define the index configuration 24 index_config = { 25 "name": index_name, 26 "definition": { 27 "fields": [ 28 { 29 "numDimensions": 1024, 30 "path": "embedding", 31 "similarity": "cosine", 32 "type": "vector" 33 } 34 ] 35 } 36 } 37 38 try: 39 # Create the vector search index 40 collection.create_search_index(index_config) 41 print(f"Successfully created vector index '{index_name}'") 42 43 except Exception as e: 44 print(f"Error creating vector index: {str(e)}") 45 finally: 46 client.close()
The creation of the vector index in Python is deliberately a bit more verbose than it should be, but it is important to notice how simple and straightforward the vector index creation process actually is. We need to provide the connection, database, and collection data, and the following:
- The desired index name—in our case, it is vector_index
- The path or the field that you need to be indexed and that contains the vector array—in our articles’ case, it is called embedding
- The similarity function used for measuring distances between vectors—we are using the cosine similarity, a fast and simple measure that compares only the cosines of the angle that two inserted vectors make, taking as the vertex the origin of the vector space. Another option might be the Euclidean distance.
- The type of the index is vector, and that part is straightforward
With everything, especially our precious vector index, in place, we are now ready to implement the semantic search function. The function will be using a very simple MongoDB aggregation pipeline to provide the similarity score along with the results:
1 def semantic_search(collection, query: str, k: int = 5) -> List[Dict]: 2 """ 3 Perform semantic search using Embedić vector similarity. 4 5 Args: 6 collection: MongoDB collection 7 query: Search query text - in Serbian! 8 k: Number of results to return 9 10 Returns: 11 List of matching documents 12 """ 13 try: 14 # Generate embedding for query using Embedić 15 query_embedding = model.encode(query).tolist() 16 17 pipeline = [ 18 { 19 "$vectorSearch": { 20 "index": "vector_index", 21 "queryVector": query_embedding, 22 "path": "embedding", 23 "numCandidates": 100, 24 "limit": k 25 } 26 }, 27 { 28 "$project": { 29 "title": 1, 30 "content": 1, 31 "date": 1, 32 "score": { "$meta": "vectorSearchScore" } 33 } 34 } 35 ] 36 37 results = list(collection.aggregate(pipeline)) 38 return results 39 40 except Exception as e: 41 logger.error(f"Search error: {str(e)}") 42 raise
This search function is rather simple. It performs only the vector search, but it expects a query in Serbian that is to be passed to the embedding model. Later, we will use the Anthropic client to make an extra call and translate the query from an arbitrary language to Serbian. For now, let’s stick with this. Some words to get you started might be “litijum” (for lithium), “ekonomija” (economy), and so on.
Let’s add a function for formatting the results:
1 def display_search_results(query: str, results: List[Dict]): 2 """ 3 Display search results in a readable format 4 """ 5 print(f"\nQuery: {query}") 6 print("=" * 80) 7 8 for i, doc in enumerate(results, 1): 9 print(f"\n{i}. {doc['title']}") 10 print(f"Score: {doc['score']:.3f}") 11 print("-" * 80) 12 13 # Let's test the search 14 queries = [ 15 "Novi Sad","Litijum", "Ekonomija" 16 ] 17 18 for query in queries: 19 results = semantic_search(collection, query, k=6) 20 display_search_results(query, results) 21 print("\n" + "=" * 100 + "\n")
After running the previous cell, you should get a nicely formatted output with the title of the article and a similarity score. The MongoDB documentation on vector searching provides an excellent and complete guide. Tthe book Building AI Intensive Python Applications reads as an exciting sci-fi novel and provides excellent guidelines regarding best practices when using vector search.
If you take the time and examine the provided Juyter notebook, you will find a subchapter on performing a simple K-means clustering on the embeddings. The vectors are very suitable for these types of classical machine learning algorithms, and combining them with modern LLMs might yield even better results in particular fields. Let’s not forget that ensemble models, for instance, are still dominating numerous classification fields. Here, we will not delve deeper into the clustering, but it is interesting to notice that the algorithm, paired with a powerful LLM like Claude Opus, is able to identify and describe very precisely five to six clusters with very cohesive relationships.
It is time to introduce our second main actor to the stage: ]Anthropic's Claude:
1 #!pip install anthropic 2 from anthropic import Anthropic 3 import os 4 from typing import List, Dict, Optional 5 import time 6 7 load_dotenv() 8 9 # Initialize Claude client 10 anthropic = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY'))
This simple article will focus solely on MongoDB (with PyMongo) and the Anthropic client, but you are more than welcome to adapt this to fit your needs by leveraging some of the LLM frameworks, like LangChain or Llamaindex which offer incredible integrations, code reusability, and flexibility.
The following function is a bit long, but it is intentionally verbose and encapsulates explicitly a couple of Claude prompts. In a real-world application, these prompts would be extracted and fed to the function itself after some trials and tests, but in this case, we will just hardcode them:
1 def generate_news_report( 2 query: str, 3 collection, 4 model: SentenceTransformer, 5 anthropic_client: Anthropic, 6 language: str = "English", 7 top_k: int = 10, 8 max_retries: int = 3, 9 retry_delay: int = 2 10 ) -> str: 11 # Translate query to Serbian using Claude 12 translation_prompt = f"Translate the following {language} text to Serbian latin. Provide only the translation, nothing else: '{query}'" 13 try: 14 translation_response = anthropic_client.messages.create( 15 model="claude-3-opus-20240229", 16 max_tokens=100, 17 temperature=0, 18 messages=[{ 19 "role": "user", 20 "content": translation_prompt 21 }] 22 ) 23 24 serbian_query = translation_response.content[0].text.strip() 25 except Exception as e: 26 raise Exception(f"Translation failed: {str(e)}") 27 28 # Generate embedding for Serbian query 29 query_embedding = model.encode(serbian_query).tolist() 30 31 # Perform vector search 32 pipeline = [ 33 { 34 "$vectorSearch": { 35 "index": "vector_index", 36 "queryVector": query_embedding, 37 "path": "embedding", 38 "numCandidates": 100, 39 "limit": top_k 40 } 41 }, 42 { 43 "$project": { 44 "title": 1, 45 "content": 1, 46 "date": 1, 47 "score": { "$meta": "vectorSearchScore" } 48 } 49 } 50 ] 51 52 results = list(collection.aggregate(pipeline)) 53 54 # Prepare content for Claude 55 articles_text = "\n\n".join([ 56 f"Article {i+1}:\nDate: {doc['date']}\nTitle: {doc['title']}\nContent: {doc['content'][:500]}..." 57 for i, doc in enumerate(results) 58 ]) 59 60 # Create prompt for report generation 61 prompt = f"""You are an expert journalist and news analyst. Based on the following {top_k} most relevant Serbian news articles about "{query}" (translated to Serbian as "{serbian_query}"), 62 create a concise, well-structured news report in {language}. The report should: 63 - Be around 250-300 words 64 - Start with a clear headline 65 - Include key facts, dates, and relevant context 66 - Maintain journalistic neutrality 67 - Focus on the most newsworthy aspects 68 - Include a brief conclusion or outlook 69 Here are the articles: 70 {articles_text} 71 Please write the report in a professional journalistic style.""" 72 73 # Get response from Claude with retries 74 for attempt in range(max_retries): 75 try: 76 response = anthropic_client.messages.create( 77 model="claude-3-opus-20240229", 78 max_tokens=1000, 79 temperature=0.3, 80 messages=[{ 81 "role": "user", 82 "content": prompt 83 }] 84 ) 85 86 report = response.content[0].text.strip() 87 return report 88 89 except Exception as e: 90 if attempt == max_retries - 1: 91 raise Exception(f"Failed to generate report: {str(e)}") 92 time.sleep(retry_delay) 93 continue
Although long, the function is rather simple. It performs a couple of tasks:
- The query and the language of the report are provided, so first, we use Claude to translate the search query to Serbian to be properly evaluated in the embeddings space.
- The translated query is fed to the semantic search pipeline that we used before, and a list of relevant articles are returned.
- The magic ingredient is the Claude prompt. LLMs have been around for quite some time now, and I am sure you will find excellent guides on how to create meaningful and useful prompts. It is a bit of a mix between science and art, but in this case, we are opting for a simple solution.
- Finally, the Claude API is called and the result is returned.
If you try the report generation function with the following parameters:
1 try: 2 query = "Investments in mining" 3 4 report = generate_news_report( 5 query=query, 6 collection=collection, 7 model=model, # our existing sentence transformer model 8 anthropic_client=anthropic, 9 top_k=10, 10 ) 11 12 print("\nGenerated News Report:") 13 print("=" * 80) 14 print(report) 15 16 except Exception as e: 17 print(f"Error generating report: {str(e)}")
And if you use the provided set of news articles, the resulting report should be something like the following:
Headline: Serbia Grapples with Mining Investments Amidst Environmental Concerns and Political Debates
In recent weeks, the topic of mining investments in Serbia has been at the forefront of public discourse, with a particular focus on the potential lithium mine in the Jadar Valley. The Serbian government and mining giant Rio Tinto have been pushing for the project, while environmental activists and some opposition politicians have voiced their concerns…
You should try the function with other languages too. I tried it with Italian and the results were satisfying.
Although this is a simple little project, which can and should be implemented in a much more concise way, it offers plenty of room for experimenting. Claude provides three models: Opus, Sonnet, and Haiku. Each of them has different characteristics and is well worth trying out. The art of prompting was already mentioned—there could be many improvements and much experimentation. Other models could be used, like OpenAI's or Llama models. The system, as is, is not meant to be running as a server, so trying a locally hosted LLM through Ollama or a similar solution is also something worth trying.
You are invited to try out this example and use your own data. It can be really anything that is of interest to you, like office reports and classical literature (you might need to perform some chunking if the content doesn’t quite fit Anthropic’s rather large context. Even better, replace the used LLM with another one—the latest ChatGPT or some open-source model—and compare the results.
In this type of setup, MongoDB proves to be an ideal database. The created reports could be saved upon generation, revised or reviewed by a human (at least, for fact-checking), and they could be linked to the original articles, creating a cascade of more complex narratives that follow a logical and chronological path that can then again be fed to an LLM. The possibilities are really endless, and experimenting with MongoDB and AI unlocks interesting scenarios and is also fun!
It is really impressive how quickly we can prototype a nontrivial Generative AI - powered application with an appropriate LLM and MongoDB as the database, leveraging the powerful vector search capabilities!
Principais comentários nos fóruns
Ainda não há comentários sobre este artigo.
Relacionado
Tutorial
Codificação com Mark: abstraindo junções e subconjuntos em Python
Mar 19, 2024 | 11 min read
Tutorial
Filtragem inteligente: um guia para gerar pré-filtros para pesquisa semântica do Atlas
Oct 02, 2024 | 20 min read