使用 LangChain 集成执行混合搜索
您可以将Atlas Vector Search与 LangChain 集成,以执行 混合搜索。在本教程中,您将完成以下步骤:
设置环境。
将Atlas用作向量存储。
对数据创建Atlas Vector Search和Atlas Search索引。
运行混合搜索查询。
将查询结果传递到 RAG管道。
先决条件
如要完成本教程,您必须具备以下条件:
设置环境
为此教程设置环境。通过保存扩展名为 .ipynb
的文件来创建交互式Python笔记本。此 Notebook 允许您单独运行Python代码片段,并且您将使用它来运行本教程中的代码。
要设立笔记本环境,请执行以下操作:
设置环境变量。
运行以下代码为本教程设立环境变量。根据提示提供 OpenAI API密钥和Atlas集群的 SRV连接字符串。
import getpass, os os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") ATLAS_CONNECTION_STRING = getpass.getpass("MongoDB Atlas SRV Connection String:")
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
使用 Atlas 作为向量存储
您必须使用Atlas作为数据的向量存储。您可以使用Atlas中的现有集合来实例化向量存储。
实例化向量存储。
在笔记本中粘贴并运行以下代码,以从Atlas中的 sample_mflix.embedded_movies
命名空间创建一个名为 vector_store
的向量存储实例。此代码使用 from_connection_string
方法创建 MongoDBAtlasVectorSearch
向量存储并指定以下参数:
您的Atlas集群的连接字符串。
OpenAI 嵌入模型作为用于将文本转换为向量嵌入的模型。默认,此模型为
text-embedding-ada-002
。sample_mflix.embedded movies
作为要使用的命名空间。plot
作为包含文本的字段。plot_embedding
作为包含嵌入的字段。dotProduct
作为相关性得分函数。
from langchain_mongodb import MongoDBAtlasVectorSearch from langchain_openai import OpenAIEmbeddings # Create the vector store vector_store = MongoDBAtlasVectorSearch.from_connection_string( connection_string = ATLAS_CONNECTION_STRING, embedding = OpenAIEmbeddings(disallowed_special=()), namespace = "sample_mflix.embedded_movies", text_key = "plot", embedding_key = "plot_embedding", relevance_score_fn = "dotProduct" )
创建索引
注意
要创建Atlas Vector Search或Atlas Search索引,您必须对Atlas项目具有Project Data Access Admin
或更高访问权限。
要在向量存储上启用混合搜索查询,请在集合上创建Atlas Vector Search和Atlas Search索引。您可以使用 LangChain 辅助方法或PyMongo驱动程序方法创建索引:
创建Atlas Search索引。
在笔记本中运行以下代码以创建搜索索引,为集合中的 plot
字段编制索引。
from langchain_mongodb.index import create_fulltext_search_index from pymongo import MongoClient # Connect to your cluster client = MongoClient(ATLAS_CONNECTION_STRING) # Use helper method to create the search index create_fulltext_search_index( collection = client["sample_mflix"]["embedded_movies"], field = "plot", index_name = "search_index" )
创建 Atlas Vector Search 索引。
运行以下代码以创建向量搜索索引,为集合中的plot_embedding
字段编制索引。
from pymongo import MongoClient from pymongo.operations import SearchIndexModel # Connect to your cluster client = MongoClient(ATLAS_CONNECTION_STRING) collection = client["sample_mflix"]["embedded_movies"] # Create your vector search index model, then create the index vector_index_model = SearchIndexModel( definition={ "fields": [ { "type": "vector", "path": "plot_embedding", "numDimensions": 1536, "similarity": "dotProduct" } ] }, name="vector_index", type="vectorSearch" ) collection.create_search_index(model=vector_index_model)
创建Atlas Search索引。
运行以下代码以创建搜索索引,为集合中的plot
字段建立索引。
1 # Create your search index model, then create the search index 2 search_index_model = SearchIndexModel( 3 definition={ 4 "mappings": { 5 "dynamic": False, 6 "fields": { 7 "plot": { 8 "type": "string" 9 } 10 } 11 } 12 }, 13 name="search_index" 14 ) 15 collection.create_search_index(model=search_index_model)
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
运行混合搜索查询
Atlas构建索引后,您可以对数据运行混合搜索查询。以下代码使用MongoDBAtlasHybridSearchRetriever
检索器对字符串time travel
执行混合搜索。它还指定了以下参数:
vectorstore
:向量存储实例的名称。search_index_name
: Atlas Search索引的名称。top_k
:要返回的文档数。fulltext_penalty
:全文搜索的惩罚。惩罚越低,全文搜索分数就越高。
vector_penalty
:向量搜索的惩罚。惩罚越低,向量搜索分数就越高。
检索器返回按全文搜索分数和向量搜索分数之和排序的文档列表。代码示例的最终输出包括标题、图表和每个文档的不同分数。
要学习;了解有关混合搜索查询结果的更多信息,请参阅关于查询。
from langchain_mongodb.retrievers.hybrid_search import MongoDBAtlasHybridSearchRetriever # Initialize the retriever retriever = MongoDBAtlasHybridSearchRetriever( vectorstore = vector_store, search_index_name = "search_index", top_k = 5, fulltext_penalty = 50, vector_penalty = 50 ) # Define your query query = "time travel" # Print results documents = retriever.invoke(query) for doc in documents: print("Title: " + doc.metadata["title"]) print("Plot: " + doc.page_content) print("Search score: {}".format(doc.metadata["fulltext_score"])) print("Vector Search score: {}".format(doc.metadata["vector_score"])) print("Total score: {}\n".format(doc.metadata["fulltext_score"] + doc.metadata["vector_score"]))
Title: Timecop Plot: An officer for a security agency that regulates time travel, must fend for his life against a shady politician who has a tie to his past. Search score: 0.019230769230769232 Vector Search score: 0.01818181818181818 Total score: 0.03741258741258741 Title: The Time Traveler's Wife Plot: A romantic drama about a Chicago librarian with a gene that causes him to involuntarily time travel, and the complications it creates for his marriage. Search score: 0.0196078431372549 Vector Search score: 0 Total score: 0.0196078431372549 Title: Thrill Seekers Plot: A reporter, learning of time travelers visiting 20th century disasters, tries to change the history they know by averting upcoming disasters. Search score: 0 Vector Search score: 0.0196078431372549 Total score: 0.0196078431372549 Title: About Time Plot: At the age of 21, Tim discovers he can travel in time and change what happens and has happened in his own life. His decision to make his world a better place by getting a girlfriend turns out not to be as easy as you might think. Search score: 0 Vector Search score: 0.019230769230769232 Total score: 0.019230769230769232 Title: My iz budushchego Plot: My iz budushchego, or We Are from the Future, is a movie about time travel. Four 21st century treasure seekers are transported back into the middle of a WWII battle in Russia. The movie's ... Search score: 0.018867924528301886 Vector Search score: 0 Total score: 0.018867924528301886
将结果传递到 RAG 管道
您可以将混合搜索结果传递到 RAG管道中,以便对检索到的文档生成响应。示例代码执行以下操作:
定义 LangChain 提示模板,指示 LLM 使用检索到的文档作为查询的上下文。 LangChain 将这些文档传递给 输入变量,并将您的查询传递给
{context}
{query}
变量。构建一条 链 指定以下内容:
您定义的用于检索相关文档的混合搜索检索器。
您定义的提示模板。
OpenAI 的法学硕士,用于生成上下文感知响应。默认,这是
gpt-3.5-turbo
模型。
使用示例查询提示链并返回响应。生成的响应可能会有所不同。
from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import PromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_openai import ChatOpenAI # Define a prompt template template = """ Use the following pieces of context to answer the question at the end. {context} Question: Can you recommend some movies about {query}? """ prompt = PromptTemplate.from_template(template) model = ChatOpenAI() # Construct a chain to answer questions on your data chain = ( {"context": retriever, "query": RunnablePassthrough()} | prompt | model | StrOutputParser() ) # Prompt the chain query = "time travel" answer = chain.invoke(query) print(answer)
Based on the pieces of context provided, here are some movies about time travel that you may find interesting: 1. "Timecop" (1994) - A movie about a cop who is part of a law enforcement agency that regulates time travel, seeking justice and dealing with personal loss. 2. "The Time Traveler's Wife" (2009) - A romantic drama about a man with the ability to time travel involuntarily and the impact it has on his relationship with his wife. 3. "Thrill Seekers" (1999) - A movie about two reporters trying to prevent disasters by tracking down a time traveler witnessing major catastrophes. 4. "About Time" (2013) - A film about a man who discovers he can travel through time and uses this ability to improve his life and relationships. 5. "My iz budushchego" (2008) - A Russian movie where four treasure seekers from the 21st century are transported back to a WWII battle, exploring themes of action, drama, fantasy, and romance. These movies offer a variety of perspectives on time travel and its impact on individuals and society.