Desbloqueando a pesquisa semântica: crie um mecanismo de pesquisa de filmes baseado em Java com o Atlas Vector Search e o Spring Boot

Tim Kelly10 min read • Published Sep 18, 2024 • Updated Sep 18, 2024

Pesquisa vetorial Java Atlas

Avaliar este tutorial

No mundo da tecnologia em rápida evolução, a busca por resultados de pesquisa mais relevantes, personalizados e intuitivos levou ao aumento da popularidade da pesquisa semântica.

O Vector Search do MongoDB permite que você pesquise seus dados relacionados semanticamente, tornando possível pesquisar seus dados por significado, não apenas por correspondência de palavras-chave.

Neste tutorial, analisaremos como podemos criar um aplicativo Spring Boot que possa executar uma pesquisa semântica em uma collection de filmes por suas descrições de trama.

O que precisaremos

Antes de começar, você precisará de algumas coisas.

Java 11 or higher
Maven or Gradle, but this tutorial will reference Maven
Your own MongoDB Atlas account
An OpenAI account, to generate our embeddings

Configure seu cluster do MongoDB

Visit the MongoDB Atlas dashboard and set up your cluster. In order to take advantage of the $vectorSearch operator in an aggregation pipeline, you need to run MongoDB Atlas 6.0.11 or higher.

A seleção de sua versão do MongoDB Atlas está disponível na parte inferior da tela ao configurar seu cluster em "Configurações adicionais".

Ao configurar sua implantação, você será solicitado a configurar um usuário de banco de dados e regras para sua conexão de rede.

If you need more help getting started with MongoDB Atlas, check out our tutorial.

For this project, we're going to use the sample data MongoDB provides. When you first log into the dashboard, you will see an option to load sample data into your database.

If you look in the sample_mflix database, you'll see a collection called embedded_movies. These documents contain a field called plot_embeddings that holds an array of floating point numbers (our vectors). These numbers are gibberish to us when looked at directly with the human eye but are going to allow us to perform our semantic search for movies based on their plot. If you want to embed your own data, you'll be able to use the methods in this tutorial to do so, or you can check out the tutorial that shows you how to set up a trigger in your database to automatically embed your data.

Crie um índice de pesquisa vetorial

In order to use the $vectorSearch operator on our data, we need to set up an appropriate search index. Select the "Search" tab on your cluster and click the "Create Search Index."

Queremos escolher a "Opção JSON Editor" e clicar em "Avançar".

On this page, we're going to select our target database, sample_mflix, and collection embedded_movies, for this tutorial.

The name isn’t too important, but name the index — PlotVectorSearch, for example — and copy in the following JSON.

1 {
2   "fields": [{
3     "type": "vector",
4     "path": "plot_embedding",
5     "numDimensions": 1536,
6     "similarity": "euclidean"
7   }]
8 }

The fields specify the embedding field name in our documents,plot_embedding, the dimensions of the model used to embed, 1536, and the similarity function to use to find K-nearest neighbors, dotProduct. It's very important that the dimensions in the index match that of the model used for embedding. This data has been embedded using the same model as the one we'll be using, but other models are available and may use different dimensions.

Check out our Vector Search documentation for more information on these configuration settings.

Configurar um projeto do Spring Boot

To set up our project, let's use the Spring Initializr. This will generate our pom.xml file which will contain our dependencies for our project.

Para este projeto, você deseja selecionar as opções na captura de tela abaixo e criar um JAR:

Projeto: Maven
Linguagem: Java
Dependências: Web reativa do Spring e MongoDB reativo dos dados do Spring
Gerar: JAR

Abra o projeto Maven no IDE de sua escolha e vamos escrever código!

Modelo do MongoDB

Antes de fazer qualquer coisa, abra seu arquivo pom.xml. Na seção de propriedades, você precisará adicionar o seguinte:

1 <mongodb.version>4.11.0</mongodb.version>

This will force your Spring Boot API to use the 4.11.0 version of the MongoDB Java drivers. Feel free to use a more up to date version in order to make use of some of the most up to date features, such as the vectorSearch() method. You will also notice that throughout this application we use the MongoDB Java Reactive Streams. This is because we are creating an asynchronous API. AI operations like generating embeddings can be compute-intensive and time-consuming. An asynchronous API allows these tasks to be processed in the background, freeing up the system to handle other requests or operations simultaneously. Now, let’s get to coding!

Para representar nosso documento em Java, usaremos objetos Java antigos simples(POJOs). Os dados que vamos manipular são os documentos dos dados de exemplo que você acabou de carregar em seu cluster. Para cada documento e subdocumento, precisamos de um POJO. Os documentos do MongoDB já têm muitas semelhanças com POJOs e são simples de configurar usando o driver MongoDB.

In the main document, we have three subdocuments: Imdb, Tomatoes, and Viewer. Thus, we will need four POJOs for our Movie document.

We first need to create a package called com.example.mdbvectorsearch.model and add our class Movie.java.

We use the @BsonProperty("_id") to assign our _id field in JSON to be mapped to our Id field in Java, so as to not violate Java naming conventions.

1 public class Movie {
2 
3     @BsonProperty("_id")
4     private ObjectId Id;
5     private String title;
6     private int year;
7     private int runtime;
8     private Date released;
9     private String poster;
10     private String plot;
11     private String fullplot;
12     private String lastupdated;
13     private String type;
14     private List<String> directors;
15     private Imdb imdb;
16     private List<String> cast;
17     private List<String> countries;
18     private List<String> genres;
19     private Tomatoes tomatoes;
20     private int num_mflix_comments;
21     private String plot_embeddings;
22 
23     // Getters and setters for Movie fields
24 
25 }

Adicione outra classe chamada Imdb.

1 public static class Imdb {
2 
3     private double rating;
4     private int votes;
5     private int id;
6 
7     // Getters and setters for Imdb fields
8 
9 }

Mais um chamado Tomatoes.

1 public static class Tomatoes {
2 
3     private Viewer viewer;
4     private Date lastUpdated;
5 
6     // Getters and setters for Tomatoes fields
7 
8 }

E, por fim, Viewer.

1 public static class Viewer {
2 
3     private double rating;
4     private int numReviews; 
5 
6     // Getters and setters for Viewer fields
7 
8 }

Dica: Para criar os getters e setters, muitos IDEs têm atalhos.

Conecte-se ao seu banco de dados

In your main file, set up a package com.example.mdbvectorsearch.config and add a class, MongodbConfig.java. This is where we will connect to our database, and create and configure our client. If you're used to using Spring Data MongoDB, a lot of this is usually obfuscated. We are doing it this way to take advantage of some of the latest features of the MongoDB Java driver to support vectors.

From the MongoDB Atlas interface, we'll get our connection string and add this to our application.properties file. We'll also specify the name of our database here.

1 mongodb.uri=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/
2 mongodb.database=sample_mflix

Now, in your MongodbConfig class, import these values, and denote this as a configuration class with the annotation @Configuration.

1 @Configuration
2 public class MongodbConfig {
3 
4     @Value("${mongodb.uri}")
5     private String MONGODB_URI;
6 
7     @Value("${mongodb.database}")
8     private String MONGODB_DATABASE;

Next, we need to create a Client and configure it to handle the translation to and from BSON for our POJOs. Here we configure a CodecRegistry to handle these conversions, and use a default codec as they are capable of handling the major Java data types. We then wrap these in a MongoClientSettings and create our MongoClient.

1     @Bean
2     public MongoClient mongoClient() {
3         CodecRegistry pojoCodecRegistry = CodecRegistries.fromRegistries(
4             MongoClientSettings.getDefaultCodecRegistry(),
5             CodecRegistries.fromProviders(
6                 PojoCodecProvider.builder().automatic(true).build()
7             )
8         );
9 
10         MongoClientSettings settings = MongoClientSettings.builder()
11             .applyConnectionString(new ConnectionString(MONGODB_URI))
12             .codecRegistry(pojoCodecRegistry)
13             .build();
14 
15         return MongoClients.create(settings);
16     }

Nossa última etapa será então obter nosso banco de dados, e terminamos esta classe.

1     @Bean
2     public MongoDatabase mongoDatabase(MongoClient mongoClient) {
3         return mongoClient.getDatabase(MONGODB_DATABASE); 
4     }
5 }

Incorpore seus dados com a API OpenAI

Vamos enviar o prompt fornecido pelo usuário para a API OpenAI para ser incorporado. Uma incorporação é uma série (vetor) de números de ponto flutuante. A distância entre dois vetores mede seu relacionamento. Pequenas distâncias sugerem alto parentesco e grandes distâncias sugerem baixo parentesco.

This will transform our natural language prompt, such as "Toys that come to life when no one is looking", to a large array of floating point numbers that will look something like this [-0.012670076, -0.008900887, ..., 0.0060262447, -0.031987168].

In order to do this, we need to create a few files. All of our code to interact with OpenAI will be contained in our OpenAIService.java class and go to com.example.mdbvectorsearch.service. The @Service at the top of our class dictates to Spring Boot that this belongs to this service layer and contains business logic.

1 @Service
2 public class OpenAIService {
3 	
4 	private static final String OPENAI_API_URL = "https://api.openai.com";
5 	
6 	@Value("${openai.api.key}")
7 	
8 	private String OPENAI_API_KEY;
9 	
10 	private WebClient webClient;
11 	
12 	@PostConstruct
13 	void init() {
14 		this.webClient = WebClient.builder()
15 			.clientConnector(new ReactorClientHttpConnector())
16 			.baseUrl(OPENAI_API_URL)
17 			.defaultHeader("Content-Type", MediaType.APPLICATION_JSON_VALUE)
18 			.defaultHeader("Authorization", "Bearer " + OPENAI_API_KEY)
19 			.build();
20 	}
21 
22 	
23 	public Mono<List<Double>> createEmbedding(String text) {
24 	Map<String, Object> body = Map.of(
25 		"model", "text-embedding-ada-002",
26 		"input", text
27 	);
28 	
29 	return webClient.post()
30 		.uri("/v1/embeddings")
31 		.bodyValue(body)
32 		.retrieve()
33 		.bodyToMono(EmbeddingResponse.class)
34 		.map(EmbeddingResponse::getEmbedding);
35 	}
36 }

We use the Spring WebClient to make the calls to the OpenAI API. We then create the embeddings. To do this, we pass in our text and specify our embedding model (e.g., text-embedding-ada-002). You can read more about the OpenAI API parameter options in their docs.

To pass in and receive the data from the Open AI API, we need to specify our models for the data being received. We're going to add two models to our com.example.mdbvectorsearch.model package, EmbeddingData.java and EmbeddingResponse.java.

1 public class EmbeddingData {
2 	private List<Double> embedding;
3 	
4 	public List<Double> getEmbedding() {
5 		return embedding;
6 	}
7 	
8 	public void setEmbedding(List<Double> embedding) {
9 		this.embedding = embedding;
10 	}
11 }

1 public class EmbeddingResponse {
2 	private List<EmbeddingData> data;
3 	
4 	public List<Double> getEmbedding() {
5 		return data.get(0).getEmbedding();
6 	}
7 	
8 	public List<EmbeddingData> getData() {
9 		return data;
10 	}
11 	
12 	public void setData(List<EmbeddingData> data) {
13 		this.data = data;
14 	}
15 }

Seu pipeline de agregação vetorial do Atlas Search no Spring Boot

Nós temos nosso banco de dados. Podemos incorporar nossos dados. Estamos prontos para enviar e receber nossos documentos de cinema. Como realmente realizamos nossa semântica Atlas Search?

The data access layer of our API implementation takes place in the repository. Create a package com.example.mdbvectorsearch.repository and add the interface MovieRepository.java.

1 public interface MovieRepository {
2     Flux<Movie> findMoviesByVector(List<Double> embedding);
3 }

Now, we implement the logic for our findMoviesByVector method in the implementation of this interface. Add a class MovieRepositoryImpl.java to the package. This method implements the data logic for our application and takes the embedding of user's inputted text, embedded using the OpenAI API, then uses the $vectorSearch aggregation stage against our embedded_movies collection, using the index we set up earlier.

1 @Repository
2 public class MovieRepositoryImpl implements MovieRepository {
3 
4     private final MongoDatabase mongoDatabase;
5 
6     public MovieRepositoryImpl(MongoDatabase mongoDatabase) {
7         this.mongoDatabase = mongoDatabase;
8     }
9 
10     private MongoCollection<Movie> getMovieCollection() {
11         return mongoDatabase.getCollection("embedded_movies", Movie.class);
12     }
13 
14     @Override
15     public Flux<Movie> findMoviesByVector(List<Double> embedding) {
16         String indexName = "PlotVectorSearch";
17         int numCandidates = 100;
18         int limit = 5;
19 
20         List<Bson> pipeline = asList(
21             vectorSearch(
22                 fieldPath("plot_embedding"),
23                 embedding,
24                 indexName,
25                 numCandidates,
26                 limit));
27 
28         return Flux.from(getMovieCollection().aggregate(pipeline, Movie.class));
29     }
30 }

For the business logic of our application, we need to create a service class. Create a class called MovieService.java in our service package.

1 @Service
2 public class MovieService {
3 
4     private final MovieRepository movieRepository;
5     private final OpenAIService embedder;
6 
7     @Autowired
8     public MovieService(MovieRepository movieRepository, OpenAIService embedder) {
9         this.movieRepository = movieRepository;
10         this.embedder = embedder;
11     }
12 
13     public Mono<List<Movie>> getMoviesSemanticSearch(String plotDescription) {
14         return embedder.createEmbedding(plotDescription)
15                 .flatMapMany(movieRepository::findMoviesByVector)
16                 .collectList();
17     }
18 }

The getMoviesSemanticSearch method will take in the user's natural language plot description, embed it using the OpenAI API, perform a vector search on our embedded_movies collection, and return the top five most similar results.

This service will take the user's inputted text, embed it using the OpenAI API, then use the $vectorSearch aggregation stage against our embedded_movies collection, using the index we set up earlier.

This returns a Mono wrapping our list of Movie objects. All that's left now is to actually pass in some data and call our function.

We’ve got the logic in our application. Now, let’s make it an API! First, we need to set up our controller. This will allow us to take in the user input for our application. Let's set up an endpoint to take in the users plot description and return our semantic search results. Create a com.example.mdbvectorsearch.service package and add the class MovieController.java.

1 @RestController
2 public class MovieController {
3 
4 	private final MovieService movieService;
5 	
6 	@Autowired
7 	public MovieController(MovieService movieService) {
8 		this.movieService = movieService;
9 	}
10 	
11 	@GetMapping("/movies/semantic-search")
12 	public Mono<List<Movie>> performSemanticSearch(@RequestParam("plotDescription") String plotDescription) {
13 		return movieService.getMoviesSemanticSearch(plotDescription);
14 	}
15 }

We define an endpoint /movies/semantic-search that handles get requests, captures the plotDescription as a query parameter, and delegates the search operation to the MovieService.

Você pode usar sua ferramenta favorita para testar os pontos de conexão da API, mas só vamos enviar um comando cURL.

1 curl -X GET "http://localhost:8080/movies/semantic-search?plotDescription=A%20cop%20from%20china%20and%20cop%20from%20america%20save%20kidnapped%20girl"

Note: We use %20 to indicate spaces in our URL.

Here we call our API with the query, "A cop from China and a cop from America save a kidnapped girl". There's no title in there but I think it's a fairly good description of a particular action/comedy movie starring Jackie Chan and Chris Tucker. Here's a slightly abbreviated version of my output. Let's check our results!

1 Movie title: Rush Hour
2 Plot: Two cops team up to get back a kidnapped daughter.
3 
4 Movie title: Police Story 3: Supercop
5 Plot: A Hong Kong detective teams up with his female Red Chinese counterpart to stop a Chinese drug czar.
6   
7 Movie title: Fuk sing go jiu
8 Plot: Two Hong-Kong cops are sent to Tokyo to catch an ex-cop who stole a large amount of money in diamonds. After one is captured by the Ninja-gang protecting the rogue cop, the other one gets ...
9   
10 Movie title: Motorway
11 Plot: A rookie cop takes on a veteran escape driver in a death defying final showdown on the streets of Hong Kong.
12   
13 Movie title: The Corruptor
14 Plot: With the aid from a NYC policeman, a top immigrant cop tries to stop drug-trafficking and corruption by immigrant Chinese Triads, but things complicate when the Triads try to bribe the policeman.

We found Rush Hour to be our top match. Just what I had in mind! If its premise resonates with you, there are a few other films you might enjoy.

You can test this yourself by changing the plotDescription we have in the cURL command.

Conclusão

Este tutorial percorreu as etapas abrangentes de criação de um aplicativo de pesquisa semântica usando MongoDB Atlas, OpenAI e Spring Boot.

A pesquisa semântica oferece uma pluralidade de aplicativos, que vão desde consultas sofisticadas de produtos em sites de e-commerce até recomendações de filmes personalizados. Este guia foi elaborado para equipá-lo com o essencial, abrindo caminho para seu próximo projeto.

Thinking about integrating vector search into your next project? Check out this article — How to Model Your Documents for Vector Search — to learn how to design your documents for vector search.

Principais comentários nos fóruns

Ainda não há comentários sobre este artigo.

Iniciar a conversa

Avaliar este tutorial

Relacionado

Início rápido

Início rápido 2: pesquisa vetorial com MongoDB e OpenAI

May 06, 2024 | 12 min read

Notícias e Anúncios

Descontinuando o MongoDB Atlas GraphQL e serviços de hospedagem

Feb 10, 2025 | 2 min read

Tutorial

IoT e MongoDB: impulsionando a análise de séries temporais do consumo doméstico de energia

Aug 28, 2024 | 6 min read

Tutorial

Crie uma API CRUD com MongoDB, Typescript, Express, Prisma e Zod

Sep 04, 2024 | 10 min read

1	{
2	"fields": [{
3	"type": "vector",
4	"path": "plot_embedding",
5	"numDimensions": 1536,
6	"similarity": "euclidean"
7	}]
8	}

1	public class Movie {
2
3	@BsonProperty("_id")
4	private ObjectId Id;
5	private String title;
6	private int year;
7	private int runtime;
8	private Date released;
9	private String poster;
10	private String plot;
11	private String fullplot;
12	private String lastupdated;
13	private String type;
14	private List<String> directors;
15	private Imdb imdb;
16	private List<String> cast;
17	private List<String> countries;
18	private List<String> genres;
19	private Tomatoes tomatoes;
20	private int num_mflix_comments;
21	private String plot_embeddings;
22
23	// Getters and setters for Movie fields
24
25	}

1	public static class Imdb {
2
3	private double rating;
4	private int votes;
5	private int id;
6
7	// Getters and setters for Imdb fields
8
9	}

1	public static class Tomatoes {
2
3	private Viewer viewer;
4	private Date lastUpdated;
5
6	// Getters and setters for Tomatoes fields
7
8	}

1	public static class Viewer {
2
3	private double rating;
4	private int numReviews;
5
6	// Getters and setters for Viewer fields
7
8	}

1	mongodb.uri=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/
2	mongodb.database=sample_mflix

1	@Configuration
2	public class MongodbConfig {
3
4	@Value("${mongodb.uri}")
5	private String MONGODB_URI;
6
7	@Value("${mongodb.database}")
8	private String MONGODB_DATABASE;

1	@Bean
2	public MongoClient mongoClient() {
3	CodecRegistry pojoCodecRegistry = CodecRegistries.fromRegistries(
4	MongoClientSettings.getDefaultCodecRegistry(),
5	CodecRegistries.fromProviders(
6	PojoCodecProvider.builder().automatic(true).build()
7	)
8	);
9
10	MongoClientSettings settings = MongoClientSettings.builder()
11	.applyConnectionString(new ConnectionString(MONGODB_URI))
12	.codecRegistry(pojoCodecRegistry)
13	.build();
14
15	return MongoClients.create(settings);
16	}

1	@Bean
2	public MongoDatabase mongoDatabase(MongoClient mongoClient) {
3	return mongoClient.getDatabase(MONGODB_DATABASE);
4	}
5	}

1	@Service
2	public class OpenAIService {
3
4	private static final String OPENAI_API_URL = "https://api.openai.com";
5
6	@Value("${openai.api.key}")
7
8	private String OPENAI_API_KEY;
9
10	private WebClient webClient;
11
12	@PostConstruct
13	void init() {
14	this.webClient = WebClient.builder()
15	.clientConnector(new ReactorClientHttpConnector())
16	.baseUrl(OPENAI_API_URL)
17	.defaultHeader("Content-Type", MediaType.APPLICATION_JSON_VALUE)
18	.defaultHeader("Authorization", "Bearer " + OPENAI_API_KEY)
19	.build();
20	}
21
22
23	public Mono<List<Double>> createEmbedding(String text) {
24	Map<String, Object> body = Map.of(
25	"model", "text-embedding-ada-002",
26	"input", text
27	);
28
29	return webClient.post()
30	.uri("/v1/embeddings")
31	.bodyValue(body)
32	.retrieve()
33	.bodyToMono(EmbeddingResponse.class)
34	.map(EmbeddingResponse::getEmbedding);
35	}
36	}

1	public class EmbeddingData {
2	private List<Double> embedding;
3
4	public List<Double> getEmbedding() {
5	return embedding;
6	}
7
8	public void setEmbedding(List<Double> embedding) {
9	this.embedding = embedding;
10	}
11	}

1	public class EmbeddingResponse {
2	private List<EmbeddingData> data;
3
4	public List<Double> getEmbedding() {
5	return data.get(0).getEmbedding();
6	}
7
8	public List<EmbeddingData> getData() {
9	return data;
10	}
11
12	public void setData(List<EmbeddingData> data) {
13	this.data = data;
14	}
15	}

1	public interface MovieRepository {
2	Flux<Movie> findMoviesByVector(List<Double> embedding);
3	}

1	@Repository
2	public class MovieRepositoryImpl implements MovieRepository {
3
4	private final MongoDatabase mongoDatabase;
5
6	public MovieRepositoryImpl(MongoDatabase mongoDatabase) {
7	this.mongoDatabase = mongoDatabase;
8	}
9
10	private MongoCollection<Movie> getMovieCollection() {
11	return mongoDatabase.getCollection("embedded_movies", Movie.class);
12	}
13
14	@Override
15	public Flux<Movie> findMoviesByVector(List<Double> embedding) {
16	String indexName = "PlotVectorSearch";
17	int numCandidates = 100;
18	int limit = 5;
19
20	List<Bson> pipeline = asList(
21	vectorSearch(
22	fieldPath("plot_embedding"),
23	embedding,
24	indexName,
25	numCandidates,
26	limit));
27
28	return Flux.from(getMovieCollection().aggregate(pipeline, Movie.class));
29	}
30	}

1	@Service
2	public class MovieService {
3
4	private final MovieRepository movieRepository;
5	private final OpenAIService embedder;
6
7	@Autowired
8	public MovieService(MovieRepository movieRepository, OpenAIService embedder) {
9	this.movieRepository = movieRepository;
10	this.embedder = embedder;
11	}
12
13	public Mono<List<Movie>> getMoviesSemanticSearch(String plotDescription) {
14	return embedder.createEmbedding(plotDescription)
15	.flatMapMany(movieRepository::findMoviesByVector)
16	.collectList();
17	}
18	}

1	@RestController
2	public class MovieController {
3
4	private final MovieService movieService;
5
6	@Autowired
7	public MovieController(MovieService movieService) {
8	this.movieService = movieService;
9	}
10
11	@GetMapping("/movies/semantic-search")
12	public Mono<List<Movie>> performSemanticSearch(@RequestParam("plotDescription") String plotDescription) {
13	return movieService.getMoviesSemanticSearch(plotDescription);
14	}
15	}

1	Movie title: Rush Hour
2	Plot: Two cops team up to get back a kidnapped daughter.
3
4	Movie title: Police Story 3: Supercop
5	Plot: A Hong Kong detective teams up with his female Red Chinese counterpart to stop a Chinese drug czar.
6
7	Movie title: Fuk sing go jiu
8	Plot: Two Hong-Kong cops are sent to Tokyo to catch an ex-cop who stole a large amount of money in diamonds. After one is captured by the Ninja-gang protecting the rogue cop, the other one gets ...
9
10	Movie title: Motorway
11	Plot: A rookie cop takes on a veteran escape driver in a death defying final showdown on the streets of Hong Kong.
12
13	Movie title: The Corruptor
14	Plot: With the aid from a NYC policeman, a top immigrant cop tries to stop drug-trafficking and corruption by immigrant Chinese Triads, but things complicate when the Triads try to bribe the policeman.