Comparing NLP Techniques for Scalable Product Search
AS
VH
NC
MM
Adit Shrimal, Varun Hande, Neil Chitre, Manav Middha8 min read • Published Sep 23, 2024 • Updated Sep 23, 2024
Rate this article
In the rapidly expanding online retail industry, data is being generated at an unprecedented scale. To provide a seamless shopping experience to customers, it’s crucial to build an efficient, scalable product search architecture for storing and processing data. In this article, we will compare four popular natural language processing (NLP) techniques — namely BM25, TF-IDF, Word2Vec, and BERT — to empirically find the most optimal solution for retrieving the most relevant results for a search query from a large corpus of products.
Our data is sourced from Asos’s (an online retailer for clothing, footwear, accessories, etc.) real-time product data APIs. We utilized Apache Spark, Google Cloud Storage (GCS), and MongoDB to store and process data orchestrated by Apache Airflow.
Our data processing pipeline consists of three main parts:
We first fetch the product categories. Then, we get the products in each category and store this data to Google Cloud Storage. The Python snippet below illustrates how we fetched the data using RapidAPI and stored the data to GCS:
1 import json 2 import requests 3 4 url = "https://asos2.p.rapidapi.com/categories/list" 5 headers = { 6 "x-rapidapi-key": "YOUR_RAPIDAPI_KEY", 7 "x-rapidapi-host": "asos2.p.rapidapi.com", 8 } 9 response = requests.request("GET", url, headers=headers) 10 categories = json.loads(response.text) 11 for category in categories:a 12 url = f"https://asos2.p.rapidapi.com/products/v2/list/{category['id']}" 13 response = requests.request("GET", url, headers=headers) 14 products = json.loads(response.text) 15 # Store products in GCS 16 bucket = os.environ.get('GS_BUCKET_NAME') 17 product_filename = f"products_{datetime.now().strftime('%H%m%S')}.json" 18 write(bucket, str(date.today()) + "/" + product_filename, products)
The next step is preprocessing. Since product names are often not clean, we tokenized the data and removed stop words. For this, we used the
RegexTokenizer
and StopWordsRemover
from SparkML.1 from pyspark.ml.feature import RegexTokenizer, StopWordsRemover 2 3 # Instantiate and set input and output column parameters 4 regexTokenizer = RegexTokenizer( 5 inputCol="product_name", outputCol="words", pattern="\\W" 6 ) 7 remover = StopWordsRemover(inputCol="words", outputCol="filtered") 8 9 # Transform the data 10 regexTokenized = regexTokenizer.transform(df) 11 removedStopWords = remover.transform(regexTokenized)
The last step is to store the transformed documents in a MongoDB collection. MongoDB’s document model, ability to efficiently handle large data volumes, built-in full-text search functionality, and easy integrations with big data tools such as PySpark make it a fitting choice for our data processing pipeline. The following code snippet shows how to write data to MongoDB:
1 import json 2 3 from pymongo import MongoClient 4 5 client = MongoClient( 6 "mongodb://<username>:<password>@cluster0.mongodb.net/test?retryWrites=true&w=majority" 7 ) 8 db = client.get_database("product_db") 9 products = db.products 10 11 # Fetch data from GCS 12 for product in gcs_data: 13 products.insert_one(product)
Our primary objective was to find the most similar products for a given search query. We experimented with several methodologies:
TF-IDF, short for Term Frequency–Inverse Document Frequency, is a measure of importance of a word to a document in a corpus. It is a widely used method to encode text. Given a user query, we first encode it using TF-IDF and then use cosine similarity to find the most similar products from our catalog.
1 from pyspark.ml.feature import IDF, HashingTF 2 3 hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=20) 4 featurizedData = hashingTF.transform(removedStopWords) 5 6 idf = IDF(inputCol="rawFeatures", outputCol="features") 7 idfModel = idf.fit(featurizedData) 8 rescaledData = idfModel.transform(featurizedData)
Word2Vec is a family of model architectures to learn word embeddings from large datasets. Here, we use the Word2Vec implementation in SparkML to embed the user query and then use cosine similarity to find the most similar products from our catalog.
1 from pyspark.ml.feature import Word2Vec 2 3 word2Vec = Word2Vec( 4 vectorSize=300, minCount=0, inputCol="filtered", outputCol="features" 5 ) 6 model = word2Vec.fit(removedStopWords) 7 result = model.transform(removedStopWords)
BM25, short for “Best Match 25,” is a ranking function used in information retrieval systems to rank documents based on their relevance to a given search query. It is a probabilistic ranking models and is known for its ability to balance term frequency (TF) and inverse document frequency (IDF), thus making it a more sophisticated version of TF-IDF.
In our project, we implemented BM25 from scratch in PySpark due to the lack of a direct implementation in the library. Here is a more detailed look at our approach and code:
Firstly, we calculated the IDF values for each term, which is a measure of how much information the word provides, i.e., if it’s common or rare across all documents.
Secondly, we computed the term frequency in each document. Term frequency is simply the number of times a word appears in a document.
With these two metrics, we could then compute the BM25 score for each document relative to a query. Below is a simplified pseudocode for the process:
1 # Compute IDF values for each term 2 # Compute term frequency in each document for each term 3 # For each query term, do: 4 # For each document containing the term do: 5 # Compute BM25 Score 6 # Sort documents by score
In order to enhance the time efficiency of calculating TF and IDF every time for a product, we created an inverted index. This inverted index saves these important values beforehand to improve the retrieval time when someone searches for a product.
The first part of preprocessing involves stemming, which is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form. It’s done to get to the basic part of a word that carries the meaning. For example, the word stem of “running” is “run.” In this work, we used the Porter Stemming algorithm, a popular and robust stemming algorithm.
1 def stemming(words): 2 ps = PorterStemmer() 3 result = [] 4 5 for w in words: 6 root_word = ps.stem(w) 7 result.append(root_word) 8 9 return result 10 11 stemming_udf = udf(lambda x: stemming(x), ArrayType(elementType=StringType())) 12 preprocessed_data = preprocessed_data.withColumn( 13 "tokens", stemming_udf("filtered_words") 14 )
Next, we flatten the list of tokens into individual words and count the total number of unique words and total words.
1 words = preprocessed_data.select(explode("tokens").alias("word")) 2 total_unique_words = words.select("word").distinct().count() 3 total_words = words.count()
Once we have a list of words, we start building the inverted index. For each word in our dataset, we create an entry in our index, where each entry points to a list of documents that contain the word, along with the word’s frequency in each document.
1 id_str = concat_ws("", col("_id").cast(StringType())) 2 3 word_count_df = ( 4 preprocessed_data.select("_id", explode("tokens").alias("word")) 5 .groupBy("word", "_id") 6 .agg(size(collect_list("word")).alias("occurrence")) 7 .groupBy("word") 8 .agg( 9 map_from_entries(collect_list(struct(id_str.alias("id"), "occurrence"))).alias( 10 "documents" 11 ) 12 ) 13 .withColumnRenamed("word", "_id") 14 ).rdd.map(lambda row: (row["_id"], row["documents"])) 15 16 inverted_index = word_count_df.collectAsMap()
Finally, we also compute the length of all documents and the unique word count for each document. These values are used later in the BM25 formula.
1 doc_length_df = ( 2 preprocessed_data.select("_id", size("tokens").alias("length")) 3 .groupBy("_id") 4 .agg(sum(col("length")).alias("doc_length")) 5 .withColumn("_id", concat_ws("", col("_id").cast(StringType()))) 6 .rdd.map(lambda row: (row["_id"], row["doc_length"])) 7 ) 8 doc_length = doc_length_df.collectAsMap() 9 10 doc_length["word_counter"] = total_unique_words 11 doc_length["doc_counter"] = total_words 12 13 unique_word_count_df = preprocessed_data.select( 14 "_id", 15 size(array_distinct("tokens")).alias("num_unique_words"), 16 ).withColumn("_id", concat_ws("", col("_id").cast(StringType()))).rdd.map(lambda row: (row["_id"], row["num_unique_words"])) 17 unique_word = unique_word_count_df.collectAsMap()
This concludes the preprocessing and inverted index-building step. The inverted index is a critical component of the BM25 ranking function, as it enables fast retrieval of relevant documents for a given query.
Now, let’s dive into the implementation of the BM25 algorithm.
The mathematical formula for BM25 is as follows:
Where:
- f(q_i, D): Frequency of query term q_i in document D
- |D|: Word count of document D
- avgdl: Average word count across all documents
- k_1 and b: Tuning parameters, usually set as k_1 in [1.2, 2.0] and b=0.75
- IDF(q_i): Weight of query term q_i based on its rarity across all documents
Here’s how we’ve implemented the above algorithm in Python:
1 def bm25(doc_length, 2 inverted_index, 3 unique_word, 4 word_counter, 5 doc_counter, 6 query): 7 k1 = 1.2 8 b = 0.75 9 k2 = 100 10 N = len(doc_length) - 2 11 avgdl = doc_length["doc_counter"] / N 12 13 score = defaultdict(int) 14 for word in query: 15 if word in inverted_index: 16 n = len(inverted_index[word]) 17 idf = math.log((N - n + 0.5) / (n + 0.5)) 18 for doc, freq in inverted_index[word].items(): 19 f = freq 20 qf = query.count(word) 21 dl = doc_length[doc] 22 K = k1 * ((1 - b) + b * (float(dl) / float(avgdl))) 23 R1 = idf * ((f * (k1 + 1)) / (f + K)) 24 R2 = (qf * (k2 + 1)) / (qf + k2) 25 R = R1 * R2 26 score[doc] += R 27 else: 28 Continue 29 30 return score
This function takes the document length dictionary, the inverted index, the unique word count dictionary, the total word counter, the document counter, and the search query. It calculates the BM25 score for each document in relation to the query and returns a dictionary where the keys are the document IDs, and the values are the corresponding BM25 scores. The documents with the highest BM25 scores are the ones most relevant to the query. In this study, we chose values for k1, b, and k2 based on other existing studies.
BERT, short for Bidirectional Encoder Representations from Transformers, was one of the early examples of large language models (LLMs) and uses a transformer encoder architecture to represent text as vectors. We used a pre-trained BERT model to map the user query to a high-dimensional vector space and then use cosine similarity to find the most similar products from our catalog.
1 from sentence_transformers import SentenceTransformer 2 3 model = SentenceTransformer('bert-base-nli-mean-tokens') 4 sentence_embeddings = model.encode(data['product_name'])
Each of the models had their strengths and weaknesses. While TF-IDF was the easiest to interpret, it struggled to capture the semantic meaning of the data. Word2Vec was limited to handling queries present in the training corpus. BM25 was very quick, but it lacked semantic understanding. Among the algorithms we tested, the best model was BERT, as it could incorporate the semantic meaning of the data and handle queries in multiple languages.
For our project, we used a GPU-accelerated compute instance on AWS. Below are the details of the specifications we used:
- Worker/Driver types: g4dn.xlarge
- Number of workers: minimum 2, maximum 5
- Resources per worker: 32–80 GB Memory, 8–20 Cores
- Driver resources: 16 GB Memory, 4 Cores
We compared the performance of the four search algorithms (BM25, TF-IDF, Word2Vec, and BERT) on three distinct queries: "sneakers", "t-shirt", and "chappals" (which means sandals/slippers in Hindi). The results varied in terms of speed and relevance for each query and algorithm combination. A summary of the results is as follows:
- Sneakers query: BM25 was the fastest in returning relevant results. Both TF-IDF and Word2Vec were slower compared to BM25, while BERT offered a balance between speed and relevance.
- T-shirt query: While BM25 was the fastest to return results, it provided irrelevant results due to the limited vocabulary of the corpus. Word2Vec performed poorly in terms of both speed and relevance, similar to TF-IDF. BERT was moderately slow but provided highly relevant results.
- Chappals query: Chappal is a Hindi word for leather sandals. Again, BM25 excelled in terms of speed but delivered irrelevant results due to language limitations in the corpus. Word2Vec and TF-IDF had similar performance metrics as with the "t-shirt" query. BERT again demonstrated its effectiveness by delivering highly relevant results even with the complexity of handling multiple languages.
TF-IDF's results for the “chappals” query show the algorithm's struggle with multi-language queries.
Word2Vec results demonstrate its limitations in handling diverse languages.
BM25 responds quickly but produces irrelevant results due to language limitations in the corpus.
BERT successfully retrieves relevant results, showcasing its strength in handling multi-language queries.
In this article, we compared four different natural language processing (NLP) techniques — namely BM25, TF-IDF, Word2Vec, and BERT — to empirically find the most optimal solution for retrieving the most relevant results for a search query from a large corpus of products.
Our preliminary experiments show that the BM25 algorithm is the most time-efficient, while BERT provides the best trade-off between speed and relevance. This observation is in-line with what we see in practice in the real world, with BM25 being a popular algorithm for lexical search, and vector search being the preferred choice for semantic search. Hybrid search, a technique to combine the strengths of keyword and semantic search, is becoming increasingly common.
MongoDB Atlas supports lexical, vector, as well as hybrid search. Here are some resources to get you started:
Top Comments in Forums
There are no comments on this article yet.