Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Written in the Stars: Predict Your Future With Tensorflow and MongoDB Charts

Anaiya Raisinghani15 min read • Published Aug 21, 2024 • Updated Aug 21, 2024
AIPythonMongoDB
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
The girlies have been going through it. From situation-ships ending, to friend group drama, to little nuisances (like my packages getting stolen and then returned), it seems like the month of July has been hectic, to say the least, for myself and practically everyone I’ve talked to.
Every hangout session in the past couple weeks has ended with, “I don’t know what’s going on, but something is in the air!” Well, what if something isn’t in the air but rather, in the stars?
Let’s use TensorFlow and MongoDB Charts to analyze six months worth of horoscope data to figure out if these mishaps are just up to chance or if they’re truly written into our destiny.
In this tutorial, we will use machine learning to perform sentiment analysis on a horoscopes dataset, TensorFlow to build and train a model using that dataset to see if things will get better in the near future, and MongoDB Charts to help visualize our data along the way.
Before we get started with uncovering our future, let’s first go over some of the important platforms used in this tutorial.

TensorFlow

TensorFlow is an open source library that developers can use to conduct all sorts of machine learning tasks. It was developed by Google Brain and is used across the board by developers and data scientists alike for a multitude of projects, such as natural language processing, image recognition, voice recognition, and so much more. TensorFlow will allow us to build out and train our machine learning model to let us know whether or not horoscopes are positive or negative, meaning we can have more insight into both past and future horoscopes.

MongoDB Charts

MongoDB Charts is a tool inside of MongoDB Atlas that allows for developers to easily visualize the data stored in their collections. The greatest part about Charts is that it allows for users to see real-time visualizations as their databases change and evolve. This is perfect for our project since it will help us to better understand and visualize our data as we work through the tutorial.

Pre-requisites

In order to be successful with this tutorial, you will need:
  1. A basic understanding of TensorFlow (and a willingness to learn more!).
  2. An IDE of your choosing — this tutorial uses a Google Colab notebook.
  3. A MongoDB Atlas account.
  4. MongoDB Compass downloaded locally.
  5. The related dataset from Kaggle.
Once your MongoDB Atlas cluster has been created, we’re ready to start. Our goals for this tutorial are to use a pre-trained model to generate labels for our dataset, and then train a model from scratch using our labeled dataset. Keep this in mind while reading on through!

Dataset

To save us some time, I went ahead and scraped all the daily horoscopes from January 28, 2024 to (and including) July 28, 2024 for all 12 zodiacs and saved them into a file with the date, the zodiac name, and the horoscope itself. These dates were chosen because they are from about six months before this tutorial was created.
Please make sure you have downloaded the anaiya-six-months-horoscopes.csv file from Kaggle into your environment.
Once your file is downloaded, when opened up, it’ll look like this: .csv file with scraped horoscopes
Once you have your dataset on hand, we can figure out each horoscope’s sentiment using sentiment analysis!

Sentiment analysis on my scraped horoscopes

Sentiment analysis is a way of figuring out emotional tone from a piece of text. We are only classifying our horoscopes as either positive or negative. We aren’t going to worry about any that are neutral.
Luckily for us, at this point in the tutorial, we don't need to build a model (yet!). We can use a pre-trained model to figure out whether or not our horoscopes from the past six months are positive or negative.
First, specify your imports. The important one in question is the transformers library from Hugging Face. This library includes a ton of pre-trained models that can help figure out text classification:
1from transformers import AutoConfig
2from transformers import TFAutoModelForSequenceClassification
3from transformers import AutoTokenizer
4import numpy
5import numpy as np
6from scipy.special import softmax
7import tensorflow as tf
Once everything is imported in, let's choose which pre-trained model we want to use. Since this is a TensorFlow tutorial, we can go ahead and use the "distilbert-base-uncased-finetuned-sst-2-english" model since it's compatible with TensorFlow, but there are a ton of options out there to choose from if you would like to switch it up.
In the code below, we are loading in our pre-trained model and the pre-trained tokenizer, config, and model from our transformers library. The tokenizer converts our text into IDs, config holds all of our configuration information, and model is our pre-trained model loaded with TensorFlow. We then write a function to find the sentiment of our text by taking in the text and tokenizing it and then finding the probability of the sentiment by using the softmax function. Then, the highest probability is categorized and saved.
1# distilbert model we are using
2distilbert = "distilbert-base-uncased-finetuned-sst-2-english"
3
4
5tokenizer = AutoTokenizer.from_pretrained(distilbert)
6config = AutoConfig.from_pretrained(distilbert)
7model = TFAutoModelForSequenceClassification.from_pretrained(distilbert)
8
9
10def sentiment_finder(horoscope):
11 input = tokenizer(horoscope, padding=True, truncation=True, max_length=512, return_tensors='tf')
12 output = model(input)
13 scores = output.logits[0].numpy()
14 scores = softmax(scores)
15 ranking = np.argsort(scores)
16 ranking = ranking[::-1]
17 return config.id2label[ranking[0]]
18
19
20# test and see if works before we try on our csv file
21horoscope = "Things might get a bit confusing for you today, Capricorn. Don't feel like you need to make sense of it all. In fact, this task may be impossible. Just be yourself. Let your creative nature shine through. Other people are quite malleable, and you should feel free to take the lead in just about any situation. Make sure, however, that you consider other people's needs."
22sentiment = sentiment_finder(horoscope)
23print(f"Horoscope is {sentiment}")
Here, just to test and ensure the model was working properly, I copied in a horoscope that is on the negative spectrum.
testing distilbert model
This is great! As we can see, that Capricorn horoscope is in fact negative, and we were able to use our pre-trained model to classify it.
But, now we need to make some changes because we don't want to put every horoscope in manually as individual strings. We want to:
  1. Use this pre-trained model, put in our .csv file of all our horoscopes, and figure out the sentiment analysis of each horoscope in our file.
  2. Incorporate a new "sentiment" column that will include 1 if the horoscope is positive and 0 if the horoscope is negative.
We are doing this so when we get to the part where we build out our actual model, we can use a new file with horoscope data and sentiment analysis to train our model on a completely new set of horoscopes our model is not trained on.
First, let’s write out a function to apply each sentiment to our horoscopes:
1# function to apply sentiment against each horoscope
2def apply_sentiment(horoscope):
3 sentiment = sentiment_finder(horoscope)
4 return 1 if sentiment == "POSITIVE" else 0
Now, we need to load in our anaiya-six-months-horoscopes.csv file. To make changes to our dataframe, let’s first go ahead and install the pandas library.
1# import pandas
2!pip install pandas
3import pandas as pd
Read in our file:
1df = pd.read_csv("anaiya-six-months-horoscopes.csv")
When we apply each sentiment, we have to make sure we have a column to hold them:
1df["sentiment"] = df["horoscope"].apply(apply_sentiment)
Our last step for this part is to save our new dataframe to a new csv file. Feel free to name it whatever you like. I’m just naming it anaiya-six-months-horoscopes-sentiment.csv:
1df.to_csv("anaiya-six-months-horoscopes-sentiment.csv")
2print("saved to new file called anaiya-six-months-horoscopes-sentiment.csv")
If you’re using Colab, you’ll see the new file save on the left-hand side of the screen:
Our old horoscope file and our new file
If we open it up, we can see that a new sentiment row has been incorporated with 1s for positive horoscopes and 0s for negative horoscopes. our new sentiment .csv file
Let’s go over to MongoDB Compass, upload our .csv file, and then visualize our data in MongoDB Charts, before we use this data to train and test a model using TensorFlow.

Visualize our data using MongoDB Charts

Since MongoDB Compass is great at taking in .csv files, let’s use Compass to our advantage.
It is not required to use MongoDB Compass in this step to write in your data, but it saves us a couple lines of code, so if you have it downloaded locally, please feel free to use it.
Please make sure it’s downloaded locally and that you’ve successfully imported your data into a new database and collection. We are going to be using the same collection for MongoDB Charts so this is very important.
Once that’s imported, click on the “Charts” tab at the top of your MongoDB Atlas dashboard and select your data source.
We can customize exactly how we want to see our data. I chose to put our sentiment on the Y-axis and our zodiacs on the X-axis. Then, I clicked “Subset” to see the spread across.
specifics of chart
Let’s take a closer look at our newly developed chart: Chart in MongoDB Charts of horoscopes and distribution
So, from this chart we can see that there are a lot more positive horoscopes in our dataset rather than negative ones, and if we create another chart to showcase this amount, this is what we see: Positive vs. Negative six-month chart
We can see that there are 1,614 positive horoscopes and 582 negative horoscopes. This is actually great news — it means the stars aren’t against all of us! We can also see that Pisces and Leo have been thriving these past six months, whereas Scorpio is really getting the short end of the stick (sorry, Scorpio!).
This gives us great insight into how our dataset is skewed as it’s crucial to visualize your dataset prior to training a model to make sure that it’s balanced and our model is learning correctly.
From our visualization, it’s clear that our dataset is unbalanced, which means we are going to have to go through a couple steps to ensure we are feeding our model with a proper training set. More on this below in the “Balance dataset” section.
Now that we have an idea of the data at hand, let’s get started to train and test our model in TensorFlow.

Train and test model in TensorFlow

We are using this dataset to train and test our model to see if we incorporate new horoscopes (for example, a weekly horoscope that isn’t in our trained dataset), we can see if they are positive or negative.
First, let’s import our TensorFlow hub:
1import tensorflow_hub as hub
We can then load in our new .csv file, but we only need to focus on the columns we want to use, which are “horoscope” and “sentiment”:
1df = pd.read_csv("anaiya-six-months-horoscopes-sentiment.csv")
2df = df[["horoscope", "sentiment"]]
Our next step is to split up our sentiment dataset into three separate sets. We need a training set, a validation set, and a test set. Before we can do this, though, we need to ensure that our dataset is balanced!

Balance dataset

Balancing our dataset is necessary since we need to make sure that our model is trained on the same amount of negative and positive horoscopes. Otherwise, our model might only be trained on positive horoscopes and it won’t learn to recognize negative horoscopes.
First, let’s shuffle our dataset using this simple line:
1df = df.sample(frac=1, random_state=42)
Now, split our columns into X and Y variables:
1X = df["horoscope"]
2y = df["sentiment"]
We are going to have to import some helpful libraries so we can achieve this, especially since we are conducting Random Under Sampling, since there are more positive horoscopes than negative horoscopes. Random Under Sampling allows us to remove random horoscopes from the majority class (or the positive horoscope class) to ensure we have a more equal distribution of positive and negative horoscopes that the model can learn from.
1from imblearn.under_sampling import RandomUnderSampler
2from sklearn.model_selection import train_test_split
3
4
5# now do for training and testing
6X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True, test_size=0.15, random_state=42)
7
8
9# create our RandomUnderSampler object
10rus = RandomUnderSampler(random_state=42, sampling_strategy='majority')
11
12
13# apply our RUS technique
14X_resampled, y_resampled = rus.fit_resample(X_train.to_frame(), y_train)
15
16
17# now convert it back to our dataframe
18balanced_trained = pd.DataFrame({'horoscope': X_resampled['horoscope'], 'sentiment': y_resampled})
Once this has been completed, let’s check to make sure we have the same number of negative and positive horoscopes in our training set:
1sentiment_amount_training = balanced_trained['sentiment'].value_counts()
2print(sentiment_amount_training)
amount of positive and negative horoscopes in our training set
They both have 495! That’s fantastic.

Split up our dataset

Since our dataset is now balanced, we can go ahead and split up our dataset into our three separate sets for our TensorFlow model. Once again, they are our training set, our validation set, and our test set. You can split it up into any percentages that you like, but it’s recommended that you do an 80% training split and a 20% validation set split.
1train, val = train_test_split(balanced_trained, test_size=0.2, stratify=balanced_trained['sentiment'], random_state=42)
Let’s quickly view the sizes of each set just to make sure we’re happy with the distribution:
1print("Training set:", len(train))
2print("Validation set:", len(val))
3print("Test set:", len(X_test))
Our sets for training our model
Now, and this part is very crucial for the next section, we need to combine our X_test and y_test back to a dataframe so that our column labels are back in place for when we convert to a TensorFlow dataset, as they were removed when balancing:
1# combine back our X_test and y_test to a df
2test = pd.DataFrame({'horoscope': X_test, 'sentiment': y_test})
Once you’ve run that, we can move on.

Convert to TensorFlow dataset

Now, let’s convert our dataframes to TensorFlow datasets so that we can properly load it into our model.
It converts each train, validation, and test dataset into a TensorFlow dataset, and then will actually shuffle everything again and batch the data for you!
1# code changed to meet my specific needs
2def df_to_dataset(dataframe, shuffle=True, batch_size=32):
3 df = dataframe.copy()
4 labels = df.pop('sentiment')
5 df = df['horoscope']
6 ds = tf.data.Dataset.from_tensor_slices((df, labels))
7 if shuffle:
8 ds = ds.shuffle(buffer_size=len(dataframe))
9 ds = ds.batch(batch_size)
10 ds = ds.prefetch(tf.data.AUTOTUNE)
11 return ds
I’ve made a couple of changes to it, but please feel free if you’re using a different dataset to incorporate your own changes.
1train_data = df_to_dataset(train)
2val_data = df_to_dataset(val)
3test_data = df_to_dataset(test)
Just to check and make sure everything has been TensorFlow-ified, let’s write one line to see what our train_data looks like:
1list(train_data)[0]
Our dataset has been converted to a TensorFlow dataset
Looks great to me! Now, we want to embed and build out our model.

Embedding layer

The embedding layer we will be using is from TensorFlow Hub. You can do this step manually yourself, but there’s no need to reinvent the wheel! We can do this with two simple lines:
1# using the embedding layer from tensorflow hub
2embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
3hub_layer = hub.KerasLayer(embedding, dtype=tf.string, trainable=True)
Keep in mind that the trainable=True ensures that these layers also get trained along with the rest of the model we are building below.
This embedding layer converts our text into embeddings for the model to better understand, as machines don’t understand text — only numbers. Since we’re using the “nnlm-en-dim50/2” layer, we’re just producing 50 dimensional embeddings for English words.

Model

Now, we can get started on building out our neural network TensorFlow model. We want various layers here built using the Sequential model since it’s a way of stacking the layers one by one, and it is the easiest model to understand and visualize. We are also going to be using Dropout layers since it’s a good way to prevent overfitting, which can lead your model astray and provide inaccurate results.
We are going to be using three dropout layers with 40%, 30%, and 20%, meaning that this percentage of our neural networks neurons will be randomly dropped out, or set to zero, so that the model can become more fine tuned on the data.
1# model
2model = tf.keras.Sequential() # since layer by layer so sequential. most basic form
3model.add(hub_layer)
4model.add(tf.keras.layers.Dense(128, activation='relu')) # first neural network layer
5model.add(tf.keras.layers.Dropout(0.4))
6model.add(tf.keras.layers.Dense(64, activation='relu')) # second layer
7model.add(tf.keras.layers.Dropout(0.3)) # another dropout layer
8model.add(tf.keras.layers.Dense(32, activation='relu')) # third layer
9model.add(tf.keras.layers.Dropout(0.2)) # another dropout layer
10# sigmoid is used for binary, so great for sentiment analysis
11model.add(tf.keras.layers.Dense(1, activation='sigmoid')) # output layer
I decided to put in three different Dense layers since they extract and learn complex patterns and features from the embedding vectors and pass them on to the next layer.
In our output layer, we’ve used sigmoid as our activation function since it always returns a value between 0 and 1, so it’ll be perfect for sentiment analysis.
Let’s compile and then run our model:
1model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
2 loss=tf.keras.losses.BinaryCrossentropy(),
3 metrics=['accuracy'])
I am using a learning rate of 0.0001, because changing your learning rate changes how frequently your model’s weights change to truly optimize your model's performance, but you can use whichever learning rate makes the most sense for your use case. It’s normally a low number between 0 and 1 and has influence on how quickly the model learns. Definitely play around with the learning rate and see your results.
Once our model has been compiled, we can train our model on our training data while using the validation data to check and see how well the model is doing on previously unseen data after each epoch (a single pass through the entire training cycle) runs:
1history = model.fit(train_data, epochs=5, validation_data=val_data)
output after running the model
We can learn a ton about our model from the output. Here, we can see that our loss decreased from 66% to 47%, while our validation loss decreased from 68% to 59%. We can also see that our accuracy went from 60% to 89%, while our validation accuracy increased from 60% to 69%.
So it’s great that we have decreases in both our training loss and our validation loss. A decrease in training loss means the model is improving based off of the training data, while a decrease in validation loss means the model is generalizing well off of unseen data.
Validation accuracy at 69% is fine, but to improve this number, we would probably need more epochs, a larger training set, more layers, etc. Have fun finetuning and seeing your results with these changes!
Let’s quickly evaluate our model on our test dataset and check the overall loss and accuracy:
1loss, accuracy = model.evaluate(test_data)
2print(f'Loss: {loss}')
3print(f'Accuracy: {accuracy}')
after evaluating on our test dataset
Here, we can see that our loss is pretty high at 59% while our accuracy is 69%. Since our dataset is relatively small and we only used five epochs, this is honestly to be expected, and we should rather focus on the fact that our validation accuracy has been increasing with each epoch and its loss has been decreasing.
Let’s go ahead and see if we can now use our model to predict the sentiment of unseen horoscopes!

New horoscope prediction

For our function to predict a new horoscope, we have to do a couple things: We have to convert our horoscope string to a tensor that TensorFlow can understand. We have to have our model make a prediction where it will return a probability.
From this probability, it will tell us whether the new horoscope is positive or negative.
Getting the prediction from the probability proved to be the hardest part, but this can be done easily with tf.squeeze.
We can do it as such:
1def predict_sentiment(horoscope):
2 # convert to so tensorflow can understand
3 encoded_input = tf.constant([horoscope])
4
5
6 prediction = model.predict(encoded_input)
7
8
9 # prediction from probability
10 probability = tf.squeeze(prediction).numpy()
11 print(f"model probability: {probability}")
12
13
14 # set it so that we can see if it's negative or positive
15 sentiment = 1 if probability > 0.5 else 0
16
17
18 return sentiment
19
20
21# daily horoscope
22positive_horoscope = "You're incredibly productive, with good business sense, Libra."
23negative_horoscope = "This isn't the most cheerful time, Leo, because important issues are rearing their heads again and forcing you to address them."
24pos_sentiment = predict_sentiment(positive_horoscope)
25neg_sentiment = predict_sentiment(negative_horoscope)
26
27
28print(f"This should be positive: {pos_sentiment}")
29print(f"This should be negative: {neg_sentiment}")
Correctly predicted positive and negative horoscopes
It seems as though our model is properly tuned to understanding positive and negative sentiment. It’s important to note here that horoscopes are meant to be vague. It’s rare to find a horoscope that is entirely negative or entirely positive. So it’s expected that a dataset that is trained on horoscopes likely hovers around the 0.5 boundary, since a lot of horoscopes are created to be up for interpretation.

Will things get better in the future?

Now, we can go ahead and check if things will be getting better for us going forward! For this part, I’ve saved in the upcoming weekly horoscope from www.horoscope.com for all 12 zodiac signs. You can find these in the new-week-horoscope2.csv file, also in my Kaggle.
Let’s run them through our model and see the output:
1file = 'new-week-horoscopes2.csv'
2df = pd.read_csv(file)
3
4
5df['sentiment'] = df['horoscope'].apply(predict_sentiment)
6
7
8for index, row in df.iterrows():
9 zodiac = row['zodiac']
10 horoscope = row['horoscope']
11 sentiment = row['sentiment']
12 print(f"{zodiac} horoscope is {sentiment}")
Prediction of weekly horoscopes that model hasn’t seen before
Now, let’s save it in MongoDB in the same database we created in the beginning of our tutorial, but under a new collection, so we can use MongoDB Charts to visualize our new horoscopes:
First, install pymongo:
1pip install pymongo
Now, we can connect to our MongoDB Atlas cluster and write in our data:
1# first connect to MongoDB Atlas
2import getpass
3from pymongo import MongoClient
4
5# set up your MongoDB connection
6connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
7client = MongoClient(connection_string)
8
9# we are creating a new collection in the same database as before
10database = client['horoscopes']
11collection = database['new_week_horoscope']
12
13for index, row in df.iterrows():
14 zodiac = row['zodiac']
15 horoscope = row['horoscope']
16 sentiment = row['sentiment']
17
18 dict = {
19 "zodiac": zodiac,
20 "horoscope": horoscope,
21 "sentiment": sentiment
22 }
23
24 collection.insert_one(dict)
25print("saved in! go check")
New collection with our new_week_horoscope
Our horoscopes are saved! Let’s go visualize them. visualization using MongoDB Charts on our new horoscopes
After graphing our new week horoscopes, it seems like everyone will have a great week except for Cancer and Pisces. We can’t all be winners.

Conclusion and learnings

In this tutorial, we conducted sentiment analysis on our .csv file to figure out whether our horoscopes were positive or negative. We then created a new .csv file that included our sentiment analysis to feed into a TensorFlow model so it could accurately predict the sentiment of horoscopes it hasn’t seen before, and then we used this trained model to predict our horoscopes going forward. Along the way, we utilized MongoDB Charts to visualize our data in a fun way!
This tutorial was super fun to create and I hope you all enjoyed reading it. My biggest piece of advice when working with TensorFlow models, and especially when training and building your own, is to ensure your dataset is balanced and to change things up when fine tuning. I also highly recommend it if you're using your own data to ensure your dataset is large enough. This was originally done on only three weeks worth of horoscopes and the results were not great. For the model itself, you can play around with the layers, the dropout rate, the learning rate, and even the number of epochs. Please keep in mind too that if you’re reshuffling and rebalancing your training data, your results will vary each time you run your model. Good luck!
If you have any questions or want to connect, please join us in the MongoDB Developer Community.

Resources used to help build this tutorial

Sentiment analysis: Learn more about sentiment analysis with BERT.
Building and training your model: To understand how to build and train my model, I watched this video from freeCodeCamp.org and I used some skeleton code from these TensorFlow docs.
Balancing dataset: For help on balancing your dataset, please take a look at this resource from Medium and this resource from Semaphoreci.
Converting to a TensorFlow dataset: TensorFlow documentation
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Modernize Your Insurance Data Models With MongoDB Relational Migrator


Jun 03, 2024 | 14 min read
Tutorial

Create a Data Pipeline for MongoDB Change Stream Using Pub/Sub BigQuery Subscription


Apr 02, 2024 | 5 min read
Podcast

Scaling the Gaming Industry with Gaspard Petit of Square Enix


Mar 22, 2023 | 29 min
Article

An Update on MongoDB's Ongoing Commitment to Swift


Jul 12, 2024 | 4 min read
Table of Contents