Written in the Stars: Predict Your Future With Tensorflow and MongoDB Charts
Rate this tutorial
The girlies have been going through it. From situation-ships ending, to friend group drama, to little nuisances (like my packages getting stolen and then returned), it seems like the month of July has been hectic, to say the least, for myself and practically everyone I’ve talked to.
Every hangout session in the past couple weeks has ended with, “I don’t know what’s going on, but something is in the air!” Well, what if something isn’t in the air but rather, in the stars?
Let’s use TensorFlow and MongoDB Charts to analyze six months worth of horoscope data to figure out if these mishaps are just up to chance or if they’re truly written into our destiny.
In this tutorial, we will use machine learning to perform sentiment analysis on a horoscopes dataset, TensorFlow to build and train a model using that dataset to see if things will get better in the near future, and MongoDB Charts to help visualize our data along the way.
Before we get started with uncovering our future, let’s first go over some of the important platforms used in this tutorial.
TensorFlow is an open source library that developers can use to conduct all sorts of machine learning tasks. It was developed by Google Brain and is used across the board by developers and data scientists alike for a multitude of projects, such as natural language processing, image recognition, voice recognition, and so much more. TensorFlow will allow us to build out and train our machine learning model to let us know whether or not horoscopes are positive or negative, meaning we can have more insight into both past and future horoscopes.
MongoDB Charts is a tool inside of MongoDB Atlas that allows for developers to easily visualize the data stored in their collections. The greatest part about Charts is that it allows for users to see real-time visualizations as their databases change and evolve. This is perfect for our project since it will help us to better understand and visualize our data as we work through the tutorial.
In order to be successful with this tutorial, you will need:
- A basic understanding of TensorFlow (and a willingness to learn more!).
- MongoDB Compass downloaded locally.
Once your MongoDB Atlas cluster has been created, we’re ready to start. Our goals for this tutorial are to use a pre-trained model to generate labels for our dataset, and then train a model from scratch using our labeled dataset. Keep this in mind while reading on through!
To save us some time, I went ahead and scraped all the daily horoscopes from January 28, 2024 to (and including) July 28, 2024 for all 12 zodiacs and saved them into a file with the date, the zodiac name, and the horoscope itself. These dates were chosen because they are from about six months before this tutorial was created.
Please make sure you have downloaded the
anaiya-six-months-horoscopes.csv
file from Kaggle into your environment.Once your file is downloaded, when opened up, it’ll look like this:
Once you have your dataset on hand, we can figure out each horoscope’s sentiment using sentiment analysis!
Sentiment analysis is a way of figuring out emotional tone from a piece of text. We are only classifying our horoscopes as either positive or negative. We aren’t going to worry about any that are neutral.
Luckily for us, at this point in the tutorial, we don't need to build a model (yet!). We can use a pre-trained model to figure out whether or not our horoscopes from the past six months are positive or negative.
First, specify your imports. The important one in question is the
transformers
library from Hugging Face. This library includes a ton of pre-trained models that can help figure out text classification:1 from transformers import AutoConfig 2 from transformers import TFAutoModelForSequenceClassification 3 from transformers import AutoTokenizer 4 import numpy 5 import numpy as np 6 from scipy.special import softmax 7 import tensorflow as tf
Once everything is imported in, let's choose which pre-trained model we want to use. Since this is a TensorFlow tutorial, we can go ahead and use the "distilbert-base-uncased-finetuned-sst-2-english" model since it's compatible with TensorFlow, but there are a ton of options out there to choose from if you would like to switch it up.
In the code below, we are loading in our pre-trained model and the pre-trained
tokenizer
, config
, and model
from our transformers
library. The tokenizer
converts our text into IDs, config
holds all of our configuration information, and model
is our pre-trained model loaded with TensorFlow. We then write a function to find the sentiment of our text by taking in the text and tokenizing it and then finding the probability of the sentiment by using the softmax
function. Then, the highest probability is categorized and saved.1 # distilbert model we are using 2 distilbert = "distilbert-base-uncased-finetuned-sst-2-english" 3 4 5 tokenizer = AutoTokenizer.from_pretrained(distilbert) 6 config = AutoConfig.from_pretrained(distilbert) 7 model = TFAutoModelForSequenceClassification.from_pretrained(distilbert) 8 9 10 def sentiment_finder(horoscope): 11 input = tokenizer(horoscope, padding=True, truncation=True, max_length=512, return_tensors='tf') 12 output = model(input) 13 scores = output.logits[0].numpy() 14 scores = softmax(scores) 15 ranking = np.argsort(scores) 16 ranking = ranking[::-1] 17 return config.id2label[ranking[0]] 18 19 20 # test and see if works before we try on our csv file 21 horoscope = "Things might get a bit confusing for you today, Capricorn. Don't feel like you need to make sense of it all. In fact, this task may be impossible. Just be yourself. Let your creative nature shine through. Other people are quite malleable, and you should feel free to take the lead in just about any situation. Make sure, however, that you consider other people's needs." 22 sentiment = sentiment_finder(horoscope) 23 print(f"Horoscope is {sentiment}")
Here, just to test and ensure the model was working properly, I copied in a horoscope that is on the negative spectrum.
This is great! As we can see, that Capricorn horoscope is in fact negative, and we were able to use our pre-trained model to classify it.
But, now we need to make some changes because we don't want to put every horoscope in manually as individual strings. We want to:
- Use this pre-trained model, put in our
.csv
file of all our horoscopes, and figure out the sentiment analysis of each horoscope in our file. - Incorporate a new "sentiment" column that will include 1 if the horoscope is positive and 0 if the horoscope is negative.
We are doing this so when we get to the part where we build out our actual model, we can use a new file with horoscope data and sentiment analysis to train our model on a completely new set of horoscopes our model is not trained on.
First, let’s write out a function to apply each sentiment to our horoscopes:
1 # function to apply sentiment against each horoscope 2 def apply_sentiment(horoscope): 3 sentiment = sentiment_finder(horoscope) 4 return 1 if sentiment == "POSITIVE" else 0
Now, we need to load in our
anaiya-six-months-horoscopes.csv
file. To make changes to our dataframe, let’s first go ahead and install the pandas
library.1 # import pandas 2 !pip install pandas 3 import pandas as pd
Read in our file:
1 df = pd.read_csv("anaiya-six-months-horoscopes.csv")
When we apply each sentiment, we have to make sure we have a column to hold them:
1 df["sentiment"] = df["horoscope"].apply(apply_sentiment)
Our last step for this part is to save our new dataframe to a new csv file. Feel free to name it whatever you like. I’m just naming it
anaiya-six-months-horoscopes-sentiment.csv
:1 df.to_csv("anaiya-six-months-horoscopes-sentiment.csv") 2 print("saved to new file called anaiya-six-months-horoscopes-sentiment.csv")
If you’re using Colab, you’ll see the new file save on the left-hand side of the screen:
If we open it up, we can see that a new sentiment row has been incorporated with 1s for positive horoscopes and 0s for negative horoscopes.
Let’s go over to MongoDB Compass, upload our
.csv
file, and then visualize our data in MongoDB Charts, before we use this data to train and test a model using TensorFlow.It is not required to use MongoDB Compass in this step to write in your data, but it saves us a couple lines of code, so if you have it downloaded locally, please feel free to use it.
Please make sure it’s downloaded locally and that you’ve successfully imported your data into a new database and collection. We are going to be using the same collection for MongoDB Charts so this is very important.
Once that’s imported, click on the “Charts” tab at the top of your MongoDB Atlas dashboard and select your data source.
We can customize exactly how we want to see our data. I chose to put our sentiment on the Y-axis and our zodiacs on the X-axis. Then, I clicked “Subset” to see the spread across.
Let’s take a closer look at our newly developed chart:
So, from this chart we can see that there are a lot more positive horoscopes in our dataset rather than negative ones, and if we create another chart to showcase this amount, this is what we see:
We can see that there are 1,614 positive horoscopes and 582 negative horoscopes. This is actually great news — it means the stars aren’t against all of us! We can also see that Pisces and Leo have been thriving these past six months, whereas Scorpio is really getting the short end of the stick (sorry, Scorpio!).
This gives us great insight into how our dataset is skewed as it’s crucial to visualize your dataset prior to training a model to make sure that it’s balanced and our model is learning correctly.
From our visualization, it’s clear that our dataset is unbalanced, which means we are going to have to go through a couple steps to ensure we are feeding our model with a proper training set. More on this below in the “Balance dataset” section.
Now that we have an idea of the data at hand, let’s get started to train and test our model in TensorFlow.
We are using this dataset to train and test our model to see if we incorporate new horoscopes (for example, a weekly horoscope that isn’t in our trained dataset), we can see if they are positive or negative.
First, let’s import our TensorFlow hub:
1 import tensorflow_hub as hub
We can then load in our new
.csv
file, but we only need to focus on the columns we want to use, which are “horoscope” and “sentiment”:1 df = pd.read_csv("anaiya-six-months-horoscopes-sentiment.csv") 2 df = df[["horoscope", "sentiment"]]
Our next step is to split up our sentiment dataset into three separate sets. We need a training set, a validation set, and a test set. Before we can do this, though, we need to ensure that our dataset is balanced!
Balancing our dataset is necessary since we need to make sure that our model is trained on the same amount of negative and positive horoscopes. Otherwise, our model might only be trained on positive horoscopes and it won’t learn to recognize negative horoscopes.
First, let’s shuffle our dataset using this simple line:
1 df = df.sample(frac=1, random_state=42)
Now, split our columns into X and Y variables:
1 X = df["horoscope"] 2 y = df["sentiment"]
We are going to have to import some helpful libraries so we can achieve this, especially since we are conducting Random Under Sampling, since there are more positive horoscopes than negative horoscopes. Random Under Sampling allows us to remove random horoscopes from the majority class (or the positive horoscope class) to ensure we have a more equal distribution of positive and negative horoscopes that the model can learn from.
1 from imblearn.under_sampling import RandomUnderSampler 2 from sklearn.model_selection import train_test_split 3 4 5 # now do for training and testing 6 X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True, test_size=0.15, random_state=42) 7 8 9 # create our RandomUnderSampler object 10 rus = RandomUnderSampler(random_state=42, sampling_strategy='majority') 11 12 13 # apply our RUS technique 14 X_resampled, y_resampled = rus.fit_resample(X_train.to_frame(), y_train) 15 16 17 # now convert it back to our dataframe 18 balanced_trained = pd.DataFrame({'horoscope': X_resampled['horoscope'], 'sentiment': y_resampled})
Once this has been completed, let’s check to make sure we have the same number of negative and positive horoscopes in our training set:
1 sentiment_amount_training = balanced_trained['sentiment'].value_counts() 2 print(sentiment_amount_training)
They both have 495! That’s fantastic.
Since our dataset is now balanced, we can go ahead and split up our dataset into our three separate sets for our TensorFlow model. Once again, they are our training set, our validation set, and our test set. You can split it up into any percentages that you like, but it’s recommended that you do an 80% training split and a 20% validation set split.
1 train, val = train_test_split(balanced_trained, test_size=0.2, stratify=balanced_trained['sentiment'], random_state=42)
Let’s quickly view the sizes of each set just to make sure we’re happy with the distribution:
1 print("Training set:", len(train)) 2 print("Validation set:", len(val)) 3 print("Test set:", len(X_test))
Now, and this part is very crucial for the next section, we need to combine our
X_test
and y_test
back to a dataframe so that our column labels are back in place for when we convert to a TensorFlow dataset, as they were removed when balancing:1 # combine back our X_test and y_test to a df 2 test = pd.DataFrame({'horoscope': X_test, 'sentiment': y_test})
Once you’ve run that, we can move on.
Now, let’s convert our dataframes to TensorFlow datasets so that we can properly load it into our model.
It converts each train, validation, and test dataset into a TensorFlow dataset, and then will actually shuffle everything again and batch the data for you!
1 # code changed to meet my specific needs 2 def df_to_dataset(dataframe, shuffle=True, batch_size=32): 3 df = dataframe.copy() 4 labels = df.pop('sentiment') 5 df = df['horoscope'] 6 ds = tf.data.Dataset.from_tensor_slices((df, labels)) 7 if shuffle: 8 ds = ds.shuffle(buffer_size=len(dataframe)) 9 ds = ds.batch(batch_size) 10 ds = ds.prefetch(tf.data.AUTOTUNE) 11 return ds
I’ve made a couple of changes to it, but please feel free if you’re using a different dataset to incorporate your own changes.
1 train_data = df_to_dataset(train) 2 val_data = df_to_dataset(val) 3 test_data = df_to_dataset(test)
Just to check and make sure everything has been TensorFlow-ified, let’s write one line to see what our
train_data
looks like:1 list(train_data)[0]
Looks great to me! Now, we want to embed and build out our model.
The embedding layer we will be using is from TensorFlow Hub. You can do this step manually yourself, but there’s no need to reinvent the wheel! We can do this with two simple lines:
1 # using the embedding layer from tensorflow hub 2 embedding = "https://tfhub.dev/google/nnlm-en-dim50/2" 3 hub_layer = hub.KerasLayer(embedding, dtype=tf.string, trainable=True)
Keep in mind that the
trainable=True
ensures that these layers also get trained along with the rest of the model we are building below.This embedding layer converts our text into embeddings for the model to better understand, as machines don’t understand text — only numbers. Since we’re using the “nnlm-en-dim50/2” layer, we’re just producing 50 dimensional embeddings for English words.
Now, we can get started on building out our neural network TensorFlow model. We want various layers here built using the Sequential model since it’s a way of stacking the layers one by one, and it is the easiest model to understand and visualize. We are also going to be using Dropout layers since it’s a good way to prevent overfitting, which can lead your model astray and provide inaccurate results.
We are going to be using three dropout layers with 40%, 30%, and 20%, meaning that this percentage of our neural networks neurons will be randomly dropped out, or set to zero, so that the model can become more fine tuned on the data.
1 # model 2 model = tf.keras.Sequential() # since layer by layer so sequential. most basic form 3 model.add(hub_layer) 4 model.add(tf.keras.layers.Dense(128, activation='relu')) # first neural network layer 5 model.add(tf.keras.layers.Dropout(0.4)) 6 model.add(tf.keras.layers.Dense(64, activation='relu')) # second layer 7 model.add(tf.keras.layers.Dropout(0.3)) # another dropout layer 8 model.add(tf.keras.layers.Dense(32, activation='relu')) # third layer 9 model.add(tf.keras.layers.Dropout(0.2)) # another dropout layer 10 # sigmoid is used for binary, so great for sentiment analysis 11 model.add(tf.keras.layers.Dense(1, activation='sigmoid')) # output layer
I decided to put in three different Dense layers since they extract and learn complex patterns and features from the embedding vectors and pass them on to the next layer.
In our output layer, we’ve used
sigmoid
as our activation function since it always returns a value between 0 and 1, so it’ll be perfect for sentiment analysis.Let’s compile and then run our model:
1 model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 2 loss=tf.keras.losses.BinaryCrossentropy(), 3 metrics=['accuracy'])
I am using a learning rate of 0.0001, because changing your learning rate changes how frequently your model’s weights change to truly optimize your model's performance, but you can use whichever learning rate makes the most sense for your use case. It’s normally a low number between 0 and 1 and has influence on how quickly the model learns. Definitely play around with the learning rate and see your results.
Once our model has been compiled, we can train our model on our training data while using the validation data to check and see how well the model is doing on previously unseen data after each epoch (a single pass through the entire training cycle) runs:
1 history = model.fit(train_data, epochs=5, validation_data=val_data)
We can learn a ton about our model from the output. Here, we can see that our loss decreased from 66% to 47%, while our validation loss decreased from 68% to 59%. We can also see that our accuracy went from 60% to 89%, while our validation accuracy increased from 60% to 69%.
So it’s great that we have decreases in both our training loss and our validation loss. A decrease in training loss means the model is improving based off of the training data, while a decrease in validation loss means the model is generalizing well off of unseen data.
Validation accuracy at 69% is fine, but to improve this number, we would probably need more epochs, a larger training set, more layers, etc. Have fun finetuning and seeing your results with these changes!
Let’s quickly evaluate our model on our test dataset and check the overall loss and accuracy:
1 loss, accuracy = model.evaluate(test_data) 2 print(f'Loss: {loss}') 3 print(f'Accuracy: {accuracy}')
Here, we can see that our loss is pretty high at 59% while our accuracy is 69%. Since our dataset is relatively small and we only used five epochs, this is honestly to be expected, and we should rather focus on the fact that our validation accuracy has been increasing with each epoch and its loss has been decreasing.
Let’s go ahead and see if we can now use our model to predict the sentiment of unseen horoscopes!
For our function to predict a new horoscope, we have to do a couple things:
We have to convert our horoscope string to a tensor that TensorFlow can understand.
We have to have our model make a prediction where it will return a probability.
From this probability, it will tell us whether the new horoscope is positive or negative.
Getting the prediction from the probability proved to be the hardest part, but this can be done easily with
tf.squeeze
.We can do it as such:
1 def predict_sentiment(horoscope): 2 # convert to so tensorflow can understand 3 encoded_input = tf.constant([horoscope]) 4 5 6 prediction = model.predict(encoded_input) 7 8 9 # prediction from probability 10 probability = tf.squeeze(prediction).numpy() 11 print(f"model probability: {probability}") 12 13 14 # set it so that we can see if it's negative or positive 15 sentiment = 1 if probability > 0.5 else 0 16 17 18 return sentiment 19 20 21 # daily horoscope 22 positive_horoscope = "You're incredibly productive, with good business sense, Libra." 23 negative_horoscope = "This isn't the most cheerful time, Leo, because important issues are rearing their heads again and forcing you to address them." 24 pos_sentiment = predict_sentiment(positive_horoscope) 25 neg_sentiment = predict_sentiment(negative_horoscope) 26 27 28 print(f"This should be positive: {pos_sentiment}") 29 print(f"This should be negative: {neg_sentiment}")
It seems as though our model is properly tuned to understanding positive and negative sentiment. It’s important to note here that horoscopes are meant to be vague. It’s rare to find a horoscope that is entirely negative or entirely positive. So it’s expected that a dataset that is trained on horoscopes likely hovers around the 0.5 boundary, since a lot of horoscopes are created to be up for interpretation.
Now, we can go ahead and check if things will be getting better for us going forward! For this part, I’ve saved in the upcoming weekly horoscope from www.horoscope.com for all 12 zodiac signs. You can find these in the
new-week-horoscope2.csv
file, also in my Kaggle.Let’s run them through our model and see the output:
1 file = 'new-week-horoscopes2.csv' 2 df = pd.read_csv(file) 3 4 5 df['sentiment'] = df['horoscope'].apply(predict_sentiment) 6 7 8 for index, row in df.iterrows(): 9 zodiac = row['zodiac'] 10 horoscope = row['horoscope'] 11 sentiment = row['sentiment'] 12 print(f"{zodiac} horoscope is {sentiment}")
Now, let’s save it in MongoDB in the same database we created in the beginning of our tutorial, but under a new collection, so we can use MongoDB Charts to visualize our new horoscopes:
First, install
pymongo
:1 pip install pymongo
Now, we can connect to our MongoDB Atlas cluster and write in our data:
1 # first connect to MongoDB Atlas 2 import getpass 3 from pymongo import MongoClient 4 5 # set up your MongoDB connection 6 connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here") 7 client = MongoClient(connection_string) 8 9 # we are creating a new collection in the same database as before 10 database = client['horoscopes'] 11 collection = database['new_week_horoscope'] 12 13 for index, row in df.iterrows(): 14 zodiac = row['zodiac'] 15 horoscope = row['horoscope'] 16 sentiment = row['sentiment'] 17 18 dict = { 19 "zodiac": zodiac, 20 "horoscope": horoscope, 21 "sentiment": sentiment 22 } 23 24 collection.insert_one(dict) 25 print("saved in! go check")
Our horoscopes are saved! Let’s go visualize them.
After graphing our new week horoscopes, it seems like everyone will have a great week except for Cancer and Pisces. We can’t all be winners.
In this tutorial, we conducted sentiment analysis on our
.csv
file to figure out whether our horoscopes were positive or negative. We then created a new .csv
file that included our sentiment analysis to feed into a TensorFlow model so it could accurately predict the sentiment of horoscopes it hasn’t seen before, and then we used this trained model to predict our horoscopes going forward. Along the way, we utilized MongoDB Charts to visualize our data in a fun way!This tutorial was super fun to create and I hope you all enjoyed reading it. My biggest piece of advice when working with TensorFlow models, and especially when training and building your own, is to ensure your dataset is balanced and to change things up when fine tuning. I also highly recommend it if you're using your own data to ensure your dataset is large enough. This was originally done on only three weeks worth of horoscopes and the results were not great. For the model itself, you can play around with the layers, the dropout rate, the learning rate, and even the number of epochs. Please keep in mind too that if you’re reshuffling and rebalancing your training data, your results will vary each time you run your model. Good luck!
Building and training your model:
To understand how to build and train my model, I watched this video from freeCodeCamp.org and I used some skeleton code from these TensorFlow docs.
Balancing dataset:
For help on balancing your dataset, please take a look at this resource from Medium and this resource from Semaphoreci.
Top Comments in Forums
There are no comments on this article yet.