Sorva, Swig e Pesquise Com Playwright, OpenAI e MongoDB Atlas Search

Anaiya Raisinghani12 min read • Published Sep 27, 2024 • Updated Oct 01, 2024

IA Python Atlas

Avaliar este tutorial

It’s not often that I find myself craving a 44-oz Diet Coke with cream, but ever since I watched the sensational Hulu original “The Secret Lives of Mormon Wives,” I’ve been thinking about it much more than I’d like to admit.

If you haven’t tried a “dirty” soda before, I highly recommend it. It’s a fun combination of any soda of your choice, half-and-half or creamer, and flavored syrups. If you don’t like soda, you’ll hate it but I fear that’s a given. I’ll admit I was skeptical, but if you prefer soda as your beverage, your biggest takeaway will be to wonder what other delicious things people in Utah are keeping from the rest of us.

After consulting with the professionals (a very quick Google search), I found that Swig is “home of the original dirty soda.” A look at their menu shows a ton of super unique drink options, with various flavor combinations. Since we’re heading into fall, let’s see if we can replace the traditional Pumpkin Spice Latte with something a bit less basic and a little worse for our teeth.

Let’s use Playwright to scrape the menu with the ingredients from their website, OpenAI’s structured outputs to help us decide which drinks are the most appropriate for each season, and MongoDB Atlas Search to filter out our “dirty” sodas based on their ingredients and what we’re craving.

Before we get started, let’s go over these platforms in a bit more detail just so we’re all on the same page.

What is Playwright and why are we using it?

Playwright is a powerful browser automation tool that was built by Microsoft. It is useful for websites that are running on modern rendering engines, such as Chromium (Google Chrome runs on this), Firefox, and WebKit, since it allows developers to make new browser pages, open up various URLs, and even allows you to interact with all the elements located on a page.

Playwright was chosen for this tutorial because of how simple it is to return a website's elements, especially from dynamic websites, like the one we are scraping. While there are a ton of other incredible capabilities for Playwright, it is perfect for our use-case since our drink items are loaded using JavaScript after the initial page loads. With other more traditional Python web-scrapers, I was getting time-out errors or empty lists since the menu items weren’t embedded in the raw HTML of the website. Playwright, on the other hand, can really nicely handle JavaScript execution and waits for the content to fully load before grabbing the information we need.

What are OpenAI’s Structured Outputs?

OpenAI’s new Structured Outputs actually make sure that any API responses look exactly the way the developer specifies. This works by forcing whichever model you use (we are using GPT-4o) to have a response that matches the schema given by the developer. This tutorial is using it to make sure that the drinks from the Swig menu are shown in a structured JSON format, since we want to analyze the model’s responses later in the tutorial. I will go over how to do this in detail in the tutorial!

MongoDB Atlas Search

MongoDB Atlas Search is embedded full-text search inside of MongoDB Atlas, MongoDB’s cloud database service for developers. It’s crucial for this tutorial since we will be saving our scraped menu items into an Atlas cluster and then creating an aggregation pipeline on the data to find which drinks match our specific season and ingredients.

Tutorial prerequisites

These prerequisites are crucial to ensure we are successful in this tutorial.

An IDE of your choice: This tutorial uses a Google Colab notebook. Please feel free to follow along.
OpenAI API key: You will need to pay to access an API key.
MongoDB Atlas cluster: Please make sure you are using the free tier, you have set your IP address to “access from anywhere,” and that you have copied the connection string to a safe place for future reference.

Once you have your OpenAI API key saved someplace safe, your MongoDB Atlas cluster provisioned, and the connection string saved someplace safe, you are ready to begin!

Inspect your website!

Before we write our main function to scrape the website, we need to make sure we really inspect the website we are hoping to scrape so that we can correctly figure out where the information we want lives.

Head over to Swig, click on the “Dirty Dr Pepper” option or any soda option of your choice, and then click on the American Fork store. This is to see all the menu items available for a location. You can choose any location you like. I just picked American Fork since it was the first one shown.

Then, you’ll see all the menu items for that location as shown below:

Now, we can click on any soda, right click to inspect the page, and see what we’re working with.

The first thing I want to point out is that this website is a dynamic website. How do I know this?

I can see that when I try to inspect a menu item to figure out how to scrape our HTML, the information is located inside an iframe, meaning that the content I want to scrape is HTML that is embedded inside of another webpage. From this example, I can see that everything I want to find (and I know this because I highlighted and inspected other parts, such as the name “Dirty Dr Pepper” and the ingredients “Dr Pepper + Coconut (120 - 440 Calories)”) is nested under the same iframe URL, https://swig-orders.crispnow.com/tabs/locations/menu.

So what does a dynamic website mean? It means I cannot just scrape the HTML from the top level directly, as the website is changing and doesn’t load entirely in one go. Luckily, Playwright is a scraper that can handle waiting for the other page elements to load before scraping, but not all web scrapers are equal in this. Prior to realizing this, and when using a different scraper to try and achieve my goal, I was getting empty outputs every time I ran my code. So please, please inspect your web page before you try and scrape!

Once we have a solid understanding of the website we are attempting to scrape, let’s write up our scraping function.

Write our scraping function

So now, let’s scrape all of the menu items Swig offers so we can see which combinations are best suited for fall. We want to get the name of each menu item and its description. An example of this is the name “Dirty Dr Pepper” and the description “Dr Pepper + Coconut (120 - 440 Calories).

To do this, let’s first install Playwright itself. This can be done with a simple pip statement, and please keep in mind we are running this in our Google Colab notebook:

1 !pip install playwright
2 !playwright install

Now, let’s define our imports. Because we are using a notebook, we have to use async. If you’re working locally through your IDE, feel free to use sync.

1 import asyncio
2 from playwright.async_api import async_playwright

Let’s use the name swigScraper for our definition. Once again, we are going to use async and then headless mode since we are using a notebook. Learn more about when to use headless vs. headed mode.

We also want to make sure we are using the correct URL. Remember from above, we want to use the URL that is located inside of the iframe that our elements are being dynamically generated from. We don’t want the normal Swig website URL.

1 async def swigScraper():
2  async with async_playwright() as playwright:
3   
4    browser = await playwright.chromium.launch(headless=True)
5    page = await browser.new_page()
6   
7    # make sure to have the correct URL
8    await page.goto('https://swig-orders.crispnow.com/tabs/locations/menu')

Since the web page we’re trying to scrape has a lot of hidden elements, let’s first scroll through the menu to see what loads after about a minute. Then, we can right click and inspect the page to see where our name and description are nested. After scrolling through for a minute, I highlighted the drink name and then right clicked and hit “inspect.” This screenshot shows my result:

As we can see from this screenshot, we needed to wait for our ion-card-content to load before we can see where the information we want lives. This allows us to finish up our function with a wait_for_selector saying we want to wait until that specific selector loads:

1 await page.wait_for_selector('ion-card-content', state='attached', timeout=60000)
2 
3 
4 # our items names and descriptions are all located in this area
5 items = await page.query_selector_all('ion-card-content')

Now, we can create a list to store our menu, loop through the HTML and take what we need, extract our text, and then make it look pretty:

1    menu = []
2 
3    for item in items:
4      result = {}
5 
6      name = await item.query_selector('p.text-h3')
7      description = await item.query_selector('p.text-b2')
8 
9      # use inner text to extract our info
10      if name and description:
11        result = {}
12        result['name'] = await name.inner_text()
13        result['description'] = await description.inner_text()
14        menu.append(result)
15 
16    for item in menu:
17        print(f"Name: {item['name']}, Description: {item['description']}")
18 
19    await browser.close()
20    return menu
21 
22 scraped_menu = await swigScraper()
23 print(scraped_menu)

Our results will look like this:

Now that we have all our menu items nicely formatted, let’s use OpenAI’s Structured Outputs to let us know which drinks are perfect for each season and why!

Part 2: OpenAI structured schema outputs

Please make sure you have your OpenAI API key ready! We will be using it in this section.

We are going to be using Structured Outputs so that we can let the model know exactly what we’re looking for and how we want our output to be styled.

Our first step is to install OpenAI:

1 !pip install openai

Now, let’s import openai along with json and getpass for our OpenAI API key.

1 import openai
2 import json
3 import getpass

Using getpass, input your API key so that it’s easy for us to use throughout this section of the tutorial.

1 # put in your OpenAI API key here
2 openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")

Before we can get started, we need to make sure our menu is properly formatted for OpenAI and our model to understand. We can do this by putting all of our drinks and their descriptions into a single string. We also want to tell OpenAI our prompt for what exactly it is and which drinks and ingredients are available. I am going to let our model know that they are the best soda mixologist Utah has ever seen and I am providing a list of our sodas with their descriptions. I also want to ask which sodas are best for each season (spring, summer, fall, winter) based on their descriptions:

1 def swigJoined(scraped_menu):
2  drink_list = []
3 
4  # just formatting our menu from above
5  for drink in scraped_menu:
6    drink_format = f"{drink['name']}: {drink['description']}]"
7    drink_list.append(drink_format)
8 
9  # put all the drinks into a single string for OpenAI to understand it
10  drink_string = "\n".join(drink_list)
11 
12  # we have to tell OpenAI which drinks/combinations are available
13  prompt = (
14      "You are the best soda mixologist Utah has ever seen! This is a list of sodas and their descriptions, or ingredients:\n"
15      f"{drink_string}\n\n Please sort each and every drink provided into spring, summer, fall, or winter seasons based on their ingredients\n"
16      "and give me reasonings as to why by stating which ingredients make it best for each season. For example, cinnamon is more fall, but peach\n"
17      "is more summer."
18  )
19 
20  return prompt

Now, let’s generate our prompt using the menu we scraped. We are going to be using our prompt down below in our structured outputs part of this tutorial:

1 my_prompt = swigJoined(scraped_menu)
2 openai.api_key = openai_api_key

Now that this is ready, we can use our structured call and JSON schema. For help on this part, please refer to the documentation:

We can see under the “extracting structured data from unstructured data” section that the request should follow these specifics:

1 POST /v1/chat/completions
2 {
3   "model": "gpt-4o-2024-08-06",
4   "messages": [
5     {
6       "role": "system",
7       "content": "Extract action items, due dates, and owners from meeting notes."
8     },
9     {
10       "role": "user",
11       "content": "...meeting notes go here..."
12     }
13   ],
14   "response_format": {
15     "type": "json_schema",
16     "json_schema": {
17       "name": "action_items",
18       "strict": true,
19       "schema": {
20         "type": "object",
21         "properties": {
22           "action_items": {
23             "type": "array",
24             "items": {
25               "type": "object",
26               "properties": {
27                 "description": {
28                   "type": "string",
29                   "description": "Description of the action item."
30                 },
31                 "due_date": {
32                   "type": ["string", "null"],
33                   "description": "Due date for the action item, can be null if not specified."
34                 },
35                 "owner": {
36                   "type": ["string", "null"],
37                   "description": "Owner responsible for the action item, can be null if not specified."
38                 }
39               },
40               "required": ["description", "due_date", "owner"],
41               "additionalProperties": false
42             },
43             "description": "List of action items from the meeting."
44           }
45         },
46         "required": ["action_items"],
47         "additionalProperties": false
48       }
49     }
50   }
51 }

So, we can take this skeleton code and make it our own. This tutorial uses GPT-4o, but please feel free to use whichever GPT you’re most comfortable with:

1 response = openai.chat.completions.create(
2    model="gpt-4o-2024-08-06",
3    messages=[
4        {"role": "system", "content": "You are the best soda mixologist Utah has ever seen!"},
5        {"role": "user", "content": my_prompt}
6    ],
7    response_format={
8        "type": "json_schema",
9        "json_schema": {
10            "name": "drink_response",
11            "strict": True,
12            "schema": {
13                "type": "object",
14                "properties": {
15                    "seasonal_drinks": {
16                        "type": "array",
17                        "items": {
18                            "type": "object",
19                            "properties": {
20                                "drink": {"type": "string"},
21                                "reason": {"type": "string"}
22                            },
23                            "required": ["drink", "reason"],
24                            "additionalProperties": False
25                        }
26                    }
27                },
28                "required": ["seasonal_drinks"],
29                "additionalProperties": False
30            }
31        }
32    }
33 )

Now, let’s print it and see our structured response:

1 print(json.dumps(response.model_dump(), indent=2))

Here, we can see that the output we are looking for is located inside of the “content” part for our fall drinks. Let’s extract it so that we can see a list of the drinks and the reasons why each drink is best for each season. Let’s do this by printing it out using model_dump:

1 content = response.model_dump()['choices'][0]['message']['content']
2 print(content)

It’s still in one line, so let’s print out the drinks so they look pretty:

1 parsed_drinks = json.loads(content)
2 seasonal_drinks_pretty = parsed_drinks['seasonal_drinks']
3 print(json.dumps(seasonal_drinks_pretty, indent=2))

Now, we can see all the drinks that are perfect for each season from the Swig menu! Let’s take a look at some of them.

OpenAI believes Dirty S.O.P is perfect for summer since, "The inclusion of peach makes this drink more suited for summer, as peach is typically associated with warm weather and summer harvests." A great drink for fall and winter is the Dr Spice: "Cinnamon and cinnamon stick are warm spices typically associated with fall and winter, making this drink best suited for chillier weather."

Now that we know which soda-based drinks are perfect for each season based on our output, let’s go ahead and insert our drinks as documents into our MongoDB Atlas cluster so we can run an aggregation pipeline on them and figure out which ones are perfect for our upcoming fall season.

Part 3: Insert into MongoDB Atlas and create an aggregation pipeline

Our first step is to install pymongo. PyMongo is the official MongoDB driver for Python applications.

Install it using pip:

1 !pip install pymongo

Here, we are going to import our MongoClient, set up our MongoDB connection using getpass, and then we can name our database and collection anything that we want since it’ll be created when we enter our data. I am naming my database “swig_menu” and the collection “fall_drinks.”

1 from pymongo import MongoClient
2 
3 # set up your MongoDB connection
4 connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
5 client = MongoClient(connection_string, appname = "devrel.showcase.swig_menu")
6 
7 # name your database and collection anything you want since it will be created when you enter your data
8 database = client['swig_menu']
9 collection = database['seasonal_drinks']
10 
11 # insert our fall drinks
12 collection.insert_many(seasonal_drinks_pretty)

Once you run this block, double-check everything is imported correctly:

Now, let’s create an Atlas Search index so we can use MongoDB’s Atlas Search on our documents!

Learn more about how to create an Atlas Search index.

Once you’ve created your search index, create an aggregation pipeline using the MongoDB Atlas UI. To do this, head over to “Collections” and then click on “Aggregation.” Here, we can search through our seasonal drinks and use Atlas Search’s exact match feature to figure out which drinks are best for fall!

Let’s first see all the fall drinks that our AI model found for us. To do this, we can use our $search operator and create a stage in our aggregation pipeline:

1 {
2     "text": {
3       "query": "fall",
4       "path": "reason"   
5     }
6 }

We have eight results!

Now, let’s say I want drinks that are for the fall that have the ingredient “apple” in them. To do this, we need to use a compound operator that combines two or more queries. So this is saying I want to find drinks that contain “fall” AND “apple.” The operator needs to be a “must.” If I wanted “fall” OR “apple,” I would need to use a “should.”

1 {
2   "compound": {
3     "must": [
4       {
5         "text": {
6           "query": "fall",
7           "path": "reason" 
8         }
9       },
10       {
11         "text": {
12           "query": "apple",
13           "path": "reason" 
14           
15         }
16       }
17     ],
18   }
19 }

We have two great options for fall drinks that include apples!

Now, we can find drinks from Swig’s website that are super specific to what we are craving for each season.

Conclusão

In this tutorial, we have gone over how to scrape a website using Playwright, we have put our scraped information through OpenAI and have gotten results for seasonal drinks from the menu with reasoning and a structured output, and we finished off the tutorial with importing our drinks with their reasons into MongoDB Atlas and used MongoDB Atlas Search to find fall drinks and their ingredients!

I hope you enjoyed this tutorial. Please connect with us in the Developer Forum.

Resources used

Web scraping with Playwright: https://oxylabs.io/blog/playwright-web-scraping https://www.lambdatest.com/blog/playwright-for-web-scraping/

Query selector with Playwright: https://playwright.dev/python/docs/api/class-page#pagequery\_selectorselector

Structured Outputs: https://openai.com/index/introducing-structured-outputs-in-the-api/

Exact matches in MongoDB Atlas Search: https://www.mongodb.com/pt-br/developer/products/atlas/atlas-search-exact-match/

Principais comentários nos fóruns

Ainda não há comentários sobre este artigo.

Iniciar a conversa

Avaliar este tutorial

Relacionado

Tutorial

Write A Serverless Function with AWS Lambda and MongoDB

Sep 23, 2022 | 15 min read

Tutorial

Dados do MongoDB ELT usando o Airbyte

Nov 16, 2022 | 2 min read

exemplo de código

Final Space API

Jul 07, 2022 | 1 min read

Tutorial

Como implementar o Agentic RAG usando o Claude 3.5 Sonnet, LlamaIndex e MongoDB

Jul 02, 2024 | 17 min read

Sumário

What is Playwright and why are we using it?
What are OpenAI’s Structured Outputs?
MongoDB Atlas Search
Tutorial prerequisites
Part 1: Scrape all menu items from Swig’s website
Part 2: OpenAI structured schema outputs
Part 3: Insert into MongoDB Atlas and create an aggregation pipeline
Conclusão
Resources used

1	import asyncio
2	from playwright.async_api import async_playwright

1	async def swigScraper():
2	async with async_playwright() as playwright:
3
4	browser = await playwright.chromium.launch(headless=True)
5	page = await browser.new_page()
6
7	# make sure to have the correct URL
8	await page.goto('https://swig-orders.crispnow.com/tabs/locations/menu')

1	await page.wait_for_selector('ion-card-content', state='attached', timeout=60000)
2
3
4	# our items names and descriptions are all located in this area
5	items = await page.query_selector_all('ion-card-content')

1	menu = []
2
3	for item in items:
4	result = {}
5
6	name = await item.query_selector('p.text-h3')
7	description = await item.query_selector('p.text-b2')
8
9	# use inner text to extract our info
10	if name and description:
11	result = {}
12	result['name'] = await name.inner_text()
13	result['description'] = await description.inner_text()
14	menu.append(result)
15
16	for item in menu:
17	print(f"Name: {item['name']}, Description: {item['description']}")
18
19	await browser.close()
20	return menu
21
22	scraped_menu = await swigScraper()
23	print(scraped_menu)

1	# put in your OpenAI API key here
2	openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")

1	def swigJoined(scraped_menu):
2	drink_list = []
3
4	# just formatting our menu from above
5	for drink in scraped_menu:
6	drink_format = f"{drink['name']}: {drink['description']}]"
7	drink_list.append(drink_format)
8
9	# put all the drinks into a single string for OpenAI to understand it
10	drink_string = "\n".join(drink_list)
11
12	# we have to tell OpenAI which drinks/combinations are available
13	prompt = (
14	"You are the best soda mixologist Utah has ever seen! This is a list of sodas and their descriptions, or ingredients:\n"
15	f"{drink_string}\n\n Please sort each and every drink provided into spring, summer, fall, or winter seasons based on their ingredients\n"
16	"and give me reasonings as to why by stating which ingredients make it best for each season. For example, cinnamon is more fall, but peach\n"
17	"is more summer."
18	)
19
20	return prompt

1	my_prompt = swigJoined(scraped_menu)
2	openai.api_key = openai_api_key

1	POST /v1/chat/completions
2	{
3	"model": "gpt-4o-2024-08-06",
4	"messages": [
5	{
6	"role": "system",
7	"content": "Extract action items, due dates, and owners from meeting notes."
8	},
9	{
10	"role": "user",
11	"content": "...meeting notes go here..."
12	}
13	],
14	"response_format": {
15	"type": "json_schema",
16	"json_schema": {
17	"name": "action_items",
18	"strict": true,
19	"schema": {
20	"type": "object",
21	"properties": {
22	"action_items": {
23	"type": "array",
24	"items": {
25	"type": "object",
26	"properties": {
27	"description": {
28	"type": "string",
29	"description": "Description of the action item."
30	},
31	"due_date": {
32	"type": ["string", "null"],
33	"description": "Due date for the action item, can be null if not specified."
34	},
35	"owner": {
36	"type": ["string", "null"],
37	"description": "Owner responsible for the action item, can be null if not specified."
38	}
39	},
40	"required": ["description", "due_date", "owner"],
41	"additionalProperties": false
42	},
43	"description": "List of action items from the meeting."
44	}
45	},
46	"required": ["action_items"],
47	"additionalProperties": false
48	}
49	}
50	}
51	}

1	response = openai.chat.completions.create(
2	model="gpt-4o-2024-08-06",
3	messages=[
4	{"role": "system", "content": "You are the best soda mixologist Utah has ever seen!"},
5	{"role": "user", "content": my_prompt}
6	],
7	response_format={
8	"type": "json_schema",
9	"json_schema": {
10	"name": "drink_response",
11	"strict": True,
12	"schema": {
13	"type": "object",
14	"properties": {
15	"seasonal_drinks": {
16	"type": "array",
17	"items": {
18	"type": "object",
19	"properties": {
20	"drink": {"type": "string"},
21	"reason": {"type": "string"}
22	},
23	"required": ["drink", "reason"],
24	"additionalProperties": False
25	}
26	}
27	},
28	"required": ["seasonal_drinks"],
29	"additionalProperties": False
30	}
31	}
32	}
33	)

1	content = response.model_dump()['choices'][0]['message']['content']
2	print(content)

1	parsed_drinks = json.loads(content)
2	seasonal_drinks_pretty = parsed_drinks['seasonal_drinks']
3	print(json.dumps(seasonal_drinks_pretty, indent=2))

1	from pymongo import MongoClient
2
3	# set up your MongoDB connection
4	connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
5	client = MongoClient(connection_string, appname = "devrel.showcase.swig_menu")
6
7	# name your database and collection anything you want since it will be created when you enter your data
8	database = client['swig_menu']
9	collection = database['seasonal_drinks']
10
11	# insert our fall drinks
12	collection.insert_many(seasonal_drinks_pretty)

1	{
2	"compound": {
3	"must": [
4	{
5	"text": {
6	"query": "fall",
7	"path": "reason"
8	}
9	},
10	{
11	"text": {
12	"query": "apple",
13	"path": "reason"
14
15	}
16	}
17	],
18	}
19	}

Atlas

Sorva, Swig e Pesquise Com Playwright, OpenAI e MongoDB Atlas Search

What is Playwright and why are we using it?

What are OpenAI’s Structured Outputs?

MongoDB Atlas Search

Tutorial prerequisites

Part 1: Scrape all menu items from Swig’s website

Inspect your website!

Write our scraping function

Part 2: OpenAI structured schema outputs

Part 3: Insert into MongoDB Atlas and create an aggregation pipeline

Conclusão

Resources used

Principais comentários nos fóruns

Relacionado

Write A Serverless Function with AWS Lambda and MongoDB

Dados do MongoDB ELT usando o Airbyte

Final Space API

Como implementar o Agentic RAG usando o Claude 3.5 Sonnet, LlamaIndex e MongoDB

Sumário