EventYou’re invited to MongoDB.local London on October 2. Get your early bird discount now! Learn more >>

Gen AI-powered video summarization

Build a YouTube transcription and summarization service that leverages a large language model (LLM) and semantic search.
Start Free
A code mentor at your side illustration.
Solution Overview

In today's instant-gratification world, it's essential to distill vast amounts of information quickly. With the amount of informational content available on platforms like YouTube, the ability to quickly find relevant videos, transcribe them, and most importantly, summarize them can increase the speed of knowledge gathering. In this article, you'll learn about a generative AI-powered solution that does all of the above.

YouTube is the leading free and open video platform, and content relevant to a specific service, product, or application could range from proprietary company videos to influencer-produced explainers and reviews. Regardless, a company can help their internal and external audiences find relevant content and decipher important takeaways much quicker with a Generative AI-powered video summarization app.

With this solution, you’ll build a GenAI service that pairs OpenAI’s GPT and text-embedding-ada-002 with Atlas Vector Search for sophisticated video-to-text generation and searching across semantically similar videos.

Other Applicable industries and use cases:
  • Software development and IT: Stack Overflow’s developer report states 60% of developers leverage online videos for learning. As such, developers, architects, and other IT professionals can improve productivity and learn new technologies faster with a GenAI video summarization solution.

  • Retail: Many goods and services, especially those that are more expensive or technical in nature, lend themselves to more pre-purchase research. A solution like this could help companies minimize the effort for their targeted customers.

  • Companies in any industry with B2B sales: Selling and buying between businesses take a higher level of knowledge transfer. An internal-facing version of this solution could help sellers keep their company knowledge up-to-date, while an external-facing version could help customers and prospects gather the best information, faster for their purchasing consideration.

Reference Architectures
  • Data Source: YouTube links that are processed for video metadata and transcripts.

    • Leveraging OCR and AI for real-time code analysis

      • To this end, the application integrates Optical Character Recognition (OCR) and Artificial Intelligence (AI) to perform real-time code analysis directly from the video frames. This sophisticated feature transcends traditional metadata and transcripts, offering actionable insights into the code displayed in the video. Whether the code is part of a tutorial, a recorded meeting, or any form of technical presentation, this layer offers a text-based, searchable version along with an AI-powered explanation, making your video resources more comprehensive and insightful than ever
  • Processing Layer: Python script utilizing the OpenAI API and other libraries to fetch and summarize the transcript.

  • Orchestration Layer: In any software system, there is often a need to coordinate between various services, modules, or components to accomplish more complex tasks. The orchestration layer serves this purpose, acting as a middleman that handles the logic required to manage several operations that are part of a bigger application flow. This is particularly beneficial in microservices architecture, but also has its use in monolithic or modular architectures. In our intelligent video processing system, the orchestration layer takes on a crucial role. Here, we've conceptualized a VideoServiceFacade class that serves as the central orchestrator, mediating between different services such as OpenAIService, VideoService, SearchService, and MongoDbRepository.

  • Output: JSON files with video metadata, full transcript, and its AI-generated summary.

Without MongoDB:
Reference architecture without MongoDB illustration.
With MongoDB:
Reference architecture with MongoDB illustration.
Data Model Approach

The data extracted from each YouTube video consists of the following:

  • Video URL: A direct link to the YouTube video.
  • Metadata: Video details such as title, uploader, date, etc.
  • Transcript: A textual representation of the content spoken in the video.
  • Summary: A concise version of the transcript, distilled using AI.
  • Code Analyzed list: A list of programming language code extracted from the video and analyzed by AI.

The data is finally stored in JSON format, which provides flexibility in terms of utilization in various applications.

Building the Solution

The code for the following steps can be found in this GitHub repo.

  • Setting up the Environment: Start by ensuring you have all required libraries installed. This includes langchain, json, pymongo, and any other domain or service-specific libraries.

  • Configuration: Use the ApplicationConfiguration to load configurations such as the OpenAI API key and MongoDB connection details.

  • Loading YouTube videos: For each video link in MONGODB_YOUTUBE_VIDEO_LINKS, the YoutubeLoader fetches metadata and the transcript.

  • Summarizing the Transcript: OpenAI's GPT model, specified by the engine gpt-35-turbo (or any variety of the model like 16K for large video transcriptions) is used to condense the content. The summarization process involves setting up a conversation prompt that helps the model generate a context-rich summary.

  • Handling errors: If the summarization process encounters an error, it's caught and stored in the summary field for that particular video.

  • Storing data locally: The compiled data, including the video summary, is serialized into a JSON format and saved to individual files, named video_transcript_<index>.json.

  • Storing data in MongoDB Atlas with Vector Search:

    • Convert the summarized transcript into embeddings suitable for vector search using the model "text-embedding-ada-002."

    • Store these embeddings in MongoDB Atlas. Using MongoDB Atlas Vector Search, you can index these embeddings to make the summarized content easily searchable with sophisticated approximate nearest neighbor (ANN) algorithms.

      • You’ll need to create a vector search index to make your embedding searchable. You have the ability to choose from a few parameters, such as number of dimensions (max in Atlas Vector Search is 4096), type of similarity search, and number (K) of nearest neighbors. The screenshot below shows the setup for this example. You can also reference our docs.

      • With the power of vector search, users can now quickly retrieve relevant video summaries by querying with a phrase or sentence, enhancing content discoverability.

You can find more information accessing the MongoDB Atlas Vector Search documentation. Visit the Atlas Vector Search Quick Start guide and create your first index in minutes.

  • Create an orchestration layer that sits at the center of the application. Its role is to coordinate the services, manage complex workflows, and deliver a seamless experience. The VideoServiceFacade class within this layer acts as the orchestrator, effectively tying all the loose ends.

    • VideoServiceFacade: Acts as the coordinator for OpenAIService, VideoService, SearchService, and MongoDbRepository.

    • VideoProcessResult: An encapsulation of the processed video results, including metadata, possible actions, and optional search query terms.

    • When the system is triggered, usually from a main function or API endpoint, it's the VideoServiceFacade that takes over. Based on user prompts and AI-generated suggestions, it triggers various processes. These could range from transcript generation and summarization to text-based searching within the stored video summaries.

Here's how it all comes together:

  1. The VideoServiceFacade's processVideo method is called with a user prompt.
  2. The AI interprets this and determines what actions to take, encapsulated within a VideoProcessResult.
  3. If actions include summarization, VideoService is called to perform it.
  4. If there's a search query, SearchService is invoked to perform a search in MongoDB Atlas.
Technologies and Products Used
Partner technologies:
  • Langchain (Python, JavaScript)
AI technologies:
Key Considerations
  • Creating and storing vector indexes in Atlas Vector Search
  • How to generate embeddings and outputs from an LLM, and store them in MongoDB Atlas
  • Leveraging the popular AI framework, Langchain, to facilitate building a Gen-AI powered application
  • Conducting semantic search to retrieve relevant data that may not have exact keyword matches
  • Orchestrating various AI-powered services in an application
Author
  • Fabio Falavinha, MongoDB
  • David Macias, MongoDB

Get started with Atlas today

Get started in seconds. Our free clusters come with 512 MB of storage so you can experiment with sample data and get familiar with our platform.
Try FreeContact sales
Illustration of hands typing on a laptop in the foreground and a superimposed desktop window and coffee cup in the background.