Data Governance for Building Generative AI Applications with MongoDB

Pramod Borkar
December 14, 2023 | Updated: January 10, 2024
#Security #genAI

Generative AI (GenAI) has been evolving at a rapid pace. With the introduction of OpenAI’s ChatGPT powered by GPT-3.5 reaching 100 million monthly active users in just two months, other major large language models (LLMs) have followed in ChatGPT's footsteps. Cohere’s LLM supports more than 100 languages and is now available on their AI platform, Google’s Med-PaLM was designed to provide high-quality answers to medical questions, OpenAI introduced GPT-4 (a 40% improvement over GPT-3.5), Microsoft integrated GPT-4 within its Office 365 suite, and Amazon introduced Bedrock, a fully managed service that makes foundation models available via API. These are just a few advancements in the Generative AI market, and a lot of enterprises and startups are adopting AI tools to solve their specific use cases. The developer community and open-source models are also growing as companies adapt to the new technology paradigm shift in the market.

Building intelligent GenAI applications requires flexibility with data. One of the core requirements is data governance, which will be discussed in this blog. Data governance is a broad term encompassing everything you do to ensure data is secure, private, accurate, available, and usable. It includes the processes, policies, measures, technology, tools, and controls around the data lifecycle. When organizations build applications and transition to a production environment, they often deal with personal data (PII) or commercially sensitive data, such as data related to intellectual property, and want to make sure all the controls are in place.

When organizations are looking to build GenAI-powered apps, there are a few capabilities that are required to deliver intelligent and modern app experiences:

Handle data for both operational and analytical workloads
A data platform that is highly scalable and performant
An expressive query API that can work with any kind of data type
Tight integrations with established and open-source LLMs
Native vector search capabilities like embeddings that enable semantic search and retrieval-augmented generation (RAG)

To learn more about the MongoDB developer data platform and how to embed generative AI applications with MongoDB, you can refer to this paper. This blog goes into detail on the security controls of MongoDB Atlas that modern AI applications need.

Check out our AI resource page to learn more about building AI-powered apps with MongoDB.

What are some of the potential security risks while building GenAI applications?

As per the recent State of AI, 2023 report by Retool, data security and data accuracy are the top two pain points when developing AI applications. In the survey, a third of respondents cited data security as a primary pain point, and it increases almost linearly with company size (refer to the MongoDB blog for more details.)

Top pain points around developing AI apps. *Source: State of AI 2023 report by Retool*

While organizations leverage AI technology to improve their businesses, they should be wary of the potential risks. The unintended consequences of generative AI are more likely to expose the above risks as companies approach experimenting with various models and AI tools. Although organizations follow best practices to be deliberate and structured in developing production-ready generative AI applications, they need to have strict security controls in place to alleviate the key security considerations that AI applications pose.

Here are some considerations for securing AI applications/systems

Data security and privacy: Generative AI foundation models rely on large amounts of data to both train against and generate new content. If the training data or data available for the RAG process (retrieval augmented generation) includes personal or confidential data, that data may turn up in outputs in unpredictable ways. Hence it is very important to have strong governance and controls in place so that confidential data does not wind up in outputs.
Intellectual property infringement: Organizations need to avoid the unauthorized use, duplication, or sale of works legally regarded as protected intellectual property. They also have to make sure to train the AI models so the output does not resemble existing works and hence infringe the copyrights of the original. Since this is still a new area for AI systems, the laws are evolving.
Regulatory compliance: AI applications have to comply with industry standards and policies like HIPAA in healthcare, PCI in finance, GDPR for data protection for EU citizens, CCPA, and more.
Explainability: AI systems and algorithms are sometimes perceived as opaque, making non-deterministic decisions. Explainability is the concept that a machine learning model and its output can be explained in a way that makes sense to a human being at an acceptable level and provides repeatable outputs given the same inputs. This is crucial for building trust and accountability in AI applications, especially in domains like healthcare, finance, and security.
AI Hallucinations: AI models may generate inaccurate information, also known as hallucinations. These are often caused by limitations in training data and algorithms. Hallucinations can result in regulatory violations in industries like finance, healthcare, and insurance, and, in the case of individuals, could be reputationally damaging or even defamatory.

These are just some of the considerations when using AI tools and systems. There are additional concerns when it comes to physical security, organizational measures, technical controls for the workforce — both internal and partners — and monitoring and auditing of the systems. By addressing each of these critical issues, organizations can ensure the AI applications they roll out to production are compliant and secure.

Let us look at how MongoDB’s developer data platform can help with some of these considerations around security controls and measures.

How does MongoDB address the security risks and data governance around GenAI?

MongoDB's developer data platform, built on MongoDB Atlas, unifies operational, analytical, and generative AI data services to streamline building intelligent applications. At the core of MongoDB Atlas is its flexible document data model and developer-native query API. Together, they enable developers to dramatically accelerate the speed of innovation, outpace competitors, and capitalize on new market opportunities presented by GenAI.

Developers and data science teams around the world are innovating with AI-powered applications on top of MongoDB. They span multiple use cases in various industry sectors and rely on the security controls MongoDB Atlas provides. Here is the library of sample case studies, white papers, and other resources about how MongoDB is helping customers build AI-powered applications.

MongoDB security & compliance capabilities

MongoDB Atlas offers built-in security controls for all organizational data. The data can be application data as well as vector embeddings and their associated metadata — giving holistic protection of all the data you are using for GenAI-powered applications. Atlas enables enterprise-grade features to integrate with your existing security protocols and compliance standards. In addition, Atlas simplifies deploying and managing your databases while offering the versatility for developers to build resilient applications. MongoDB allows easy integration for security administrators with external systems, while developers can focus on their business requirements. Along with key security features being enabled by default, MongoDB Atlas is designed with security controls that meet enterprise security requirements. Here's how these controls help organizations build their AI applications on MongoDB’s platform and meet the considerations we discussed above:

Data security

MongoDB has access and authentication controls enabled by default. Customers can authenticate to the platform using mechanisms including SCRAM, x.509 certificates, LDAP, passwordless authentication with AWS-IAM, and OpenID Connect. MongoDB also provides role-based access control (RBAC) to determine the user's access privilege to various resources within the platform. Data scientists and developers building AI applications can leverage any of these access controls to fine-tune user access and privileges while training or prompting their AI models. Organizations can implement access control mechanisms to restrict access to the data to only authorized personnel.

End-to-end encryption of data: MongoDB’s data encryption tools offer robust features to protect your data while in transit (network), at rest (storage), and in use (memory and logs). Customers can use automatic encryption of key data fields like personally identifiable information (PII), protected health information (PHI), or any data deemed sensitive, ensuring data is encrypted throughout its lifecycle. Going beyond encryption at rest and in transit, MongoDB has released Queryable Encryption to encrypt data in use. Queryable Encryption enables an application to encrypt sensitive data from the client side, store the encrypted data in the MongoDB database, and run server-side queries on the encrypted data without having to decrypt it. Queryable Encryption is an excellent anonymization technique that makes sensitive data opaque. This technology can be leveraged when you are using company-specific data that contain confidential information from the MongoDB database for the RAG process and that data needs to be anonymized or when you are storing sensitive data in the database.

Regulatory compliance and data privacy

Many uses of generative AI are subject to existing laws and regulations that govern data privacy, intellectual property, and other related areas. New laws and regulations aimed specifically at AI are in the works around the world.

The MongoDB developer data platform undergoes independent verification of platform security, privacy, and compliance controls to help customers meet their regulatory and policy objectives, including the unique compliance needs of highly regulated industries and U.S. government agencies. Refer to the MongoDB Atlas Trust Center for our current certifications and assessments.

Regular security audits

Organizations should conduct regular security audits to identify potential vulnerabilities in their data security practices. This can help ensure that any security weaknesses are identified and addressed promptly. Audits help to identify and mitigate any risks and errors in your AI models and data, as well as ensure that you are compliant with regulations and standards. MongoDB offers granular auditing that provides a trail of how and what data was used and is designed to monitor and detect any unauthorized access to data.

What are additional best practices and considerations while working with AI models?

While it is essential to work with a trusted data platform, it is also important to prioritize security and data governance as discussed. In addition to data security, compliance, and data privacy as mentioned above, here are additional best practices and considerations.

Data quality
Monitor and assess the quality of input data to avoid biases in foundation models. Make sure that your training data is representative of the domain in which your model will be applied. If your model is expected to generalize to real-world scenarios, your training data or data made available for the RAG process should be monitored.
Secure deployment
Use secure and encrypted channels for deploying foundation models. Implement robust authentication and authorization mechanisms to ensure that only authorized users and systems can access sensitive data and AI models. Enforce mechanisms to anonymize sensitive information to protect user privacy.
Audit trails and monitoring
Maintain detailed audit trails and logs of model training, evaluation, and deployment activities. Implement continuous monitoring of both data inputs and model outputs for unexpected patterns or deviations.

MongoDB maintains audit trails and logs of all the data operations and data processing. Customers can use the audit logs for monitoring, troubleshooting, and security purposes, including intrusion detection. We utilize a combination of automated scanning, automated alerting, and human review to monitor the data.
Secure data storage
Implement secure storage practices for both raw and processed data. Use encryption for data at rest and in transit as discussed above.

Encryption at-rest is turned on automatically on MongoDB servers. The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a filesystem perspective, and data only exists in an unencrypted state in memory and during transmission.

Conclusion

As generative AI tools grow in popularity, it matters more than ever how an organization understands and protects its data, and puts it to use — defining the roles, controls, processes, and policies for interacting with data. As modern enterprises use generative AI and LLMs to better serve customers and extract insights from the data, strong data governance becomes essential. By understanding the potential risks and carefully evaluating the platform capabilities the data is hosted on, organizations can confidently harness the power of these tools.

For more details on MongoDB’s trusted platform, refer to these links.

← Previous

Powering Vector Search Maturity in Retail with Pureinsights

In a competitive retail market, with customer demands higher than ever, retailers are on a constant journey toward search maturity. With the recent announcement of MongoDB’s Vector Search offering , retailers are implementing smarter search solutions to provide customers and staff with delightful experiences. Here we’ll explore how partners like Pureinsights are helping retailers to understand what true search maturity entails, and how to start their vector search journey on MongoDB Atlas. Check out our AI resource page to learn more about building AI-powered apps with MongoDB. How MongoDB Partners Like Pureinsights Can Help Search and AI application specialists like Pureinsights can shorten the planning and development cycle, bring applications to production faster, and accelerate time to value for the customer. The Architecture of Vector Search Applications Virtually every Vector Search application will follow the basic logical flow illustrated below. A Client creates a complex query, which is then submitted to an encoder. The encoder turns the query into a Vector and submits it to the Vector Search Engine. The Vector Search engine searches the Vector Database and returns results, which are then formulated and returned to the Client for presentation. A complete Vector Search application includes all of the elements in this diagram, but not all of them are currently provided in the MongoDB Atlas platform. Everything to the left of the Vector Search Engine has to be developed by someone. MongoDB provides the vector store and a means to search it, but someone has to build the client and logic for the complete application. Why Involve Pureinsights to build your Vector Search applications? Pureinsights is a MongoDB BSI partner and has extensive knowledge and expertise in helping customers accelerate time-to-production of premier search applications. Pureinsights specializes in search applications and provides services to build end-to-end vector search solutions, including solutions to create and populate MongoDB Vector Search and UI/Client to search MongoDB Atlas using Atlas Search and Atlas Vector Search. Customers can focus on their core business while we do the development. Pureinsights Search Maturity Matrix – A Roadmap for Better Search, including Vector Search All of the use cases we discussed – e-commerce search, AI-powered search for support, and product information/reviews are advanced search features for Retail. But it’s always best to walk before you run, so before implementing Vector Search, a good strategy is to make sure your current applications have been optimized. Pureinsights methodology for search applications includes analyzing the state of current applications using a Search Maturity Matrix. Pureinsights - Design, Build, and Manage After mapping out their journey to build out advanced search capabilities for their retail applications, Pureinsights can help customers build the applications on the MongoDB Atlas Platform from design, to build, to operations. Application Design and Architecture: A well-defined plan is the key to efficient application development. Pureinsights with their immense experience can help with complex design decisions, such as choosing the right AI models and creating the best architecture for performance and security. Application Build: With over 20 years of experience in search, Pureinsights can help you build and deploy your Atlas Search application quickly and efficiently. Pureinsights has developed methodologies and frameworks like the Pureinsights Discovery Platform, which work with AI technologies (e.g., ChatGPT) and integrate with the Atlas platform to reduce development time and accelerate time to production. Managed services: Pureinsights can even run your search application for you with our SearchOps and maintain it for optimum performance with their fully managed service so you can focus on your core business. Conclusion Pureinsights can help customers overcome the challenges of building vector search applications and accelerate the time to production. With their expertise in application design, build, and managed services, Pureinsights can help customers build and deploy next-generation vector search applications that deliver real business value. Is your e-commerce store ready for AI? And are your products as easy to find as your competitors? Modern consumer expect flawless search experiences in mobile and online e-commerce search. Join MongoDB and Pureinsights on Tuesday, January 23, at 1pm ET for an insightful new webinar hosted by Digital Commerce 360 to learn: What is the search Maturity Matrix, and which capabilities are your organization missing to achieve better results How retailers are building smarter search applications with AI What's possible with MongoDB's new Vector Search offering Related resources: Modernize E-commerce Customer Experiences with MongoDB | MongoDB Atlas Vector Search | MongoDB MongoDB Atlas for Retail: Driving Innovation from Supply Chain to Checkout | MongoDB MongoDB Atlas Search for Retail: Go Beyond the E-commerce Store | MongoDB

December 14, 2023

Next →

Payments Modernization and the Role of the Operational Data Layer

To stay relevant and competitive, payment solution providers must enhance their payment processes to adapt to changing customer expectations, regulatory demands, and advancing technologies. The imperative for modernization is clear: payment systems must become faster, more secure, and seamlessly integrated across platforms. Driven by multiple factors—real-time payments, regulatory shifts like Payment Services Directive 2 (PSD2), heightened customer expectations, the power of open banking, and the disruptive force of fintech startups—the need for payment modernization has never been more pressing. But transformation is not without its challenges. Complex systems, industry reliance on outdated technology, high upgrade costs, and technical debt all pose formidable obstacles. This article will explore modernization approaches and how MongoDB helps smooth transformations. Approaches to modernization As businesses work to modernize their payment systems, they need to overcome the complexities inherent in updating legacy systems. Forward-thinking organizations embrace innovative strategies to streamline their operations, enhance scalability, and facilitate agile responses to evolving market demands. Two such approaches gaining prominence in the realm of payment system modernization are domain-driven design and microservices architecture : Domain-driven design: This approach focuses on a business's core operations to develop scalable and easier-to-manage systems. Domain-driven design ensures that technology serves strategic business goals by aligning system development with business needs. At its core, this approach seeks to break down complex business domains into manageable components, or "domains," each representing a distinct area of business functionality. Microservices architecture: Unlike traditional monolithic architectures, characterized by tightly coupled and interdependent components, a microservices architecture decomposes applications into a collection of loosely coupled services, each of which is responsible for a specific business function or capability. It introduces more flexibility and allows for quicker updates, facilitating agile responses to changing business requirements. Discover how Wells Fargo launched their next-generation card payments by building an operational data store with MongoDB . Modernizing with an operational data layer In the payments modernization process, the significance of an operational data layer (ODL) cannot be overstated. An ODL is an architectural pattern that centrally integrates and organizes siloed enterprise data, making it available to consuming applications. The simplest representation of this pattern looks something like the sample reference architecture below. Figure 1: Operational Data Layer structure An ODL is deployed in front of legacy systems to enable new business initiatives and to meet new requirements that the existing architecture can’t handle—without the difficulty and risk of fully replacing legacy systems. It can reduce the workload on source systems, improve availability, reduce end-user response times, combine data from multiple systems into a single repository, serve as a foundation for re-architecting a monolithic application into a suite of microservices, and more. The ODL becomes a system of innovation, allowing the business to take an iterative approach to digital transformation. Here's why an ODL is considered ideal for payment operations: Unified data management: Payment systems involve handling a vast amount of diverse data, including transaction details, customer information, and regulatory compliance data. An ODL provides a centralized repository for storing and managing this data, eliminating silos and ensuring data integrity. Real-time processing: An ODL enables real-time processing of transactions, allowing businesses to handle high numbers of transactions swiftly and efficiently. This capability is essential for meeting customer expectations for instant payments and facilitating seamless transactions across various channels. Scalability and flexibility: Payment systems must accommodate fluctuating transaction volumes and evolving business needs. An ODL offers scalability and flexibility, allowing businesses to scale their infrastructure as demand grows. Enhanced security: An ODL incorporates robust security features —such as encryption, access controls, and auditing capabilities—to safeguard data integrity and confidentiality. By centralizing security measures within the ODL, businesses can ensure compliance with regulatory requirements and mitigate security risks effectively. Support for payments data monetization: Payment systems generate a wealth of data that can provide valuable insights into customer behavior, transaction trends, and business performance. An ODL facilitates real-time analytics and reporting by providing a unified platform for collecting, storing, and analyzing this data. Transform with MongoDB MongoDB’s fundamental technology principles ensure companies can reap the advantages of microservices and domain-driven design—specifically, our flexible data model and built-in redundancy, automation, and scalability. Indeed, the document model is tailor-made for the intricacies of payment data, ensuring adaptability and scalability as market demands evolve. Here’s how MongoDB helps with domain-driven design and microservice implementation to adopt industry best practices: Ease of use: MongoDB’s document model makes it simple to model or remodel data to fit the needs of payment applications. Documents are a natural way of describing data. They present a single data structure, with related data embedded as sub-documents and arrays, making it simpler and faster for developers to model how data in the application will be mapped to data stored in the database. In addition, MongoDB guarantees the multi-record ACID transactional semantics that developers are familiar with, making it easier to reason about data. Flexibility: MongoDB’s dynamic schema is ideal for handling the requirements of microservices and a domain-driven design. Domain-driven design emphasizes modeling the domain to reflect the business requirements, which may evolve over time. MongoDB's flexible schema allows you to store domain objects as documents without rigid schema constraints, facilitating agile development and evolution of the domain model. Speed: Using MongoDB for an ODL means you can get better performance when accessing data, and write less code to do so. A document is a single place for the database to read and write data for an entity. This locality of data ensures the complete document can be accessed in a single database operation that avoids the need internally to pull data from many different tables and rows. Data access and microservice-based APIs: MongoDB integrates seamlessly with modern technologies and frameworks commonly used in microservices architectures. MongoDB's flexible data model and ability to handle various data types, including structured and unstructured data, is a great fit for orchestrating your open API ecosystem to make data flow between banks, third parties, and consumers possible. Scalability: Even if an ODL starts at a small scale, you need to be prepared for growth as new source systems are integrated, adding data volume, and new consuming systems are developed, increasing workload. MongoDB provides horizontal scale-out on low-cost, commodity hardware or cloud infrastructure using sharding to meet the needs of an ODL with large data sets and high throughput requirements. High availability: Microservices architectures require high availability to ensure that individual services remain accessible even in the event of failures. MongoDB provides built-in replication and failover capabilities, ensuring data availability and minimal downtime in case of server failures. Payment modernization is not merely a trend but a strategic imperative. By embracing modern payment solutions and leveraging the power of an ODL with MongoDB, organizations can unlock new growth opportunities, enhance operational efficiency, and deliver superior customer experiences. Learn how to build an operational data layer with MongoDB using this Payments Modernization Solution Accelerator . Learn more about how MongoDB is powering industries on our solution library .

May 15, 2024