Databases vs. Data Warehouses vs. Data Lakes

What are databases, data warehouses, and data lakes? Let's examine the key differences and when should you use each one.

Perhaps you've heard the terms "database," "data warehouse," and "data lake," and you've got some questions. Are these different words to describe the same thing? If not, what are the differences? And when should you choose one over the other? We'll explore answers to these questions and more in this article.

Table of Contents:

What is a Database?
OLAP + data warehouses and data lakes
What is a Data Warehouse?
What is a Data Lake?
What are the key differences between a database, data warehouse, and data lake?
Database vs. data warehouse vs. data lake: which is right for me?
Using MongoDB Atlas databases and data lakes

What is a database?

A database is a collection of data or information. Databases are typically accessed electronically and are used to support Online Transaction Processing (OLTP). Database Management Systems (DBMS) store data in the database and enable users and applications to interact with the data. The term “database” is commonly used to reference both the database itself as well as the DBMS.

Database characteristics

A variety of database types have emerged over the last several decades. All databases store information, but each database will have its own characteristics. Relational databases store data in tables with fixed rows and columns. Non-relational databases (also known as NoSQL databases) store data in a variety of models including JSON (JavaScript Object Notation), BSON (Binary JSON), key-value pairs, tables with rows and dynamic columns, and nodes and edges. Databases store structured and/or semi-structured data, depending on the type.

You may also find database characteristics like:

Security features to ensure the data can only be accessed by authorized users.
ACID (Atomicity, Consistency, Isolation, Durability) transactions to ensure data integrity.
Query languages and APIs to easily interact with the data in the database.
Indexes to optimize query performance.
Full-text search.
Optimizations for mobile devices.
Flexible deployment topologies to isolate workloads (e.g., analytics workloads) to a specific set of resources.
On-premises, private cloud, public cloud, hybrid cloud, and/or multi-cloud hosting options.

Why use a database?

If your application needs to store data (and nearly every interactive application does), your application needs a database. Applications across industries and use cases are built on databases. Many types of data can be stored in databases, including:

Patient medical records
Items in an online store
Financial records
Articles and blog entries
Sports scores and statistics
Online gaming information
Student grades and scores
IoT device readings
Mobile application information

Database examples

A myriad of databases exist. Examples include:

Relational databases: Oracle, MySQL, Microsoft SQL Server, and PostgreSQL
Document databases: MongoDB and CouchDB
Key-value databases: Redis and DynamoDB
Wide-column stores: Cassandra and HBase
Graph databases: Neo4j and Amazon Neptune

OLAP + data warehouses and data lakes

Both data warehouses and data lakes are meant to support Online Analytical Processing (OLAP). OLAP systems are typically used to collect data from a variety of sources. The data is then used to power a range of analytical use cases ranging from business intelligence and reporting (e.g., quarterly sales reports by store) to forecasting (e.g., predicting home sales for the next six months based on historical trends).

With that in mind, let’s compare these two approaches to OLAP.

What is a data warehouse?

A data warehouse is a system that stores highly structured information from various sources. Data warehouses typically store current and historical data from one or more systems. The goal of using a data warehouse is to combine disparate data sources in order to analyze the data, look for insights, and create business intelligence (BI) in the form of reports and dashboards.

You might be wondering, "Is a data warehouse a database?" Yes, a data warehouse is a giant database that is optimized for analytics.

Data warehouse characteristics

Data warehouses store large amounts of current and historical data from various sources. They contain a range of data, from raw ingested data to highly curated, cleansed, filtered, and aggregated data.

Extract, transform, load (ETL) processes move data from its original source to the data warehouse. The ETL processes move data on a regular schedule (for example, hourly or daily), so data in the data warehouse may not reflect the most up-to-date state of the systems.

Data warehouses typically have a pre-defined and fixed relational schema. Therefore, they work well with structured data. Some data warehouses also support semi-structured data.

Once the data is in the warehouse, business analysts can connect data warehouses with BI tools. These tools allow business analysts and data scientists to explore the data, look for insights, and generate reports for business stakeholders.

Why use a data warehouse?

Data warehouses are a good option when you need to store large amounts of historical data and/or perform in-depth analysis of your data to generate business intelligence. Due to their highly structured nature, analyzing the data in data warehouses is relatively straightforward and can be performed by business analysts and data scientists.

Note that data warehouses are not intended to satisfy the transaction and concurrency needs of an application. If an organization determines they will benefit from a data warehouse, they will need a separate database or databases to power their daily operations.

Data warehouse examples

Examples of data warehouses include:

Amazon Redshift.
Google BigQuery.
IBM Db2 Warehouse.
Microsoft Azure Synapse.
Oracle Autonomous Data Warehouse.
Snowflake.
Teradata Vantage.

What is a data lake?

A data lake is a repository of data from disparate sources that is stored in its original, raw format. Like data warehouses, data lakes store large amounts of current and historical data. What sets data lakes apart is their ability to store data in a variety of formats including JSON, BSON, CSV, TSV, Avro, ORC, and Parquet.

Typically, the primary purpose of a data lake is to analyze the data to gain insights. However, organizations sometimes use data lakes simply for their cheap storage with the idea that the data may be used for analytics in the future.

Is a data lake a database?

You might be wondering, "Is a data lake a database?" A data lake is a repository for data stored in a variety of ways including databases. With modern tools and technologies, a data lake can also form the storage layer of a database. Tools like Starburst, Presto, Dremio, and Atlas Data Lake can give a database-like view into the data stored in your data lake. In many cases, these tools can power the same analytical workloads as a data warehouse.

Data lake characteristics

Data lakes store large amounts of structured, semi-structured, and unstructured data. They can contain everything from relational data to JSON documents to PDFs to audio files.

Data does not need to be transformed in order to be added to the data lake, which means data can be added (or “ingested”) incredibly efficiently without upfront planning.

The primary users of a data lake can vary based on the structure of the data. Business analysts will be able to gain insights when the data is more structured. When the data is more unstructured, data analysis will likely require the expertise of developers, data scientists, or data engineers.

The flexible nature of data lakes enables business analysts and data scientists to look for unexpected patterns and insights. The raw nature of the data combined with its volume allows users to solve problems they may not have been aware of when they initially configured the data lake.

Data in data lakes can be processed with a variety of OLAP systems and visualized with BI tools.

Why use a data lake?

Data lakes are a cost-effective way to store huge amounts of data. Use a data lake when you want to gain insights into your current and historical data in its raw form without having to transform and move it. Data lakes also support machine learning and predictive analytics.

Like data warehouses, data lakes are not intended to satisfy the transaction and concurrency needs of an application.

Data lake examples

Data lakes can provide storage and compute capabilities, either independently or together.

The following are examples of technology that provide flexible and scalable storage for building data lakes:

AWS S3
Azure Data Lake Storage Gen2
Google Cloud Storage

Other technologies enable organizing and querying data in data lakes, including:

AWS Athena.
Presto.
Starburst.
Databricks SQL Analytics.

What are the key differences between a database, data warehouse, and data lake?

Databases, data warehouses, and data lakes are all used to store data. So what's the difference?

The key differences between a database, a data warehouse, and a data lake are that:

A database stores the current data required to power an application.
A data warehouse stores current and historical data from one or more systems in a predefined and fixed schema, which allows business analysts and data scientists to easily analyze the data.
A data lake stores current and historical data from one or more systems in its raw form, which allows business analysts and data scientists to easily analyze the data.

The table below summarizes similarities and differences between databases, data warehouses, and data lakes.

	Database	Data Lake	Data Warehouse
Workloads	Operational and transactional	Analytical	Analytical
Data Type	Structured or semi-structured	Structured, semi-structured, and/or unstructured	Structured and/or semi-structured
Schema Flexibility	Rigid or flexible schema depending on database type	No schema definition required for ingest (schema on read)	Pre-defined and fixed schema definition for ingest (schema on write and read)
Data Freshness	Real time	May not be up-to-date based on frequency of ETL processes	May not be up-to-date based on frequency of ETL processes
Users	Application developers	Business analysts, application developers, and data scientists	Business analysts and data scientists
Pros	Fast queries for storing and updating data	Easy data storage simplifies ingesting raw data A schema is applied afterwards to make working with the data easy for business analysts Separate storage and compute	The fixed schema makes working with the data easy for business analysts
Cons	May have limited analytics capabilities	Requires effort to organize and prepare data for use	Difficult to design and evolve schema Scaling compute may require unnecessary scaling of storage, because they are tightly coupled

Database vs. data warehouse vs. data lake: which is right for me?

Nearly every interactive application will require a database.

When organizations want to analyze their data from multiple sources, they may choose to complement their databases with a data warehouse, a data lake, or both. When determining if a data lake and/or data warehouse is right for your organization, consider the following questions:

Is my data structured, semi-structured, or unstructured? Data warehouses support structured and semi-structured data whereas data lakes support all three.
Will my analysis benefit from having a pre-defined, fixed schema? Data warehouses require users to create a pre-defined, fixed schema upfront, which lends itself to more limited (but easier) data analysis. Data lakes allow users to store data in its raw, original format, which makes it easier to store data without having to apply and maintain structure.
Where is my data currently stored? Data warehouses require you to create ETL processes to move your data into the warehouse. Depending on where the data is stored, a data lake may not require any data to be moved. For example, MongoDB Atlas Data Lake is able to access data stored in an Amazon S3 bucket, which can be quite advantageous for organizations who are already storing their data there.

Using MongoDB Atlas databases and data lakes

MongoDB Atlas is a fully-managed database-as-a-service that supports creating MongoDB databases with a few clicks. MongoDB databases have flexible schemas that support structured or semi-structured data.

In many cases, the MongoDB data platform provides enough support for analytics that a data warehouse or a data lake is not required. Some of the features that MongoDB provides to support analytics include:

A powerful aggregation pipeline that allows for data to be aggregated and analyzed in real time.
Support for analytics nodes that are designated for analytic workloads. This means that running analytics will not impact the performance of an application's critical operational workloads.
The MongoDB BI Connector, which allows you to connect your MongoDB data to BI and analytics platforms for further visualizations and analysis.

Summary

Databases, data warehouses, and data lakes each have their own purpose. Nearly every modern application will require a database to store the current application data. Organizations that want to analyze their applications' current and historical data may choose to complement their databases with a data warehouse, a data lake, or both.

AWs re:Invent 2022 Presentation - From RDBMS to NoSQL

Ready to get started?

Get started today with a free Atlas database.

FAQ

What is the difference between a database and a data lake?

A database stores the current data required to power an application. A data lake stores current and historical data for one or more systems in its raw form for the purpose of analyzing the data.

What is a database vs. a data warehouse?

A database stores the current data required to power an application whereas a data warehouse stores current and historical data for one or more systems in a predefined and fixed schema for the purpose of analyzing the data.

What is a data warehouse vs. a data lake?

Both data lakes and data warehouses store current and historical data for one or more systems. Data warehouses store data using a predefined and fixed schema whereas data lakes store data in their raw form.

When to use a data lake vs. a data warehouse?

An organization can choose to use a data lake, a data warehouse, or both when they want to analyze data from one or more systems in order to gain insights. Data lakes are a good option when an organization wants to store raw data in its original raw format. Data warehouses are a good choice when an organization wants to store data in a highly structured format.

What is a data lake used for?

Data lakes are used to store current and historical data for one or more systems. Data lakes store data in its raw (untransformed) form, which allows developers, data scientists, and data engineers to run ad-hoc analytics.

Is data warehousing dead?

No, data warehousing is not dead. The need for analytics to help a company gain insights and make decisions is not going away.

Data lakes are an alternative approach to data warehousing. A data lake can be a powerful complement to a data warehouse when an organization is struggling to handle the variety and ever-changing nature of its data sources.

Will big data replace data warehouses?

Big data and data warehouses are two different concepts. Big data refers to data that has high volume, velocity, and variety. Big data could be stored in a data lake or a data warehouse.

How can I learn how to use databases?

To get started using a database, you'll typically begin by creating a database and then learning to run the CRUD (create, read, update, and delete) operations. Each database will have its own unique flavor of how to get started. To learn how to use MongoDB, visit Get Started with MongoDB.

How do databases work?

Databases utilize storage engines, which manage how data is stored and retrieved. To learn more about MongoDB storage engines, visit FAQ: MongoDB Storage.