Perhaps you've heard the terms "database," "data warehouse," and "data lake," and you've got some questions. Are these different words to describe the same thing? If not, what are the differences? And when should you choose one over the other? We'll explore answers to these questions and more in this article.
Table of Contents:
A database is a collection of data or information. Databases are typically accessed electronically and are used to support Online Transaction Processing (OLTP). Database Management Systems (DBMS) store data in the database and enable users and applications to interact with the data. The term “database” is commonly used to reference both the database itself as well as the DBMS.
A variety of database types have emerged over the last several decades. All databases store information, but each database will have its own characteristics. Relational databases store data in tables with fixed rows and columns. Non-relational databases (also known as NoSQL databases) store data in a variety of models including JSON (JavaScript Object Notation), BSON (Binary JSON), key-value pairs, tables with rows and dynamic columns, and nodes and edges. Databases store structured and/or semi-structured data, depending on the type.
You may also find database characteristics like:
If your application needs to store data (and nearly every interactive application does), your application needs a database. Applications across industries and use cases are built on databases. Many types of data can be stored in databases, including:
A myriad of databases exist. Examples include:
Both data warehouses and data lakes are meant to support Online Analytical Processing (OLAP). OLAP systems are typically used to collect data from a variety of sources. The data is then used to power a range of analytical use cases ranging from business intelligence and reporting (e.g., quarterly sales reports by store) to forecasting (e.g., predicting home sales for the next six months based on historical trends).
With that in mind, let’s compare these two approaches to OLAP.
A data warehouse is a system that stores highly structured information from various sources. Data warehouses typically store current and historical data from one or more systems. The goal of using a data warehouse is to combine disparate data sources in order to analyze the data, look for insights, and create business intelligence (BI) in the form of reports and dashboards.
You might be wondering, "Is a data warehouse a database?" Yes, a data warehouse is a giant database that is optimized for analytics.
Data warehouses store large amounts of current and historical data from various sources. They contain a range of data, from raw ingested data to highly curated, cleansed, filtered, and aggregated data.
Extract, transform, load (ETL) processes move data from its original source to the data warehouse. The ETL processes move data on a regular schedule (for example, hourly or daily), so data in the data warehouse may not reflect the most up-to-date state of the systems.
Data warehouses typically have a pre-defined and fixed relational schema. Therefore, they work well with structured data. Some data warehouses also support semi-structured data.
Once the data is in the warehouse, business analysts can connect data warehouses with BI tools. These tools allow business analysts and data scientists to explore the data, look for insights, and generate reports for business stakeholders.
Data warehouses are a good option when you need to store large amounts of historical data and/or perform in-depth analysis of your data to generate business intelligence. Due to their highly structured nature, analyzing the data in data warehouses is relatively straightforward and can be performed by business analysts and data scientists.
Note that data warehouses are not intended to satisfy the transaction and concurrency needs of an application. If an organization determines they will benefit from a data warehouse, they will need a separate database or databases to power their daily operations.
Examples of data warehouses include:
A data lake is a repository of data from disparate sources that is stored in its original, raw format. Like data warehouses, data lakes store large amounts of current and historical data. What sets data lakes apart is their ability to store data in a variety of formats including JSON, BSON, CSV, TSV, Avro, ORC, and Parquet.
Typically, the primary purpose of a data lake is to analyze the data to gain insights. However, organizations sometimes use data lakes simply for their cheap storage with the idea that the data may be used for analytics in the future.
You might be wondering, "Is a data lake a database?" A data lake is a repository for data stored in a variety of ways including databases. With modern tools and technologies, a data lake can also form the storage layer of a database. Tools like Starburst, Presto, Dremio, and Atlas Data Lake can give a database-like view into the data stored in your data lake. In many cases, these tools can power the same analytical workloads as a data warehouse.
Data lakes store large amounts of structured, semi-structured, and unstructured data. They can contain everything from relational data to JSON documents to PDFs to audio files.
Data does not need to be transformed in order to be added to the data lake, which means data can be added (or “ingested”) incredibly efficiently without upfront planning.
The primary users of a data lake can vary based on the structure of the data. Business analysts will be able to gain insights when the data is more structured. When the data is more unstructured, data analysis will likely require the expertise of developers, data scientists, or data engineers.
The flexible nature of data lakes enables business analysts and data scientists to look for unexpected patterns and insights. The raw nature of the data combined with its volume allows users to solve problems they may not have been aware of when they initially configured the data lake.
Data in data lakes can be processed with a variety of OLAP systems and visualized with BI tools.
Data lakes are a cost-effective way to store huge amounts of data. Use a data lake when you want to gain insights into your current and historical data in its raw form without having to transform and move it. Data lakes also support machine learning and predictive analytics.
Like data warehouses, data lakes are not intended to satisfy the transaction and concurrency needs of an application.
Data lakes can provide storage and compute capabilities, either independently or together.
The following are examples of technology that provide flexible and scalable storage for building data lakes:
Other technologies enable organizing and querying data in data lakes, including:
Databases, data warehouses, and data lakes are all used to store data. So what's the difference?
The key differences between a database, a data warehouse, and a data lake are that:
The table below summarizes similarities and differences between databases, data warehouses, and data lakes.
Database | Data Lake | Data Warehouse | |
---|---|---|---|
Workloads | Operational and transactional | Analytical | Analytical |
Data Type | Structured or semi-structured | Structured, semi-structured, and/or unstructured | Structured and/or semi-structured |
Schema Flexibility | Rigid or flexible schema depending on database type | No schema definition required for ingest (schema on read) | Pre-defined and fixed schema definition for ingest (schema on write and read) |
Data Freshness | Real time | May not be up-to-date based on frequency of ETL processes | May not be up-to-date based on frequency of ETL processes |
Users | Application developers | Business analysts, application developers, and data scientists | Business analysts and data scientists |
Pros | Fast queries for storing and updating data | Easy data storage simplifies ingesting raw data A schema is applied afterwards to make working with the data easy for business analysts Separate storage and compute | The fixed schema makes working with the data easy for business analysts |
Cons | May have limited analytics capabilities | Requires effort to organize and prepare data for use | Difficult to design and evolve schema Scaling compute may require unnecessary scaling of storage, because they are tightly coupled |
Nearly every interactive application will require a database.
When organizations want to analyze their data from multiple sources, they may choose to complement their databases with a data warehouse, a data lake, or both. When determining if a data lake and/or data warehouse is right for your organization, consider the following questions:
MongoDB Atlas is a fully-managed database-as-a-service that supports creating MongoDB databases with a few clicks. MongoDB databases have flexible schemas that support structured or semi-structured data.
In many cases, the MongoDB data platform provides enough support for analytics that a data warehouse or a data lake is not required. Some of the features that MongoDB provides to support analytics include:
Databases, data warehouses, and data lakes each have their own purpose. Nearly every modern application will require a database to store the current application data. Organizations that want to analyze their applications' current and historical data may choose to complement their databases with a data warehouse, a data lake, or both.
Get started today with a free Atlas database.
A database stores the current data required to power an application whereas a data warehouse stores current and historical data for one or more systems in a predefined and fixed schema for the purpose of analyzing the data.
Data lakes are used to store current and historical data for one or more systems. Data lakes store data in its raw (untransformed) form, which allows developers, data scientists, and data engineers to run ad-hoc analytics.
No, data warehousing is not dead. The need for analytics to help a company gain insights and make decisions is not going away.
Data lakes are an alternative approach to data warehousing. A data lake can be a powerful complement to a data warehouse when an organization is struggling to handle the variety and ever-changing nature of its data sources.