When you need to ingest, process and analyze data sets that are too sizable and/or complex for conventional relational databases, the solution is technology organized into a structure called a Big Data architecture. Use cases include:
Storage and processing of data in very large volumes: generally, anything over 100 GB in size
Aggregation and transformation of large sets of unstructured data for analysis and reporting
The capture, processing, and analysis of streaming data in real-time or near-real-time
Big Data architectures have a number of layers or components. These are the most common:
Data is sourced from multiple inputs in a variety of formats, including both structured and unstructured. Sources include relational databases allied with applications such as ERP or CRM, data warehouses, mobile devices, social media, email, and real-time streaming data inputs such as IoT devices. Data can be ingested in batch mode or in real-time.
This is the data receiving layer, which ingests data, stores it, and converts unstructured data into a format analytic tools can work with. Structured data is often stored in a relational database, while unstructured data can be housed in a NoSQL database such as MongoDB Atlas. A specialized distributed system like Hadoop Distributed File System (HDFS) is a good option for high-volume batch processed data in various formats.
With very large data sets, long-running batch jobs are required to filter, combine, and generally render the data usable for analysis. Source files are typically read and processed, with the output written to new files. Hadoop is a common solution for this.
This component focuses on categorizing the data for a smooth transition into the deeper layers of the environment. An architecture designed for real-time sources needs a mechanism to ingest and store real-time messages for stream processing. Messages can sometimes just be dropped into a folder, but in other cases, a message capture store is necessary for buffering and to enable scale-out processing, reliable delivery, and other queuing requirements.
Once captured, the real-time messages have to be filtered, aggregated, and otherwise prepared for analysis, after which they are written to an output sink. Options for this phase include Azure Stream Analytics, Apache Storm, and Apache Spark Streaming.
The processed data can now be presented in a structured format – such as a relational data warehouse – for querying by analytical tools, as is the case with traditional business intelligence (BI) platforms. Other alternatives for serving the data are low-latency NoSQL technologies or an interactive Hive database.
Most Big Data platforms are geared to extracting business insights from the stored data via analysis and reporting. This requires multiple tools. Structured data is relatively easy to handle, while more advanced and specialized techniques are required for unstructured data. Data scientists may undertake interactive data exploration using various notebooks and tool-sets. A data modeling layer might also be included in the architecture, which may also enable self-service BI using popular visualization and modeling techniques.
Analytics results are sent to the reporting component, which replicates them to various output systems for human viewers, business processes, and applications. After visualization into reports or dashboards, the analytic results are used for data-driven business decision making.
The cadence of Big Data analysis involves multiple data processing operations followed by data transformation, movement among sources and sinks, and loading of the prepared data into an analytical data store. These workflows can be automated with orchestration systems from Apache such as Oozie and Sqoop, or Azure Data Factory.
To process large data sets quickly, big data architectures use parallel computing, in which multiprocessor servers perform numerous calculations at the same time. Sizable problems are broken up into smaller units which can be solved simultaneously.
Big Data architectures can be scaled horizontally, enabling the environment to be adjusted to the size of each workload. Big Data solutions are usually run in the cloud, where you only pay for the storage and computing resources you actually use.
The marketplace offers many solutions and platforms for use in Big Data architectures, such as Azure managed services, MongoDB Atlas, and Apache technologies. You can combine solutions to get the best fit for your various workloads, existing systems, and IT skill sets.
You can create integrated platforms across different types of workloads, leveraging Big Data architecture components for IoT processing and BI as well as analytics workflows.
Big data of the static variety is usually stored in a centralized data lake. Robust security is required to ensure your data stays protected from intrusion and theft. But secure access can be difficult to set up, as other applications need to consume the data as well.
A Big Data architecture typically contains many interlocking moving parts. These include multiple data sources with separate data-ingestion components and numerous cross-component configuration settings to optimize performance. Building, testing, and troubleshooting Big Data processes are challenges that take high levels of knowledge and skill.
It’s important to choose the right solutions and components to meet the business objectives of your Big Data initiatives. This can be daunting, as many Big Data technologies, practices, and standards are relatively new and still in a process of evolution. Core Hadoop components such as Hive and Pig have attained a level of stability, but other technologies and services remain immature and are likely to change over time.
Big Data APIs built on mainstream languages are gradually coming into use. Nevertheless, Big Data architectures and solutions do generally employ atypical, highly specialized languages and frameworks that impose a considerable learning curve for developers and data analysts alike.