Connectors
MongoDB Connector for Apache Spark
Access insights now
We live in a world of “big data”. But it isn’t just the data itself that is valuable – it’s the insight it can generate. How quickly an organization can unlock and act on that insight has become a major source of competitive advantage. Collecting data in operational systems and then relying on nightly batch extract, transform, load (ETL) processes to update the enterprise data warehouse (EDW) is no longer sufficient.
Unlock the power of Apache Spark
The MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python and R. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs.
Leverage the power of MongoDB
The MongoDB Connector for Apache Spark can take advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the data it needs – for example, analyzing all customers located in a specific geography. Traditional NoSQL datastores do not offer secondary indexes or in-database aggregations. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. The MongoDB Connector for Apache Spark co-locates Resilient Distributed Datasets (RDDs) with the source MongoDB node to minimize data movement across the cluster and reducing latency.MongoDB and Apache Spark: Working for Data Science Teams Today
China Eastern Airlines
Stratio
Stratio implemented its Pure Spark big data platform, combining MongoDB with Apache Spark, Zeppelin, and Kafka, to build an operational data lake for Mutua Madrileña, one of Spain’s largest insurance companies. Machine learning models are built to personalize the customer experience, with analysis of marketing campaign data to measure impact and improve performance.
Global Airline
A global airline has consolidated customer data scattered across more than 100 systems into a single view stored in MongoDB. Spark processes are run against the live operational data in MongoDB to update customer classifications and personalize offers in real time, as the customer is live on the web or speaking with the call center.