APACHE Spark

Apache Spark is a unified analytics engine for Big Data processing, with built-in modules for streaming, SQL, Machine Learning and graph processing.

Apache Spark™

An integrated part of CDH and supported with Cloudera Enterprise, Apache Spark is the open standard for flexible in-memory data processing that enables batch, real-time, and advanced analytics on the Apache Hadoop platform. Via the One Platform Initiative, Cloudera is committed to helping the ecosystem adopt Spark as the default data execution engine for analytic workloads. Apache Spark is the open standard for flexible in-memory data processing that enables batch, real-time, and advanced analytics on the Apache Hadoop platform. Via the One Platform Initiative, Cloudera is committed to helping the ecosystem adopt Spark as the default data execution engine for analytic workloads. machine-learning built-in.

Easy, Productive Development

Simple, yet rich, APIs for Java, Scala, and Python open up data for interactive discovery and iterative development of applications. Through shared common code, data scientists and developers can increase productivity with rapid prototyping for batch and streaming applications, using the language and third-party tools on which they already rely.

Fast Processing

Take advantage of Spark’s distributed in-memory storage for high-performance processing across a variety of use cases, including batch processing, real-time streaming, and advanced modeling and analytics. With significant performance improvements over MapReduce, Spark is the tool of choice for data scientists and analysts to turn their data into real results.

The Cloudera difference for Apache Spark

The first integrated solution to support Apache Spark, Cloudera not only has the most experience — with production customers across industries — but also has built the deepest engineering integration between Spark and the rest of the ecosystem, including bringing Spark to YARN and adding necessary security and management integrations (500+ patches contributed, to date). Cloudera also has multiple Spark committers on staff, so you get direct access and influence to the roadmap based on your needs and use cases.

The One Platform Initiative

Apache Spark is well-positioned to replace MapReduce as the default data-processing engine in the Hadoop ecosystem, but for customers to fully embrace Spark for all production workloads, there is still work to be done to make it enterprise-grade. Cloudera’s One Platform Initiative focuses on the need to deeply integrate Spark with the Hadoop ecosystem so users get maximum benefits from their big data infrastructure. To achieve this vision, Cloudera’s committers, working alongside the community, will specifically address the issues shown in the diagram to the right (with some items already done).