Snowflake Data Source for Spark provides fully managed, governed, and secure elastic cloud data warehouse for Apache Spark data

SAN MATEO, Calif. – June 6, 2016 – Snowflake Computing, the cloud data warehousing company, today announced Snowflake Data Source for Spark — a native connector that joins the power of Snowflake’s cloud data warehouse with Apache Spark. This tight integration provides Spark developers a ready-to-use platform for diverse data that offers advanced security, high concurrency, a robust ANSI SQL dialect, and exceptional performance at any scale of dataset or workload — all without cluster setup or database management and tuning complexities.

Until now, developers using Spark had to plan and build an infrastructure to store and query all of their Spark data. This included the complexity of implementing a distributed or clustered infrastructure to scale capacity and support large datasets, incurring increased effort and complexity to provision, manage, secure, and govern that environment.

The new Snowflake Data Source for Spark, which is built on Spark’s DataFrame API, provides developers a fully managed and governed warehouse platform for all their diverse data (such as JSON, Avro, CSV, XML, machine data, etc.) that offers a fast, higher level connection to data with Spark’s API. The results are increased developer productivity and a simple, agile, and easy-to-deploy platform that makes it significantly easier and faster to develop and execute successful Spark projects.

Companies using Snowflake’s Data Source for Spark are able to concentrate on implementing Spark-based applications without creating unnecessary complexity and delays due to needing to secure and manage their Spark data storage infrastructure. In addition, Snowflake’s unique architecture and workload management provides a high level of concurrent query workload support for multiple Spark workgroups and the ability to fully query relational and nested Spark data stored in Snowflake.

Furthermore, Snowflake is built on top of the elasticity, flexibility, and resiliency of Amazon Web Services (AWS). This delivers differentiated performance and flexibility at any scale of data and analytics while allowing developers to pay for only the compute time and storage capacity they need, when they need it.

As Spark’s popularity and usage continues to grow, IT architects and Spark users want an innovative, robust, and easy-to-use data warehouse infrastructure solution that integrates seamlessly with and that complements the in-memory processing power of Spark. “Snowflake’s Data Source for Spark marks a significant milestone for Snowflake by bringing together our highly-concurrent, elastic data warehouse service with Spark’s ‘Lingua Franca’ APIs for Big Data,” said Matt Glickman, VP of product at Snowflake Computing. “This highly-parallelized integration will free Spark developers from having to manage data infrastructure and instead allow them to focus on machine learning, graph traversal algorithms, and ETL pipelines.  In addition, the same data can be easily queried by SQL BI tools with complete workload isolation.”

Snowflake’s Data Source for Spark enables parallel, bidirectional data movement between systems. Developers can seamlessly populate a Spark DataFrame from a Snowflake table or query and vice versa.

Use case examples of Snowflake and Spark working together include:

  • Streaming/IoT Data Ingestion. Stream data into Snowflake in near real time and combine it with other datasets to build a bigger picture to gain better insights. The data is available inside Snowflake and can be queried using familiar ANSI SQL semantics.
  • Complex ETL: Perform complex ETL in Spark, such as sessionization, and then store the data in Snowflake for broad, self-service access across the organization using SQL and SQL tools.
  • Machine Learning: Use Spark for machine learning and predictive analytics functionality while leveraging Snowflake for algorithm training and testing, Spark data dashboard reporting, DML support, and ANSI SQL query support.

“By using Snowflake, Spark, and Databricks, we developed a platform to speed the processing of raw data, analyze deep-nested level data, and enable self-service for analysts,” said Grega Kešpret, director of engineering, analytics at Celtra.

Snowflake has an ecosystem of popular data visualization and BI tools and enables rapid development of data-driven applications. This allows Snowflake customers like Celtra,, and Sharethrough to employ Spark and Snowflake together for machine learning analytics, while taking advantage of cloud scale and elasticity without the complexity and inflexibility of other data analytics platform solutions.

Snowflake is a sponsor of Spark Summit, happening June 6 – 8 in San Francisco. To hear more, stop by the Snowflake booth to speak with Snowflake experts and see the technology in action. For more details and to register, visit:

Tweet This: @SnowflakeDB Sparkles: announces Snowflake Data Source for Spark– native connector joins #ApacheSpark & #ElasticDW



About Snowflake

Snowflake Computing, the cloud data warehousing company, has reinvented the data warehouse for the cloud and today’s data. The Snowflake Elastic Data Warehouse is built from the cloud up with a patent-pending new architecture that delivers the power of data warehousing, the flexibility of big data platforms and the elasticity of the cloud – at a fraction of the cost of traditional solutions. Snowflake can be found online at


Media Contact
Danielle Salvato-Earl
Kulesa Faul for Snowflake Computing
(650) 922-7287