Global Snowflake: Loading Data into Snowflake from Azure Blob Storage

Snowflake is available in a number of AWS regions as we continue to expand our geographic presence to meet customer demand, reduce latency and satisfy industry compliance and country-specific regulations.

As the global adoption of Snowflake continues, we also expect customers to transfer increasingly higher data volumes between regions. There are several potential scenarios that, taken together, are driving this trend, including:

  • Data sharing across regions. A data set maintained by a Snowflake account in a specific region can be easily shared and consumed by others in different regions.
  • Synchronizing Snowflake data between different geographic regions for disaster recovery purposes.
  • Importing and exporting raw data from storage in other cloud providers. A common request from Snowflake customers running Snowflake on AWS was to support import from and export to Blob Storage in Microsoft Azure.

Therefore, we are happy to announce additional integration capabilities with Blob Storage in Microsoft Azure. This is part of Snowflake’s commitment to becoming a global cloud provider, supporting customers no matter which cloud provider(s) they choose. The integration with Azure Blob Storage complements Snowflake’s existing functionality for data loading and unloading. From now on, the Snowflake SQL commands to define an external stage support the URL and credential specifications for Azure Blob Storage. This allows customers to import data from and export data to Azure Blob Storage containers.

This targets architectures with data sets in Azure. Customers using Microsoft Azure can now easily exchange data with Snowflake to benefit from Snowflake’s leading analytics capabilities. The following picture illustrates the overall approach:

 

The key integration point between Azure and Snowflake is a container in Azure Blob Storage. Snowflake expects that any data to be loaded from the Azure application into Snowflake is placed in a container. This container is then registered as an external stage in Snowflake using the following Snowflake SQL command:

CREATE STAGE azstage

URL = azure://<account>.blob.core.windows.net/<container>/<path>

CREDENTIALS=(AZURE_SAS_TOKEN=…)

The URL parameter of the CREATE STAGE command now supports URLs for Azure Blob Storage service endpoints. The endpoint for a given account can be found in the overview pane of the storage account in the Azure portal, as shown in the following figure:

When defining an external stage in Azure Blob Storage, the service endpoint URL should be followed by the container name and can include additional path specifications. Note that the URL in the stage definition replaces ‘https:’ from the endpoint URL with ‘azure’. Client-side encryption is supported for files that are encrypted using an Azure Storage SDK or compatible encryption library. Customers can provide their encryption key to the CREATE STAGE command. Details on creating Azure stages can be found in the Snowflake documentation here.

After defining the external stage, customers can use Snowflake’s familiar COPY syntax to refer to the stage. For example, the following statement loads a batch of data files from the Azure Blob Storage container into a target table T1 in Snowflake:

COPY INTO T1 

FROM @azstage/newbatch

Similarly, the following COPY statement exports the contents of an existing table T2 in Snowflake to a set of files in the Azure external stage:

COPY INTO @azstage/t2data 

FROM T2

The Snowflake external stage support for Azure Blob Storage complements Snowflake’s expansion across Amazon data centers worldwide. It now provides the ability to easily access data in Azure storage using built-in Snowflake functionality.

While cloud providers do not charge for data ingress, they do charge for data egress. For Snowflake customers using the new Azure external stages, importing data from Azure Blob Storage into Snowflake will incur data egress charges for the amount of data transferred out of their Blob Storage accounts. The charges accrue to the Azure subscription that the Blob Storage accounts belong to. The rate at which these charges occur depends on the geographical location of the Blob Storage account and the volume of data transferred during a billing period. You can find more details here.   

Starting on February 1, 2018, we will pass through the cost of data egress from AWS and Microsoft through external stages. This cost will appear on your bill. Egress through JDBC and other drivers will continue to be free. Since data egress is uncommon among Snowflake customers, our initial analysis shows that many customers will not be affected by this change.

The Billing & Usage page in the Snowflake web portal tracks data transfer volumes similarly to warehouse utilization and storage volumes. More information on Snowflake data transfer prices can be found here.

To give an example, exporting a 1TB table across cloud providers, from Snowflake running in US West on AWS into an Azure Blob Storage container located in the Azure East US region, costs $90. Exporting the table within AWS and into an AWS S3 bucket located in US East costs $20 (cheaper since the transfer is within AWS US regions). Data transferred into an external stage in the same region and the same cloud provider continues to be free of charge, e.g., exporting the table from the previous example into an AWS S3 bucket in US West is free of charge.

Check out the Snowflake documentation for detailed information on exchanging data between Snowflake and Azure Blob Storage, as well as the new external stage support for Azure Blob Storage.

 

Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Snowpipe: Serverless Loading for Streaming Data

Today’s organizations struggle to load, integrate and analyze the sources of streaming data they rely on to advance their businesses. The Internet-of-things, mobile devices, adtech scenarios, telemetry systems and application health monitoring are just some of the scenarios driving this trend. To remain competitive, organizations require this data to drive their analytics in near real time.

Despite the rapid rise in streaming data, and the infinite scalability of the cloud, traditional data warehousing solutions can’t deliver seamless ingestion for streaming data. They still require processes originally designed for batch loading, happening once or a few times per day. This unnecessary latency is a drag on an organization’s analytics. Workarounds using micro-batching provide some relief but are difficult to implement and require careful tuning. Equally, serverless computing is still foreign to most traditional data warehouse solutions. In fact, most cloud data warehouses, which are “cloud-washed” versions of on-premises solutions, do not offer a serverless experience.  

Snowpipe tackles both continuous loading for streaming data and serverless computing for data loading into Snowflake. With Snowpipe, AWS S3 event notifications automatically trigger Snowflake to load data into target tables. Snowflake SQL queries retrieve the most recent data within a minute after it arrived to the S3 bucket.

The “pipe” is a key concept in the surface area that Snowpipe adds to Snowflake. A pipe definition wraps the familiar COPY statement for data loading with Snowflake. Most of the semantics from the existing COPY statement carry forward to a pipe in Snowpipe. The main difference, though, is that pipes are continuously watching for new data and are continuously loading data from the stage used by the pipe.

Snowpipe’s surface area provides two different levels of control for pipes. With most use cases, Snowpipe can rely on Amazon SQS notifications from an S3 bucket to trigger data loads. It requires a one-time configuration for an S3 bucket and a set of target tables. This usually takes less than 15 minutes to set up. It’s completely configuration-based, with no need to write any application code other than some Snowflake DDL statements. This experience is available for preview in December 2017. The following diagram illustrates this approach:

 

 

For use cases that require more control or deeper integration into applications, Snowpipe also provides a programmatic REST API to notify Snowflake of new data in S3. The REST API is available today in preview for Snowflake customers in the US West region. The following diagram shows an architectural overview of this approach:

 

For both Snowpipe experiences, Snowflake runs and manages a fleet of servers that asynchronously perform the actual data loading into the target tables. This server fleet is completely internal to Snowflake, which automatically adds or removes servers from the fleet depending on the current Snowpipe load. Customers don’t need to worry about monitoring or managing compute capacity when using Snowpipe for their data loads.

Snowpipe utilization, billing and cost

Snowpipe uses a serverless billing model. Customers are charged based on their actual compute resource utilization rather than capacity reservations that may be idle or overutilized. Instead, Snowpipe tracks the resource consumption of pipes in a given Snowflake account for the load requests that the pipe processed, with per-second/per-core granularity. The utilization recorded is then translated into familiar Snowflake credits. Snowpipe utilization shows up in the form of Snowflake credits on the bill, and account administrators can track Snowpipe utilization on their Snowflake account pages during the course of the month. Snowpipe utilization is shown as a special Snowflake warehouse – indicated by the Snowflake logo proceeding the warehouse name – in the Warehouse tab in Billing & Usage on the Snowflake web portal.

 

A table function called pipe_utilization_history in Snowflake SQL allows you to drill into Snowpipe utilization details over specific periods of time or for specific pipes.

Give Snowpipe a spin today and let us know your feedback. Snowpipe using REST-based notifications is available today. You can find the documentation and information on how to get started here.

Snowpipe with auto-ingest using SQS is available in December. If you are interested in participating in a private preview for this capability, please let us know here. Make sure to also read part two of this blog about Snowpipe here.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

How Snowpipe Streamlines Your Continuous Data Loading and Your Business

For anyone who harbors a love-hate relationship with data loading, it’s time to tip the scales.

We all know data can be difficult to work with. The challenges start with the varying formats and complexity of the data itself. This is especially the case with semi-structured data such as JSON, Avro and XML, and it continues with the significant programming skills needed to extract and process data from multiple sources. Making matters worse, traditional on-premise and cloud data warehouses require batch loading of data (with limitations on the size of data files ingested) and huge manual efforts to run and manage servers.

The results? Poor, slow performance and the inability to extract immediate insights from all your data. Data scientists and analysts are forced to wait days or even weeks before they can use the data to develop accurate models, spot trends and identify opportunities. Consequently, executives don’t get the necessary up-to-minute insights to make real-time decisions with confidence and speed.

Common problems that affect data loading include:

  • Legacy architecture – Tightly coupled storage and compute necessitate contention with queries as data is loading.
  • Stale data – Batch loading prevents organizations from acquiring instant, data-driven insight.
  • Limited data – Lack of support for semi-structured data requires transforming newer data types and defining a schema before loading, which introduces delays.
  • Manageability – Dedicated clusters or warehouses are required to handle the loading of data.
  • High-maintenance – Traditional data warehouse tools result in unnecessary overhead in the form of constant indexing, tuning, sorting and vacuuming.

These obstacles all point to the need for a solution that allows continuous data loading without impacting other workloads, without requiring the management of servers and without crippling the performance of your data warehouse.

Introducing Snowpipe, our continuous, automated and cost-effective service that loads all of your data quickly and efficiently without any manual effort. How does Snowpipe work?

Snowpipe automatically listens for new data as it arrives in your cloud storage environment and continuously loads it into Snowflake. With Snowpipe’s unlimited concurrency, other workloads are never impacted , and you benefit from serverless, continuous loading without ever worrying about provisioning. That’s right. There are no servers to manage and no manual effort is required. Snowpipe makes all this happen automatically.

The direct benefits of Snowpipe’s continuous data loading include:

  • Instant insights – Immediately provide fresh data to all your business users without contention.
  • Cost-effectiveness – Pay only for the per-second compute utilized to load data rather than running a warehouse continuously or by the hour. 
  • Ease-of-use – Point Snowpipe at an S3 bucket from within the Snowflake UI and data will automatically load asynchronously as it arrives.
  • Flexibility – Technical resources can interface directly with the programmatic REST API, using Java and Python SDKs to enable highly customized loading use cases.
  • Zero management – Snowpipe automatically provisions the correct capacity for the data being loaded. No servers or management to worry about.

Snowpipe frees up resources across your organization so you can focus on analyzing your data, not managing it. Snowpipe puts your data on pace with near real-time analytics. At Snowflake, we tip the scales on your love-hate relationship with data so you can cherish your data without reservation.

Read more about the technical aspects of Snowpipe on our engineering blog. For an in-depth look at Snowpipe in action, you can also join us for a live webinar on December 14th.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

 

Snowflake and Spark, Part 1: Why Spark?

This is the first post in a 2-part series describing Snowflake’s integration with Spark. In this post, we introduce the Snowflake Connector for Spark (package available from Maven Central or Spark Packages, source code in Github) and make the case for using it to bring Spark and Snowflake together to power your data-driven solutions.

What is Spark?

Apache Spark is a distributed data processing system with support for functional, declarative and imperative programming styles. Its popularity and power lie in its myriad of programming paradigms, supported APIs (Scala, R, and Python), machine-learning libraries, and tight integration with the Hadoop ecosystem. Spark also provides connectivity to a wide variety of data sources, including SQL-based relational database systems and NoSQL systems. As a result, Spark has become the tool of choice for data engineering tasks.

With the introduction of the Snowflake Connector for Spark in June 2016, Snowflake enabled connectivity to and from Spark. The connector provides the Spark ecosystem with access to Snowflake as a fully-managed and governed repository for all data types, including JSON, Avro, CSV, XML, machine-born data, etc. The connector also enables powerful use cases that integrate Spark and Snowflake, including:

  • Complex ETL: Using Spark, you can easily build complex, functionally rich and highly scalable data ingestion pipelines for Snowflake. With a large set of readily-available connectors to diverse data sources, Spark facilitates data extraction, which is typically the first part in any complex ETL pipeline. Spark also helps with computationally-involved tasks for data transformation such as sessionization, data cleansing, data consolidation, and data unification, which usually happens at later stages in the ETL pipeline. Using the Snowflake Connector for Spark, the data produced by these complex ETL pipelines can now easily be stored in Snowflake for broad, self-service access across the organization using standard SQL and SQL tools.
  • Machine Learning: Spark provides a rich ecosystem for machine learning and predictive analytics functionality, e.g. the popular machine learning library, MLlib. With the integration between Spark and Snowflake, Snowflake provides you with an elastic, scalable repository for all the data underlying your algorithm training and testing. With machine learning, processing capacity needs can fluctuate heavily. Snowflake can easily expand its compute capacity to allow your machine learning in Spark to process vast amounts of data.

Enabling Spark in AWS EMR with Snowflake

With the Snowflake Connector for Spark, you can use Spark clusters, e.g. in AWS EMR or Data Bricks, and connect them easily with Snowflake. For example, you can create an EMR cluster with Spark pre-installed when selecting Spark as the application. Before we dive into the details of using Snowflake with Spark, the following code samples illustrate how to create and connect to a Spark cluster in AWS EMR and start a spark-shell using the connector:

STEP 1: Create a Spark cluster in AWS EMR 5.4.0 with Spark 2.1 using the AWS CLI. For example, in US-West-2:

aws emr create-cluster \
   --applications Name=Ganglia Name=Spark \
   --ec2-attributes '{"KeyName":"","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":""}' \
   --service-role EMR_DefaultRole \
   --enable-debugging \
   --release-label emr-5.4.0 \
   --log-uri 's3n:///elasticmapreduce/' \
   --name '' \
   --instance-groups \
      '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master Instance Group"}, \
      {"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core Instance Group"}]' \
   --configurations '[{"Classification":"spark","Properties":{"maximizeResourceAllocation":"true"},"Configurations":[]}]' \
   --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR \
   --region us-west-2

STEP 2: Connect to the cluster using ssh:

ssh i ~/.pem hadoop@.compute.amazonaws.com 

STEP 3: Start spark-shell with the Snowflake connector packages. Alternatively, you can also pre-load the packages using the packages option when creating the cluster. For example:

spark-shell --packages net.snowflake:snowflake-jdbc:3.0.14,net.snowflake:spark-snowflake_2.11:2.1.3

Where:

  • spark-snowflake_2.11 specifies the connector artifact ID (for Scala 2.11).
  • 2.1.3 specifies the connector version. Note that this version is for Spark 2.1. For Spark 2.0, use 2.1.3-spark_2.0 instead.

Also, note that, if you are not running from an EMR cluster, you need to add the package for AWS support to the packages list. For instance, when you run spark-shell from a local installation, your packages list will look like this:

spark-shell --packages net.snowflake:snowflake-jdbc:3.0.14,net.snowflake:spark-snowflake_2.11:2.1.3,org.apache.hadoop:hadoop-aws:2.8.0

STEP 4: In spark-shell, you then need to define which Snowflake database and virtual warehouse to use. The connector also needs access to a staging area in AWS S3 which needs to be defined. You can do that with the following Scala commands in spark-shell:

// Configuration of the staging area for the connector in AWS S3 
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "")

// Snowflake instance parameters
val defaultOptions = Map(
  "sfURL" -> "",
  "sfAccount" > "",
  "sfUser" -> "",
  "sfPassword" -> "",
  "sfDatabase" -> "",
  "sfSchema" -> "public",
  "sfWarehouse" -> "",
  "awsAccessKey" -> sc.hadoopConfiguration.get("fs.s3n.awsAccessKeyId"),
  "awsSecretKey" -> sc.hadoopConfiguration.get("fs.s3n.awsSecretAccessKey"),
  "tempdir" -> "s3n:///"
)

With these steps, you now have a Spark cluster and a spark-shell running in AWS EMR that you can use to showcase some of the most common use cases, such as complex ETL and machine learning.

Overview of Loading Data from Spark into Snowflake

Before we get into our discussion of ETL using Spark and Snowflake, let’s first take a look at what happens behind the scenes when the Snowflake Spark connector persists Spark data frames in Snowflake.

For data loading operations, the Spark connector for Snowflake performs three separate steps, similar to the spark-redshift connector:

  1. The first step persists the contents of the data frame in the staging area in AWS S3. This step currently uses the S3 bucket provided during the configuration of the Spark connector (as described in the Snowflake documentation). Future versions of the connector will not require an S3 bucket.
  2. The second step then connects to Snowflake from Spark using JDBC. On that JDBC connection, it issues a COPY command to load the data.
  3. As the third and final step, the COPY command retrieves the data from the staging area in S3 and using the current virtual warehouse to load it into tables in the Snowflake database.

Note that, in the picture above, the slave nodes in the Spark cluster and the compute nodes in Snowflake (i.e. the virtual warehouse) process the data flow. This approach allows for much greater scale than a more conventional approach where the data flow goes through the JDBC connection. Here you can scale the Spark cluster or the Snowflake virtual warehouse independently to increase data transfer bandwidth between Spark and Snowflake while your bandwidth will always be limited to a single JDBC connection.

Complex ETL in Spark with Snowflake

Today, data engineers need to continuously extract and transform data from various sources before it can be made available for consumption by business users through querying and analytics. Given the volume of data that today’s organizations face, a significant amount of compute and storage capacity needs to be made available to perform these tasks in a timely and scalable fashion. Spark has become the tool of choice for many data engineers to implement the computation steps along their data processing pipelines. This is due to the high efficiency of Spark with its in-memory execution capabilities, the availability of libraries from the Spark ecosystem, and the ease of development with languages such as Scala and Python.

The following example illustrates how Spark can be used to implement a simple data ingestion pipeline that performs several transformations on the new data before storing it in Snowflake. The example uses a web log scenario. Assume that new data is read from a web server log file, in this case using the Apache web log format. Log lines are made available as a list in Scala. The sample parses the IP addresses from the log lines and transforms them into ZIP codes using REST calls to the FreeGeoIP web service. The resulting list of ZIP codes is then stored in Snowflake.

// *************************************
// Geo mapping of Apache web log lines 
// *************************************
import scala.util.parsing.json.JSON._
import scala.io.Source.{fromInputStream}
import java.net._

// Example data: Apache web server log lines
var weblog: List[String] = List(
 """8.8.8.8 - - [07/Mar/2017:00:05:49 -0800] "GET /foo/index.html HTTP/1.1" 401 12846""",
 """8.8.8.8 - - [07/Mar/2017:00:06:51 -0800] "GET /bar/index.html HTTP/1.1" 200 4523""")
// Function to parse web log line into a list of zip codes
def ParseLogLine (strLine : String) : Integer = {
 var Array(ip, d1, d2, dt, tz, request, url, proto, ret, size) = strLine.split(" ")

 // REST call to URL
 var url2 = "http://freegeoip.net/json/".concat(s"$ip")
 var result2 = scala.io.Source.fromURL(url2).mkString

 // Parse into JSON
 var json: Option[Any] = parseFull(result2)
 var map: Map[String,Any] = json.get.asInstanceOf[Map[String, Any]]
 var zip: String = map("zip_code").asInstanceOf[String]

var intZip: Integer = zip.trim.mkString.toInt 

return intZip
}
// Parse web log
var zips: List[Integer] = weblog.map { line => ParseLogLine(line)
}
// Let's take a look at the zip codes we found in the log
zips.foreach(println(_))

// Get a data frame for the zip codes list
var df = zips.toDF()

// Push the list/df content to Snowflake as a new table
import org.apache.spark.sql.SaveMode
df.write.format("net.snowflake.spark.snowflake").options(defaultOptions).option("dbtable", "zip_codes").mode(SaveMode.Overwrite).save()

Now we have the zip codes in Snowflake and can start using them in Snowflake queries and BI tools that connect to Snowflake.

Machine Learning in Spark with Snowflake Connectivity

Part of Spark’s appeal is how easy it is to use machine learning capabilities over the data that has been made available to Spark. For example, MLlib, a popular library for machine learning, comes as part of the standard Spark configuration. With these machine learning capabilities in hand, organizations can easily gain new insights and business value from the data that they acquire.

Expanding on our previous web log example, you may wonder what zip codes or broader geographical areas the requests in the web server logs are coming from. The following Scala code illustrates how to retrieve a query in Snowflake and apply machine learning functions to the query:

// *************************************
// Retrieve zip codes stored in Snowflake into Spark
// *************************************

// Function to retrieve a Snowflake query result into a Spark data frame
def snowflakedf(sql: String) = {
 spark.read
 .format("net.snowflake.spark.snowflake")
 .options(defaultOptions)
 .option("query", sql)
 .load()
}

// Snowflake SQL query to retrieve all ZIP codes
val df2 = snowflakedf("SELECT * FROM zip_codes")

// *************************************
// Machine learning over zip codes stored in Snowflake 
// *************************************
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

// Convert DF with SQL result into Vectors for the MLlib API
var vectors = df2.rdd.map(r => Vectors.dense(r.getDecimal(0).doubleValue()))
vectors.cache

var numClusters = 2
var numIterations = 20
var clusters = KMeans.train(vectors, numClusters, numIterations)
clusters.clusterCenters

You can now use snowflakedf(.) to define Spark data frames that are populated with data from the Snowflake query.

Summary and Next Steps

As we’ve illustrated in this post, Spark is a powerful tool for data wrangling. Its rich ecosystem provides compelling capabilities for complex ETL and machine learning. With the deep integration into Spark provided by the connector, Snowflake can now serve as the fully-managed and governed database for all your Spark data, including traditional relational data, JSON, Avro, CSV, XML, machine-born data, etc. This makes Snowflake your repository of choice in any Spark-powered solution.

So what’s next? We encourage you to try Snowflake and its integration with Spark in your data processing solutions today:

Also, are you interested in helping design and build the next-generation Spark-Snowflake integration? If so, we invite you to take a look at the open Engineering positions on our careers page.

In part 2 of this series, we’ll take a look behind the scenes to better understand how the Spark connector processes queries that retrieve data from Snowflake into Spark and how this can be used to enable high-performance ELT solutions using Spark and Snowflake. In the meantime, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.