Posted by Torsten Grabs
Apr 20, 2017

Contributing author: Edward Ma.

This is the first post in an ongoing series describing Snowflake’s integration with Spark. In this post, we introduce the Snowflake Connector for Spark (package available from Maven Central or Spark Packages, source code in Github) and make the case for using it to bring Spark and Snowflake together to power your data-driven solutions.

What is Spark?

Apache Spark is a distributed data processing system with support for functional, declarative and imperative programming styles. Its popularity and power lie in its myriad of programming paradigms, supported APIs (Scala, R, and Python), machine-learning libraries, and tight integration with the Hadoop ecosystem. Spark also provides connectivity to a wide variety of data sources, including SQL-based relational database systems and NoSQL systems. As a result, Spark has become the tool of choice for data engineering tasks.

With the introduction of the Snowflake Connector for Spark in June 2016, Snowflake enabled connectivity to and from Spark. The connector provides the Spark ecosystem with access to Snowflake as a fully-managed and governed repository for all data types, including JSON, Avro, CSV, XML, machine-born data, etc. The connector also enables powerful use cases that integrate Spark and Snowflake, including:

  • Complex ETL: Using Spark, you can easily build complex, functionally rich and highly scalable data ingestion pipelines for Snowflake. With a large set of readily-available connectors to diverse data sources, Spark facilitates data extraction, which is typically the first part in any complex ETL pipeline. Spark also helps with computationally-involved tasks for data transformation such as sessionization, data cleansing, data consolidation, and data unification, which usually happens at later stages in the ETL pipeline. Using the Snowflake Connector for Spark, the data produced by these complex ETL pipelines can now easily be stored in Snowflake for broad, self-service access across the organization using standard SQL and SQL tools.
  • Machine Learning: Spark provides a rich ecosystem for machine learning and predictive analytics functionality, e.g. the popular machine learning library, MLlib. With the integration between Spark and Snowflake, Snowflake provides you with an elastic, scalable repository for all the data underlying your algorithm training and testing. With machine learning, processing capacity needs can fluctuate heavily. Snowflake can easily expand its compute capacity to allow your machine learning in Spark to process vast amounts of data.

Enabling Spark in AWS EMR with Snowflake

With the Snowflake Connector for Spark, you can use Spark clusters, e.g. in AWS EMR or Data Bricks, and connect them easily with Snowflake. For example, you can create an EMR cluster with Spark pre-installed when selecting Spark as the application. Before we dive into the details of using Snowflake with Spark, the following code samples illustrate how to create and connect to a Spark cluster in AWS EMR and start a spark-shell using the connector:

STEP 1: Create a Spark cluster in AWS EMR 5.4.0 with Spark 2.1 using the AWS CLI. For example, in US-West-2:

aws emr create-cluster \
   --applications Name=Ganglia Name=Spark \
   --ec2-attributes '{"KeyName":"","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":""}' \
   --service-role EMR_DefaultRole \
   --enable-debugging \
   --release-label emr-5.4.0 \
   --log-uri 's3n:///elasticmapreduce/' \
   --name '' \
   --instance-groups \
      '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master Instance Group"}, \
      {"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core Instance Group"}]' \
   --configurations '[{"Classification":"spark","Properties":{"maximizeResourceAllocation":"true"},"Configurations":[]}]' \
   --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR \
   --region us-west-2

STEP 2: Connect to the cluster using ssh:

ssh i ~/.pem hadoop@.compute.amazonaws.com 

STEP 3: Start spark-shell with the Snowflake connector packages. Alternatively, you can also pre-load the packages using the packages option when creating the cluster. For example:

spark-shell --packages net.snowflake:snowflake-jdbc:3.0.14,net.snowflake:spark-snowflake_2.11:2.1.3

Where:

  • spark-snowflake_2.11 specifies the connector artifact ID (for Scala 2.11).
  • 2.1.3 specifies the connector version. Note that this version is for Spark 2.1. For Spark 2.0, use 2.1.3-spark_2.0 instead.

Also, note that, if you are not running from an EMR cluster, you need to add the package for AWS support to the packages list. For instance, when you run spark-shell from a local installation, your packages list will look like this:

spark-shell --packages net.snowflake:snowflake-jdbc:3.0.14,net.snowflake:spark-snowflake_2.11:2.1.3,org.apache.hadoop:hadoop-aws:2.8.0

STEP 4: In spark-shell, you then need to define which Snowflake database and virtual warehouse to use. The connector also needs access to a staging area in AWS S3 which needs to be defined. You can do that with the following Scala commands in spark-shell:

// Configuration of the staging area for the connector in AWS S3 
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "")

// Snowflake instance parameters
val defaultOptions = Map(
  "sfURL" -> "",
  "sfAccount" > "",
  "sfUser" -> "",
  "sfPassword" -> "",
  "sfDatabase" -> "",
  "sfSchema" -> "public",
  "sfWarehouse" -> "",
  "awsAccessKey" -> sc.hadoopConfiguration.get("fs.s3n.awsAccessKeyId"),
  "awsSecretKey" -> sc.hadoopConfiguration.get("fs.s3n.awsSecretAccessKey"),
  "tempdir" -> "s3n:///"
)

With these steps, you now have a Spark cluster and a spark-shell running in AWS EMR that you can use to showcase some of the most common use cases, such as complex ETL and machine learning.

Overview of Loading Data from Spark into Snowflake

Before we get into our discussion of ETL using Spark and Snowflake, let’s first take a look at what happens behind the scenes when the Snowflake Spark connector persists Spark data frames in Snowflake.

For data loading operations, the Spark connector for Snowflake performs three separate steps, similar to the spark-redshift connector:

  1. The first step persists the contents of the data frame in the staging area in AWS S3. This step currently uses the S3 bucket provided during the configuration of the Spark connector (as described in the Snowflake documentation). Future versions of the connector will not require an S3 bucket.
  2. The second step then connects to Snowflake from Spark using JDBC. On that JDBC connection, it issues a COPY command to load the data.
  3. As the third and final step, the COPY command retrieves the data from the staging area in S3 and using the current virtual warehouse to load it into tables in the Snowflake database.

Note that, in the picture above, the slave nodes in the Spark cluster and the compute nodes in Snowflake (i.e. the virtual warehouse) process the data flow. This approach allows for much greater scale than a more conventional approach where the data flow goes through the JDBC connection. Here you can scale the Spark cluster or the Snowflake virtual warehouse independently to increase data transfer bandwidth between Spark and Snowflake while your bandwidth will always be limited to a single JDBC connection.

Complex ETL in Spark with Snowflake

Today, data engineers need to continuously extract and transform data from various sources before it can be made available for consumption by business users through querying and analytics. Given the volume of data that today’s organizations face, a significant amount of compute and storage capacity needs to be made available to perform these tasks in a timely and scalable fashion. Spark has become the tool of choice for many data engineers to implement the computation steps along their data processing pipelines. This is due to the high efficiency of Spark with its in-memory execution capabilities, the availability of libraries from the Spark ecosystem, and the ease of development with languages such as Scala and Python.

The following example illustrates how Spark can be used to implement a simple data ingestion pipeline that performs several transformations on the new data before storing it in Snowflake. The example uses a web log scenario. Assume that new data is read from a web server log file, in this case using the Apache web log format. Log lines are made available as a list in Scala. The sample parses the IP addresses from the log lines and transforms them into ZIP codes using REST calls to the FreeGeoIP web service. The resulting list of ZIP codes is then stored in Snowflake.

// *************************************
// Geo mapping of Apache web log lines 
// *************************************
import scala.util.parsing.json.JSON._
import scala.io.Source.{fromInputStream}
import java.net._

// Example data: Apache web server log lines
var weblog: List[String] = List(
 """8.8.8.8 - - [07/Mar/2017:00:05:49 -0800] "GET /foo/index.html HTTP/1.1" 401 12846""",
 """8.8.8.8 - - [07/Mar/2017:00:06:51 -0800] "GET /bar/index.html HTTP/1.1" 200 4523""")
// Function to parse web log line into a list of zip codes
def ParseLogLine (strLine : String) : Integer = {
 var Array(ip, d1, d2, dt, tz, request, url, proto, ret, size) = strLine.split(" ")

 // REST call to URL
 var url2 = "http://freegeoip.net/json/".concat(s"$ip")
 var result2 = scala.io.Source.fromURL(url2).mkString

 // Parse into JSON
 var json: Option[Any] = parseFull(result2)
 var map: Map[String,Any] = json.get.asInstanceOf[Map[String, Any]]
 var zip: String = map("zip_code").asInstanceOf[String]

var intZip: Integer = zip.trim.mkString.toInt 

return intZip
}
// Parse web log
var zips: List[Integer] = weblog.map { line => ParseLogLine(line)
}
// Let's take a look at the zip codes we found in the log
zips.foreach(println(_))

// Get a data frame for the zip codes list
var df = zips.toDF()

// Push the list/df content to Snowflake as a new table
import org.apache.spark.sql.SaveMode
df.write.format("net.snowflake.spark.snowflake").options(defaultOptions).option("dbtable", "zip_codes").mode(SaveMode.Overwrite).save()

Now we have the zip codes in Snowflake and can start using them in Snowflake queries and BI tools that connect to Snowflake.

Machine Learning in Spark with Snowflake Connectivity

Part of Spark’s appeal is how easy it is to use machine learning capabilities over the data that has been made available to Spark. For example, MLlib, a popular library for machine learning, comes as part of the standard Spark configuration. With these machine learning capabilities in hand, organizations can easily gain new insights and business value from the data that they acquire.

Expanding on our previous web log example, you may wonder what zip codes or broader geographical areas the requests in the web server logs are coming from. The following Scala code illustrates how to retrieve a query in Snowflake and apply machine learning functions to the query:

// *************************************
// Retrieve zip codes stored in Snowflake into Spark
// *************************************

// Function to retrieve a Snowflake query result into a Spark data frame
def snowflakedf(sql: String) = {
 spark.read
 .format("net.snowflake.spark.snowflake")
 .options(defaultOptions)
 .option("query", sql)
 .load()
}

// Snowflake SQL query to retrieve all ZIP codes
val df2 = snowflakedf("SELECT * FROM zip_codes")

// *************************************
// Machine learning over zip codes stored in Snowflake 
// *************************************
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

// Convert DF with SQL result into Vectors for the MLlib API
var vectors = df2.rdd.map(r => Vectors.dense(r.getDecimal(0).doubleValue()))
vectors.cache

var numClusters = 2
var numIterations = 20
var clusters = KMeans.train(vectors, numClusters, numIterations)
clusters.clusterCenters

You can now use snowflakedf(.) to define Spark data frames that are populated with data from the Snowflake query.

Summary and Next Steps

As we’ve illustrated in this post, Spark is a powerful tool for data wrangling. Its rich ecosystem provides compelling capabilities for complex ETL and machine learning. With the deep integration into Spark provided by the connector, Snowflake can now serve as the fully-managed and governed database for all your Spark data, including traditional relational data, JSON, Avro, CSV, XML, machine-born data, etc. This makes Snowflake your repository of choice in any Spark-powered solution.

So what’s next? We encourage you to try Snowflake and its integration with Spark in your data processing solutions today:

Also, are you interested in helping design and build the next-generation Spark-Snowflake integration? If so, we invite you to take a look at the open Engineering positions on our careers page.

In part 2 of this series, we’ll take a look behind the scenes to better understand how the Spark connector processes queries that retrieve data from Snowflake into Spark and how this can be used to enable high-performance ELT solutions using Spark and Snowflake. In the meantime, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.