How to Use AWS Glue with Snowflake

The process of extraction, transformation and load (ETL) is central to any data warehousing initiative. With advances in cloud data warehouse architectures, customers are also benefiting from the alternative approach of extraction, load, and transformation (ELT), where data processing is pushed to the database.

With either approach, the debate continues. Should you take a hand-coded method or leverage any number of the available ETL or ELT data integration tools? While there are advantages to both and some will choose a “one or the other” approach, many organizations select a combination of a data integration tool and hand coding. Code provides developers with the flexibility to build using preferred languages while maintaining a high level of control over integration processes and structures. The challenge has been that hand-coding options are traditionally more complex and costly to maintain. However, with AWS Glue, developers now have an option to easily build and manage their data preparation and loading processes with generated code that is customizable, reusable and portable with no infrastructure to buy, setup or manage.

In this blog, we’ll cover how to leverage the power of AWS Glue with Snowflake and how processing is optimized through the use of query pushdown for ELT.

Why AWS Glue with Snowflake

Snowflake customers now have a simple option to manage their programmatic data integration processes without worrying about servers, Spark clusters or the ongoing maintenance traditionally associated with these systems. AWS Glue provides a fully managed environment which integrates easily with Snowflake’s data warehouse-as-a-service. Together, these two solutions enable customers to manage their data ingestion and transformation pipelines with more ease and flexibility than ever before. With AWS Glue and Snowflake, customers get the added benefit of Snowflake’s query pushdown which automatically pushes Spark workloads, translated to SQL, into Snowflake. Customers can focus on writing their code and instrumenting their pipelines without having to worry about optimizing Spark performance (For more on this, read our “Why Spark” and our “Pushing Spark Query Processing to Snowflake” blogs). With AWS Glue and Snowflake, customers can reap the benefits of optimized ELT processing that is low cost and easy to use and maintain.

AWS Glue and Snowflake in Action

Prerequisites:

Setup

  1. Log into AWS.
  2. Search for and click on the S3 link.
    1. Create an S3 bucket and folder.
    2. Add the Spark Connector and JDBC .jar files to the folder.
    3. Create another folder in the same bucket to be used as the Glue temporary directory in later steps (see below).
  3. Switch to the AWS Glue Service.
  4. Click on Jobs on the left panel under ETL.
  5. Add a job by clicking Add job, click Next, click Next again, then click Finish.
    1. Provide a name for the job.
    2. Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
    3. Select the option for A new script to be authored by you.
    4. Give the script a name.
    5. Set the temporary directory to the one you created in step 2c.
    6. Expand Script libraries and job parameters:
      1. Under Dependent jars path, add entries for both .jar files from 2b.

[NOTE: You have to add the full path to the actual .jars files. Example: s3://[bucket_name]/GlueJars/spark-snowflake_2.11-2.2.6.jar,s3://[bucket_name]/GlueJars/snowflake-jdbc -3.2.4.jar]

2. Under Job parameters, enter the following information with your Snowflake account information. Make sure to include the two dashes before each key.  

[NOTE: Storing your account information and credentials this way, will expose them to anyone with access to this job. This can be useful for testing purposes but it is recommended that you securely store your credentials as outlined in the section: Store credentials securely.]

  1. Click Next, click Next again, then click Finish.
  2. You will be prompted with a blank script interface.

Sample script

Use the following sample script to test the integration between AWS Glue and your Snowflake account. This script assumes you have stored your account information and credentials using Job parameters as described in section 5.6.2.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake";

## @params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, "net.snowflake.spark.snowflake";)

## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
}

## Read from a Snowflake table into a Spark Data Frame
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[table_name]").load()

## Perform any kind of transformations on your data and save as a new Data Frame: df1 = df.[Insert any filter, transformation, or other operation]
## Write the Data Frame contents back to Snowflake in a new table df1.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[new_table_name]").mode("overwrite").save() job.commit()

Securing credentials

To securely store your account information and credentials, see the following article which describes how this is accomplished with EC2: How to Securely Store Credentials with EC2.

Conclusion

AWS Glue and Snowflake make it easy to get started and manage your programmatic data integration processes. AWS Glue can be used standalone or in conjunction with a data integration tool without adding significant overhead. With native query pushdown through the Snowflake Spark connector, this approach optimizes both processing and cost for true ELT processing. With AWS Glue and Snowflake, customers get a fully managed, fully optimized platform to support a wide range of custom data integration requirements.

 

Subscribe to the snowflake blog

Connecting a Jupyter Notebook to Snowflake via Spark (Part 4)

In the third part of this series, we learned how to connect Sagemaker to Snowflake using the Python connector. In this fourth and final post, we’ll cover how to connect Sagemaker to Snowflake with the Spark connector. If you haven’t already downloaded the Jupyter Notebooks, you can find them here.  

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

Spark Connector – local Spark

We’ll start with building a notebook that uses a local Spark instance. The Snowflake jdbc driver and the Spark connector must both be installed on your local machine. Installation of the drivers happens automatically in the Jupyter Notebook, so there’s no need for you to manually download the files. However, as a reference, the drivers can be can be downloaded here.

First, let’s review the installation process. The step outlined below handles downloading all of the necessary files plus the installation and configuration. You can initiate this step by performing the following actions:

  • Create a directory for the snowflake jar files
  • Define the drivers to be downloaded
  • Identify the latest version of the driver
  • Download the driver
%%bash
SFC_DIR=/home/ec2-user/snowflake
[ ! -d "$SFC_DIR" ] && mkdir $SFC_DIR 
cd $SFC_DIR
PRODUCTS='snowflake-jdbc spark-snowflake_2.11'
for PRODUCT in $PRODUCTS
do
   wget "https://repo1.maven.org/maven2/net/snowflake/$PRODUCT/maven-metadata.xml" 2> /dev/null
   VERSION=$(grep latest maven-metadata.xml | awk -F">" '{ print $2 }' | awk -F"<" '{ print $1 }')
   DRIVER=$PRODUCT-$VERSION.jar
   if [[ ! -e $DRIVER ]]
   then
      rm $PRODUCT* 2>/dev/null
      wget "https://repo1.maven.org/maven2/net/snowflake/$PRODUCT/$VERSION/$DRIVER" 2> /dev/null
   fi
   [ -e maven-metadata.xml ] && rm maven-metadata.xml
Done

After both jdbc drivers are installed, you’re ready to create the SparkContext. But first, let’s review how the step below accomplishes this task.

To successfully build the SparkContext, you must add the newly installed libraries to the CLASSPATH. You can start by running a shell command to list the content of the installation directory, as well as for adding the result to the CLASSPATH. With the Spark configuration pointing to all of the required libraries, you’re now ready to build both the Spark and SQL context.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext,SparkSession
from pyspark.sql.types import *
from sagemaker_pyspark import IAMRole, classpath_jars
from sagemaker_pyspark.algorithms import KMeansSageMakerEstimator

sfc_jars=!ls -d /home/ec2-user/snowflake/*.jar

conf = (SparkConf()
        .set("spark.driver.extraClassPath", (":".join(classpath_jars())+":"+":".join(sfc_jars)))
        .setMaster('local')
        .setAppName('local-spark-test'))
sc=SparkContext(conf=conf)

spark = SQLContext(sc)
sc

With the SparkContext now created, you’re ready to load your credentials. You can complete this step following the same instructions covered in part three of this series.

import boto3

params=['/SNOWFLAKE/URL','/SNOWFLAKE/ACCOUNT_ID'
        ,'/SNOWFLAKE/USER_ID','/SNOWFLAKE/PASSWORD'
        ,'/SNOWFLAKE/DATABASE','/SNOWFLAKE/SCHEMA'
        ,'/SNOWFLAKE/WAREHOUSE','/SNOWFLAKE/BUCKET'
        ,'/SNOWFLAKE/PREFIX']

region='us-east-1'

def get_credentials(params):
   ssm = boto3.client('ssm',region)
   response = ssm.get_parameters(
      Names=params,
      WithDecryption=True
   )
   #Build dict of credentials
   param_values={k['Name']:k['Value'] for k in  response['Parameters']}
   return param_values

param_values=get_credentials(params)

After the SparkContext is up and running, you’re ready to begin reading data from Snowflake through the spark.read method. For this example, we’ll be reading 50 million rows.

sfOptions = {
  "sfURL" : param_values['/SNOWFLAKE/URL'],
  "sfAccount" : param_values['/SNOWFLAKE/ACCOUNT_ID'],
  "sfUser" : param_values['/SNOWFLAKE/USER_ID'],
  "sfPassword" : param_values['/SNOWFLAKE/PASSWORD'],
  "sfDatabase" : param_values['/SNOWFLAKE/DATABASE'],
  "sfSchema" : param_values['/SNOWFLAKE/SCHEMA'],
  "sfWarehouse" : param_values['/SNOWFLAKE/WAREHOUSE'],
}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
  .options(**sfOptions) \
  .option("query", \
"select (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, " +\
"       (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, " +\
"       cast(V:time as timestamp) time, " +\
"       V:city.coord.lat lat, " +\
"       V:city.coord.lon lon " +\
"from snowflake_sample_data.weather.weather_14_total limit 5000000").load()

Here, you’ll see that I’m running a Spark instance on a single machine (i.e., the notebook instance server). On my notebook instance, it took about 2 minutes to first read 50 million rows from Snowflake and compute the statistical information.

Reading the full dataset (225 million rows) can render the notebook instance unresponsive. This is likely due to running out of memory. To mitigate this issue, you can either build a bigger notebook instance by choosing a different instance type or by running Spark on an EMR cluster. The first option is usually referred to as scaling up, while the latter is called scaling out. Scaling out is more complex, but it also provides you with more flexibility. As such, we’ll review how to run the notebook instance against a Spark cluster.

Using the Spark Connector to create an EMR cluster

Harnessing the power of Spark requires connecting to a Spark cluster rather than a local Spark instance. Building a Spark cluster that is accessible by the Sagemaker Jupyter Notebook requires the following steps:

  • The Sagemaker server needs to be built in a VPC and therefore within a subnet
  • Build a new security group to allow incoming requests from the Sagemaker subnet via Port 8998 (Livy API) and SSH (Port 22) from you own machine (Note: This is for test purposes)
  • Advanced Options
    • Use the Advanced options link to configure all of necessary options
  • Software and Steps
    • Pick Hadoop and Spark
    • Optionally, you can select Zeppelin and Ganglia
  • Hardware
    • Validate the VPC (Network). Note: The Sagemaker host needs to be created in the same VPC as the EMR cluster
    • Optionally, you can also change the instance types and indicate whether or not to use spot pricing
  • General Cluster Settings
    • Set the cluster name
    • Keep Logging for troubleshooting problems
  • Security
    • Pick an EC2 key pair (create one if you don’t have one already). Without the key pair, you won’t be able to access the master node via ssh to finalize the setup.
    • Create and additional security group to enable access via SSH and Livy
  • On the EMR master node, install pip packages sagemaker_pyspark, boto3 and sagemaker for python 2.7 and 3.4
  • Install the Snowflake Spark & JDBC driver
  • Update Driver & Executor extra Class Path to include Snowflake driver jar files

Let’s walk through this next process step-by-step. In the AWS console, find the EMR service, click “Create Cluster” then click “Advanced Options”

Creating a Spark cluster is a four-step process. Step one requires selecting the software configuration for your EMR cluster. (Note: Uncheck all other packages, then check Hadoop, Livy, and Spark only).

Step two specifies the hardware (i.e., the types of virtual machines you want to provision). For a test EMR cluster, I usually select spot pricing. As of the writing of this post, an on-demand M4.LARGE EC2 instance costs $0.10 per hour. I can typically get the same machine for $0.04, which includes a 32 GB SSD drive.

Step three defines the general cluster settings. Be sure to check “Logging” so you can troubleshoot if your Spark cluster doesn’t start. Next, configure a custom bootstrap action (You can download the file here).

The script performs the following steps:

  1. Installation of the python packages sagemaker_pyspark, boto3, and sagemaker for python 2.7 and 3.4
  2. Installation of the Snowflake JDBC and Spark drivers. As of writing this post, the newest versions are 3.5.3 (jdbc) and 2.3.1 (spark 2.11)
  3. Creation of a script to update the extraClassPath for the properties spark.driver and spark.executor
  4. Creation of a start a script to call the script listed above

Step D may not  look familiar to some of you; however, it’s necessary because when AWS creates the EMR servers, it also starts the bootstrap action. At this stage, the Spark configuration files aren’t yet installed; therefore the extra CLASSPATH properties can’t be updated. Step D starts a script that will wait until the EMR build is complete, then run the script necessary for updating the configuration.

The last step required for creating the Spark cluster focuses on security.

To enable the permissions necessary to decrypt the credentials configured in the Jupyter Notebook, you must first grant the EMR nodes access to the Systems Manager. In part 3 of this blog series, decryption of the credentials was managed by a process running with your account context, whereas here, in part 4, decryption is managed by a process running under the EMR context. As such, the EMR process context needs the same system manager permissions granted by the policy created in part 3, which is the SagemakerCredentialsPolicy.

Next, click on “EMR_EC2_DefaultRole” and “Attach policy,” then, find the SagemakerCredentialsPolicy.

At this stage, you must grant the Sagemaker Notebook instance permissions so it can communicate with the EMR cluster. Start by creating a new security group. (I named mine SagemakerEMR). Within the SagemakerEMR security group, you also need to create two inbound rules.

The first rule (SSH) enables you to establish a SSH session from the client machine (e.g. your laptop) to the EMR master. While this step isn’t necessary, it makes troubleshooting much easier.

The second rule (Custom TCP) is for port 8998, which is the Livy API. This rule enables the Sagemaker Notebook instance to communicate with the EMR cluster through the Livy API. The easiest way to accomplish this is to create the Sagemaker Notebook instance in the default VPC, then select the default VPC security group as a source for inbound traffic through port 8998.

After you’ve created the new security group, select it as an “Additional Security Group” for the EMR Master.

Next, click “Create Cluster” to launch the roughly 10-minute process. When the cluster is ready, it will display as “waiting.”

You now have your EMR cluster. Now, you need to find the local IP for the EMR Master node because the EMR master node hosts the Livy API, which is, in turn, used by the Sagemaker Notebook instance to communicate with the Spark cluster. To find the local API, select your cluster, the hardware tab and your EMR Master. Next, scroll down to the find the private IP and make note of it as you will need it for the Sagemaker configuration.

Build the Sagemaker Notebook instance

To utilize the EMR cluster, you first need to create a new Sagemaker Notebook instance in a VPC. To minimize the inter-AZ network, I usually co-locate the notebook instance on the same subnet I use for the EMR cluster. Finally, choose the VPCs default security group as the security group for the Sagemaker Notebook instance (Note: For security reasons, direct internet access should be disabled).

When the build process for the Sagemaker Notebook instance is complete, download the Jupyter Spark-EMR-Snowflake Notebook to your local machine, then upload it to your Sagemaker Notebook instance.

Next, review the first task in the Sagemaker Notebook and update the environment variable EMR_MASTER_INTERNAL_IP with the internal IP from the EMR cluster and run the step (Note: In the example above, it appears as ip-172-31-61-244.ec2.internal).

If the Sparkmagic configuration file doesn’t exist, this step will automatically download the Sparkmagic configuration file, then update it so that it points to the EMR cluster rather than the localhost. To affect the change, restart the kernel.

%%bash
EMR_MASTER_INTERNAL_IP=ip-172-31-58-190.ec2.internal
CONF=/home/ec2-user/.sparkmagic/config.json
if [[ ! -e $CONF.bk ]]
then
   wget "https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.json" \
-P /home/ec2-user/.sparkmagic -O /home/ec2-user/.sparkmagic/config.json.bk 2>/dev/null
fi
cat $CONF.bk | sed "s/localhost/$EMR_MASTER_INTERNAL_IP/" > $CONF.new
if [[ $(diff $CONF.new $CONF) ]]
then
   echo "Configuration has changed; Restart Kernel"
fi
cp $CONF.new $CONF

After restarting the kernel, the following step checks the configuration to ensure that it is pointing to the correct EMR master. If it is correct, the process moves on without updating the configuration.

Upon running the first step on the Spark cluster, the Pyspark kernel automatically starts a SparkContext.

import boto3

params=['/SNOWFLAKE/URL','/SNOWFLAKE/ACCOUNT_ID'
        ,'/SNOWFLAKE/USER_ID','/SNOWFLAKE/PASSWORD'
        ,'/SNOWFLAKE/DATABASE','/SNOWFLAKE/SCHEMA'
        ,'/SNOWFLAKE/WAREHOUSE','/SNOWFLAKE/BUCKET'
        ,'/SNOWFLAKE/PREFIX']

region='us-east-1'

def get_credentials(params):
   ssm = boto3.client('ssm',region)
   response = ssm.get_parameters(
      Names=params,
      WithDecryption=True
   )
   #Build dict of credentials
   param_values={k['Name']:k['Value'] for k in  response['Parameters']}
   return param_values

param_values=get_credentials(params)

 

Congratulations! You have now successfully configured Sagemaker and EMR.  You’re now ready for reading the dataset from Snowflake. This time, however, there’s no need to limit the number or results and, as you will see, you’ve now ingested 225 million rows.

sfOptions = {
  "sfURL" : param_values['/SNOWFLAKE/URL'],
  "sfAccount" : param_values['/SNOWFLAKE/ACCOUNT_ID'],
  "sfUser" : param_values['/SNOWFLAKE/USER_ID'],
  "sfPassword" : param_values['/SNOWFLAKE/PASSWORD'],
  "sfDatabase" : param_values['/SNOWFLAKE/DATABASE'],
  "sfSchema" : param_values['/SNOWFLAKE/SCHEMA'],
  "sfWarehouse" : param_values['/SNOWFLAKE/WAREHOUSE'],
}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
  .options(**sfOptions) \
  .option("query", \
"select (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, " +\
"       (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, " +\
"       cast(V:time as timestamp) time, " +\
"       V:city.coord.lat lat, " +\
"       V:city.coord.lon lon " +\
"from snowflake_sample_data.weather.weather_14_total").load()
df.describe().show()

Conclusion

Cloud-based SaaS solutions have greatly simplified the build-out and setup of end-to-end machine learning (ML) solutions and have made ML available to even the smallest companies. What once took a significant amount of time, money and effort can now be accomplished with a fraction of the resources.

For more information on working with Spark, please review the excellent two-part post from Torsten Grabs and Edward Ma. The first part, Why Spark, explains benefits of using Spark and how to use the Spark shell against an EMR cluster to process data in Snowflake. The second part, Pushing Spark Query Processing to Snowflake, provides an excellent explanation of how Spark with query pushdown provides a significant performance boost over regular Spark processing.

A Sagemaker / Snowflake setup makes ML available to even the smallest budget. That leaves only one question. What will you do with your data?

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

 

Subscribe to the Snowflake Blog

How to Load terabytes into Snowflake – Speeds, Feeds and Techniques

We often get these questions from customers facing an initial data load into Snowflake or, a large-scale daily ingestion: “What’s the fastest way to load terabytes of data?”, and: “What incoming data format do you recommend?” Here’s an example of a data load that provides answers to both of those questions, and more.

We recently used data from the 10TB TPCDS Benchmark data set to explore a few alternatives. This data is available to all Snowflake customers through the database named SNOWFLAKE_SAMPLE_DATA, schema TPCDS_SF10TCL.

The largest table in that database is STORE_SALES, containing 28.8 billion rows representing 4.7 TB of uncompressed data, roughly 164 bytes per row. Snowflake compresses this down to 1.3 TB internally. The table contains five years of daily transaction history and 23 columns split between integer and decimal data.

We loaded three different source data formats for this table:

  1. CSV files gzipped
  2. Date-partitioned Parquet files (snappy compressed)
  3. Date-partitioned ORC files (snappy compressed)

When loading Parquet and ORC into Snowflake, you have the choice of storing entire rows within a Snowflake VARIANT, or extracting the individual columns into a structured schema.  We tested both approaches for load performance. But for data with a fixed schema such as TPCDS, we prefer to store it structured.

Partitioned Parquet and ORC are interesting in other ways. When using HIVE partitioning for these formats within a data-lake environment, the value of the partitioning data column is typically represented by a portion of the file name, rather than by a value inside of the data itself.  This means that during data loading, we must capture and potentially manipulate the file name by referencing Snowflake’s METADATA$FILENAME property, when using the COPY command.

Load performance

First, let’s look at the raw performance of loading the data using a Snowflake 2X-large cluster:

Source Format Target Layout Load Time (sec) TB/Hr (uncompressed)
CSV (Gzipped) Structured 1104 15.4
Parquet (Snappy comp) Semi-structured 3518 4.8
Parquet (Snappy comp) Structured 3095 5.4
ORC (Snappy comp) Semi-structured 3845 4.4
ORC (Snappy comp) Structured 2820 6.0

A few points jump right out:

  • Loading from Gzipped CSV is several times faster than loading from ORC and Parquet at an impressive 15 TB/Hour. While 5-6 TB/hour is decent if your data is originally in ORC or Parquet, don’t go out of your way to CREATE ORC or Parquet files from CSV in the hope that it will load Snowflake faster.
  • Loading data into fully structured (columnarized) schema is ~10-20% faster than landing it into a VARIANT.

When we tested loading the same data using different warehouse sizes, we found that load speed was inversely proportional to the scale of the warehouse, as expected.  For example, a 3X-large warehouse, which is twice the scale of a 2X-large, loaded the same CSV data at a rate of 28 TB/Hour. Conversely, an X-large loaded at ~7 TB/Hour, and a large loaded at a rate of ~3.5 TB/hr.  This means you will spend about the same number of Snowflake credits to load a given data set regardless of the cluster size you use, as long as you suspend the warehouse when done to avoid idle time.

Load rates for your own data files may differ based on a number of factors:

  • Location of your S3 buckets – For our test, both our Snowflake deployment and S3 buckets were located in us-west-2
  • Number and types of columns – A larger number of columns may require more time relative to number of bytes in the files.
  • Gzip Compression efficiency – More data read from S3 per uncompressed byte may lead to longer load times.

(In all cases, be sure to use a sufficient number of load files to keep all loading threads busy. For a 2X-large, there are 256 such threads, and we had ~2000 load files to cover the five years of history.)

Best Practices for Parquet and ORC

While we are considering Parquet and ORC, let’s look at the technique we used to populate the fully-structured version of STORE_SALES using partitioned Parquet data.

First, consider that the date partitioned Parquet files reside in an S3 bucket with the following  prefix naming conventions, where the highlighted integer is one of the values of the partitioning keys. In STORE_SALES, it is an integer surrogate key for the sold_date column named ss_sold_date_sk:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=2451132/

If the data files were originally generated by HIVE, there will also be a prefix representing data for which the partitioning key is NULL:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/

Finally, HIVE will create a series of 0-byte “tag” files in the bucket that need to be ignored during ingestion. These files have a format in the pattern:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=2451132_$folder$

As we noted earlier, the data files themselves do not contain a column or value for ss_sold_date_sk even though it is part of the table’s definition. Instead, the value must be derived from the prefix name in S3.

To handle ingestion from this bucket, we first defined an external stage for Snowflake as:

  create or replace stage parquet_test
     url='s3://<my_bucket>/tpcds/10tb_parquet/'
     credentials = (aws_key_id=…,aws_secret_key=…)
     FILE_FORMAT = (TYPE = 'PARQUET');

Finally, our command to load all of the Parquet data into the fully structured STORE_SALES table will look like this:

  copy into STORE_SALES from (
   select 
     NULLIF(
       regexp_replace (
       METADATA$FILENAME,
       '.*\\=(.*)\\/.*',
       '\\1'), 
       '__HIVE_DEFAULT_PARTITION__'
     )                         as ss_sold_date_sk,
     $1:ss_sold_time_sk        as ss_sold_time_sk,
     $1:ss_item_sk             as ss_item_sk,
     $1:ss_customer_sk         as ss_customer_sk,
     $1:ss_cdemo_sk            as ss_cdemo_sk,
     $1:ss_hdemo_sk            as ss_hdemo_sk,
     $1:ss_addr_sk             as ss_addr_sk,
     $1:ss_store_sk            as ss_store_sk,
     $1:ss_promo_sk            as ss_promo_sk,
     $1:ss_ticket_number       as ss_ticket_number,
     $1:ss_quantity            as ss_quantity,
     $1:ss_wholesale_cost      as ss_wholesale_cost,
     $1:ss_list_price          as ss_list_price,
     $1:ss_sales_price         as ss_sales_price,
     $1:ss_ext_discount_amt    as ss_ext_discount_amt,
     $1:ss_ext_sales_price     as ss_ext_sales_price,
     $1:ss_ext_wholesale_cost  as ss_ext_wholesale_cost,
     $1:ss_ext_list_price      as ss_ext_list_price,
     $1:ss_ext_tax             as ss_ext_tax,
     $1:ss_coupon_amt          as ss_coupon_amt,
     $1:ss_net_paid            as ss_net_paid,
     $1:ss_net_paid_inc_tax    as ss_net_paid_inc_tax,
     $1:ss_net_profit          as ss_net_profit
   from @parquet_test/store_sales/)
   pattern= '.*/.*/.*/ss_sold_date_sk=.*/.*'
  ;   

Notice that we are using the “transform” feature of the COPY command to parse and manipulate the semi-structured Parquet format. The main body of the COPY includes extraction of the labeled fields contained in the Parquet data, mapping them directly to the corresponding column in STORE_SALES.  E.g. in the expression:

     $1:ss_net_paid as ss_net_paid,

$1 refers to the contents of the single column representing an entire Parquet row of input data as a set of key-value pairs, and $1:ss_net_paid represents the value associated with the ss_net_paid key in that row.

Let’s take a closer look at the two highlighted expressions from the script above.

The first expression,

     NULLIF(
       regexp_replace (
       METADATA$FILENAME,
       '.*\\=(.*)\\/.*',
       '\\1'), 
       '__HIVE_DEFAULT_PARTITION__'
      ) 

is used to populate the ss_sold_date_sk column, which is the value used to partition the input data. The REGEX_REPLACE function transforms the fully qualified S3 file name into just the integer value representing the date_key embedded in the prefix. It does this by searching for the characters in the file path following the ‘=’ sign, up to the next ‘/’. NULLIF is used to replace the partitions named HIVE_DEFAULT_PARTITION with the value NULL for the date key.

The final expression

  pattern= '.*/.*/.*/ss_sold_date_sk=.*/.*'

serves as a filter on the input files, forcing COPY to ignore the 0-byte placeholder files in the bucket.

Loading ORC entails exactly the same process, changing only the FORMAT definition in the CREATE STAGE command.

Conclusion

Loading data into Snowflake is fast and flexible. You get the greatest speed when working with CSV files, but Snowflake’s expressiveness in handling semi-structured data allows even complex partitioning schemes for existing ORC and Parquet data sets to be easily ingested into fully structured Snowflake tables.

 

subscribe to the snowflake blog

Connecting a Jupyter Notebook to Snowflake through Python (Part 3)

In part two of this four-part series, we learned how to create a Sagemaker Notebook instance. In part three, we’ll learn how to connect that Sagemaker Notebook instance to Snowflake. If you’ve completed the steps outlined in part one and part two, the Jupyter Notebook instance is up and running and you have access to your Snowflake instance, including the demo dataset. Now, you’re ready to connect the two platforms.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

The Sagemaker Console

The first step is to open the Jupyter service using the link on the Sagemaker console.

 

There are two options for creating a Jupyter Notebook. You can create the notebook from scratch by following the step-by-step instructions below, or you can download sample notebooks here. If you decide to build the notebook from scratch, select the conda_python3 kernel. Alternatively, if you decide to work with a pre-made sample, make sure to upload it to your Sagemaker notebook instance first.

There are several options for connecting Sagemaker to Snowflake. The simplest way to get connected is through the Snowflake Connector for Python. By the way, the connector doesn’t come pre-installed with Sagemaker, so you will need to install it through the Python Package manager. (Note: As of the writing of this post, the Snowflake Python connector has a dependency to C foreign function interface (CFFI), so it requires a minimum version of 1.11). Sagemaker, on the other hand, comes preinstalled with libcffi 1.10, which unfortunately causes the Snowflake Python connector to fail.

The step outlined below automatically detects the failure and, if necessary, triggers de-installation and re-installation of the most recent version of libcffi. Note: If re-installation of the CFFI package is necessary, you must also restart the Kernel so the new version is recognized.

%%bash
CFFI_VERSION=$(pip list 2>/dev/null | grep cffi )
echo $CFFI_VERSION
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   pip uninstall --yes cffi
fi
yum_log=$(sudo yum install -y libffi-devel openssl-devel)
pip_log=$(pip install --upgrade snowflake-connector-python)  
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   echo "configuration has changed; restart notebook"
fi

This  problem is persistent and needs to be resolved each time the Sagemaker server is shut down and restarted. The Jupyter Kernel also needs to be restarted after each shutdown. The good news is that the cffi package will eventually be updated on the Sagemaker AMI, and the step indicated above will begin to automatically recognize the update.

 

The next step is to connect to the Snowflake instance with your credentials.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=<user>,
  password=<password>,
  account=<account>
)

Here you have the option to hard code all credentials and other specific information, including the S3 bucket names. However, for security reasons, it’s advisable to not store credentials in the notebook. Another option is to enter your credentials every time you run the notebook.

Rather than storing credentials directly in the notebook, I opted to store a reference to the credentials. The actual credentials are automatically stored in a secure key/value management system called AWS Systems Manager Parameter Store (SSM).  

With most AWS systems, the first step requires setting up permissions for SSM through AWS IAM. Please ask your AWS security admin to create another policy with the following Actions on KMS and SSM.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",

            "Effect": "Allow",

            "Action": [

                "ssm:PutParameter",

                "ssm:DeleteParameter",

                "kms:Decrypt",

                "ssm:GetParameterHistory",

                "ssm:GetParametersByPath",

                "ssm:GetParameters",

                "ssm:GetParameter",

                "ssm:DeleteParameters"
            ],

            "Resource": [

                "arn:aws:kms:<region>:<accountid>:key/SNOWFLAKE/*",

                "arn:aws:ssm:<region>:<accountid>:parameter/SNOWFLAKE/*"
            ]
        },

        {
            "Sid": "VisualEditor1",

            "Effect": "Allow",

            "Action": [

                "ssm:DescribeParameters",

                "kms:ListAliases"
            ],

            "Resource": "*"
        }
    ]

}         

Adhering to the best-practice principle of least permissions, I recommend limiting usage of the “Actions by Resource”. Also, be sure to change the region and accountid in the code segment shown above or, alternatively, grant access to all resources (i.e. “*”).

In the code segment shown above, I created a root name of “SNOWFLAKE”. This is only an example. You’re free to create your own, unique naming convention.

Next, check permissions for your login. Assuming the new policy has been called SagemakerCredentialsPolicy, permissions for your login should look like the example shown below.

With the SagemakerCredentialsPolicy in place, you’re ready to begin configuring all your secrets (i.e. credentials) in SSM. Be sure to take the same namespace that you used to configure the credentials policy and apply them to the prefixes of your secrets.

After setting up your key/value pairs in SSM, use the following step to read the key/value pairs into your Jupyter Notebook.

import boto3

params=['/SNOWFLAKE/URL','/SNOWFLAKE/ACCOUNT_ID'
        ,'/SNOWFLAKE/USER_ID','/SNOWFLAKE/PASSWORD'
        ,'/SNOWFLAKE/DATABASE','/SNOWFLAKE/SCHEMA'
        ,'/SNOWFLAKE/WAREHOUSE','/SNOWFLAKE/BUCKET'
        ,'/SNOWFLAKE/PREFIX']
region='us-east-1'

def get_credentials(params):
   ssm = boto3.client('ssm',region)
   response = ssm.get_parameters(
      Names=params,
      WithDecryption=True
   )
   #Build dict of credentials
   param_values={k['Name']:k['Value'] for k in  response['Parameters']}
   return param_values

param_values=get_credentials(params)

Instead of hard coding the credentials, you can reference key/value pairs via the variable param_values. In addition to the credentials (account_id, user_id, password), also I stored the warehouse, database and schema.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=param_values['/SNOWFLAKE/USER_ID'],
  password=param_values['/SNOWFLAKE/PASSWORD'],
  account=param_values['/SNOWFLAKE/ACCOUNT_ID'],
  warehouse=param_values['/SNOWFLAKE/WAREHOUSE'],
  database=param_values['/SNOWFLAKE/DATABASE'],
  schema=param_values['/SNOWFLAKE/SCHEMA']

Now, you’re ready to read data from Snowflake. To illustrate the benefits of using data in Snowflake, we will read semi-structured data from the database I named “SNOWFLAKE_SAMPLE_DATABASE”.

When data is stored in Snowflake, you can use the Snowflake JSON parser and the SQL engine to easily query, transform, cast and filter JSON data data before it gets to the Jupyter Notebook.

From the JSON documents stored in WEATHER_14_TOTAL, the following step shows the minimum and maximum temperature values, a date and timestamp and the latitude/longitude coordinates for New York City.

cs=ctx.cursor()
allrows=cs.execute( \
"select (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, " +\
"       (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, " +\
"       cast(V:time as timestamp) time, " +\
"       V:city.coord.lat lat, " +\
"       V:city.coord.lon lon " +\
"from snowflake_sample_data.weather.weather_14_total " +\
"where v:city.name = 'New York' " +\
"and   v:city.country = 'US' ").fetchall()

The final step converts the result set into a Pandas DataFrame, which is suitable for machine learning algorithms.

import pandas as pd                               # For munging tabular data

data = pd.DataFrame(allrows)
data.columns=['temp_max_far','temp_min_far','time','lat','lon']
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
Data

 

Conclusion

Now that we’ve connected a Jupyter Notebook in Sagemaker to the data in Snowflake using the Snowflake Connector for Python, we’re ready for the final stage: Connecting Sagemaker and a Jupyter Notebook to both a local Spark instance and a multi-node EMR Spark cluster. I’ll cover how to accomplish this connection in the fourth and final installment of this series — Connecting a Jupyter Notebook to Snowflake via Spark.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

 

subscribe to the snowflake blog

How Usage-Based Pricing Delivers a Budget-Friendly Cloud Data Warehouse

When you have budgets and business plans you need to stick to, you want clear pricing when operating your cloud data warehouse. However, it can be a frustrating task to understand how some cloud data warehouses charge for services. Each vendor has its own pricing model. Users expect differences. However, it’s more about the hidden charges, quotas and other penalties that are frustrating.

Here are just some of the ways Snowflake helps you to manage your data warehouse budget easily, cost-effectively and efficiently.

A simple data warehouse pricing model

Snowflake’s pricing model includes only two items: the cost of storage and the cost of compute resources consumed. The charge for storage is per terabyte, compressed, per month. The charge for compute is based on the processing units, which we refer to as credits, consumed to run your queries or perform a service (for example, Snowpipe data loading). Compute charges are billed on actual usage, per second. All of our Snowflake editions and pricing details can be found here on our website.   

All charges are usage-based

As examples, using the US as a reference, Snowflake storage costs can begin at a flat rate of $23/TB, average compressed amount, per month accrued daily. Compute costs $0.00056 per second, per credit, for our Snowflake On Demand Standard Edition. Our Snowflake On Demand Enterprise Sensitive Data Edition (which includes HIPAA compliance, PCI compliance, customer managed encryption keys and other security hardened features) is $0.0011 per second, per compute credit.

Clear and straightforward compute sizing  

Once you’re in Snowflake, you can enable any number of “virtual data warehouses”, which are effectively the compute engines that power query execution. Virtual data warehouses are available in eight “T-shirt” style sizes: X-Small, Small, Medium, Large, and X- to 4X-Large. Each data warehouse size has a compute credit designation. As you go up in size, credits usage will vary. Refer to Table 1.

Table 1 – Snowflake Warehouse Sizes and Credit Usage

Based on our internal benchmarks, performance improves linearly as the size of of warehouses increases.

No charge for idle compute time

All queries run automatically from the data warehouse from which a query is launched.  What’s cool about Snowflake is that it allows you to specify a warehouse to shift into suspend mode if no queries are actively running. Once suspended, charges are also suspended for idle compute time (Figure 1).

Figure 1 – Snowflake Pricing Tracks With Actual Usage

Automatically restart the warehouse to process a new query. No manual re-provisioning required. This is more graceful than terminating a service, as required with other cloud-based solutions, to stop charges.

When you terminate services (to stop billing), with other cloud data warehouse solutions, you lose the data since it will be unloaded from the data warehouse. If you later need to run queries against that data, you will have to re-provision the data warehouse and reload the data all over again. This is disruptive and inefficient. Because of this disruption, these cloud data warehouse solutions must remain on and active, 24×7, with the meter running, whether you are running queries or not. Further, with other cloud data warehouses, if terminated services are restarted, unused credits do not roll over.

No hidden quotas or price bumps

When you run queries in Snowflake, there are no added usage quotas or hidden price premiums. You pay only for what you use.

Other cloud data warehouse solutions may not charge based on time, but rather will charge based on how many terabytes are scanned or how many terabytes are returned from a query. In addition, you may have no ability to control the compute resources available to you because you are given an allocation of processing units or slots. And if the query you run consumes more than the allocation provided to you, you are charged a premium.

True cloud-scale elasticity

On-premises data warehouses, particularly, make it nearly impossible to scale down hardware once installed and provisioned. This is cost prohibitive because most on-premises implementations are pre-sized and purchased for peak demand period, which could be for just few days a month or year, leaving massive investments idle the rest of the time.

Keeping it simple, cost-effective and price-efficient

Unlike traditional data warehouse technology (on-premises or in the cloud) never designed for the cloud, Snowflake’s cloud-built data warehouse-as-a-service makes analytics and budgeting easy. We also make it as cost-effective as possible for you to get all the insight from all your data at a fraction of the time required by your current solution. By avoiding the headaches of traditional solutions, Snowflake enables you to focus on strategic data analytics and engineering projects that advance your business.

 

The Dream Data Warehouse Development Environment

Earlier this month, Snowflake’s Customer Enablement Team was assigned an email from one of our customers. The customer stated that he was not happy about the idea of cloning full copies of production databases for development purposes. “Do we really want to develop and maintain a system to copy a fraction of the production DB to dev?”, citing the reason for his message that, by just copying the entire production database, the dev team would have access to too much data. Being a veteran of Snowflake, I initially dismissed his concern because of Snowflake’s zero-copy clone capability, as outlined in this article. From my perspective, the zero-copy clone would not incur any additional cost for development purposes, so why not give the dev team all of the data?

The answer of course, as the customer pointed out, has to do with making the development effort more efficient. The assumption of zero-copy clone equating to zero-cost development is, of course, incorrect. There’s the cost of querying the data (requiring virtual warehouse credits) and the cost behind each development hour. After all, longer query times lead to longer development iterations and longer testing cycles. To create a “blissful” development environment in Snowflake, we need a more refined approach towards building the development data set.

The approach outlined by the customer was rooted in complimenting Snowflake’s zero-copy clone with the additional technique of using Block Sampling. In other words, they proposed creating a view containing a sample set from the original table. This approach enables an administrator to quickly set up a dev environment with minimal data. Controlling the amount of sample data is a good thing for many development situations because developers seldom require access to the full dataset.

Ok, let’s take a look at SAMPLE / TABLESAMPLE and see how we can do this. The syntax is quite simple:

 
 SELECT ...
 FROM ...
   { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( <probability> ) [ { REPEATABLE | SEED } ( <seed> ) ]
 [ ... ]
 
 -- Where:
    samplingMethod :: = { { BERNOULLI | ROW } | { SYSTEM | BLOCK } }


Note that there are some interchangeable terms we should pay attention to when writing the query. These terms are synonyms and only differ in syntax:

SAMPLE | TABLESAMPLE

BERNOULLI | ROW

SYSTEM | BLOCK

REPEATABLE | SEED

The two main methods of sampling are ROW (or BERNOULLI) and BLOCK (or SYSTEM) sampling. Let’s take a closer look at each one. 

Row Sampling

This approach uses the Bernoulli principle to select data by applying a probability of p/100 to each row. In the Snowflake documentation, we mention that this is similar to “flipping a weighted coin” on each row. The number of sampled (selected) rows should be equal to (p/100) * n, where n is the total number of rows in the table and p is the sample probability value set by the user.

This method implies that we will iterate through each row and calculate a probability at each row to match a row candidate. This iteration is going to impact query performance. However, the advantage here is that we will end up with a formal and concise distribution of samples from our data. Regarding creating a dev environment, using the Bernoulli sampling method to create a view does not result in better performance. However, you could certainly create a new data table based on this sampling method and still scan fewer data. For example:

create table dev_Sales_LineItem_Bernoulli_20 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample bernoulli (20);

This statement will create a development table from the LINEITEM table in our SALES database with 20% of the rows from the original table.

I won’t focus too much on this method in this post. You can feel free to experiment on your own to see if this sampling method would suit your needs. Instead, I will talk more about Block Sampling.

Block Sampling

In this approach, we apply a probability of p/100 to each block of rows. For those who are familiar with Snowflake’s micro partitions (details below), block sampling chooses individual partitions based on a specific probability. If you simply want quick access to production data and to run queries against a small percentage of the rows in a table, leveraging Block Sampling is a good way to go.

Let’s look at this with a specific example using the Block Sampling method on the LINEITEM table in the SALES database. Here are the table metadata:

In this example, I have a developer who is working on an aggregation query against the LINEITEM table. He created a simple query like this:

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

L_linestatus;

Result:

 

This query reports the amount of business that was billed, shipped and returned. If the developer has decided to issue a query against this table, we would see the following outcome in our query plan. (Note: the below result was performed on an XS warehouse):


It took about 19 seconds on an XS warehouse, which is not bad by any means. However, we can see that the query still performed a full table scan and performed a pretty good amount of aggregation.

Now, let’s see how we can improve the experience for the developer by creating a sample set of the LINEITEM table. We’ll create a special view (against our cloned dev database) for this developer and call it V_SALES_LINEITEM_SAMPLE_10. In this view, we’ll use the Block Sampling technique and only scan 10% of the partitions required by this query.

First create the view:

create view v_Sales_LineItem_Sample_10 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample block (10);

Next, let’s update the query to use the view instead.

select

       l_returnflag,

       l_linestatus,

       sum(l_quantity) as sum_qty,

       sum(l_extendedprice) as sum_base_price,

       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

       avg(l_quantity) as avg_qty,

       avg(l_extendedprice) as avg_price,

       avg(l_discount) as avg_disc,

       count(*) as count_order

  from

       v_Sales_LineItem_Sample_10

where

       l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

       l_returnflag,

       l_linestatus

 order by

       l_returnflag,

       l_linestatus;

Result:

 

Let’s take a look at the query profile for this query:

 

This query ran about 2.5 seconds on an XS warehouse, and we scanned about 10% of the total table partition. This result is much better! Snowflake is sampling 10% of the partition to pull based on the WHERE filter. The result should still be accurate enough to let the developer know this query is working as expected.

In conclusion, leveraging Snowflake’s SAMPLE clause considerably reduces the amount of test dataset for a developer without losing data integrity. Even if a specific id or a timestamp biases the native data, chances are the developer is going to accept the nuance and continue with the development instead of spending additional compute credit and hours waiting for the results to return. The alternative is to use the LIMIT function perhaps to reduce the number of rows being returned. However, this involves modifying the original query or creating a view that still needs to be maintained. Using the SAMPLE clause, you can reliably subset a large table with acceptable result mix and performance. I hope you find this helpful. Feel free to leave comments in our community and let me know your thoughts.

 

Boost Your Analytics with Machine Learning and Advanced Data Preparation

Enterprises can now harness the power of Apache Spark to quickly and easily prepare data and build Machine Learning (ML) models directly against that data in Snowflake. Snowflake and Qubole make it easy to get started by embedding required drivers, securing credentials, simplifying connection setup and optimizing in-database processing. Customers can focus on getting started quickly with their data preparation and ML initiatives instead of worrying about complex integrations and the cost of moving large data sets.

Setting up the Snowflake connection and getting started takes only a few minutes. Customers first create a Snowflake data store in Qubole and enter details for their Snowflake data warehouse. All drivers and packages are preloaded and kept up to date, eliminating manual bootstrapping of jars into the Spark cluster. There is no further configuration or tuning required and there are no added costs for the integration. Once the connection is saved, customers can browse their snowflake tables, view metadata and see a preview of the Snowflake data all from the Qubole interface. They can then use Zeppelin notebooks to get started reading and writing data to Snowflake as they begin exploring data preparation and ML use cases.

Below is an example of the object browser view showing the available tables and properties:

Security is also handled seamlessly so customers can focus on getting started with their data, without the worry of over-protecting their credentials. Qubole provides centralized and secure credential management which eliminates the need to specify any credentials in plain text. Username and password are entered only when setting up the data store, but are otherwise inaccessible.

The solution is also designed for enterprise requirements and allows customers to use federated authentication and SSO via the embedded Snowflake drivers. With SSO enabled, customers can authenticate through an external, SAML 2.0-compliant identity provider (IdP) and achieve a higher level of security and simplicity. These capabilities help customers more easily share notebooks and collaborate on projects with little risk of sensitive information being exposed.

Below is a sample Scala program showing how to read from Snowflake using the data store object without specifying any credentials in plain text:

Beyond the simplicity and security of the integration, which helps customers get started quickly, customers will also benefit from a highly optimized Spark integration that uses Snowflake query pushdown to balance the query-processing power of Snowflake with the computational capabilities of Apache Spark. From simple projection and filter operations to advanced operations such as joins, aggregations and even scalar SQL functions, query pushdown runs these operations in Snowflake (where the data resides) to help refine and pre-filter the data before it is read into Spark. The traditional performance and cost challenges associated with moving large amounts data for processing are thereby eliminated without additional overhead or management.

With Snowflake and Qubole, customers get an optimized platform for advanced data preparation and ML that makes it simple to get started. Customers can complement their existing cloud data warehouse strategy or get more value out of ML initiatives by opening access to more data. To learn more about ML and advanced data preparation with Qubole, visit the Qubole blog.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Quickstart Guide for Sagemaker + Snowflake (Part One)

Machine Learning (ML) and predictive analytics are quickly becoming irreplaceable tools for small startups and large enterprises. The questions that ML can answer are boundless. For example, you might want to ask, “What job would appeal to someone based on their  interests or the interests of jobseekers like them?” Or, “Is this attempt to access the network an indication of an intruder?” Or, “What type of credit card usage indicates fraud?”

Setting up a machine learning environment, in particular for on-premise infrastructure, has its challenges. An infrastructure team must request physical and/or virtual machines and then build and integrate those resources. This approach is both time-consuming and error prone due to the number of manual steps involved. It may work in a small environment, but the task becomes exponentially more complicated and impractical at scale.

There are many different ML systems to choose from, including  TensorFlow, XGBoost, Spark ML and MXNet, to name a few. They all come with their own installation guides, system requirements and dependencies. What’s more, implementation is just the first step. The next challenge is figuring out how to make the output from the machine learning step (e.g. a model) available for consumption. Then, all of the components for building a model within the machine learning tier and the access to the model in the API tier need to scale to provide predictions in real-time. Last but not least, the team needs to figure out where to store all the data needed to build the model.

Managing this whole process from end-to-end becomes significantly easier when using cloud-based technologies. The ability to provision infrastructure on demand (IaaS) solves the problem of manually requesting virtual machines. It also provides immediate access to compute resources whenever they are needed. But that still leaves the administrative overhead of managing the ML software and the platform to store and manage the data.   

At last year’s AWS developer conference, AWS announced Sagemaker, a “Fully managed end-to-end machine learning service that enables data scientists, developers and machine learning experts to quickly build, train and host machine learning models at scale.”

Sagemaker can access data from many different sources (specifically the underlying kernels like Python, PySpark, Spark and R), and access data provided by Snowflake.  Storing data in Snowflake also has significant advantages.

Single source of truth

If data is stored in multiple locations, inevitably those locations will get out of sync. Even if data is supposed to be immutable, often one location is modified to fix a problem for one system while other locations are not. In contrast, if data is stored in one central, enterprise-grade, scalable repository, it serves as a “single source of truth” because keeping data in sync is made easy. Different tools are not required for structured and semi-structured data. Data can be modified transactionally, which immediately reduces the risk of problems.

Shorten the data preparation cycle

According to a study published in Forbes, data preparation accounts for about 80% of the work performed by data scientists. Shortening the data preparation cycle therefore has a major impact on the overall efficiency of data scientists.

Snowflake is uniquely positioned to shorten the data preparation cycle due to its excellent support for both structured and semi-structured data into language, SQL. This means that semi-structured data and structured data can be seamlessly parsed, combined, joined and modified through SQL statements in set-based operations. This enables data scientists to use the power of a full SQL engine for rapid data cleansing and preparation.

Scale as you go

Another problem that ML implementers frequently encounter is what we at Snowflake call “works on my machine” syndrome. Small datasets easily work on a local machine but when migrated to production dataset size, reading all the data into a single machine doesn’t scale, or it may behave unexpectedly. Even if it does finish the job, it can take hours to load a terabyte-sized dataset. In Snowflake there is no infrastructure that needs to be provisioned, and Snowflake’s elasticity feature allows you to scale horizontally as well as vertically, all with the push of a button.

Connecting Sagemaker and Snowflake

Sagemaker and Snowflake both utilize cloud infrastructure as a service offerings by AWS, which enables us to build the Infrastructure when we need it, where we need it (geographically) and at any scale required.

Since building the services becomes simplified with Sagemaker and Snowflake, the question becomes how to connect the two services. And that’s exactly the subject of the following parts of this post. How do you get started? What additional configuration do you need in AWS for security and networking? How do you store credentials?

In part two of this four-part blog, I’ll explain how to build a Sagemaker ML environment in AWS from scratch. In the third post, I will put it all together and show you how to connect a Jupyter Notebook to Snowflake via the Snowflake Python connector.  With the Python connector, you can import data from Snowflake into a Jupyter Notebook. Once connected, you can begin to explore data, run statistical analysis, visualize the data and call the Sagemaker ML interfaces.

However, to perform any analysis at scale, you really don’t want to use a single server setup like Jupyter running a python kernel. Jupyter running a PySpark kernel against a Spark cluster on EMR is a much better solution for that use case. So, in part four of this series I’ll connect a Jupyter Notebook to a local Spark instance and an EMR cluster using the Snowflake Spark connector.

The Snowflake difference

Snowflake is the only data warehouse built for the cloud. Snowflake delivers the performance, concurrency and simplicity needed to store and analyze all data available to an organization in one location. Snowflake’s technology combines the power of data warehousing, the flexibility of big data platforms, the elasticity of the cloud, and live data sharing at a fraction of the cost of traditional solutions. Snowflake: Your data, no limits.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.