How to Use AWS Glue with Snowflake

The process of extraction, transformation and load (ETL) is central to any data warehousing initiative. With advances in cloud data warehouse architectures, customers are also benefiting from the alternative approach of extraction, load, and transformation (ELT), where data processing is pushed to the database.

With either approach, the debate continues. Should you take a hand-coded method or leverage any number of the available ETL or ELT data integration tools? While there are advantages to both and some will choose a “one or the other” approach, many organizations select a combination of a data integration tool and hand coding. Code provides developers with the flexibility to build using preferred languages while maintaining a high level of control over integration processes and structures. The challenge has been that hand-coding options are traditionally more complex and costly to maintain. However, with AWS Glue, developers now have an option to easily build and manage their data preparation and loading processes with generated code that is customizable, reusable and portable with no infrastructure to buy, setup or manage.

In this blog, we’ll cover how to leverage the power of AWS Glue with Snowflake and how processing is optimized through the use of query pushdown for ELT.

Why AWS Glue with Snowflake

Snowflake customers now have a simple option to manage their programmatic data integration processes without worrying about servers, Spark clusters or the ongoing maintenance traditionally associated with these systems. AWS Glue provides a fully managed environment which integrates easily with Snowflake’s data warehouse-as-a-service. Together, these two solutions enable customers to manage their data ingestion and transformation pipelines with more ease and flexibility than ever before. With AWS Glue and Snowflake, customers get the added benefit of Snowflake’s query pushdown which automatically pushes Spark workloads, translated to SQL, into Snowflake. Customers can focus on writing their code and instrumenting their pipelines without having to worry about optimizing Spark performance (For more on this, read our “Why Spark” and our “Pushing Spark Query Processing to Snowflake” blogs). With AWS Glue and Snowflake, customers can reap the benefits of optimized ELT processing that is low cost and easy to use and maintain.

AWS Glue and Snowflake in Action

Prerequisites:

Setup

  1. Log into AWS.
  2. Search for and click on the S3 link.
    1. Create an S3 bucket and folder.
    2. Add the Spark Connector and JDBC .jar files to the folder.
    3. Create another folder in the same bucket to be used as the Glue temporary directory in later steps (see below).
  3. Switch to the AWS Glue Service.
  4. Click on Jobs on the left panel under ETL.
  5. Add a job by clicking Add job, click Next, click Next again, then click Finish.
    1. Provide a name for the job.
    2. Select an IAM role. Create a new IAM role if one doesn’t already exist and be sure to add all Glue policies to this role.
    3. Select the option for A new script to be authored by you.
    4. Give the script a name.
    5. Set the temporary directory to the one you created in step 2c.
    6. Expand Script libraries and job parameters:
      1. Under Dependent jars path, add entries for both .jar files from 2b.

[NOTE: You have to add the full path to the actual .jars files. Example: s3://[bucket_name]/GlueJars/spark-snowflake_2.11-2.2.6.jar,s3://[bucket_name]/GlueJars/snowflake-jdbc -3.2.4.jar]

2. Under Job parameters, enter the following information with your Snowflake account information. Make sure to include the two dashes before each key.  

[NOTE: Storing your account information and credentials this way, will expose them to anyone with access to this job. This can be useful for testing purposes but it is recommended that you securely store your credentials as outlined in the section: Store credentials securely.]

  1. Click Next, click Next again, then click Finish.
  2. You will be prompted with a blank script interface.

Sample script

Use the following sample script to test the integration between AWS Glue and your Snowflake account. This script assumes you have stored your account information and credentials using Job parameters as described in section 5.6.2.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake";

## @params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, "net.snowflake.spark.snowflake";)

## uj = sc._jvm.net.snowflake.spark.snowflake
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfWarehouse" : args['WAREHOUSE'],
}

## Read from a Snowflake table into a Spark Data Frame
df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[table_name]").load()

## Perform any kind of transformations on your data and save as a new Data Frame: df1 = df.[Insert any filter, transformation, or other operation]
## Write the Data Frame contents back to Snowflake in a new table df1.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "[new_table_name]").mode("overwrite").save() job.commit()

Securing credentials

To securely store your account information and credentials, see the following article which describes how this is accomplished with EC2: How to Securely Store Credentials with EC2.

Conclusion

AWS Glue and Snowflake make it easy to get started and manage your programmatic data integration processes. AWS Glue can be used standalone or in conjunction with a data integration tool without adding significant overhead. With native query pushdown through the Snowflake Spark connector, this approach optimizes both processing and cost for true ELT processing. With AWS Glue and Snowflake, customers get a fully managed, fully optimized platform to support a wide range of custom data integration requirements.

 

Subscribe to the snowflake blog

How the Modern Data Warehouse Helps Ease the Burden of GDPR

While many activities driving an organization’s GDPR compliance are in the organization’s own hands, its IT vendors should help satisfy their customers’ compliance requirements. At a minimum, an organization’s SaaS vendors should satisfy the security requirements that wholly reside in their domain but impact their customer’s business and data security. Snowflake was built from the ground up and in ways that can ease the burden of complying with GDPR, especially for security – and customer-focused organizations.

Snowflake was designed from the beginning to handle an enormous amount of structured and semi-structured data with the ease of standard SQL. The accessibility and simplicity of SQL gives organizations the flexibility to seamlessly make any updates, changes or deletions required under GDPR. Snowflake’s support for semi-structured data can make it easier to adapt to new fields and other record changes. In addition, delivering industry-best security has been fundamental to the architecture, implementation and operation of Snowflake’s data warehouse-as-a-service since day one.

A core principle of GDPR

A major GDPR compliance factor is understanding what data an organization holds and to whom it relates. This requirement demands that data is structured, organized and easy to search.

Snowflake’s relational, SQL database architecture provides a significantly simplified structure and organization, ensuring that each record has a unique and easily identified location within the database. Snowflake customers can also combine relational storage with Snowflake’s Variant column type for semi-structured data. This approach extends the simplicity of the relational format to the schema flexibility of semi-structured data.

Snowflake is made even more powerful by its ability to support massive concurrency. With larger organizations, there may be dozens or even hundreds of concurrent data modifications, queries and searches occuring at any one time. Traditional data warehouses can’t scale beyond a single cluster of compute at any given time, leading to long queues and delayed compliance. Snowflake’s multi-cluster, shared data architecture solves this problem by enabling as many unique clusters of compute resources for any purpose, leading to more efficient workload isolation and query throughput. Anyone can store, organize, modify, search and query very large amounts of data with as many concurrent users or operations as necessary.

Data subject rights

Organizations affected by GDPR must ensure they can comply with data subject requests. Individuals now have significantly expanded rights for learning about what type of data an organization holds about them and the right to request accessing and/or correcting their data, having the data deleted, and/or porting the data to a new provider. When providing these services, organizations must respond fairly quickly, generally within 30 days. Therefore they must quickly search their business systems and data warehouse to locate all personal data related to an individual and take action.

Organizations can greatly benefit from storing all their data in a data warehouse-as-a-service with full DML and SQL capabilities. It eases the burden of searching various discrete business systems and data stores to locate the relevant data. This helps to ensure that individual records can be searched, deleted, restricted, updated, truncated, split and otherwise manipulated to align with data subject requests. It also makes it possible to move data to comply with a “right to portability” request. From the beginning, Snowflake was architected with ANSI-standard SQL and full DML to ensure these types of operations are possible.

Security

Unfortunately, many traditional data warehouses require security to be home-built and cobbled together with services outside of the core offering. What’s more, they may not even enable encryption out of the box.

As the modern data warehouse built for the cloud, Snowflake was built to ensure stringent security as a key feature of the service. Snowflake has key, built-in protections and security features, including:

  • Zero management – Snowflake reduces complexity with built-in performance, security and high availability so there’s no infrastructure to tweak, no knobs to turn and no tuning required.
  • Encryption everywhere – Snowflake automatically encrypts all data at rest and in transit.
  • Comprehensive protection – Security features include multi-factor authentication, role-based access control, IP address whitelisting, federated authentication and annual rekeying of encrypted data.
  • Tri-Secret Secure – Ensures customer control and data protection by combining a customer-provided encryption key along with a Snowflake-provided encryption key and user credentials.
  • Support for AWS Private LinkCustomers can transmit data between their virtual private network and Snowflake without accessing the Internet, making inter-network connectivity secure and easier to manage.
  • Stronger intra-company data demarcation through Snowflake Data Sharing – Leverage Snowflake’s data sharing features to share non-PII data with other teams in your organization who don’t need access by enforcing stronger security and GDPR controls.
  • Private deployment – Enterprises can get a dedicated and managed instance of Snowflake within a separate AWS Virtual Private Cloud (VPC).

Accountability

To add to the complexity, organizations must also ensure they and the organizations and tools they work with are able to demonstrate compliance. Snowflake aggressively audits and perfects its security practice on an ongoing basis, with regular penetration testing. The Snowflake data warehouse-as-a-service is SOC 2 Type II certified, PCI DSS compliant and supports HIPAA compliance and customers can audit data as it has been manipulated to comply with data subject requests.  

In addition to these out of the box capabilities and validations, Snowflake also provides customers with our Data Protection Addendum, which is tightly aligned with GDPR requirements. Snowflake also adheres to robust contractual security commitments to facilitate more efficient transactions and simplified due diligence.

Conclusion

Under the GDPR, companies must implement technical measures that will help them respond to the data protection and privacy needs of their customers. Snowflake provides not only the benefit of storing all critical customer data in a single location, it enables rapid location and retrieval of that data so businesses can take action.

 

subscribe to the snowflake blog

How to Make Data Protection and High Availability for Analytics Fast and Easy

When moving enterprise data warehouse analytic workloads to the cloud, it’s important to consider data protection and high availability (HA) services that will keep your valuable data preserved and your analytics running. Possible events such as human error, infrastructure failure, an unfortunate act of nature or any activity that places your data at risk can’t be ignored.

All the while, data protection and HA should be fast and easy. It shouldn’t take days, weeks or months, nor an army of technical specialists or a big budget, to make these critical safety measures happen.  

What data-driven companies depend on

With real-time dashboards, organizations depend on data analytics to communicate the status of business operations. Increasingly, companies embed self-service analytics into customer-facing applications. Therefore, enterprises spend enormous amounts of effort, energy and resources to gather and cultivate data about their customers. With all this activity around data, loss of any data or processing capability could result in catastrophic consequences for an organization.

How Snowflake protects your data and services

For these reasons, Snowflake innovates and integrates data protection and HA as core capabilities of our cloud-built data warehouse-as-a-service. In Figure 1, you’ll see that what makes Snowflake different. Our protection capabilities are all built-in and orchestrated with metadata across your entire service. The figure also illustrates, how Snowflake resilience is automatically distributed across three availability zones.

  • Built-in data protection: Over and above standard cloud data protection, Snowflake Time Travel enables you to recover data from any point, up to 90 days.  In addition, it’s all accomplished automatically. Other than specifying the number of days for Time Travel retention at setup (default is 24 hours for Snowflake Enterprise and above), you do not have to initiate a thing or manage snapshots.


        

    Figure 1. Snowflake Built-in Data Protection and High Availability

      

This brings significant advantages for business analysts performing scenario-based analytics on changed datasets, or for data scientists who want to train new models and algorithms on old data.

  • Built-in service protection against node failures:  The impact of node failures can be tricky to figure out with different cloud implementations offered by different cloud data warehouse vendors. While other cloud data warehouse or querying services may provide some level of redundancy for current data, mechanisms to protect against data corruption or data loss in the event of a node failure vary.

    In most cases, the burden is on you to create a cluster (i.e., a system with a node count of greater than one) to protect against node failures. Typically, this means added cost (hardware, storage, and software instances), as well as added complexity, to account for the additional nodes. Some competing services may have a performance penalty on data writes. This exists because, under the covers, redundant nodes are being written using compute resources. We see this most frequently with on-premises data warehouse environments retrofitted for the cloud. Moreover, there also could be hidden costs in the form of your cluster going down and not being accessible for queries or updates during the time a failed node is being reconstructed.

    Because the Snowflake architecture separates the compute, storage and service layers, Snowflake assures resiliency and data consistency in the event of node failures. Depending on the severity of failures, Snowflake may automatically reissue (retry) without a users’ involvement. And there is also no impact on write (or read) performance. In addition, you can take advantage of lower cost storage. Competing services may highly encourage or restrict you to use premium-cost storage.
  • Built-in high availability: Providing an even higher degree of data protection and service resilience, within the same deployment region, Snowflake provides standard failover protection across three availability zones (including the primary active zone). Your data and business are protected. As you ingest your data, it is synchronously and transparently replicated across availability zones. This protection is automatically extended from Snowflake to customers, at no added charge.

    Further, all the metadata, the magic of Snowflake services, is also protected.

 

Table 1. Snowflake Data Protection and High Availability

Summary

Bottom line, within the same deployment region, you do not have to configure or struggle with manually building an HA infrastructure. Our data warehouse-as-a-service takes care of this for you, automatically. Snowflake makes data protection and high availability fast and easy. You can mitigate risks with speed, cost-effectiveness, confidence and peace of mind.

 

Subscribe to the the snowflake blog

Snowflake Europe: What a Difference a Year Makes

Since Snowflake Computing first opened its doors in London a year ago, we’ve seen unprecedented growth. Starting from zero presence in Europe, we have onboarded 130 new customers in 9 countries, including Capital One and Deliveroo. In the first quarter of 2018 alone, we grew revenue nearly 1.5x compared to all of 2017. We’ve also opened offices in four additional European locations – Paris, Munich, Amsterdam and Stockholm.

This time last year, it was just four of us building the European arm of the business. We’ve now grown to 34 employees, spread across our five locations, developing an impressive portfolio of European customers and partners in the process.

In the past year, we’ve held 22 industry events in EMEA alone, including the European leg of our Cloud Analytics World Tour, helping us better connect with customers, partners and other industry leaders.

One of the key drivers of our success has been our ability to understand the value of data for businesses. Traditional, big data platform vendors are trying to adapt to the changing landscape of the cloud but retrofitting their infrastructure isn’t enough. Rather than patching and adding to old systems, we’ve created a data warehouse built for the cloud to handling any data analytics challenge our customers face.

But this is only the beginning of our story. The future is very promising for Snowflake. We’ll continue to deliver new innovations for our data warehouse-as-a-service to meet every aspect of our customers’ needs, especially with the challenges GDPR legislation brings. There’s already some impressive customers on the horizon that we’re excited to serve.

A big focus for us will be meeting the needs of the data-intensive financial services industry. Our product version dedicated to serving financial services better helps our customers navigate through the complex PSD2 and Open Banking regulations. Additionally, the retail sector will also be a significant focus for us as we help retailers of all sizes to capitalise on their vast data stores to better personalise their customers’ experiences.

Since our Silicon Valley inception in 2012, Snowflake has launched in eight countries, attracting more than 1000 customers globally. We also secured our most recent round of funding – US$263M in January 2018. With such unprecedented growth in just 12 months, we can’t wait to see where the next 12 months takes us!

subscribe to the snowflake blog

How to Load terabytes into Snowflake – Speeds, Feeds and Techniques

We often get these questions from customers facing an initial data load into Snowflake or, a large-scale daily ingestion: “What’s the fastest way to load terabytes of data?”, and: “What incoming data format do you recommend?” Here’s an example of a data load that provides answers to both of those questions, and more.

We recently used data from the 10TB TPCDS Benchmark data set to explore a few alternatives. This data is available to all Snowflake customers through the database named SNOWFLAKE_SAMPLE_DATA, schema TPCDS_SF10TCL.

The largest table in that database is STORE_SALES, containing 28.8 billion rows representing 4.7 TB of uncompressed data, roughly 164 bytes per row. Snowflake compresses this down to 1.3 TB internally. The table contains five years of daily transaction history and 23 columns split between integer and decimal data.

We loaded three different source data formats for this table:

  1. CSV files gzipped
  2. Date-partitioned Parquet files (snappy compressed)
  3. Date-partitioned ORC files (snappy compressed)

When loading Parquet and ORC into Snowflake, you have the choice of storing entire rows within a Snowflake VARIANT, or extracting the individual columns into a structured schema.  We tested both approaches for load performance. But for data with a fixed schema such as TPCDS, we prefer to store it structured.

Partitioned Parquet and ORC are interesting in other ways. When using HIVE partitioning for these formats within a data-lake environment, the value of the partitioning data column is typically represented by a portion of the file name, rather than by a value inside of the data itself.  This means that during data loading, we must capture and potentially manipulate the file name by referencing Snowflake’s METADATA$FILENAME property, when using the COPY command.

Load performance

First, let’s look at the raw performance of loading the data using a Snowflake 2X-large cluster:

Source Format Target Layout Load Time (sec) TB/Hr (uncompressed)
CSV (Gzipped) Structured 1104 15.4
Parquet (Snappy comp) Semi-structured 3518 4.8
Parquet (Snappy comp) Structured 3095 5.4
ORC (Snappy comp) Semi-structured 3845 4.4
ORC (Snappy comp) Structured 2820 6.0

A few points jump right out:

  • Loading from Gzipped CSV is several times faster than loading from ORC and Parquet at an impressive 15 TB/Hour. While 5-6 TB/hour is decent if your data is originally in ORC or Parquet, don’t go out of your way to CREATE ORC or Parquet files from CSV in the hope that it will load Snowflake faster.
  • Loading data into fully structured (columnarized) schema is ~10-20% faster than landing it into a VARIANT.

When we tested loading the same data using different warehouse sizes, we found that load speed was inversely proportional to the scale of the warehouse, as expected.  For example, a 3X-large warehouse, which is twice the scale of a 2X-large, loaded the same CSV data at a rate of 28 TB/Hour. Conversely, an X-large loaded at ~7 TB/Hour, and a large loaded at a rate of ~3.5 TB/hr.  This means you will spend about the same number of Snowflake credits to load a given data set regardless of the cluster size you use, as long as you suspend the warehouse when done to avoid idle time.

Load rates for your own data files may differ based on a number of factors:

  • Location of your S3 buckets – For our test, both our Snowflake deployment and S3 buckets were located in us-west-2
  • Number and types of columns – A larger number of columns may require more time relative to number of bytes in the files.
  • Gzip Compression efficiency – More data read from S3 per uncompressed byte may lead to longer load times.

(In all cases, be sure to use a sufficient number of load files to keep all loading threads busy. For a 2X-large, there are 256 such threads, and we had ~2000 load files to cover the five years of history.)

Best Practices for Parquet and ORC

While we are considering Parquet and ORC, let’s look at the technique we used to populate the fully-structured version of STORE_SALES using partitioned Parquet data.

First, consider that the date partitioned Parquet files reside in an S3 bucket with the following  prefix naming conventions, where the highlighted integer is one of the values of the partitioning keys. In STORE_SALES, it is an integer surrogate key for the sold_date column named ss_sold_date_sk:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=2451132/

If the data files were originally generated by HIVE, there will also be a prefix representing data for which the partitioning key is NULL:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/

Finally, HIVE will create a series of 0-byte “tag” files in the bucket that need to be ignored during ingestion. These files have a format in the pattern:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=2451132_$folder$

As we noted earlier, the data files themselves do not contain a column or value for ss_sold_date_sk even though it is part of the table’s definition. Instead, the value must be derived from the prefix name in S3.

To handle ingestion from this bucket, we first defined an external stage for Snowflake as:

  create or replace stage parquet_test
     url='s3://<my_bucket>/tpcds/10tb_parquet/'
     credentials = (aws_key_id=…,aws_secret_key=…)
     FILE_FORMAT = (TYPE = 'PARQUET');

Finally, our command to load all of the Parquet data into the fully structured STORE_SALES table will look like this:

  copy into STORE_SALES from (
   select 
     NULLIF(
       regexp_replace (
       METADATA$FILENAME,
       '.*\\=(.*)\\/.*',
       '\\1'), 
       '__HIVE_DEFAULT_PARTITION__'
     )                         as ss_sold_date_sk,
     $1:ss_sold_time_sk        as ss_sold_time_sk,
     $1:ss_item_sk             as ss_item_sk,
     $1:ss_customer_sk         as ss_customer_sk,
     $1:ss_cdemo_sk            as ss_cdemo_sk,
     $1:ss_hdemo_sk            as ss_hdemo_sk,
     $1:ss_addr_sk             as ss_addr_sk,
     $1:ss_store_sk            as ss_store_sk,
     $1:ss_promo_sk            as ss_promo_sk,
     $1:ss_ticket_number       as ss_ticket_number,
     $1:ss_quantity            as ss_quantity,
     $1:ss_wholesale_cost      as ss_wholesale_cost,
     $1:ss_list_price          as ss_list_price,
     $1:ss_sales_price         as ss_sales_price,
     $1:ss_ext_discount_amt    as ss_ext_discount_amt,
     $1:ss_ext_sales_price     as ss_ext_sales_price,
     $1:ss_ext_wholesale_cost  as ss_ext_wholesale_cost,
     $1:ss_ext_list_price      as ss_ext_list_price,
     $1:ss_ext_tax             as ss_ext_tax,
     $1:ss_coupon_amt          as ss_coupon_amt,
     $1:ss_net_paid            as ss_net_paid,
     $1:ss_net_paid_inc_tax    as ss_net_paid_inc_tax,
     $1:ss_net_profit          as ss_net_profit
   from @parquet_test/store_sales/)
   pattern= '.*/.*/.*/ss_sold_date_sk=.*/.*'
  ;   

Notice that we are using the “transform” feature of the COPY command to parse and manipulate the semi-structured Parquet format. The main body of the COPY includes extraction of the labeled fields contained in the Parquet data, mapping them directly to the corresponding column in STORE_SALES.  E.g. in the expression:

     $1:ss_net_paid as ss_net_paid,

$1 refers to the contents of the single column representing an entire Parquet row of input data as a set of key-value pairs, and $1:ss_net_paid represents the value associated with the ss_net_paid key in that row.

Let’s take a closer look at the two highlighted expressions from the script above.

The first expression,

     NULLIF(
       regexp_replace (
       METADATA$FILENAME,
       '.*\\=(.*)\\/.*',
       '\\1'), 
       '__HIVE_DEFAULT_PARTITION__'
      ) 

is used to populate the ss_sold_date_sk column, which is the value used to partition the input data. The REGEX_REPLACE function transforms the fully qualified S3 file name into just the integer value representing the date_key embedded in the prefix. It does this by searching for the characters in the file path following the ‘=’ sign, up to the next ‘/’. NULLIF is used to replace the partitions named HIVE_DEFAULT_PARTITION with the value NULL for the date key.

The final expression

  pattern= '.*/.*/.*/ss_sold_date_sk=.*/.*'

serves as a filter on the input files, forcing COPY to ignore the 0-byte placeholder files in the bucket.

Loading ORC entails exactly the same process, changing only the FORMAT definition in the CREATE STAGE command.

Conclusion

Loading data into Snowflake is fast and flexible. You get the greatest speed when working with CSV files, but Snowflake’s expressiveness in handling semi-structured data allows even complex partitioning schemes for existing ORC and Parquet data sets to be easily ingested into fully structured Snowflake tables.

 

subscribe to the snowflake blog

The Dream Data Warehouse Development Environment

Earlier this month, Snowflake’s Customer Enablement Team was assigned an email from one of our customers. The customer stated that he was not happy about the idea of cloning full copies of production databases for development purposes. “Do we really want to develop and maintain a system to copy a fraction of the production DB to dev?”, citing the reason for his message that, by just copying the entire production database, the dev team would have access to too much data. Being a veteran of Snowflake, I initially dismissed his concern because of Snowflake’s zero-copy clone capability, as outlined in this article. From my perspective, the zero-copy clone would not incur any additional cost for development purposes, so why not give the dev team all of the data?

The answer of course, as the customer pointed out, has to do with making the development effort more efficient. The assumption of zero-copy clone equating to zero-cost development is, of course, incorrect. There’s the cost of querying the data (requiring virtual warehouse credits) and the cost behind each development hour. After all, longer query times lead to longer development iterations and longer testing cycles. To create a “blissful” development environment in Snowflake, we need a more refined approach towards building the development data set.

The approach outlined by the customer was rooted in complimenting Snowflake’s zero-copy clone with the additional technique of using Block Sampling. In other words, they proposed creating a view containing a sample set from the original table. This approach enables an administrator to quickly set up a dev environment with minimal data. Controlling the amount of sample data is a good thing for many development situations because developers seldom require access to the full dataset.

Ok, let’s take a look at SAMPLE / TABLESAMPLE and see how we can do this. The syntax is quite simple:

 
 SELECT ...
 FROM ...
   { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( <probability> ) [ { REPEATABLE | SEED } ( <seed> ) ]
 [ ... ]
 
 -- Where:
    samplingMethod :: = { { BERNOULLI | ROW } | { SYSTEM | BLOCK } }


Note that there are some interchangeable terms we should pay attention to when writing the query. These terms are synonyms and only differ in syntax:

SAMPLE | TABLESAMPLE

BERNOULLI | ROW

SYSTEM | BLOCK

REPEATABLE | SEED

The two main methods of sampling are ROW (or BERNOULLI) and BLOCK (or SYSTEM) sampling. Let’s take a closer look at each one. 

Row Sampling

This approach uses the Bernoulli principle to select data by applying a probability of p/100 to each row. In the Snowflake documentation, we mention that this is similar to “flipping a weighted coin” on each row. The number of sampled (selected) rows should be equal to (p/100) * n, where n is the total number of rows in the table and p is the sample probability value set by the user.

This method implies that we will iterate through each row and calculate a probability at each row to match a row candidate. This iteration is going to impact query performance. However, the advantage here is that we will end up with a formal and concise distribution of samples from our data. Regarding creating a dev environment, using the Bernoulli sampling method to create a view does not result in better performance. However, you could certainly create a new data table based on this sampling method and still scan fewer data. For example:

create table dev_Sales_LineItem_Bernoulli_20 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample bernoulli (20);

This statement will create a development table from the LINEITEM table in our SALES database with 20% of the rows from the original table.

I won’t focus too much on this method in this post. You can feel free to experiment on your own to see if this sampling method would suit your needs. Instead, I will talk more about Block Sampling.

Block Sampling

In this approach, we apply a probability of p/100 to each block of rows. For those who are familiar with Snowflake’s micro partitions (details below), block sampling chooses individual partitions based on a specific probability. If you simply want quick access to production data and to run queries against a small percentage of the rows in a table, leveraging Block Sampling is a good way to go.

Let’s look at this with a specific example using the Block Sampling method on the LINEITEM table in the SALES database. Here are the table metadata:

In this example, I have a developer who is working on an aggregation query against the LINEITEM table. He created a simple query like this:

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

L_linestatus;

Result:

 

This query reports the amount of business that was billed, shipped and returned. If the developer has decided to issue a query against this table, we would see the following outcome in our query plan. (Note: the below result was performed on an XS warehouse):


It took about 19 seconds on an XS warehouse, which is not bad by any means. However, we can see that the query still performed a full table scan and performed a pretty good amount of aggregation.

Now, let’s see how we can improve the experience for the developer by creating a sample set of the LINEITEM table. We’ll create a special view (against our cloned dev database) for this developer and call it V_SALES_LINEITEM_SAMPLE_10. In this view, we’ll use the Block Sampling technique and only scan 10% of the partitions required by this query.

First create the view:

create view v_Sales_LineItem_Sample_10 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample block (10);

Next, let’s update the query to use the view instead.

select

       l_returnflag,

       l_linestatus,

       sum(l_quantity) as sum_qty,

       sum(l_extendedprice) as sum_base_price,

       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

       avg(l_quantity) as avg_qty,

       avg(l_extendedprice) as avg_price,

       avg(l_discount) as avg_disc,

       count(*) as count_order

  from

       v_Sales_LineItem_Sample_10

where

       l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

       l_returnflag,

       l_linestatus

 order by

       l_returnflag,

       l_linestatus;

Result:

 

Let’s take a look at the query profile for this query:

 

This query ran about 2.5 seconds on an XS warehouse, and we scanned about 10% of the total table partition. This result is much better! Snowflake is sampling 10% of the partition to pull based on the WHERE filter. The result should still be accurate enough to let the developer know this query is working as expected.

In conclusion, leveraging Snowflake’s SAMPLE clause considerably reduces the amount of test dataset for a developer without losing data integrity. Even if a specific id or a timestamp biases the native data, chances are the developer is going to accept the nuance and continue with the development instead of spending additional compute credit and hours waiting for the results to return. The alternative is to use the LIMIT function perhaps to reduce the number of rows being returned. However, this involves modifying the original query or creating a view that still needs to be maintained. Using the SAMPLE clause, you can reliably subset a large table with acceptable result mix and performance. I hope you find this helpful. Feel free to leave comments in our community and let me know your thoughts.

 

Boost Your Analytics with Machine Learning and Advanced Data Preparation

Enterprises can now harness the power of Apache Spark to quickly and easily prepare data and build Machine Learning (ML) models directly against that data in Snowflake. Snowflake and Qubole make it easy to get started by embedding required drivers, securing credentials, simplifying connection setup and optimizing in-database processing. Customers can focus on getting started quickly with their data preparation and ML initiatives instead of worrying about complex integrations and the cost of moving large data sets.

Setting up the Snowflake connection and getting started takes only a few minutes. Customers first create a Snowflake data store in Qubole and enter details for their Snowflake data warehouse. All drivers and packages are preloaded and kept up to date, eliminating manual bootstrapping of jars into the Spark cluster. There is no further configuration or tuning required and there are no added costs for the integration. Once the connection is saved, customers can browse their snowflake tables, view metadata and see a preview of the Snowflake data all from the Qubole interface. They can then use Zeppelin notebooks to get started reading and writing data to Snowflake as they begin exploring data preparation and ML use cases.

Below is an example of the object browser view showing the available tables and properties:

Security is also handled seamlessly so customers can focus on getting started with their data, without the worry of over-protecting their credentials. Qubole provides centralized and secure credential management which eliminates the need to specify any credentials in plain text. Username and password are entered only when setting up the data store, but are otherwise inaccessible.

The solution is also designed for enterprise requirements and allows customers to use federated authentication and SSO via the embedded Snowflake drivers. With SSO enabled, customers can authenticate through an external, SAML 2.0-compliant identity provider (IdP) and achieve a higher level of security and simplicity. These capabilities help customers more easily share notebooks and collaborate on projects with little risk of sensitive information being exposed.

Below is a sample Scala program showing how to read from Snowflake using the data store object without specifying any credentials in plain text:

Beyond the simplicity and security of the integration, which helps customers get started quickly, customers will also benefit from a highly optimized Spark integration that uses Snowflake query pushdown to balance the query-processing power of Snowflake with the computational capabilities of Apache Spark. From simple projection and filter operations to advanced operations such as joins, aggregations and even scalar SQL functions, query pushdown runs these operations in Snowflake (where the data resides) to help refine and pre-filter the data before it is read into Spark. The traditional performance and cost challenges associated with moving large amounts data for processing are thereby eliminated without additional overhead or management.

With Snowflake and Qubole, customers get an optimized platform for advanced data preparation and ML that makes it simple to get started. Customers can complement their existing cloud data warehouse strategy or get more value out of ML initiatives by opening access to more data. To learn more about ML and advanced data preparation with Qubole, visit the Qubole blog.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Quickstart Guide for Sagemaker + Snowflake (Part One)

Machine Learning (ML) and predictive analytics are quickly becoming irreplaceable tools for small startups and large enterprises. The questions that ML can answer are boundless. For example, you might want to ask, “What job would appeal to someone based on their  interests or the interests of jobseekers like them?” Or, “Is this attempt to access the network an indication of an intruder?” Or, “What type of credit card usage indicates fraud?”

Setting up a machine learning environment, in particular for on-premise infrastructure, has its challenges. An infrastructure team must request physical and/or virtual machines and then build and integrate those resources. This approach is both time-consuming and error prone due to the number of manual steps involved. It may work in a small environment, but the task becomes exponentially more complicated and impractical at scale.

There are many different ML systems to choose from, including  TensorFlow, XGBoost, Spark ML and MXNet, to name a few. They all come with their own installation guides, system requirements and dependencies. What’s more, implementation is just the first step. The next challenge is figuring out how to make the output from the machine learning step (e.g. a model) available for consumption. Then, all of the components for building a model within the machine learning tier and the access to the model in the API tier need to scale to provide predictions in real-time. Last but not least, the team needs to figure out where to store all the data needed to build the model.

Managing this whole process from end-to-end becomes significantly easier when using cloud-based technologies. The ability to provision infrastructure on demand (IaaS) solves the problem of manually requesting virtual machines. It also provides immediate access to compute resources whenever they are needed. But that still leaves the administrative overhead of managing the ML software and the platform to store and manage the data.   

At last year’s AWS developer conference, AWS announced Sagemaker, a “Fully managed end-to-end machine learning service that enables data scientists, developers and machine learning experts to quickly build, train and host machine learning models at scale.”

Sagemaker can access data from many different sources (specifically the underlying kernels like Python, PySpark, Spark and R), and access data provided by Snowflake.  Storing data in Snowflake also has significant advantages.

Single source of truth

If data is stored in multiple locations, inevitably those locations will get out of sync. Even if data is supposed to be immutable, often one location is modified to fix a problem for one system while other locations are not. In contrast, if data is stored in one central, enterprise-grade, scalable repository, it serves as a “single source of truth” because keeping data in sync is made easy. Different tools are not required for structured and semi-structured data. Data can be modified transactionally, which immediately reduces the risk of problems.

Shorten the data preparation cycle

According to a study published in Forbes, data preparation accounts for about 80% of the work performed by data scientists. Shortening the data preparation cycle therefore has a major impact on the overall efficiency of data scientists.

Snowflake is uniquely positioned to shorten the data preparation cycle due to its excellent support for both structured and semi-structured data into language, SQL. This means that semi-structured data and structured data can be seamlessly parsed, combined, joined and modified through SQL statements in set-based operations. This enables data scientists to use the power of a full SQL engine for rapid data cleansing and preparation.

Scale as you go

Another problem that ML implementers frequently encounter is what we at Snowflake call “works on my machine” syndrome. Small datasets easily work on a local machine but when migrated to production dataset size, reading all the data into a single machine doesn’t scale, or it may behave unexpectedly. Even if it does finish the job, it can take hours to load a terabyte-sized dataset. In Snowflake there is no infrastructure that needs to be provisioned, and Snowflake’s elasticity feature allows you to scale horizontally as well as vertically, all with the push of a button.

Connecting Sagemaker and Snowflake

Sagemaker and Snowflake both utilize cloud infrastructure as a service offerings by AWS, which enables us to build the Infrastructure when we need it, where we need it (geographically) and at any scale required.

Since building the services becomes simplified with Sagemaker and Snowflake, the question becomes how to connect the two services. And that’s exactly the subject of the following parts of this post. How do you get started? What additional configuration do you need in AWS for security and networking? How do you store credentials?

In part two of this four-part blog, I’ll explain how to build a Sagemaker ML environment in AWS from scratch. In the third post, I will put it all together and show you how to connect a Jupyter Notebook to Snowflake via the Snowflake Python connector.  With the Python connector, you can import data from Snowflake into a Jupyter Notebook. Once connected, you can begin to explore data, run statistical analysis, visualize the data and call the Sagemaker ML interfaces.

However, to perform any analysis at scale, you really don’t want to use a single server setup like Jupyter running a python kernel. Jupyter running a PySpark kernel against a Spark cluster on EMR is a much better solution for that use case. So, in part four of this series I’ll connect a Jupyter Notebook to a local Spark instance and an EMR cluster using the Snowflake Spark connector.

The Snowflake difference

Snowflake is the only data warehouse built for the cloud. Snowflake delivers the performance, concurrency and simplicity needed to store and analyze all data available to an organization in one location. Snowflake’s technology combines the power of data warehousing, the flexibility of big data platforms, the elasticity of the cloud, and live data sharing at a fraction of the cost of traditional solutions. Snowflake: Your data, no limits.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.