How the Modern Data Warehouse Helps Ease the Burden of GDPR

While many activities driving an organization’s GDPR compliance are in the organization’s own hands, its IT vendors should help satisfy their customers’ compliance requirements. At a minimum, an organization’s SaaS vendors should satisfy the security requirements that wholly reside in their domain but impact their customer’s business and data security. Snowflake was built from the ground up and in ways that can ease the burden of complying with GDPR, especially for security – and customer-focused organizations.

Snowflake was designed from the beginning to handle an enormous amount of structured and semi-structured data with the ease of standard SQL. The accessibility and simplicity of SQL gives organizations the flexibility to seamlessly make any updates, changes or deletions required under GDPR. Snowflake’s support for semi-structured data can make it easier to adapt to new fields and other record changes. In addition, delivering industry-best security has been fundamental to the architecture, implementation and operation of Snowflake’s data warehouse-as-a-service since day one.

A core principle of GDPR

A major GDPR compliance factor is understanding what data an organization holds and to whom it relates. This requirement demands that data is structured, organized and easy to search.

Snowflake’s relational, SQL database architecture provides a significantly simplified structure and organization, ensuring that each record has a unique and easily identified location within the database. Snowflake customers can also combine relational storage with Snowflake’s Variant column type for semi-structured data. This approach extends the simplicity of the relational format to the schema flexibility of semi-structured data.

Snowflake is made even more powerful by its ability to support massive concurrency. With larger organizations, there may be dozens or even hundreds of concurrent data modifications, queries and searches occuring at any one time. Traditional data warehouses can’t scale beyond a single cluster of compute at any given time, leading to long queues and delayed compliance. Snowflake’s multi-cluster, shared data architecture solves this problem by enabling as many unique clusters of compute resources for any purpose, leading to more efficient workload isolation and query throughput. Anyone can store, organize, modify, search and query very large amounts of data with as many concurrent users or operations as necessary.

Data subject rights

Organizations affected by GDPR must ensure they can comply with data subject requests. Individuals now have significantly expanded rights for learning about what type of data an organization holds about them and the right to request accessing and/or correcting their data, having the data deleted, and/or porting the data to a new provider. When providing these services, organizations must respond fairly quickly, generally within 30 days. Therefore they must quickly search their business systems and data warehouse to locate all personal data related to an individual and take action.

Organizations can greatly benefit from storing all their data in a data warehouse-as-a-service with full DML and SQL capabilities. It eases the burden of searching various discrete business systems and data stores to locate the relevant data. This helps to ensure that individual records can be searched, deleted, restricted, updated, truncated, split and otherwise manipulated to align with data subject requests. It also makes it possible to move data to comply with a “right to portability” request. From the beginning, Snowflake was architected with ANSI-standard SQL and full DML to ensure these types of operations are possible.

Security

Unfortunately, many traditional data warehouses require security to be home-built and cobbled together with services outside of the core offering. What’s more, they may not even enable encryption out of the box.

As the modern data warehouse built for the cloud, Snowflake was built to ensure stringent security as a key feature of the service. Snowflake has key, built-in protections and security features, including:

  • Zero management – Snowflake reduces complexity with built-in performance, security and high availability so there’s no infrastructure to tweak, no knobs to turn and no tuning required.
  • Encryption everywhere – Snowflake automatically encrypts all data at rest and in transit.
  • Comprehensive protection – Security features include multi-factor authentication, role-based access control, IP address whitelisting, federated authentication and annual rekeying of encrypted data.
  • Tri-Secret Secure – Ensures customer control and data protection by combining a customer-provided encryption key along with a Snowflake-provided encryption key and user credentials.
  • Support for AWS Private LinkCustomers can transmit data between their virtual private network and Snowflake without accessing the Internet, making inter-network connectivity secure and easier to manage.
  • Stronger intra-company data demarcation through Snowflake Data Sharing – Leverage Snowflake’s data sharing features to share non-PII data with other teams in your organization who don’t need access by enforcing stronger security and GDPR controls.
  • Private deployment – Enterprises can get a dedicated and managed instance of Snowflake within a separate AWS Virtual Private Cloud (VPC).

Accountability

To add to the complexity, organizations must also ensure they and the organizations and tools they work with are able to demonstrate compliance. Snowflake aggressively audits and perfects its security practice on an ongoing basis, with regular penetration testing. The Snowflake data warehouse-as-a-service is SOC 2 Type II certified, PCI DSS compliant and supports HIPAA compliance and customers can audit data as it has been manipulated to comply with data subject requests.  

In addition to these out of the box capabilities and validations, Snowflake also provides customers with our Data Protection Addendum, which is tightly aligned with GDPR requirements. Snowflake also adheres to robust contractual security commitments to facilitate more efficient transactions and simplified due diligence.

Conclusion

Under the GDPR, companies must implement technical measures that will help them respond to the data protection and privacy needs of their customers. Snowflake provides not only the benefit of storing all critical customer data in a single location, it enables rapid location and retrieval of that data so businesses can take action.

 

subscribe to the snowflake blog

How to Make Data Protection and High Availability for Analytics Fast and Easy

When moving enterprise data warehouse analytic workloads to the cloud, it’s important to consider data protection and high availability (HA) services that will keep your valuable data preserved and your analytics running. Possible events such as human error, infrastructure failure, an unfortunate act of nature or any activity that places your data at risk can’t be ignored.

All the while, data protection and HA should be fast and easy. It shouldn’t take days, weeks or months, nor an army of technical specialists or a big budget, to make these critical safety measures happen.  

What data-driven companies depend on

With real-time dashboards, organizations depend on data analytics to communicate the status of business operations. Increasingly, companies embed self-service analytics into customer-facing applications. Therefore, enterprises spend enormous amounts of effort, energy and resources to gather and cultivate data about their customers. With all this activity around data, loss of any data or processing capability could result in catastrophic consequences for an organization.

How Snowflake protects your data and services

For these reasons, Snowflake innovates and integrates data protection and HA as core capabilities of our cloud-built data warehouse-as-a-service. In Figure 1, you’ll see that what makes Snowflake different. Our protection capabilities are all built-in and orchestrated with metadata across your entire service. The figure also illustrates, how Snowflake resilience is automatically distributed across three availability zones.

  • Built-in data protection: Over and above standard cloud data protection, Snowflake Time Travel enables you to recover data from any point, up to 90 days.  In addition, it’s all accomplished automatically. Other than specifying the number of days for Time Travel retention at setup (default is 24 hours for Snowflake Enterprise and above), you do not have to initiate a thing or manage snapshots.


        

    Figure 1. Snowflake Built-in Data Protection and High Availability

      

This brings significant advantages for business analysts performing scenario-based analytics on changed datasets, or for data scientists who want to train new models and algorithms on old data.

  • Built-in service protection against node failures:  The impact of node failures can be tricky to figure out with different cloud implementations offered by different cloud data warehouse vendors. While other cloud data warehouse or querying services may provide some level of redundancy for current data, mechanisms to protect against data corruption or data loss in the event of a node failure vary.

    In most cases, the burden is on you to create a cluster (i.e., a system with a node count of greater than one) to protect against node failures. Typically, this means added cost (hardware, storage, and software instances), as well as added complexity, to account for the additional nodes. Some competing services may have a performance penalty on data writes. This exists because, under the covers, redundant nodes are being written using compute resources. We see this most frequently with on-premises data warehouse environments retrofitted for the cloud. Moreover, there also could be hidden costs in the form of your cluster going down and not being accessible for queries or updates during the time a failed node is being reconstructed.

    Because the Snowflake architecture separates the compute, storage and service layers, Snowflake assures resiliency and data consistency in the event of node failures. Depending on the severity of failures, Snowflake may automatically reissue (retry) without a users’ involvement. And there is also no impact on write (or read) performance. In addition, you can take advantage of lower cost storage. Competing services may highly encourage or restrict you to use premium-cost storage.
  • Built-in high availability: Providing an even higher degree of data protection and service resilience, within the same deployment region, Snowflake provides standard failover protection across three availability zones (including the primary active zone). Your data and business are protected. As you ingest your data, it is synchronously and transparently replicated across availability zones. This protection is automatically extended from Snowflake to customers, at no added charge.

    Further, all the metadata, the magic of Snowflake services, is also protected.

 

Table 1. Snowflake Data Protection and High Availability

Summary

Bottom line, within the same deployment region, you do not have to configure or struggle with manually building an HA infrastructure. Our data warehouse-as-a-service takes care of this for you, automatically. Snowflake makes data protection and high availability fast and easy. You can mitigate risks with speed, cost-effectiveness, confidence and peace of mind.

 

Subscribe to the the snowflake blog

How to Load terabytes into Snowflake – Speeds, Feeds and Techniques

We often get these questions from customers facing an initial data load into Snowflake or, a large-scale daily ingestion: “What’s the fastest way to load terabytes of data?”, and: “What incoming data format do you recommend?” Here’s an example of a data load that provides answers to both of those questions, and more.

We recently used data from the 10TB TPCDS Benchmark data set to explore a few alternatives. This data is available to all Snowflake customers through the database named SNOWFLAKE_SAMPLE_DATA, schema TPCDS_SF10TCL.

The largest table in that database is STORE_SALES, containing 28.8 billion rows representing 4.7 TB of uncompressed data, roughly 164 bytes per row. Snowflake compresses this down to 1.3 TB internally. The table contains five years of daily transaction history and 23 columns split between integer and decimal data.

We loaded three different source data formats for this table:

  1. CSV files gzipped
  2. Date-partitioned Parquet files (snappy compressed)
  3. Date-partitioned ORC files (snappy compressed)

When loading Parquet and ORC into Snowflake, you have the choice of storing entire rows within a Snowflake VARIANT, or extracting the individual columns into a structured schema.  We tested both approaches for load performance. But for data with a fixed schema such as TPCDS, we prefer to store it structured.

Partitioned Parquet and ORC are interesting in other ways. When using HIVE partitioning for these formats within a data-lake environment, the value of the partitioning data column is typically represented by a portion of the file name, rather than by a value inside of the data itself.  This means that during data loading, we must capture and potentially manipulate the file name by referencing Snowflake’s METADATA$FILENAME property, when using the COPY command.

Load performance

First, let’s look at the raw performance of loading the data using a Snowflake 2X-large cluster:

Source Format Target Layout Load Time (sec) TB/Hr (uncompressed)
CSV (Gzipped) Structured 1104 15.4
Parquet (Snappy comp) Semi-structured 3518 4.8
Parquet (Snappy comp) Structured 3095 5.4
ORC (Snappy comp) Semi-structured 3845 4.4
ORC (Snappy comp) Structured 2820 6.0

A few points jump right out:

  • Loading from Gzipped CSV is several times faster than loading from ORC and Parquet at an impressive 15 TB/Hour. While 5-6 TB/hour is decent if your data is originally in ORC or Parquet, don’t go out of your way to CREATE ORC or Parquet files from CSV in the hope that it will load Snowflake faster.
  • Loading data into fully structured (columnarized) schema is ~10-20% faster than landing it into a VARIANT.

When we tested loading the same data using different warehouse sizes, we found that load speed was inversely proportional to the scale of the warehouse, as expected.  For example, a 3X-large warehouse, which is twice the scale of a 2X-large, loaded the same CSV data at a rate of 28 TB/Hour. Conversely, an X-large loaded at ~7 TB/Hour, and a large loaded at a rate of ~3.5 TB/hr.  This means you will spend about the same number of Snowflake credits to load a given data set regardless of the cluster size you use, as long as you suspend the warehouse when done to avoid idle time.

Load rates for your own data files may differ based on a number of factors:

  • Location of your S3 buckets – For our test, both our Snowflake deployment and S3 buckets were located in us-west-2
  • Number and types of columns – A larger number of columns may require more time relative to number of bytes in the files.
  • Gzip Compression efficiency – More data read from S3 per uncompressed byte may lead to longer load times.

(In all cases, be sure to use a sufficient number of load files to keep all loading threads busy. For a 2X-large, there are 256 such threads, and we had ~2000 load files to cover the five years of history.)

Best Practices for Parquet and ORC

While we are considering Parquet and ORC, let’s look at the technique we used to populate the fully-structured version of STORE_SALES using partitioned Parquet data.

First, consider that the date partitioned Parquet files reside in an S3 bucket with the following  prefix naming conventions, where the highlighted integer is one of the values of the partitioning keys. In STORE_SALES, it is an integer surrogate key for the sold_date column named ss_sold_date_sk:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=2451132/

If the data files were originally generated by HIVE, there will also be a prefix representing data for which the partitioning key is NULL:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__/

Finally, HIVE will create a series of 0-byte “tag” files in the bucket that need to be ignored during ingestion. These files have a format in the pattern:

  S3://<my_bucket>/10tb_parquet/store_sales/ss_sold_date_sk=2451132_$folder$

As we noted earlier, the data files themselves do not contain a column or value for ss_sold_date_sk even though it is part of the table’s definition. Instead, the value must be derived from the prefix name in S3.

To handle ingestion from this bucket, we first defined an external stage for Snowflake as:

  create or replace stage parquet_test
     url='s3://<my_bucket>/tpcds/10tb_parquet/'
     credentials = (aws_key_id=…,aws_secret_key=…)
     FILE_FORMAT = (TYPE = 'PARQUET');

Finally, our command to load all of the Parquet data into the fully structured STORE_SALES table will look like this:

  copy into STORE_SALES from (
   select 
     NULLIF(
       regexp_replace (
       METADATA$FILENAME,
       '.*\\=(.*)\\/.*',
       '\\1'), 
       '__HIVE_DEFAULT_PARTITION__'
     )                         as ss_sold_date_sk,
     $1:ss_sold_time_sk        as ss_sold_time_sk,
     $1:ss_item_sk             as ss_item_sk,
     $1:ss_customer_sk         as ss_customer_sk,
     $1:ss_cdemo_sk            as ss_cdemo_sk,
     $1:ss_hdemo_sk            as ss_hdemo_sk,
     $1:ss_addr_sk             as ss_addr_sk,
     $1:ss_store_sk            as ss_store_sk,
     $1:ss_promo_sk            as ss_promo_sk,
     $1:ss_ticket_number       as ss_ticket_number,
     $1:ss_quantity            as ss_quantity,
     $1:ss_wholesale_cost      as ss_wholesale_cost,
     $1:ss_list_price          as ss_list_price,
     $1:ss_sales_price         as ss_sales_price,
     $1:ss_ext_discount_amt    as ss_ext_discount_amt,
     $1:ss_ext_sales_price     as ss_ext_sales_price,
     $1:ss_ext_wholesale_cost  as ss_ext_wholesale_cost,
     $1:ss_ext_list_price      as ss_ext_list_price,
     $1:ss_ext_tax             as ss_ext_tax,
     $1:ss_coupon_amt          as ss_coupon_amt,
     $1:ss_net_paid            as ss_net_paid,
     $1:ss_net_paid_inc_tax    as ss_net_paid_inc_tax,
     $1:ss_net_profit          as ss_net_profit
   from @parquet_test/store_sales/)
   pattern= '.*/.*/.*/ss_sold_date_sk=.*/.*'
  ;   

Notice that we are using the “transform” feature of the COPY command to parse and manipulate the semi-structured Parquet format. The main body of the COPY includes extraction of the labeled fields contained in the Parquet data, mapping them directly to the corresponding column in STORE_SALES.  E.g. in the expression:

     $1:ss_net_paid as ss_net_paid,

$1 refers to the contents of the single column representing an entire Parquet row of input data as a set of key-value pairs, and $1:ss_net_paid represents the value associated with the ss_net_paid key in that row.

Let’s take a closer look at the two highlighted expressions from the script above.

The first expression,

     NULLIF(
       regexp_replace (
       METADATA$FILENAME,
       '.*\\=(.*)\\/.*',
       '\\1'), 
       '__HIVE_DEFAULT_PARTITION__'
      ) 

is used to populate the ss_sold_date_sk column, which is the value used to partition the input data. The REGEX_REPLACE function transforms the fully qualified S3 file name into just the integer value representing the date_key embedded in the prefix. It does this by searching for the characters in the file path following the ‘=’ sign, up to the next ‘/’. NULLIF is used to replace the partitions named HIVE_DEFAULT_PARTITION with the value NULL for the date key.

The final expression

  pattern= '.*/.*/.*/ss_sold_date_sk=.*/.*'

serves as a filter on the input files, forcing COPY to ignore the 0-byte placeholder files in the bucket.

Loading ORC entails exactly the same process, changing only the FORMAT definition in the CREATE STAGE command.

Conclusion

Loading data into Snowflake is fast and flexible. You get the greatest speed when working with CSV files, but Snowflake’s expressiveness in handling semi-structured data allows even complex partitioning schemes for existing ORC and Parquet data sets to be easily ingested into fully structured Snowflake tables.

 

subscribe to the snowflake blog

Connecting a Jupyter Notebook to Snowflake through Python (Part 3)

In part two of this four-part series, we learned how to create a Sagemaker Notebook instance. In part three, we’ll learn how to connect that Sagemaker Notebook instance to Snowflake. If you’ve completed the steps outlined in part one and part two, the Jupyter Notebook instance is up and running and you have access to your Snowflake instance, including the demo dataset. Now, you’re ready to connect the two platforms.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

The Sagemaker Console

The first step is to open the Jupyter service using the link on the Sagemaker console.

 

There are two options for creating a Jupyter Notebook. You can create the notebook from scratch by following the step-by-step instructions below, or you can download sample notebooks here. If you decide to build the notebook from scratch, select the conda_python3 kernel. Alternatively, if you decide to work with a pre-made sample, make sure to upload it to your Sagemaker notebook instance first.

There are several options for connecting Sagemaker to Snowflake. The simplest way to get connected is through the Snowflake Connector for Python. By the way, the connector doesn’t come pre-installed with Sagemaker, so you will need to install it through the Python Package manager. (Note: As of the writing of this post, the Snowflake Python connector has a dependency to C foreign function interface (CFFI), so it requires a minimum version of 1.11). Sagemaker, on the other hand, comes preinstalled with libcffi 1.10, which unfortunately causes the Snowflake Python connector to fail.

The step outlined below automatically detects the failure and, if necessary, triggers de-installation and re-installation of the most recent version of libcffi. Note: If re-installation of the CFFI package is necessary, you must also restart the Kernel so the new version is recognized.

%%bash
CFFI_VERSION=$(pip list 2>/dev/null | grep cffi )
echo $CFFI_VERSION
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   pip uninstall --yes cffi
fi
yum_log=$(sudo yum install -y libffi-devel openssl-devel)
pip_log=$(pip install --upgrade snowflake-connector-python)  
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   echo "configuration has changed; restart notebook"
fi

This  problem is persistent and needs to be resolved each time the Sagemaker server is shut down and restarted. The Jupyter Kernel also needs to be restarted after each shutdown. The good news is that the cffi package will eventually be updated on the Sagemaker AMI, and the step indicated above will begin to automatically recognize the update.

 

The next step is to connect to the Snowflake instance with your credentials.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=<user>,
  password=<password>,
  account=<account>
)

Here you have the option to hard code all credentials and other specific information, including the S3 bucket names. However, for security reasons, it’s advisable to not store credentials in the notebook. Another option is to enter your credentials every time you run the notebook.

Rather than storing credentials directly in the notebook, I opted to store a reference to the credentials. The actual credentials are automatically stored in a secure key/value management system called AWS Systems Manager Parameter Store (SSM).  

With most AWS systems, the first step requires setting up permissions for SSM through AWS IAM. Please ask your AWS security admin to create another policy with the following Actions on KMS and SSM.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",

            "Effect": "Allow",

            "Action": [

                "ssm:PutParameter",

                "ssm:DeleteParameter",

                "kms:Decrypt",

                "ssm:GetParameterHistory",

                "ssm:GetParametersByPath",

                "ssm:GetParameters",

                "ssm:GetParameter",

                "ssm:DeleteParameters"
            ],

            "Resource": [

                "arn:aws:kms:<region>:<accountid>:key/SNOWFLAKE/*",

                "arn:aws:ssm:<region>:<accountid>:parameter/SNOWFLAKE/*"
            ]
        },

        {
            "Sid": "VisualEditor1",

            "Effect": "Allow",

            "Action": [

                "ssm:DescribeParameters",

                "kms:ListAliases"
            ],

            "Resource": "*"
        }
    ]

}         

Adhering to the best-practice principle of least permissions, I recommend limiting usage of the “Actions by Resource”. Also, be sure to change the region and accountid in the code segment shown above or, alternatively, grant access to all resources (i.e. “*”).

In the code segment shown above, I created a root name of “SNOWFLAKE”. This is only an example. You’re free to create your own, unique naming convention.

Next, check permissions for your login. Assuming the new policy has been called SagemakerCredentialsPolicy, permissions for your login should look like the example shown below.

With the SagemakerCredentialsPolicy in place, you’re ready to begin configuring all your secrets (i.e. credentials) in SSM. Be sure to take the same namespace that you used to configure the credentials policy and apply them to the prefixes of your secrets.

After setting up your key/value pairs in SSM, use the following step to read the key/value pairs into your Jupyter Notebook.

import boto3

params=['/SNOWFLAKE/URL','/SNOWFLAKE/ACCOUNT_ID'
        ,'/SNOWFLAKE/USER_ID','/SNOWFLAKE/PASSWORD'
        ,'/SNOWFLAKE/DATABASE','/SNOWFLAKE/SCHEMA'
        ,'/SNOWFLAKE/WAREHOUSE','/SNOWFLAKE/BUCKET'
        ,'/SNOWFLAKE/PREFIX']
region='us-east-1'

def get_credentials(params):
   ssm = boto3.client('ssm',region)
   response = ssm.get_parameters(
      Names=params,
      WithDecryption=True
   )
   #Build dict of credentials
   param_values={k['Name']:k['Value'] for k in  response['Parameters']}
   return param_values

param_values=get_credentials(params)

Instead of hard coding the credentials, you can reference key/value pairs via the variable param_values. In addition to the credentials (account_id, user_id, password), also I stored the warehouse, database and schema.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=param_values['/SNOWFLAKE/USER_ID'],
  password=param_values['/SNOWFLAKE/PASSWORD'],
  account=param_values['/SNOWFLAKE/ACCOUNT_ID'],
  warehouse=param_values['/SNOWFLAKE/WAREHOUSE'],
  database=param_values['/SNOWFLAKE/DATABASE'],
  schema=param_values['/SNOWFLAKE/SCHEMA']

Now, you’re ready to read data from Snowflake. To illustrate the benefits of using data in Snowflake, we will read semi-structured data from the database I named “SNOWFLAKE_SAMPLE_DATABASE”.

When data is stored in Snowflake, you can use the Snowflake JSON parser and the SQL engine to easily query, transform, cast and filter JSON data data before it gets to the Jupyter Notebook.

From the JSON documents stored in WEATHER_14_TOTAL, the following step shows the minimum and maximum temperature values, a date and timestamp and the latitude/longitude coordinates for New York City.

cs=ctx.cursor()
allrows=cs.execute( \
"select (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, " +\
"       (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, " +\
"       cast(V:time as timestamp) time, " +\
"       V:city.coord.lat lat, " +\
"       V:city.coord.lon lon " +\
"from snowflake_sample_data.weather.weather_14_total " +\
"where v:city.name = 'New York' " +\
"and   v:city.country = 'US' ").fetchall()

The final step converts the result set into a Pandas DataFrame, which is suitable for machine learning algorithms.

import pandas as pd                               # For munging tabular data

data = pd.DataFrame(allrows)
data.columns=['temp_max_far','temp_min_far','time','lat','lon']
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
Data

 

Conclusion

Now that we’ve connected a Jupyter Notebook in Sagemaker to the data in Snowflake using the Snowflake Connector for Python, we’re ready for the final stage: Connecting Sagemaker and a Jupyter Notebook to both a local Spark instance and a multi-node EMR Spark cluster. I’ll cover how to accomplish this connection in the fourth and final installment of this series — Connecting a Jupyter Notebook to Snowflake via Spark.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

 

subscribe to the snowflake blog

Top 9 Best Practices for Data Warehouse Development

When planning for a  modern cloud data warehousing development project, having some form or outline around understanding the business and IT needs and pain points will be key to the ultimate success of your venture. Being able to tell the right story will give the business the structure it needs to be successful in data warehousing efforts.

Here are 9 things you should know about staying current in data warehouse development, but won’t necessarily hear from your current IT staff and consultants.

1) Have a data model. Getting a common understanding of what information is important to the business will be vital to the success of the data warehouse. Sometimes the businesses themselves don’t know their own data needs or landscape. They will be using different words for the same data sets, the same words for different data sets, etc. Modeling the business’ information can be a real eye opener for all parties concerned.

2) Have a data flow diagram. Knowing where all the business’ data repositories are and how the data travels within the company in a diagram format allows everyone to determine the best steps for moving forward. You can’t get where you want to be if you don’t know where you are.

3) Build a source agnostic integration layer. The integration layers’ sole purpose is to pull together information from multiple sources. This is generally done to allow better business reporting. Unless the company has a custom application developed with a business-aligned data model on the back end, choosing a 3rd party source to align to defeats that purpose. Integration MUST align with the business model.

4) Adopt a recognized data warehouse architecture standard.(i.e. 3NF, star schema [dimensional], Data Vault). Regardless of the actual approach chosen, picking a standard and sticking with it will enable efficiency within a data warehouse development approach. Supporting a singular methodology for support and troubleshooting allows new staff to join the team and ramp-up faster.

5) Consider adopting an agile data warehouse methodology. Data warehouses no longer have to be large, monolithic, multi quarter / year efforts. With proper planning aligning to a single integration layer, data warehouse projects can be broken down into smaller, faster deliverable pieces that return value much more quickly. This also allows you to prioritize the warehouse as the business needs change.

6) Favor ELT over ETL. Moving corporate data, as is, to a single platform should be job #1. Then legacy systems can be bypassed and retired along the way, helping the business realize savings faster. Once data is colocated, it is much more efficient to let the power of a single cloud engine do integrations and transformations (i.e. fewer moving parts, push down optimizations, etc.).

7) Adopt a data warehouse automation tool. Automation allows you to leverage your IT resources more fully, iterate faster through projects and enforce coding standards (i.e. Wherescape, AnalytixDS, Ajilius, homespun, etc.) for easier support and ramp-up. 

8) Get your staff trained in modern approaches. Giving your team knowledge of the advantages of newer technologies and approaches lets your IT staff become more self-sufficient and effective. This will also open up more understanding and options in hiring and contracting with the best resources that the IT industry has to offer.

9) Pick a cloud-based data warehouse environment. For the least initial investment, the storage and compute elasticity coupled with the pay-as-you-go nature of cloud-based services provide the most flexible data warehousing solution on the market. 

 

Subscribe to the snowflake blog

Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

New Snowflake Features Released in H2’17

Looking back at the first half of this year, the main theme at Snowflake has been centered around scaling – the team, the service and our customer base. We introduced and improved the self-service onboarding experience, grew our ecosystem by collaborating with major technology partners and SIs and invested in making our service faster, more secure and easier to use.

In the summer, we announced the general availability of Snowflake’s Instant Data Sharing. As we continue to grow, this will provide the foundation for data-driven organizations of all sizes to instantly exchange data, create new insights and discover new revenue streams.

In the second half of this year, we started a journey to evolve the service in substantial ways and this will continue throughout 2018. 

Instant elasticity, per-second billing and 4XL virtual warehouses
Imagine you have instant access to performance and scalability – and truly pay only for what you use.

Many cloud DW solutions make the claim in the headline above. Nonetheless, taking a closer look exposes fine differences with the degree of scalability and concurrency supported, overall user experience and the types of workloads a user can run at scale. One of our key investments in this area resulted in our instant elasticity feature:

  • The introduction of per-second pricing represents one crucial building block. Now, our customers are billed to the second instead of a full hour when running workloads in Snowflake.
  • We also improved our provisioning algorithms, using past usage data to better serve compute capacity for future use.
  • We added support for our new 4XL virtual warehouses, doubling the maximal compute configuration and allowing our customers to tackle their most challenging workloads and performance SLAs.

Based on customer feedback, the decrease in wait time from minutes to seconds is a huge benefit. You can now ensure the right performance for your executive dashboards and not worry anymore about a full-hour charge. You can also scale your load instantly and have fast access to the most current data.

Improving performance and SQL programmability
Our ongoing mission is to build the fastest data warehouse with the SQL you love.

Throughout the last quarter, we released a wide range of SQL capabilities that address database migration challenges and increase both compatibility with SQL ANSI and overall SQL programmability:

  • Support for JavaScript Table UDFs (in public preview).
  • Support for Double-Dot notation for specifying database objects to ease migration from on-premise database systems.
  • Support for Table Literals.
  • INSERT OVERWRITE: Allowing to rebuild tables to ease migration from Hadoop systems.
  • LEAD/LAG function enhancement: Ignoring NULL values.
  • SAMPLE/TABLESAMPLE – announced general availability.
  • We introduced functions to estimate frequent values in near-linear time.
  • More flexibility and support to configure start of the week and week of the year policy.

In addition, we have been working on substantial performance improvements that address ease-of-use and automation. Please stay tuned for major announcements in early 2018.

Staying ahead with enterprise-ready security and compliance
From day one, security has always been core to Snowflake’s design.

  • Programmatic SSO support – Now, a Snowflake user can leverage browser-assisted single-sign-on programmatically through our Snowflake ODBC, JDBC, and Python drivers and our command line tool, SnowSQL. In contrast to some of our competitors, you don’t need to write code to connect to your SAML 2.0-compliant IDP. (Programmatic SSO support is available in our Enterprise Edition).
  • PrivateLink support for Snowflake – We worked with AWS to integrate PrivateLink with Snowflake. With PrivateLink, Snowflake users can now connect to Snowflake by bypassing the public internet. No cumbersome proxies need to be set up between Snowflake and their network. Users have full control over egress traffic.

Improving our ecosystem and data loading
Enabling developers and builders to onboard new workloads and create applications with their favorite tools, drivers and languages remains a top priority.

Snowpipe
We recently announced Snowpipe, which represents an important milestone in many ways. First, users can now efficiently address continuous loading use cases for streaming data. Secondly, with Snowpipe, we introduced the first service in addition to the core data warehousing service. Snowpipe is serverless so there’s no need to manage a virtual warehouse for data loading into Snowflake. Finally, we continue to deeply integrate with the entire AWS stack: Snowpipe allows customers to take advantage of AWS S3 event notifications to automatically trigger Snowflake data loads into target tables. You can read more about Snowpipe here: Overview and First Steps.

Broader Ecosystem
While we continue adding functionality and enriching Snowflake with additional services, we strongly believe in freedom of choice. Instead of preaching a unified platform and a one-solution-fits-all philosophy, our key objectives are to make integrations easier and with improved performance.

  • For enterprise-class ETL, data integration and replication:
    • Added PowerCenter 10.2 support for Snowflake.
    • Added  support for the ETL use case with AWS Glue – build modern and performing ETL jobs via our Spark connector, pushing compute into Snowflake.  
  • For data analytics and data science:
    • Added a native Snowflake connector for Alteryx supporting in-DB components.
    • Upgraded our Snowflake dplyR library.  
  • For  business intelligence (BI):  
    • Added a native Snowflake connector for ChartIO ( in preview).
  • For parallel data loading & unloading via the COPY command, developers can now:
    • Load non-UTF-8 character-encoded data.
    • Unload to the Parquet file format.

Increasing transparency and usability
These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Resource monitors can now be created through the Snowflake UI. As a user creates resource monitors, they can also set up notifications. Notifications for “suspend” and “suspend immediately” actions are sent out when quota thresholds are reached for a particular resource monitor. In addition, we added  support for up to 5 warnings that can be defined for each resource monitor.
  • New MONITOR USAGE privilege to grant access to review billing and usage information via SQL or through the Snowflake UI. This provides non-account administrators with the ability to review information about data stored in databases and stages, as well as provide insight into warehouse usage.
  • For improved usability and convenience, we added a new option (COPY GRANTS) which allows users to preserve or copy grants as part of executing the CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, CREATE TABLE LIKE, and CREATE TABLE CLONE variations.  
  • We also completely overhauled our worksheet and improved the overall experience. We are going to roll these changes out in stages in the coming weeks, with much more to come in 2018.

Scaling and investing in service robustness
These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Our global expansion continues with a new Snowflake region in Sydney, Australia.
  • We introduced a new product edition: Virtual Private Snowflake (VPS). With VPS, customers get a dedicated and managed instance of Snowflake within a separate dedicated AWS VPC, with all the performance, simplicity and concurrency inherent to Snowflake. This also includes completely dedicated metadata services isolated from the metadata activity of other Snowflake customers.
  • We continued our investments in system stability, improving and strengthening various components of our cloud services layer.

Conclusion

We are well underway to reach 1000 customers by the end of 2017, with tens of compressed PBs stored in Snowflake and millions of data processing jobs successfully executed daily. We grew to over 300 Snowflakes and expanded our global presence. 2017 has been truly a defining year in Snowflake’s young history: Self-service, instant Data Sharing, broadening our ecosystem, instant elasticity, Snowpipe, VPS and PrivateLink for Snowflake.  We look forward to an exciting 2018 as we continue to help our customers solve their data challenges at scale.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

5 Must-Have Features for High Concurrency Environments

Whether you’re new to cloud data warehousing or comparing multiple cloud data warehouse technologies, it’s critical to assess whether your data warehouse environment will need to support concurrent processing of any sort. Unless you’re a lone database shop, in all likelihood, the answer is yes you will. Concurrency, or concurrent data processing, is simultaneous access and/or manipulation of the same data. This is not to be confused with parallel processing, which is multiple operations happening at the same time, but not against the same data.

Concurrency can take the form of multiple users interactively exploring and manipulating a particular data set, concurrent applications querying and visualizing the same data set, transactional updates to the data, or even concurrent loading of new data or change data into the data set. And if you thought to yourself, “What? I can concurrently load new data, while supporting queries on the same data set?,” then keep reading.

If you require concurrency, at any capacity — you will want these five cloud data warehouse features:

  • High relational performance across a broad range of data types: Of course, you want high performance–that’s a given. However, the more challenging aspect to plan for is fast queries on a broad range of data types, including semi-structured/JSON data. You don’t want your relational data warehouse to bog down and hold up corporate users because now it must handle non-traditional data from groups like the web team or product engineering. JSON is the new normal.  
  • Automatic and instant warehouse scaling: Let’s say your warehouse can handle a high number of concurrent accesses, but it bogs down during a period of high demand. Now what? This is a very important question to have answered as you compare technologies. Will you have to kick-off users? Will you have to schedule after-hour jobs? Will you have to add nodes? Will this require you to redistribute data? If so, redistributing data takes time and it’s a double-whammy because the existing data in the warehouse has to be unloaded. This is a huge disruption to your users. Instead, you want the ability to load balance across new virtual warehouses (compute engines) and have your cloud data warehouse continue to execute at fast speeds–including loading new data–all against the same data set. The more automatic this load balancing happens, the better.
  • ACID compliance: Right up there with poor performance is inaccurate results or reporting due to inconsistent data or dirty reads. With more people, workgroups, or applications accessing the same data simultaneously, the more pressure there is to maintain data consistency. A data warehouse with ACID compliance ensures consistency and data integrity are validated without having to write scripts or manually managing data integrity and consistency yourself.
  • Multi-statement transaction support: Tied to ACID compliance is multi-statement transaction support. If you have users that nest multiple transactions within a single query, you want a warehouse solution that you can trust will completely execute the transactions with integrity or will automatically rollback transactions should another transaction in the line fail.  
  • Data sharing: Data integrity and freshness should not stop just because you have to share data with an external stakeholder. Traditional approaches require engaging in an ETL process or spending time manually deconstructing, securing and transferring data. Data consumers on the receiving end must also spend time to reverse your steps and reconstruct data. This process is slow and manual, and does not ensure your data consumers are always working with live and up to date data. Data sharing allows you to eliminate all of this effort and ensure access to live data.

The business value for proactively planning for concurrency is that you want to ensure your cloud data warehouse can support your environment, regardless of what it throws at you. Especially during times of sudden, unpredictable, heavy query loads against a common data set.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Let History Halve Your Next Data Analytics Purchase

If you’re lucky, you’ll spend just six to 12 months considering and buying your next enterprise software solution. If it’s a seven-figure purchase, plan for an additional six to 12 months to confirm your organization has made the best investment possible.

During that process, dozens of your IT and business leaders will engage your shortlist of on-premises and software-as-a-service (SaaS) vendors to compare these competing technologies based on architecture, features, performance, business benefits and cost of ownership. Buying a data warehouse is no different.

But what if you could shorten that process? What if you had fact-based information arranged in a non-traditional but highly effective method to help you confidently narrow your search and quicken your time to market (TTM) with your chosen solution? What would that be worth to your organization and your peace of mind?

All enterprise software decisions include alternatives that span the decades. Meaning, your organization is likely to consider upgrading an existing technology you already own, and consider solutions that represent today’s latest SaaS offerings.

How so? One of your alternatives might be to upgrade your existing, on-premises data warehouse your organization purchased 10 to 15 years ago. A solution that wasn’t much different when it first emerged in the 1990s, and hasn’t advanced much since first deployed in your data center.

You may also own or are considering an on-premises NoSQL solution such as Hadoop. This technology emerged just over a decade ago, challenging the very existence of the legacy data warehouse in order to accommodate the exponential increase in the volume, variety and velocity of existing and new data types.

Since the advent of Hadoop, many traditional data warehouse vendors now offer a cloud version of their on-premises solution. With all this said, there are now many more solution categories, and many new vendors, that you must consider for your next data warehouse purchase.

Herein lies the rub. Nearly all data warehouse purchase decisions, and all enterprise software decisions for that matter, take the form of a side-by-side-by-side, laundry-list comparison. That’s a significant amount of ground to cover, especially for data warehousing, which is four decades old. Your review of competing alternatives becomes even more protracted when the architectures and features of traditional and more modern products don’t align, which they never do.

Instead, think linear. Think like a historian. Many of the solutions you’ll consider are a response to the drawbacks, and benefits, of preceding, competing technologies. For example, at Snowflake, many of our customers had previously used a legacy data warehouse for years and then added a Hadoop solution to expand their data analytics platform. Unfortunately, that combination of technologies did not meet their ever-increasing requirements.

So, before you kick off a necessary, side-by-side comparison, consider your initial group of alternative technologies as building blocks, from bottom to top, from the oldest to the most recent. Then eliminate from contention those that do not add anything new or innovative. This approach enables you to focus on more recent technologies that truly deliver better value, and will enable you to continue to innovate well into the future. In the end, you’ll get to a shorter shortlist, and quickly, thus speeding your decision-making process and TTM by up 50 percent.

At Snowflake, we’ve done the hard work for you. We invite you to read this short but revealing ebook that details the benefits and drawbacks of each succeeding data analytics technology – from the birth of the legacy data warehouse, all the way to today’s modern, built-for-the-cloud data warehouse. We’re confident it will provide the insight you need to quicken your next data warehouse purchase by rapidly reviewing every technology that got us to where we are today.