Global Snowflake: Loading Data into Snowflake from Azure Blob Storage

Snowflake is available in a number of AWS regions as we continue to expand our geographic presence to meet customer demand, reduce latency and satisfy industry compliance and country-specific regulations.

As the global adoption of Snowflake continues, we also expect customers to transfer increasingly higher data volumes between regions. There are several potential scenarios that, taken together, are driving this trend, including:

  • Data sharing across regions. A data set maintained by a Snowflake account in a specific region can be easily shared and consumed by others in different regions.
  • Synchronizing Snowflake data between different geographic regions for disaster recovery purposes.
  • Importing and exporting raw data from storage in other cloud providers. A common request from Snowflake customers running Snowflake on AWS was to support import from and export to Blob Storage in Microsoft Azure.

Therefore, we are happy to announce additional integration capabilities with Blob Storage in Microsoft Azure. This is part of Snowflake’s commitment to becoming a global cloud provider, supporting customers no matter which cloud provider(s) they choose. The integration with Azure Blob Storage complements Snowflake’s existing functionality for data loading and unloading. From now on, the Snowflake SQL commands to define an external stage support the URL and credential specifications for Azure Blob Storage. This allows customers to import data from and export data to Azure Blob Storage containers.

This targets architectures with data sets in Azure. Customers using Microsoft Azure can now easily exchange data with Snowflake to benefit from Snowflake’s leading analytics capabilities. The following picture illustrates the overall approach:

 

The key integration point between Azure and Snowflake is a container in Azure Blob Storage. Snowflake expects that any data to be loaded from the Azure application into Snowflake is placed in a container. This container is then registered as an external stage in Snowflake using the following Snowflake SQL command:

CREATE STAGE azstage

URL = azure://<account>.blob.core.windows.net/<container>/<path>

CREDENTIALS=(AZURE_SAS_TOKEN=…)

The URL parameter of the CREATE STAGE command now supports URLs for Azure Blob Storage service endpoints. The endpoint for a given account can be found in the overview pane of the storage account in the Azure portal, as shown in the following figure:

When defining an external stage in Azure Blob Storage, the service endpoint URL should be followed by the container name and can include additional path specifications. Note that the URL in the stage definition replaces ‘https:’ from the endpoint URL with ‘azure’. Client-side encryption is supported for files that are encrypted using an Azure Storage SDK or compatible encryption library. Customers can provide their encryption key to the CREATE STAGE command. Details on creating Azure stages can be found in the Snowflake documentation here.

After defining the external stage, customers can use Snowflake’s familiar COPY syntax to refer to the stage. For example, the following statement loads a batch of data files from the Azure Blob Storage container into a target table T1 in Snowflake:

COPY INTO T1 

FROM @azstage/newbatch

Similarly, the following COPY statement exports the contents of an existing table T2 in Snowflake to a set of files in the Azure external stage:

COPY INTO @azstage/t2data 

FROM T2

The Snowflake external stage support for Azure Blob Storage complements Snowflake’s expansion across Amazon data centers worldwide. It now provides the ability to easily access data in Azure storage using built-in Snowflake functionality.

While cloud providers do not charge for data ingress, they do charge for data egress. For Snowflake customers using the new Azure external stages, importing data from Azure Blob Storage into Snowflake will incur data egress charges for the amount of data transferred out of their Blob Storage accounts. The charges accrue to the Azure subscription that the Blob Storage accounts belong to. The rate at which these charges occur depends on the geographical location of the Blob Storage account and the volume of data transferred during a billing period. You can find more details here.   

Starting on February 1, 2018, we will pass through the cost of data egress from AWS and Microsoft through external stages. This cost will appear on your bill. Egress through JDBC and other drivers will continue to be free. Since data egress is uncommon among Snowflake customers, our initial analysis shows that many customers will not be affected by this change.

The Billing & Usage page in the Snowflake web portal tracks data transfer volumes similarly to warehouse utilization and storage volumes. More information on Snowflake data transfer prices can be found here.

To give an example, exporting a 1TB table across cloud providers, from Snowflake running in US West on AWS into an Azure Blob Storage container located in the Azure East US region, costs $90. Exporting the table within AWS and into an AWS S3 bucket located in US East costs $20 (cheaper since the transfer is within AWS US regions). Data transferred into an external stage in the same region and the same cloud provider continues to be free of charge, e.g., exporting the table from the previous example into an AWS S3 bucket in US West is free of charge.

Check out the Snowflake documentation for detailed information on exchanging data between Snowflake and Azure Blob Storage, as well as the new external stage support for Azure Blob Storage.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Modern Data Sharing: The Opportunities Are Endless

Data sharing between organizations for commercial purposes has been around for over 100 years. But until very recently, enterprises have been forced to rely on traditional data sharing methods that are labor-intensive, costly and error-prone. These methods are also more open to hackers and produce stale data. Snowflake Data Sharing, one of the newest innovations to Snowflake’s cloud-built data warehouse, has eliminated those barriers and enabled enterprises to easily share live data in real time via one-to-one, one-to-many and many-to-many relationships. Best of all, the shared data between data providers and data consumers doesn’t move.

Below is an example of how Snowflake Data Sharing reduced the time to create a live, secure data share to a fraction of the time and cost of a standard method. Most interestingly, the application of Snowflake Data Sharing in this instance reveals that the solutions addressed by modern data sharing are endless.

The Challenge

The Federal Risk and Authorization Management Program (FedRAMP), “is a government-wide program that provides a standardized approach to security assessment, authorization, and continuous monitoring for cloud products and services.” Complying with the program’s approximately 670 security requirements and collecting supporting evidence is a significant challenge. But having to do so monthly, as required, is a Sisyphean task if you attempt it manually. Cloud providers must complete an inventory of all of their FedRAMP assets, which include, binaries, running network services, asset types and more. Automation is really the only logical approach in solving this FedRAMP inventory conundrum.

The Run-of-the-mill Method

Often, people gravitate to what’s familiar so it’s no surprise we initially considered a solution that comprised the combination of an AWS tool, an IT automation tool and some Python to clean up the data. However, we estimated a significant effort to develop, test, and deploy the code, which is separate from the required, ongoing maintenance. Instead, we took a different approach.

The Solution

Snowflake’s Data Sharing technology is a fast, simple and powerful solution that allows us to maintain our security governance posture while sharing live data in real time without having to move the data. With modern data sharing, there is the concept of a data provider and a data consumer. For this project we engaged Lacework, a Snowflake security partner, as our initial data provider.

Lacework provides us with advanced threat detection by running their agents on our servers to help capture relevant activities, organize events logically, baseline behaviors, and identify deviations. In doing so, Lacework looks for file modifications, running processes, installed software packages, network sockets, and monitors for suspicious activities in one’s AWS account. All of that data is then analyzed and stored in their Snowflake account. Lacework is both a Snowflake vendor and a Snowflake data warehouse-as-a-service customer. They use Snowflake to provide their security services to Snowflake. Basically, Lacework already collects the data required to complete the FedRAMP system inventory collection task.

We contacted Lacework and presented them with our FedRAMP challenge and suggested leveraging data sharing between their Snowflake account and our Security Snowflake account. Within a couple of hours, they provided us with live FedRAMP data through Snowflake Data Sharing. Yes, you read that right. It only took a couple of hours. The following describes the steps for creating, sharing, and consuming the data:                                                                                                                               

Data Provider (Lacework) steps

  1. Create a share and give it a name
  2. Grant privilege to the database and its objects (schemas, tables, views)
  3. Alter share to add other Snowflake accounts to the share

Data Consumer (Snowflake Security) steps

  1. Create Database from the share
  2. Perform some SQL commands and voila! We have our FedRAMP System inventory data (we redacted the results in this image for security reasons)

 

 

 

Again, the whole data sharing effort took just a few hours.

Beyond FedRAMP

You are probably asking yourself at this point, “why didn’t you just ask Lacework to generate a FedRAMP report instead doing data sharing?” I would wholeheartedly agree with you if we were dealing with a conventional data warehouse built with a 90’s philosophy of share nothing. But Snowflake is the farthest thing from a conventional data warehouse and data sharing is nothing short of transformational. How so?

In addition to consuming data from Lacework, we also consume data from other data providers that share application logs, JIRA cases and more. We combine these data sources to automatically track software packages to determine if they are approved or not. Before data sharing, this activity was a time-consuming, manual process. Now, the team is free to focus on more critical security activities since data sharing has helped with FedRAMP and has improved our overall security posture.

Conclusion

As I wrote this blog, I watched my kids enthusiastically playing with their new Lego set. It’s remarkable how simple blocks can form such complex structures. Modern data sharing displays similar extrinsic properties because it offers data consumers and data providers with infinite ways of solving challenges and thus creating boundless business opportunities.

Learn more about Lacework and Fedramp.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Snowflake’s Remediation Plans for the Meltdown and Spectre Vulnerabilities

Meltdown

Meltdown is a hardware vulnerability that primarily affects Intel x86 processors. An attacker must have local access on the target system and must be able to run their rogue code to successfully exploit the Meltdown vulnerability. Moreover, security researchers have  determined that Meltdown poses a clear risk to a virtualized environment.

In lieu of that, we commend the major cloud IaaS providers such as AWS for recognizing the threat, rolling-up their sleeves and quickly deploying a remediation security update. All indications from AWS suggest they have successfully remediated this vulnerability. Since Snowflake security tightly controls the code that can be run on our production servers, the main threat for data exposure is cross-VM attacks which AWS has remediated with its hypervisor patch.

There has also been a lot of concern about performance degradation after AWS deployed the security update. Our current internal performance results fall well within the noise range. In other words, we have not detected any significant impact to performance.

In addition, AWS has published an AWS kernel update so customers can deploy it to their respective VMs. However, Snowflake’s defense-in-depth approach adequately addresses the impact of Meltdown in the Snowflake service because we have a tight control of who can access our production environment. We limit this access to only those who need to perform administrative and security support. We also enforce several forms of multi-factor authentications before anyone can access the production VPC,  and we monitor all system changes on our servers and ensure those changes are authorized and secure. Although our security architecture does not require the AWS kernel patch for security reasons, we are evaluating the performance impact of this patch and will install it in all situations that do not materially impact the experience of our customers. Moreover, we have updated all of our Snowflake endpoints such as our company laptops.  

Spectre

Snowflake currently considers the Spectre Variant 1 vulnerability (CVE-2017-5753) as the most risky of the three new classes of speculative attacks (e.g., Spectre Variant 1, Spectre Variant 2, and Meltdown) because it has the ability to exploit browsers via JavaScript. Therefore, we have deployed all available browser Spectre patches to all of our Snowflake endpoints and we will continue to quickly deploy new browser Spectre patches when they become publicly available.

Outside of the browser attack surface, we will continue to remediate this vulnerability across our environment as vendors take proactive measures by releasing security updates. For example, we have deployed a vendor’s Spectre security update in our test environment and we are currently running regression and performance tests. We expect to deploy such a patch to the production environment shortly.

In the interim, we are monitoring our environment and continue to research for potential exploits by leveraging our security partners.

Customers

It is also critical that our customers update their systems, especially if they may execute untrusted code, which could be vulnerable to Meltdown or Spectre. This includes updating user web browsers with vendor-provided updates as soon as possible. We also recommend that customers leverage two-factor authentication whenever possible. As such, Snowflake generally recommends that customers use our MFA services and our IP whitelist features for interactive logins to their Snowflake account for defense-in-depth.

Conclusion

We will continue to send customer updates as we reach patch deployment milestones or if we detect significant system performance issues with the mitigations associated with these vulnerabilities.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

New Snowflake Features Released in H2’17

Looking back at the first half of this year, the main theme at Snowflake has been centered around scaling – the team, the service and our customer base. We introduced and improved the self-service onboarding experience, grew our ecosystem by collaborating with major technology partners and SIs and invested in making our service faster, more secure and easier to use.

In the summer, we announced the general availability of Snowflake’s Instant Data Sharing. As we continue to grow, this will provide the foundation for data-driven organizations of all sizes to instantly exchange data, create new insights and discover new revenue streams.

In the second half of this year, we started a journey to evolve the service in substantial ways and this will continue throughout 2018. 

Instant elasticity, per-second billing and 4XL virtual warehouses
Imagine you have instant access to performance and scalability – and truly pay only for what you use.

Many cloud DW solutions make the claim in the headline above. Nonetheless, taking a closer look exposes fine differences with the degree of scalability and concurrency supported, overall user experience and the types of workloads a user can run at scale. One of our key investments in this area resulted in our instant elasticity feature:

  • The introduction of per-second pricing represents one crucial building block. Now, our customers are billed to the second instead of a full hour when running workloads in Snowflake.
  • We also improved our provisioning algorithms, using past usage data to better serve compute capacity for future use.
  • We added support for our new 4XL virtual warehouses, doubling the maximal compute configuration and allowing our customers to tackle their most challenging workloads and performance SLAs.

Based on customer feedback, the decrease in wait time from minutes to seconds is a huge benefit. You can now ensure the right performance for your executive dashboards and not worry anymore about a full-hour charge. You can also scale your load instantly and have fast access to the most current data.

Improving performance and SQL programmability
Our ongoing mission is to build the fastest data warehouse with the SQL you love.

Throughout the last quarter, we released a wide range of SQL capabilities that address database migration challenges and increase both compatibility with SQL ANSI and overall SQL programmability:

  • Support for JavaScript Table UDFs (in public preview).
  • Support for Double-Dot notation for specifying database objects to ease migration from on-premise database systems.
  • Support for Table Literals.
  • INSERT OVERWRITE: Allowing to rebuild tables to ease migration from Hadoop systems.
  • LEAD/LAG function enhancement: Ignoring NULL values.
  • SAMPLE/TABLESAMPLE – announced general availability.
  • We introduced functions to estimate frequent values in near-linear time.
  • More flexibility and support to configure start of the week and week of the year policy.

In addition, we have been working on substantial performance improvements that address ease-of-use and automation. Please stay tuned for major announcements in early 2018.

Staying ahead with enterprise-ready security and compliance
From day one, security has always been core to Snowflake’s design.

  • Programmatic SSO support – Now, a Snowflake user can leverage browser-assisted single-sign-on programmatically through our Snowflake ODBC, JDBC, and Python drivers and our command line tool, SnowSQL. In contrast to some of our competitors, you don’t need to write code to connect to your SAML 2.0-compliant IDP. (Programmatic SSO support is available in our Enterprise Edition).
  • PrivateLink support for Snowflake – We worked with AWS to integrate PrivateLink with Snowflake. With PrivateLink, Snowflake users can now connect to Snowflake by bypassing the public internet. No cumbersome proxies need to be set up between Snowflake and their network. Users have full control over egress traffic.

Improving our ecosystem and data loading
Enabling developers and builders to onboard new workloads and create applications with their favorite tools, drivers and languages remains a top priority.

Snowpipe
We recently announced Snowpipe, which represents an important milestone in many ways. First, users can now efficiently address continuous loading use cases for streaming data. Secondly, with Snowpipe, we introduced the first service in addition to the core data warehousing service. Snowpipe is serverless so there’s no need to manage a virtual warehouse for data loading into Snowflake. Finally, we continue to deeply integrate with the entire AWS stack: Snowpipe allows customers to take advantage of AWS S3 event notifications to automatically trigger Snowflake data loads into target tables. You can read more about Snowpipe here: Overview and First Steps.

Broader Ecosystem
While we continue adding functionality and enriching Snowflake with additional services, we strongly believe in freedom of choice. Instead of preaching a unified platform and a one-solution-fits-all philosophy, our key objectives are to make integrations easier and with improved performance.

  • For enterprise-class ETL, data integration and replication:
    • Added PowerCenter 10.2 support for Snowflake.
    • Added  support for the ETL use case with AWS Glue – build modern and performing ETL jobs via our Spark connector, pushing compute into Snowflake.  
  • For data analytics and data science:
    • Added a native Snowflake connector for Alteryx supporting in-DB components.
    • Upgraded our Snowflake dplyR library.  
  • For  business intelligence (BI):  
    • Added a native Snowflake connector for ChartIO ( in preview).
  • For parallel data loading & unloading via the COPY command, developers can now:
    • Load non-UTF-8 character-encoded data.
    • Unload to the Parquet file format.

Increasing transparency and usability
These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Resource monitors can now be created through the Snowflake UI. As a user creates resource monitors, they can also set up notifications. Notifications for “suspend” and “suspend immediately” actions are sent out when quota thresholds are reached for a particular resource monitor. In addition, we added  support for up to 5 warnings that can be defined for each resource monitor.
  • New MONITOR USAGE privilege to grant access to review billing and usage information via SQL or through the Snowflake UI. This provides non-account administrators with the ability to review information about data stored in databases and stages, as well as provide insight into warehouse usage.
  • For improved usability and convenience, we added a new option (COPY GRANTS) which allows users to preserve or copy grants as part of executing the CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, CREATE TABLE LIKE, and CREATE TABLE CLONE variations.  
  • We also completely overhauled our worksheet and improved the overall experience. We are going to roll these changes out in stages in the coming weeks, with much more to come in 2018.

Scaling and investing in service robustness
These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Our global expansion continues with a new Snowflake region in Sydney, Australia.
  • We introduced a new product edition: Virtual Private Snowflake (VPS). With VPS, customers get a dedicated and managed instance of Snowflake within a separate dedicated AWS VPC, with all the performance, simplicity and concurrency inherent to Snowflake. This also includes completely dedicated metadata services isolated from the metadata activity of other Snowflake customers.
  • We continued our investments in system stability, improving and strengthening various components of our cloud services layer.

Conclusion

We are well underway to reach 1000 customers by the end of 2017, with tens of compressed PBs stored in Snowflake and millions of data processing jobs successfully executed daily. We grew to over 300 Snowflakes and expanded our global presence. 2017 has been truly a defining year in Snowflake’s young history: Self-service, instant Data Sharing, broadening our ecosystem, instant elasticity, Snowpipe, VPS and PrivateLink for Snowflake.  We look forward to an exciting 2018 as we continue to help our customers solve their data challenges at scale.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Snowpipe: Serverless Loading for Streaming Data

Today’s organizations struggle to load, integrate and analyze the sources of streaming data they rely on to advance their businesses. The Internet-of-things, mobile devices, adtech scenarios, telemetry systems and application health monitoring are just some of the scenarios driving this trend. To remain competitive, organizations require this data to drive their analytics in near real time.

Despite the rapid rise in streaming data, and the infinite scalability of the cloud, traditional data warehousing solutions can’t deliver seamless ingestion for streaming data. They still require processes originally designed for batch loading, happening once or a few times per day. This unnecessary latency is a drag on an organization’s analytics. Workarounds using micro-batching provide some relief but are difficult to implement and require careful tuning. Equally, serverless computing is still foreign to most traditional data warehouse solutions. In fact, most cloud data warehouses, which are “cloud-washed” versions of on-premises solutions, do not offer a serverless experience.  

Snowpipe tackles both continuous loading for streaming data and serverless computing for data loading into Snowflake. With Snowpipe, AWS S3 event notifications automatically trigger Snowflake to load data into target tables. Snowflake SQL queries retrieve the most recent data within a minute after it arrived to the S3 bucket.

The “pipe” is a key concept in the surface area that Snowpipe adds to Snowflake. A pipe definition wraps the familiar COPY statement for data loading with Snowflake. Most of the semantics from the existing COPY statement carry forward to a pipe in Snowpipe. The main difference, though, is that pipes are continuously watching for new data and are continuously loading data from the stage used by the pipe.

Snowpipe’s surface area provides two different levels of control for pipes. With most use cases, Snowpipe can rely on Amazon SQS notifications from an S3 bucket to trigger data loads. It requires a one-time configuration for an S3 bucket and a set of target tables. This usually takes less than 15 minutes to set up. It’s completely configuration-based, with no need to write any application code other than some Snowflake DDL statements. This experience is available for preview in December 2017. The following diagram illustrates this approach:

 

 

For use cases that require more control or deeper integration into applications, Snowpipe also provides a programmatic REST API to notify Snowflake of new data in S3. The REST API is available today in preview for Snowflake customers in the US West region. The following diagram shows an architectural overview of this approach:

 

For both Snowpipe experiences, Snowflake runs and manages a fleet of servers that asynchronously perform the actual data loading into the target tables. This server fleet is completely internal to Snowflake, which automatically adds or removes servers from the fleet depending on the current Snowpipe load. Customers don’t need to worry about monitoring or managing compute capacity when using Snowpipe for their data loads.

Snowpipe utilization, billing and cost

Snowpipe uses a serverless billing model. Customers are charged based on their actual compute resource utilization rather than capacity reservations that may be idle or overutilized. Instead, Snowpipe tracks the resource consumption of pipes in a given Snowflake account for the load requests that the pipe processed, with per-second/per-core granularity. The utilization recorded is then translated into familiar Snowflake credits. Snowpipe utilization shows up in the form of Snowflake credits on the bill, and account administrators can track Snowpipe utilization on their Snowflake account pages during the course of the month. Snowpipe utilization is shown as a special Snowflake warehouse – indicated by the Snowflake logo proceeding the warehouse name – in the Warehouse tab in Billing & Usage on the Snowflake web portal.

 

A table function called pipe_utilization_history in Snowflake SQL allows you to drill into Snowpipe utilization details over specific periods of time or for specific pipes.

Give Snowpipe a spin today and let us know your feedback. Snowpipe using REST-based notifications is available today. You can find the documentation and information on how to get started here.

Snowpipe with auto-ingest using SQS is available in December. If you are interested in participating in a private preview for this capability, please let us know here. Make sure to also read part two of this blog about Snowpipe here.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Your First Steps with Snowpipe

Please note that Snowpipe is in public preview in the US West Region of AWS. Auto Ingest will be available by February 2018. 

This blog post walks you through your first steps with deploying Snowpipe. We start with the steps for configuring Snowpipe so that it continuously loads data from Amazon S3 into a target table in Snowflake. Next, we explore the easy-to-configure mode of Snowpipe, where S3 event notifications sent to a dedicated queue in Amazon Simple Queue Service (SQS) inform Snowpipe about new data that is ready for loading. Finally, this post explains how to exert tighter programmatic control over when and what notifications are sent using the Snowpipe REST API.

If you haven’t read our blog about what Snowpipe is, click here, to understand how Snowpipe works and the business benefits it provides. Otherwise, keep reading this blog and learn how to configure Snowpipe.

Snowpipe one-time configuration in SQL

The following SQL statements show the one-time configuration experience for setting up Snowpipe. They include familiar DDL, such as creating an external stage and a new table, as well as how to create a pipe which is a new database object in Snowflake.

In the example below, we use a VARIANT column in Snowflake to store incoming data. Semi-structured data types in Snowflake are particularly well-suited for streaming or continuous data loading scenarios. They allow for seamless schema evolution in your application, over time, without making any changes to the Snowflake table schema. Snowflake automatically applies an efficient columnar representation to the data while it’s being loaded.

This makes it easy to leave the incoming continuous data in its original format. Note, a conversion into a different format such as Parquet is not needed for efficient loading or querying in Snowflake:

use role snowpipe_role;
use database snowpipe;

create or replace stage snowpipe.public.snowstage
    url='s3://snowpipe-demo/'
    credentials = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...' );
show stages;

-- Create target table for JSON data
create or replace table snowpipe.public.snowtable(jsontext variant);
show tables;

-- Create a pipe to ingest JSON data
create or replace pipe snowpipe.public.snowpipe auto_ingest=true as
    copy into snowpipe.public.snowtable
    from @snowpipe.public.snowstage
    file_format = (type = 'JSON');
show pipes;

The key database concept that Snowpipe introduces to Snowflake’s SQL language is called a “pipe“. Think of a pipe as a wrapper around Snowflake’s familiar COPY statement for data loading. A Snowpipe pipe continuously looks for new data, then loads it using the COPY statement in the pipe definition.

In the current example, the pipe is defined with AUTO_INGEST=true>, which tells Snowflake to use an SQS queue in AWS to receive event notifications from an S3 bucket pertaining to new data that is ready to load. The name of the SQS queue is shown in a new column within the results of the SHOW STAGES command. This queue name is used to configure S3 event notifications.

Snowpipe one-time configuration in AWS S3

With Snowpipe auto-ingest, available in February 2018, the remaining step requires configuring event notifications for the S3 bucket so the pipe recognizes when there is new data available for loading. Auto-ingest relies on SQS queues to deliver the notifications from S3 to Snowpipe. All Snowpipe SQS queues are created and managed by Snowflake so you don’t have to worry about managing yet another AWS service for your application.

The name of the SQS queue is available from the SHOW STAGES command in a new column called NOTIFICATION_CHANNEL. The following screenshot shows the event notification configuration for the S3 bucket. Note the use of the name (or ARN, to be precise) of the SQS queue at the bottom of the dialog box along and the check mark for the ObjectCreate notifications.

 

After saving this notification configuration, place new files in the bucket. Snowflake will receive automatic notifications about the files from S3. Then, Snowpipe will automatically begin loading them into whatever table you have defined as the target in your pipe configuration.

Continuously delivering data

Snowpipe requires placing data into S3 before loading.This design choice takes advantage of the rich ecosystem of tools designed for storing data in S3. Snowpipe’s continued use of S3 as the stage for data loading allows you to use the tool of your choice for delivering data in an S3 bucket. This includes services such as AWS Kinesis with Firehose, and Kafka with its S3 connector. Both are popular choices for transporting continuous and streaming data.

The architectural take-away from this is important: Snowflake can automatically and continuously load data delivered to S3 from Kinesis or Kafka.

Querying continuously loading data

A simple, but useful, example query for Snowpipe tracks the arrival of rows over time with a JSON field called 'created_at':

select
    date_trunc('MINUTE', to_timestamp(jsontext:created_at)),
    count(*)
from snowpipe.public.snowtable
group by date_trunc('MINUTE', to_timestamp(jsontext:created_at));

This illustrates how many rows arrived each minute, serving as a starting point for more expressive analytics over time windows.

It’s worth pointing out the powerful date, time and windowing functions Snowflake’s SQL language provides. This makes it easy to analyze data, as it arrives over time, and group it into time windows for examining trends.

Full control for Snowpipe

Using S3 event notifications to tell Snowpipe about new data to load from an S3 bucket may not be appropriate in some use cases. Consider an application that first needs to perform some processing in AWS Lambda before performing the load into Snowflake. In that case, we recommend timing notifications to Snowpipe more tightly.

Snowpipe allows for defining pipes with AUTO_INGEST set to ‘false’ while using exactly the same DDL, as shown above, to create stages and pipes. Without automatic ingestion pipes, however, an application needs to invoke a REST API with the filename, or a set of file names, to tell Snowpipe that the file is ready for loading. For instance, you can call the Snowpipe REST API after finishing the preprocessing from within your AWS Lambda code.

In general, any application capable of invoking a REST API can notify Snowpipe about new files. To make developing against the REST API easier, SDKs for both Python and Java are available. The following code snippet shows an example of the proxy-like abstraction that the Python SDK provides. Note how the ingest_files call can consume a set of files in addition to just a single file:

ingest_manager = SimpleIngestManager(account='...',
                                     user='...',
                                     pipe='...',
                                     private_key=...)
staged_file_list = []
staged_file_list.append(StagedFile(filename, None))
resp = ingest_manager.ingest_files(staged_file_list)

You can find the latest SDK versions on PyPI or Maven Central by using the search terms “Snowflake ingest”.

Try Snowpipe today

Please give Snowpipe a spin today and let us know your feedback. Snowpipe using REST-based notifications is now available in US West. You can find the documentation and information on how to get started here.

Snowpipe with auto-ingest using SQS will be available for preview in December, 2017. If you are interested in participating in a private preview for this capability, please let us know here.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

PrivateLink for Snowflake: No Internet Required

Improve Security and Simplify Connectivity with PrivateLink for Snowflake

AWS recently announced PrivateLink, the newest generation of VPC Endpoints that allows direct and secure connectivity between AWS VPCs, without traversing the public Internet. We’ve been working closely with the AWS product team to integrate PrivateLink with Snowflake and we’re  excited to be among the first launch partners. By integrating with PrivateLink, we allow customers with strict security policies to connect to Snowflake without exposing their data to the Internet. In this blog post, we’ll highlight how PrivateLink enhances our existing security capabilities, and how customers can easily set up PrivateLink with Snowflake.

Snowflake is an enterprise-grade, cloud data warehouse with a unique, multi-cluster, shared data architecture purpose-built for the cloud. From day one, security has been a central pillar of Snowflake’s architecture, with advanced security features baked into the solution. Customers get varying levels of security from Snowflake’s five different product editions: Standard, Premier, Enterprise, Enterprise for Sensitive Data (ESD) and Virtual Private Snowflake (VPS).

Across all editions, Snowflake provides a secure environment for customer data, protecting it in-transit and at rest. All customer data is encrypted by default using the latest security standards and best practices, and validated by compliance with industry-standard security protocols. In addition, customers have access to a host of security features and data protection enhancements such as IP whitelisting, role-based access control, and multi-factor authentication.

As shown in figure 1 below, Snowflake’s multi-tenant service runs inside a Virtual Private Cloud (VPC), isolating and limiting access to its internal components. Incoming traffic from customer VPCs is routed through an Elastic Load Balancer (ELB) to the Snowflake VPC.

For customers working with highly sensitive data or with specific compliance requirements, such as HIPAA and PCI, Snowflake offers Enterprise for Sensitive Data (ESD). With ESD edition, customer data is encrypted in transit across all networks including within Snowflake’s own VPC. ESD customers also benefit from additional security features such as Tri-Secret Secure, giving them full control over access to their data. See figure 2 below.

Earlier this year, we also introduced a private, single-tenant version of the Snowflake service – Virtual Private Snowflake. VPS, which is the most advanced and secure edition of Snowflake, includes all features of ESD and addresses the specific needs of regulated companies such as those in the financial industries. With VPS, customers get a dedicated and managed instance of Snowflake within a separate, dedicated VPC. Additionally, VPS customers can use secure proxies for egress traffic control to minimize risks associated with their internal users and systems communicating with unauthorized external hosts, as shown in figure 3 below:

But we recognize that a key area of concern for some customers has been around how data is sent from their private subnet to Snowflake. These customers need to enforce restrictive firewall rules on egress traffic. Others have restrictive policies about their resources accessing the Internet at all. So, how do you send data without allowing unrestricted outbound access to the public Internet and without violating existing security compliance requirements?

Enter AWS PrivateLink: a purpose-built technology that enables direct, secure connectivity among VPCs while keeping network traffic within the AWS network. Using PrivateLink, customers can connect to Snowflake without going over the public Internet, and without requiring proxies to be setup between Snowflake and their network as a stand-in solution for egress traffic control. Instead, all communication between the customer VPC and Snowflake is performed within the AWS private network backbone.

Snowflake leverages PrivateLink by running its service behind a Network Load Balancer (NLB) and shares the endpoint with customers’ VPCs. The Snowflake endpoint appears in the customer VPC, enabling direct connectivity to Snowflake via private IP addresses. Customers can then accept the end point and choose which of their VPCs and subnets to have access to Snowflake. This effectively allows Snowflake to function like a service that is hosted directly on the customer’s private network. Figures 4 and 5 show PrivateLink connectivity from customer VPCs to Snowflake in both multi-tenant (ESD) and single-tenant (VPS) scenarios.

Additionally, customers can access PrivateLink endpoints from their on-premise network via AWS Direct Connect, allowing them to connect all their virtual and physical environments in a single, private network. As such, Direct Connect can be used in conjunction with PrivateLink to connect customer’s datacenter to Snowflake. See figure 6 below.

Snowflake already delivers the world’s most secure data warehouse built for the cloud. Our ESD and VPS product editions are designed to address the highest security needs and compliance requirements of organizations large and small. With PrivateLink, we’re taking that a step further by allowing our customers to establish direct and private connectivity to Snowflake, without ever exposing their data to the public Internet.

PrivateLink is available to all Snowflake customers with ESD and VPS product editions. You can visit our user guide for instructions on how to get started with PrivateLink.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Snowflake’s Seattle Office Welcomes New Engineering Talent

In early October, Snowflake announced the expansion of its Seattle office and welcomed two new engineers, Polita and Murali, to the rapidly growing team. We sat down with Polita and Murali to get some insights into their individual career paths and find out what ultimately drew them to Snowflake.

Snowflake Senior Engineer, Polita, wanted to be part of Snowflake’s customer-first culture.

In February 2017, Polita was happily working at a major software company in the Seattle area. It had been her home for more than 15 years, and she wasn’t looking to go anywhere else.

But then she came upon an exciting opportunity: come work at a startup that is fundamentally changing the way organizations work with data.

“I knew that Snowflake is a company with a great future,” Polita said. “The leadership and employees are so passionate about this important product they’re building. This was such a compelling opportunity to join a company that cares so much about the customer that I couldn’t turn it down.”

Today, Polita is a senior engineer working on security and identity in Snowflake’s new Seattle office. She says the way the company responds to customer needs is unique in the industry.

“I’m working on a feature specifically because a customer asked for it, and I know why they need it,” she says. “It’s not always the case that you can be this connected to the customer and their ‘why,’ but at Snowflake, I am. And that shows in the product that we ultimately build.”

 

Snowflake Senior Engineer, Murali, returns to databases to find a whole new world.

After beginning his career in databases, Murali explored other areas in tech. He worked at several high-profile companies before deciding that he wanted to get back to his database roots. After all, it’s an exciting time to be working with data.

“Organizations are producing so much more data than ever before, and that data is key to their success. But how can they process all of it?” Murali said. “These realities have changed the way we think about databases, and I was really impressed with the way Snowflake is applying this thinking to the cloud.”

Murali had worked with a few Snowflake employees in previous jobs, and had a lot of respect for their smarts. After he dove into the product, he was equally impressed. He joined the company as a software engineer in September 2017 and went to work on improving data ingestion in Snowflake.

“Snowflake, because of the way it’s architected in the cloud from the ground up, is ideal for helping customers get insight into their data,” he said. “The product has been around for about three years, and it is phenomenal how extensible and reliable it is. It really does make data easier to organize and manage for all kinds of customers.”

As Snowflake continues to grow in Seattle, we are looking for brilliant and dedicated engineers and product managers to join our team. Check the Snowflake Careers page for current openings.

TPC-DS at 100TB and 10TB Scale Now Available in Snowflake’s Samples

We are happy to announce that a full 100 TB version of TPC-DS data, along with samples of all the benchmark’s 99 queries, are available now to all Snowflake customers for exploration and testing. We also provide a 10TB version if you are interested in smaller scale testing.

STORE_SALES sub-schema from the TPC-DS Benchmark

The STORE_SALES sub-schema from the TPC-DS Benchmark
Source: TPC Benchmark™ DS Specification

You can find the tables in:

  • Database: SNOWFLAKE_SAMPLE_DATA
  • Schema: TPCDS_SF100TCL (100TB version) or TPCDS_SF10TCL (10TB version) .

(Note that the raw data compresses in Snowflake to less than 1/3 of it’s original size.)

Sample TPC-DS queries are available as a tutorial under the + menu in the Snowflake Worksheet UI:

TPC-DS tutorial in Snowflake web interface

Accessing Sample TPC-DS queries in the Snowflake Worksheet UI

What is TPS-DS?

TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables: Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.

The 100TB version of TPC-DS is the largest public sample relational database we know of available on any platform for public testing and evaluation. For perspective, the STORE_SALES table alone contains over 280 billion rows loaded using 42 terabytes of CSV files.

Full details of the TPC-DS schema and queries, including business descriptions of each query, can be found in the TPC Benchmark™ DS Specification. To test examples of different types of queries, consider:

 Type  Queries
 Interactive (1-3 months of data scanned) — Simple star-join queries  19, 42, 52, 55
 Reporting (1 year of data scanned) — Simple star-join queries  3, 7, 53, 89
 Analytic (Multiple years, customer patterns) — Customer extracts, star joins  34, 34, 59
 Complex — Fact-to-fact joins, windows, extensive subqueries  23, 36, 64, 94
  • At 10 TB scale, the full set of 99 queries should complete in well under 2 hours on a Snowflake 2X-Large virtual warehouse.
  • At 100 TB, we recommend using the largest size virtual warehouse available. For example, on a 3X-Large warehouse, you can expect all 99 queries to complete within 7 hours.

Note that, if you plan to run identical queries multiple times or concurrently, be sure to disable result caching in Snowflake when you run tests by adding the following to your script:

alter session set use_cached_result = false;

TPS-DS Benchmark Kit and Working with Date Ranges

While we provide samples of the 99 queries containing specific parameter values, the TPC-DS Benchmark Kit includes tools for generating random permutations of parameters for each query — which is what we use in our internal testing.

In all queries, the date ranges are supplied using predicates on the DATE_DIM table — as specified by the TPC-DS benchmark — rather than using Date Key restrictions directly on the large fact tables (a strategy that some vendors have used to unrealistically simplify queries). If you want to create variations on these queries without using the benchmark kit, you can create versions that scan different ranges by changing the year, month and day restrictions in the WHERE clauses.

Conclusion

TPC-DS data (and other sample data sets) are made available to you through Snowflake’s unique Data Sharing feature, which allows the contents of any database in Snowflake to be shared with other Snowflake customers without requiring copies of the data.

We hope you enjoy working with this demanding and diverse workload, and invite you to compare your Snowflake results with other platforms.

And, be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb) for all the news and happenings here at Snowflake.

New Snowflake features released in Q2’17

It has been an incredible few months at Snowflake. Along with the introduction of self-service and numerous other features added in the last quarter, we have witnessed:

  • Our customer base has grown exponentially with large numbers of applications in full production.
  • Billions of analytical jobs successfully executed this year alone, with petabytes of data stored in Snowflake today, and without a single failed deployment to-date.
  • A strong interest in pushing the boundaries for data warehousing even further by allowing everyone in organizations to share, access and analyze data.

Continuing to engage closely with our customers during this rapid growth period, we rolled out key new product capabilities throughout the second quarter.

Instantly sharing data without limits – Introducing the Data Sharehouse  

Giving users the ability to easily and securely share data in the right format without the need for cumbersome data integration pipelines.

One of our highlights was the general availability of Snowflake’s Data Sharing feature. It allows multiple Snowflake customers to instantly share data with each other.

With data sharing, there is no need to use the expensive, insecure, and often complicated and error-prone procedure of transferring large numbers of single files from one location to another. Instead, customers can now simply share data via SQL statements and secure views fully integrated with our role-based access control model.

To learn more about how Snowflake customers are already leveraging this capability to build new products and features to drive their business, we encourage you to read:

Improving out-of-the box performance & SQL programmability  

Our ongoing mission is to build the fastest database for data warehousing with the SQL you love and no concurrency limits.

  • We continued to make end-to-end query execution faster with more efficient pruning for sub-columns in VARIANT types (JSON), general JOIN performance improvements, and faster global median calculations.
  • We addressed popular customer requests for improved programmability via SQL by:
  • Customers who are using our multi-cluster warehouses auto-scale feature for spiking workloads can now specify up to 10 compute clusters. This allows running hundreds of queries without any query queuing.

Staying ahead with enterprise-ready security and compliance

From day one, security has always been core to Snowflake’s design.

  • One of our exciting new capabilities this quarter is the general availability of customer-managed keys which added an additional layer of security for customers with sensitive data. This feature is the primary component of Tri-Secret Secure, a Snowflake Enterprise Edition for Sensitive Data (ESD) feature. You can find more details in our engineering blog about customer-managed keys in Snowflake.  
  • We also improved the ability for our users to monitor and filter the query history Information Schema table function for more specific SQL command types.
  • After its preview in Q1, secure views reached general availability in Q2.
  • In terms of certification, Snowflake received PCI DSS compliance – a common requirement for customers in banking, financial services, retail, services, and more. Customers and prospects who have PCI requirements will need to subscribe to the Snowflake Enterprise Edition for Sensitive Data (ESD).  

Improving our ecosystem and data loading  

Enabling developers and builders to create applications with their favorite tools, drivers, and languages remains a top priority.

  • For data warehouse automation, Wherescape added support for Snowflake.
  • We enhanced our parallel data loading & unloading via the COPY command; developers can now:
  • Expanding our driver support, we announced a preview version of our open-source Go driver, available in Snowflake’s Github repo.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Snowflake users can now set up MFA via the UI; they no longer need to reach out to our support team to enable the feature.
  • Building on the general availability of Query Profile in Q1, we added a number of additional usability enhancements that can be leveraged to better understand where time is spent during query execution.   
  • The AWS Key ID is now displayed in DESC STAGE output and interface.
  • We added support for leveraging file extensions used for data unloading operations, and changed the default behavior when loading files containing byte order marks, i.e. we now detect and skip the marks instead of throwing an error because the data could not be converted to proper data types.
  • To allow Snowflake users to better control consumption of compute resources, we also enhanced resource monitors (currently in preview). Users can now explicitly assign them to virtual warehouses and specify certain actions if a credit threshold is met or exceeded. Please stay tuned for a separate blog on this important capability.  

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Given our rapid growth since the beginning of this year, we continued working on product scale, reliability, and availability to ensure our readiness for the next phase of hyper-growth.
  • We’re already seeing our efforts to automate all operations, particularly Snowflake deployments, pay off. We were able to quickly roll out a new region (US East – Virginia in Q2) while working on additional deployments around the globe at the same time.

Acknowledgements and conclusion

As always, we want to first thank our customers for their continuous feedback. Additionally, we want to recognize that moving rapidly while also scaling would not be possible without our mighty Engineering Team, which has proven once again that it’s possible to ship high-quality features while serving our existing users’ needs at the same time.

Now, onwards to Q3 and the next set of exciting capabilities you will hear about very soon!

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.