Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

New Snowflake Features Released in H2’17

Looking back at the first half of this year, the main theme at Snowflake has been centered around scaling – the team, the service and our customer base. We introduced and improved the self-service onboarding experience, grew our ecosystem by collaborating with major technology partners and SIs and invested in making our service faster, more secure and easier to use.

In the summer, we announced the general availability of Snowflake’s Instant Data Sharing. As we continue to grow, this will provide the foundation for data-driven organizations of all sizes to instantly exchange data, create new insights and discover new revenue streams.

In the second half of this year, we started a journey to evolve the service in substantial ways and this will continue throughout 2018. 

Instant elasticity, per-second billing and 4XL virtual warehouses
Imagine you have instant access to performance and scalability – and truly pay only for what you use.

Many cloud DW solutions make the claim in the headline above. Nonetheless, taking a closer look exposes fine differences with the degree of scalability and concurrency supported, overall user experience and the types of workloads a user can run at scale. One of our key investments in this area resulted in our instant elasticity feature:

  • The introduction of per-second pricing represents one crucial building block. Now, our customers are billed to the second instead of a full hour when running workloads in Snowflake.
  • We also improved our provisioning algorithms, using past usage data to better serve compute capacity for future use.
  • We added support for our new 4XL virtual warehouses, doubling the maximal compute configuration and allowing our customers to tackle their most challenging workloads and performance SLAs.

Based on customer feedback, the decrease in wait time from minutes to seconds is a huge benefit. You can now ensure the right performance for your executive dashboards and not worry anymore about a full-hour charge. You can also scale your load instantly and have fast access to the most current data.

Improving performance and SQL programmability
Our ongoing mission is to build the fastest data warehouse with the SQL you love.

Throughout the last quarter, we released a wide range of SQL capabilities that address database migration challenges and increase both compatibility with SQL ANSI and overall SQL programmability:

  • Support for JavaScript Table UDFs (in public preview).
  • Support for Double-Dot notation for specifying database objects to ease migration from on-premise database systems.
  • Support for Table Literals.
  • INSERT OVERWRITE: Allowing to rebuild tables to ease migration from Hadoop systems.
  • LEAD/LAG function enhancement: Ignoring NULL values.
  • SAMPLE/TABLESAMPLE – announced general availability.
  • We introduced functions to estimate frequent values in near-linear time.
  • More flexibility and support to configure start of the week and week of the year policy.

In addition, we have been working on substantial performance improvements that address ease-of-use and automation. Please stay tuned for major announcements in early 2018.

Staying ahead with enterprise-ready security and compliance
From day one, security has always been core to Snowflake’s design.

  • Programmatic SSO support – Now, a Snowflake user can leverage browser-assisted single-sign-on programmatically through our Snowflake ODBC, JDBC, and Python drivers and our command line tool, SnowSQL. In contrast to some of our competitors, you don’t need to write code to connect to your SAML 2.0-compliant IDP. (Programmatic SSO support is available in our Enterprise Edition).
  • PrivateLink support for Snowflake – We worked with AWS to integrate PrivateLink with Snowflake. With PrivateLink, Snowflake users can now connect to Snowflake by bypassing the public internet. No cumbersome proxies need to be set up between Snowflake and their network. Users have full control over egress traffic.

Improving our ecosystem and data loading
Enabling developers and builders to onboard new workloads and create applications with their favorite tools, drivers and languages remains a top priority.

Snowpipe
We recently announced Snowpipe, which represents an important milestone in many ways. First, users can now efficiently address continuous loading use cases for streaming data. Secondly, with Snowpipe, we introduced the first service in addition to the core data warehousing service. Snowpipe is serverless so there’s no need to manage a virtual warehouse for data loading into Snowflake. Finally, we continue to deeply integrate with the entire AWS stack: Snowpipe allows customers to take advantage of AWS S3 event notifications to automatically trigger Snowflake data loads into target tables. You can read more about Snowpipe here: Overview and First Steps.

Broader Ecosystem
While we continue adding functionality and enriching Snowflake with additional services, we strongly believe in freedom of choice. Instead of preaching a unified platform and a one-solution-fits-all philosophy, our key objectives are to make integrations easier and with improved performance.

  • For enterprise-class ETL, data integration and replication:
    • Added PowerCenter 10.2 support for Snowflake.
    • Added  support for the ETL use case with AWS Glue – build modern and performing ETL jobs via our Spark connector, pushing compute into Snowflake.  
  • For data analytics and data science:
    • Added a native Snowflake connector for Alteryx supporting in-DB components.
    • Upgraded our Snowflake dplyR library.  
  • For  business intelligence (BI):  
    • Added a native Snowflake connector for ChartIO ( in preview).
  • For parallel data loading & unloading via the COPY command, developers can now:
    • Load non-UTF-8 character-encoded data.
    • Unload to the Parquet file format.

Increasing transparency and usability
These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Resource monitors can now be created through the Snowflake UI. As a user creates resource monitors, they can also set up notifications. Notifications for “suspend” and “suspend immediately” actions are sent out when quota thresholds are reached for a particular resource monitor. In addition, we added  support for up to 5 warnings that can be defined for each resource monitor.
  • New MONITOR USAGE privilege to grant access to review billing and usage information via SQL or through the Snowflake UI. This provides non-account administrators with the ability to review information about data stored in databases and stages, as well as provide insight into warehouse usage.
  • For improved usability and convenience, we added a new option (COPY GRANTS) which allows users to preserve or copy grants as part of executing the CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, CREATE TABLE LIKE, and CREATE TABLE CLONE variations.  
  • We also completely overhauled our worksheet and improved the overall experience. We are going to roll these changes out in stages in the coming weeks, with much more to come in 2018.

Scaling and investing in service robustness
These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Our global expansion continues with a new Snowflake region in Sydney, Australia.
  • We introduced a new product edition: Virtual Private Snowflake (VPS). With VPS, customers get a dedicated and managed instance of Snowflake within a separate dedicated AWS VPC, with all the performance, simplicity and concurrency inherent to Snowflake. This also includes completely dedicated metadata services isolated from the metadata activity of other Snowflake customers.
  • We continued our investments in system stability, improving and strengthening various components of our cloud services layer.

Conclusion

We are well underway to reach 1000 customers by the end of 2017, with tens of compressed PBs stored in Snowflake and millions of data processing jobs successfully executed daily. We grew to over 300 Snowflakes and expanded our global presence. 2017 has been truly a defining year in Snowflake’s young history: Self-service, instant Data Sharing, broadening our ecosystem, instant elasticity, Snowpipe, VPS and PrivateLink for Snowflake.  We look forward to an exciting 2018 as we continue to help our customers solve their data challenges at scale.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Snowpipe: Serverless Loading for Streaming Data

Today’s organizations struggle to load, integrate and analyze the sources of streaming data they rely on to advance their businesses. The Internet-of-things, mobile devices, adtech scenarios, telemetry systems and application health monitoring are just some of the scenarios driving this trend. To remain competitive, organizations require this data to drive their analytics in near real time.

Despite the rapid rise in streaming data, and the infinite scalability of the cloud, traditional data warehousing solutions can’t deliver seamless ingestion for streaming data. They still require processes originally designed for batch loading, happening once or a few times per day. This unnecessary latency is a drag on an organization’s analytics. Workarounds using micro-batching provide some relief but are difficult to implement and require careful tuning. Equally, serverless computing is still foreign to most traditional data warehouse solutions. In fact, most cloud data warehouses, which are “cloud-washed” versions of on-premises solutions, do not offer a serverless experience.  

Snowpipe tackles both continuous loading for streaming data and serverless computing for data loading into Snowflake. With Snowpipe, AWS S3 event notifications automatically trigger Snowflake to load data into target tables. Snowflake SQL queries retrieve the most recent data within a minute after it arrived to the S3 bucket.

The “pipe” is a key concept in the surface area that Snowpipe adds to Snowflake. A pipe definition wraps the familiar COPY statement for data loading with Snowflake. Most of the semantics from the existing COPY statement carry forward to a pipe in Snowpipe. The main difference, though, is that pipes are continuously watching for new data and are continuously loading data from the stage used by the pipe.

Snowpipe’s surface area provides two different levels of control for pipes. With most use cases, Snowpipe can rely on Amazon SQS notifications from an S3 bucket to trigger data loads. It requires a one-time configuration for an S3 bucket and a set of target tables. This usually takes less than 15 minutes to set up. It’s completely configuration-based, with no need to write any application code other than some Snowflake DDL statements. This experience is available for preview in December 2017. The following diagram illustrates this approach:

 

 

For use cases that require more control or deeper integration into applications, Snowpipe also provides a programmatic REST API to notify Snowflake of new data in S3. The REST API is available today in preview for Snowflake customers in the US West region. The following diagram shows an architectural overview of this approach:

 

For both Snowpipe experiences, Snowflake runs and manages a fleet of servers that asynchronously perform the actual data loading into the target tables. This server fleet is completely internal to Snowflake, which automatically adds or removes servers from the fleet depending on the current Snowpipe load. Customers don’t need to worry about monitoring or managing compute capacity when using Snowpipe for their data loads.

Snowpipe utilization, billing and cost

Snowpipe uses a serverless billing model. Customers are charged based on their actual compute resource utilization rather than capacity reservations that may be idle or overutilized. Instead, Snowpipe tracks the resource consumption of pipes in a given Snowflake account for the load requests that the pipe processed, with per-second/per-core granularity. The utilization recorded is then translated into familiar Snowflake credits. Snowpipe utilization shows up in the form of Snowflake credits on the bill, and account administrators can track Snowpipe utilization on their Snowflake account pages during the course of the month. Snowpipe utilization is shown as a special Snowflake warehouse – indicated by the Snowflake logo proceeding the warehouse name – in the Warehouse tab in Billing & Usage on the Snowflake web portal.

 

A table function called pipe_utilization_history in Snowflake SQL allows you to drill into Snowpipe utilization details over specific periods of time or for specific pipes.

Give Snowpipe a spin today and let us know your feedback. Snowpipe using REST-based notifications is available today. You can find the documentation and information on how to get started here.

Snowpipe with auto-ingest using SQS is available in December. If you are interested in participating in a private preview for this capability, please let us know here. Make sure to also read part two of this blog about Snowpipe here.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Your First Steps with Snowpipe

Please note that Snowpipe is in public preview in the US West Region of AWS. Auto Ingest will be available by February 2018. 

This blog post walks you through your first steps with deploying Snowpipe. We start with the steps for configuring Snowpipe so that it continuously loads data from Amazon S3 into a target table in Snowflake. Next, we explore the easy-to-configure mode of Snowpipe, where S3 event notifications sent to a dedicated queue in Amazon Simple Queue Service (SQS) inform Snowpipe about new data that is ready for loading. Finally, this post explains how to exert tighter programmatic control over when and what notifications are sent using the Snowpipe REST API.

If you haven’t read our blog about what Snowpipe is, click here, to understand how Snowpipe works and the business benefits it provides. Otherwise, keep reading this blog and learn how to configure Snowpipe.

Snowpipe one-time configuration in SQL

The following SQL statements show the one-time configuration experience for setting up Snowpipe. They include familiar DDL, such as creating an external stage and a new table, as well as how to create a pipe which is a new database object in Snowflake.

In the example below, we use a VARIANT column in Snowflake to store incoming data. Semi-structured data types in Snowflake are particularly well-suited for streaming or continuous data loading scenarios. They allow for seamless schema evolution in your application, over time, without making any changes to the Snowflake table schema. Snowflake automatically applies an efficient columnar representation to the data while it’s being loaded.

This makes it easy to leave the incoming continuous data in its original format. Note, a conversion into a different format such as Parquet is not needed for efficient loading or querying in Snowflake:

use role snowpipe_role;
use database snowpipe;

create or replace stage snowpipe.public.snowstage
    url='s3://snowpipe-demo/'
    credentials = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...' );
show stages;

-- Create target table for JSON data
create or replace table snowpipe.public.snowtable(jsontext variant);
show tables;

-- Create a pipe to ingest JSON data
create or replace pipe snowpipe.public.snowpipe auto_ingest=true as
    copy into snowpipe.public.snowtable
    from @snowpipe.public.snowstage
    file_format = (type = 'JSON');
show pipes;

The key database concept that Snowpipe introduces to Snowflake’s SQL language is called a “pipe“. Think of a pipe as a wrapper around Snowflake’s familiar COPY statement for data loading. A Snowpipe pipe continuously looks for new data, then loads it using the COPY statement in the pipe definition.

In the current example, the pipe is defined with AUTO_INGEST=true>, which tells Snowflake to use an SQS queue in AWS to receive event notifications from an S3 bucket pertaining to new data that is ready to load. The name of the SQS queue is shown in a new column within the results of the SHOW STAGES command. This queue name is used to configure S3 event notifications.

Snowpipe one-time configuration in AWS S3

With Snowpipe auto-ingest, available in February 2018, the remaining step requires configuring event notifications for the S3 bucket so the pipe recognizes when there is new data available for loading. Auto-ingest relies on SQS queues to deliver the notifications from S3 to Snowpipe. All Snowpipe SQS queues are created and managed by Snowflake so you don’t have to worry about managing yet another AWS service for your application.

The name of the SQS queue is available from the SHOW STAGES command in a new column called NOTIFICATION_CHANNEL. The following screenshot shows the event notification configuration for the S3 bucket. Note the use of the name (or ARN, to be precise) of the SQS queue at the bottom of the dialog box along and the check mark for the ObjectCreate notifications.

 

After saving this notification configuration, place new files in the bucket. Snowflake will receive automatic notifications about the files from S3. Then, Snowpipe will automatically begin loading them into whatever table you have defined as the target in your pipe configuration.

Continuously delivering data

Snowpipe requires placing data into S3 before loading.This design choice takes advantage of the rich ecosystem of tools designed for storing data in S3. Snowpipe’s continued use of S3 as the stage for data loading allows you to use the tool of your choice for delivering data in an S3 bucket. This includes services such as AWS Kinesis with Firehose, and Kafka with its S3 connector. Both are popular choices for transporting continuous and streaming data.

The architectural take-away from this is important: Snowflake can automatically and continuously load data delivered to S3 from Kinesis or Kafka.

Querying continuously loading data

A simple, but useful, example query for Snowpipe tracks the arrival of rows over time with a JSON field called 'created_at':

select
    date_trunc('MINUTE', to_timestamp(jsontext:created_at)),
    count(*)
from snowpipe.public.snowtable
group by date_trunc('MINUTE', to_timestamp(jsontext:created_at));

This illustrates how many rows arrived each minute, serving as a starting point for more expressive analytics over time windows.

It’s worth pointing out the powerful date, time and windowing functions Snowflake’s SQL language provides. This makes it easy to analyze data, as it arrives over time, and group it into time windows for examining trends.

Full control for Snowpipe

Using S3 event notifications to tell Snowpipe about new data to load from an S3 bucket may not be appropriate in some use cases. Consider an application that first needs to perform some processing in AWS Lambda before performing the load into Snowflake. In that case, we recommend timing notifications to Snowpipe more tightly.

Snowpipe allows for defining pipes with AUTO_INGEST set to ‘false’ while using exactly the same DDL, as shown above, to create stages and pipes. Without automatic ingestion pipes, however, an application needs to invoke a REST API with the filename, or a set of file names, to tell Snowpipe that the file is ready for loading. For instance, you can call the Snowpipe REST API after finishing the preprocessing from within your AWS Lambda code.

In general, any application capable of invoking a REST API can notify Snowpipe about new files. To make developing against the REST API easier, SDKs for both Python and Java are available. The following code snippet shows an example of the proxy-like abstraction that the Python SDK provides. Note how the ingest_files call can consume a set of files in addition to just a single file:

ingest_manager = SimpleIngestManager(account='...',
                                     user='...',
                                     pipe='...',
                                     private_key=...)
staged_file_list = []
staged_file_list.append(StagedFile(filename, None))
resp = ingest_manager.ingest_files(staged_file_list)

You can find the latest SDK versions on PyPI or Maven Central by using the search terms “Snowflake ingest”.

Try Snowpipe today

Please give Snowpipe a spin today and let us know your feedback. Snowpipe using REST-based notifications is now available in US West. You can find the documentation and information on how to get started here.

Snowpipe with auto-ingest using SQS will be available for preview in December, 2017. If you are interested in participating in a private preview for this capability, please let us know here.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

How Snowpipe Streamlines Your Continuous Data Loading and Your Business

For anyone who harbors a love-hate relationship with data loading, it’s time to tip the scales.

We all know data can be difficult to work with. The challenges start with the varying formats and complexity of the data itself. This is especially the case with semi-structured data such as JSON, Avro and XML, and it continues with the significant programming skills needed to extract and process data from multiple sources. Making matters worse, traditional on-premise and cloud data warehouses require batch loading of data (with limitations on the size of data files ingested) and huge manual efforts to run and manage servers.

The results? Poor, slow performance and the inability to extract immediate insights from all your data. Data scientists and analysts are forced to wait days or even weeks before they can use the data to develop accurate models, spot trends and identify opportunities. Consequently, executives don’t get the necessary up-to-minute insights to make real-time decisions with confidence and speed.

Common problems that affect data loading include:

  • Legacy architecture – Tightly coupled storage and compute necessitate contention with queries as data is loading.
  • Stale data – Batch loading prevents organizations from acquiring instant, data-driven insight.
  • Limited data – Lack of support for semi-structured data requires transforming newer data types and defining a schema before loading, which introduces delays.
  • Manageability – Dedicated clusters or warehouses are required to handle the loading of data.
  • High-maintenance – Traditional data warehouse tools result in unnecessary overhead in the form of constant indexing, tuning, sorting and vacuuming.

These obstacles all point to the need for a solution that allows continuous data loading without impacting other workloads, without requiring the management of servers and without crippling the performance of your data warehouse.

Introducing Snowpipe, our continuous, automated and cost-effective service that loads all of your data quickly and efficiently without any manual effort. How does Snowpipe work?

Snowpipe automatically listens for new data as it arrives in your cloud storage environment and continuously loads it into Snowflake. With Snowpipe’s unlimited concurrency, other workloads are never impacted , and you benefit from serverless, continuous loading without ever worrying about provisioning. That’s right. There are no servers to manage and no manual effort is required. Snowpipe makes all this happen automatically.

The direct benefits of Snowpipe’s continuous data loading include:

  • Instant insights – Immediately provide fresh data to all your business users without contention.
  • Cost-effectiveness – Pay only for the per-second compute utilized to load data rather than running a warehouse continuously or by the hour. 
  • Ease-of-use – Point Snowpipe at an S3 bucket from within the Snowflake UI and data will automatically load asynchronously as it arrives.
  • Flexibility – Technical resources can interface directly with the programmatic REST API, using Java and Python SDKs to enable highly customized loading use cases.
  • Zero management – Snowpipe automatically provisions the correct capacity for the data being loaded. No servers or management to worry about.

Snowpipe frees up resources across your organization so you can focus on analyzing your data, not managing it. Snowpipe puts your data on pace with near real-time analytics. At Snowflake, we tip the scales on your love-hate relationship with data so you can cherish your data without reservation.

Read more about the technical aspects of Snowpipe on our engineering blog. For an in-depth look at Snowpipe in action, you can also join us for a live webinar on December 14th.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.