Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

New Snowflake Features Released in H2’17

Looking back at the first half of this year, the main theme at Snowflake has been centered around scaling – the team, the service and our customer base. We introduced and improved the self-service onboarding experience, grew our ecosystem by collaborating with major technology partners and SIs and invested in making our service faster, more secure and easier to use.

In the summer, we announced the general availability of Snowflake’s Instant Data Sharing. As we continue to grow, this will provide the foundation for data-driven organizations of all sizes to instantly exchange data, create new insights and discover new revenue streams.

In the second half of this year, we started a journey to evolve the service in substantial ways and this will continue throughout 2018. 

Instant elasticity, per-second billing and 4XL virtual warehouses
Imagine you have instant access to performance and scalability – and truly pay only for what you use.

Many cloud DW solutions make the claim in the headline above. Nonetheless, taking a closer look exposes fine differences with the degree of scalability and concurrency supported, overall user experience and the types of workloads a user can run at scale. One of our key investments in this area resulted in our instant elasticity feature:

  • The introduction of per-second pricing represents one crucial building block. Now, our customers are billed to the second instead of a full hour when running workloads in Snowflake.
  • We also improved our provisioning algorithms, using past usage data to better serve compute capacity for future use.
  • We added support for our new 4XL virtual warehouses, doubling the maximal compute configuration and allowing our customers to tackle their most challenging workloads and performance SLAs.

Based on customer feedback, the decrease in wait time from minutes to seconds is a huge benefit. You can now ensure the right performance for your executive dashboards and not worry anymore about a full-hour charge. You can also scale your load instantly and have fast access to the most current data.

Improving performance and SQL programmability
Our ongoing mission is to build the fastest data warehouse with the SQL you love.

Throughout the last quarter, we released a wide range of SQL capabilities that address database migration challenges and increase both compatibility with SQL ANSI and overall SQL programmability:

  • Support for JavaScript Table UDFs (in public preview).
  • Support for Double-Dot notation for specifying database objects to ease migration from on-premise database systems.
  • Support for Table Literals.
  • INSERT OVERWRITE: Allowing to rebuild tables to ease migration from Hadoop systems.
  • LEAD/LAG function enhancement: Ignoring NULL values.
  • SAMPLE/TABLESAMPLE – announced general availability.
  • We introduced functions to estimate frequent values in near-linear time.
  • More flexibility and support to configure start of the week and week of the year policy.

In addition, we have been working on substantial performance improvements that address ease-of-use and automation. Please stay tuned for major announcements in early 2018.

Staying ahead with enterprise-ready security and compliance
From day one, security has always been core to Snowflake’s design.

  • Programmatic SSO support – Now, a Snowflake user can leverage browser-assisted single-sign-on programmatically through our Snowflake ODBC, JDBC, and Python drivers and our command line tool, SnowSQL. In contrast to some of our competitors, you don’t need to write code to connect to your SAML 2.0-compliant IDP. (Programmatic SSO support is available in our Enterprise Edition).
  • PrivateLink support for Snowflake – We worked with AWS to integrate PrivateLink with Snowflake. With PrivateLink, Snowflake users can now connect to Snowflake by bypassing the public internet. No cumbersome proxies need to be set up between Snowflake and their network. Users have full control over egress traffic.

Improving our ecosystem and data loading
Enabling developers and builders to onboard new workloads and create applications with their favorite tools, drivers and languages remains a top priority.

Snowpipe
We recently announced Snowpipe, which represents an important milestone in many ways. First, users can now efficiently address continuous loading use cases for streaming data. Secondly, with Snowpipe, we introduced the first service in addition to the core data warehousing service. Snowpipe is serverless so there’s no need to manage a virtual warehouse for data loading into Snowflake. Finally, we continue to deeply integrate with the entire AWS stack: Snowpipe allows customers to take advantage of AWS S3 event notifications to automatically trigger Snowflake data loads into target tables. You can read more about Snowpipe here: Overview and First Steps.

Broader Ecosystem
While we continue adding functionality and enriching Snowflake with additional services, we strongly believe in freedom of choice. Instead of preaching a unified platform and a one-solution-fits-all philosophy, our key objectives are to make integrations easier and with improved performance.

  • For enterprise-class ETL, data integration and replication:
    • Added PowerCenter 10.2 support for Snowflake.
    • Added  support for the ETL use case with AWS Glue – build modern and performing ETL jobs via our Spark connector, pushing compute into Snowflake.  
  • For data analytics and data science:
    • Added a native Snowflake connector for Alteryx supporting in-DB components.
    • Upgraded our Snowflake dplyR library.  
  • For  business intelligence (BI):  
    • Added a native Snowflake connector for ChartIO ( in preview).
  • For parallel data loading & unloading via the COPY command, developers can now:
    • Load non-UTF-8 character-encoded data.
    • Unload to the Parquet file format.

Increasing transparency and usability
These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Resource monitors can now be created through the Snowflake UI. As a user creates resource monitors, they can also set up notifications. Notifications for “suspend” and “suspend immediately” actions are sent out when quota thresholds are reached for a particular resource monitor. In addition, we added  support for up to 5 warnings that can be defined for each resource monitor.
  • New MONITOR USAGE privilege to grant access to review billing and usage information via SQL or through the Snowflake UI. This provides non-account administrators with the ability to review information about data stored in databases and stages, as well as provide insight into warehouse usage.
  • For improved usability and convenience, we added a new option (COPY GRANTS) which allows users to preserve or copy grants as part of executing the CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, CREATE TABLE LIKE, and CREATE TABLE CLONE variations.  
  • We also completely overhauled our worksheet and improved the overall experience. We are going to roll these changes out in stages in the coming weeks, with much more to come in 2018.

Scaling and investing in service robustness
These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Our global expansion continues with a new Snowflake region in Sydney, Australia.
  • We introduced a new product edition: Virtual Private Snowflake (VPS). With VPS, customers get a dedicated and managed instance of Snowflake within a separate dedicated AWS VPC, with all the performance, simplicity and concurrency inherent to Snowflake. This also includes completely dedicated metadata services isolated from the metadata activity of other Snowflake customers.
  • We continued our investments in system stability, improving and strengthening various components of our cloud services layer.

Conclusion

We are well underway to reach 1000 customers by the end of 2017, with tens of compressed PBs stored in Snowflake and millions of data processing jobs successfully executed daily. We grew to over 300 Snowflakes and expanded our global presence. 2017 has been truly a defining year in Snowflake’s young history: Self-service, instant Data Sharing, broadening our ecosystem, instant elasticity, Snowpipe, VPS and PrivateLink for Snowflake.  We look forward to an exciting 2018 as we continue to help our customers solve their data challenges at scale.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Your First Steps with Snowpipe

Please note that Snowpipe is in public preview in the US West Region of AWS. Auto Ingest will be available by February 2018. 

This blog post walks you through your first steps with deploying Snowpipe. We start with the steps for configuring Snowpipe so that it continuously loads data from Amazon S3 into a target table in Snowflake. Next, we explore the easy-to-configure mode of Snowpipe, where S3 event notifications sent to a dedicated queue in Amazon Simple Queue Service (SQS) inform Snowpipe about new data that is ready for loading. Finally, this post explains how to exert tighter programmatic control over when and what notifications are sent using the Snowpipe REST API.

If you haven’t read our blog about what Snowpipe is, click here, to understand how Snowpipe works and the business benefits it provides. Otherwise, keep reading this blog and learn how to configure Snowpipe.

Snowpipe one-time configuration in SQL

The following SQL statements show the one-time configuration experience for setting up Snowpipe. They include familiar DDL, such as creating an external stage and a new table, as well as how to create a pipe which is a new database object in Snowflake.

In the example below, we use a VARIANT column in Snowflake to store incoming data. Semi-structured data types in Snowflake are particularly well-suited for streaming or continuous data loading scenarios. They allow for seamless schema evolution in your application, over time, without making any changes to the Snowflake table schema. Snowflake automatically applies an efficient columnar representation to the data while it’s being loaded.

This makes it easy to leave the incoming continuous data in its original format. Note, a conversion into a different format such as Parquet is not needed for efficient loading or querying in Snowflake:

use role snowpipe_role;
use database snowpipe;

create or replace stage snowpipe.public.snowstage
    url='s3://snowpipe-demo/'
    credentials = (AWS_KEY_ID = '...' AWS_SECRET_KEY = '...' );
show stages;

-- Create target table for JSON data
create or replace table snowpipe.public.snowtable(jsontext variant);
show tables;

-- Create a pipe to ingest JSON data
create or replace pipe snowpipe.public.snowpipe auto_ingest=true as
    copy into snowpipe.public.snowtable
    from @snowpipe.public.snowstage
    file_format = (type = 'JSON');
show pipes;

The key database concept that Snowpipe introduces to Snowflake’s SQL language is called a “pipe“. Think of a pipe as a wrapper around Snowflake’s familiar COPY statement for data loading. A Snowpipe pipe continuously looks for new data, then loads it using the COPY statement in the pipe definition.

In the current example, the pipe is defined with AUTO_INGEST=true>, which tells Snowflake to use an SQS queue in AWS to receive event notifications from an S3 bucket pertaining to new data that is ready to load. The name of the SQS queue is shown in a new column within the results of the SHOW STAGES command. This queue name is used to configure S3 event notifications.

Snowpipe one-time configuration in AWS S3

With Snowpipe auto-ingest, available in February 2018, the remaining step requires configuring event notifications for the S3 bucket so the pipe recognizes when there is new data available for loading. Auto-ingest relies on SQS queues to deliver the notifications from S3 to Snowpipe. All Snowpipe SQS queues are created and managed by Snowflake so you don’t have to worry about managing yet another AWS service for your application.

The name of the SQS queue is available from the SHOW STAGES command in a new column called NOTIFICATION_CHANNEL. The following screenshot shows the event notification configuration for the S3 bucket. Note the use of the name (or ARN, to be precise) of the SQS queue at the bottom of the dialog box along and the check mark for the ObjectCreate notifications.

 

After saving this notification configuration, place new files in the bucket. Snowflake will receive automatic notifications about the files from S3. Then, Snowpipe will automatically begin loading them into whatever table you have defined as the target in your pipe configuration.

Continuously delivering data

Snowpipe requires placing data into S3 before loading.This design choice takes advantage of the rich ecosystem of tools designed for storing data in S3. Snowpipe’s continued use of S3 as the stage for data loading allows you to use the tool of your choice for delivering data in an S3 bucket. This includes services such as AWS Kinesis with Firehose, and Kafka with its S3 connector. Both are popular choices for transporting continuous and streaming data.

The architectural take-away from this is important: Snowflake can automatically and continuously load data delivered to S3 from Kinesis or Kafka.

Querying continuously loading data

A simple, but useful, example query for Snowpipe tracks the arrival of rows over time with a JSON field called 'created_at':

select
    date_trunc('MINUTE', to_timestamp(jsontext:created_at)),
    count(*)
from snowpipe.public.snowtable
group by date_trunc('MINUTE', to_timestamp(jsontext:created_at));

This illustrates how many rows arrived each minute, serving as a starting point for more expressive analytics over time windows.

It’s worth pointing out the powerful date, time and windowing functions Snowflake’s SQL language provides. This makes it easy to analyze data, as it arrives over time, and group it into time windows for examining trends.

Full control for Snowpipe

Using S3 event notifications to tell Snowpipe about new data to load from an S3 bucket may not be appropriate in some use cases. Consider an application that first needs to perform some processing in AWS Lambda before performing the load into Snowflake. In that case, we recommend timing notifications to Snowpipe more tightly.

Snowpipe allows for defining pipes with AUTO_INGEST set to ‘false’ while using exactly the same DDL, as shown above, to create stages and pipes. Without automatic ingestion pipes, however, an application needs to invoke a REST API with the filename, or a set of file names, to tell Snowpipe that the file is ready for loading. For instance, you can call the Snowpipe REST API after finishing the preprocessing from within your AWS Lambda code.

In general, any application capable of invoking a REST API can notify Snowpipe about new files. To make developing against the REST API easier, SDKs for both Python and Java are available. The following code snippet shows an example of the proxy-like abstraction that the Python SDK provides. Note how the ingest_files call can consume a set of files in addition to just a single file:

ingest_manager = SimpleIngestManager(account='...',
                                     user='...',
                                     pipe='...',
                                     private_key=...)
staged_file_list = []
staged_file_list.append(StagedFile(filename, None))
resp = ingest_manager.ingest_files(staged_file_list)

You can find the latest SDK versions on PyPI or Maven Central by using the search terms “Snowflake ingest”.

Try Snowpipe today

Please give Snowpipe a spin today and let us know your feedback. Snowpipe using REST-based notifications is now available in US West. You can find the documentation and information on how to get started here.

Snowpipe with auto-ingest using SQS will be available for preview in December, 2017. If you are interested in participating in a private preview for this capability, please let us know here.

You can also try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

How to analyze JSON with SQL

One of the key reasons I joined  Snowflake is its built-in support to load and query semi-structured data such as JSON. In most conventional data warehouse and Big Data environments, you have to first load this type of data to a Hadoop or NoSQL platform. Then, you must parse it with a product such as MapReduce so you can load the data into tables in a relational database. Then, and only then, you’re ready to analyze the data with SQL queries or a BI/Analytics tool. But why take extra steps when there is a faster, easier way to get the job done?

With Snowflake, you can load your semi-structured data direct into a relational table, query the data with a SQL statement and then join it to other structured data – all while not fretting about future changes to the schema of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or shredding required.

This means you can leverage your existing knowledge and skills in SQL to jump into the world of big data. Even with this feature, there is still a tiny bit to learn. However, it’s easy with Snowflake’s extensions to SQL. To get you started, we have produced a handy ebook that takes you step-by-step through loading some JSON into Snowflake, then querying that data with SQL. The book is called How to analyze JSON with SQL: Schema-on-read made easy.

Get your free ebook now and start your journey to analyzing big data in the cloud!

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) for all the news and happenings here at Snowflake.

P.S. If you don’t already have a Snowflake account, you can sign up for a self-service account here and get a jumpstart with $400 in free credits!

 

What makes Snowflake a data warehouse?

One of the most common questions I get when speaking to people about Snowflake is: “Why do you call it a data warehouse and not a database?” This is a very reasonable question given some of the characteristics of Snowflake.

At Snowflake, in part, we say we are a full relational database management system (RDBMS) built for the cloud. We are ACID compliant and we support standard SQL. Sounds like a database to me, too. Let’s take a closer look just to be sure.

What is a database?

A database is a collection of information organized to be easily accessed, managed and updated. While there are many types of databases available today, the most common is an RDBMS. But when most folks say “database”, they usually mean a traditional RDBMS that handles Online Transaction Processing (OLTP).

So, what are some of the defining characteristics of an OLTP database?

  • Designed for rapid storage and retrieval of small sets of current data records in support of transactions and interactions within an enterprise.
  • Data is organized in tables and columns, allowing users access via structured query language (SQL).
  • Handles quick, real-time activity such as entering a customer name, recording a sale and recording all accounting activity of that sale.
  • Works well for basic operational reporting of a limited number of records. Analytic reporting is relegated to simple, static reports often driven by IT.

What is a data warehouse?

Some of the defining characteristics of a data warehouse are:

  • A database designed to store and process large volumes of current and historical data collected from multiple sources inside and outside the enterprise for deep analysis.
  • Organizes data into tables and columns, and allows users access via SQL.
  • Optimized for loading, integrating and analyzing very large amounts of data.
  • Designed to support descriptive, diagnostic, predictive and prescriptive analytic workloads.

Snowflake definitely includes the overlapping characteristics of both a database and a data warehouse-ACID compliant, support for standard SQL, etc. But Snowflake also embodies all of the defining characteristics of a data warehouse.

One of the key differentiators of Snowflake, from other solutions, is that it’s specifically designed for data warehousing and high speed analytic processing. Rather than a generalized SQL database that has been “tuned” or even adapted to handle these type of workloads, Snowflake was built from the ground up for the cloud to optimize loading, processing and query performance for very large volumes of data. Therefore, hands down, Snowflake is a data warehouse.

So, why do we still need a specialized data warehouse engine?

As OLTP databases have been able to scale higher and innovations like in-memory databases have emerged, some organizations have questioned whether they still need a separate technology or specialized system for reporting and analytics. The answer, again, requires us to look at the basics: What benefits emerge from storing and analyzing data in a separate system?

  1. It eases the burden of reporting from transactional systems by removing the contention for limited and expensive resources.
  2. It produces more business-friendly data results by allowing the data to be restructured to a more suitable format.
  3. It provides access to a wider array of reports more quickly because all the resources in the data warehouse are dedicated to reporting and analysis.
  4. It integrates valuable data from across the enterprise for richer insight. Something that can’t (and shouldn’t) be done in an OLTP system.

For more information on how you can up your data warehousing game with a modern, built-for-the-cloud approach, check out some of our free resources such as our ebook The Data Warehouse: The Engine That Drives Analytics. We would love to help you on your journey to the cloud so keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) to keep up with all the news and happenings here at Snowflake.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse

Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.

Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.

Which then leads to this common question:

Is there any reason to maintain a permanently disabled FK in the data model?  If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK.  Instead, the relationship can be described in the comment of the child column.

So, why would I want RI constraints in my data warehouse?

Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.

#1 – Design Metadata

RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.

RI in a diagram
A picture is worth a thousand words

#2 – BI Metadata

If you want to use any sort of reporting or BI tool against the database (it is a data warehouse,  after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.

Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.

#3 – QA your ETL/ELT code

I know you think your ETL code is perfect.

But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?

If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.

Defining RI Constraints in Snowflake

You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.

The syntax is standard SQL. You can define the constraints both inline and out-of-line.

Here is a simple example of inline constraints:

Create or replace TABLE SAT_REGIONS (
HUB_REGION_KEY NUMBER(38,0) NOT NULL,
SAT_LOAD_DTS DATE NOT NULL,
REGION_COMMENT VARCHAR(152),
HASH_DIFF VARCHAR(32) NOT NULL,
SAT_REC_SRC VARCHAR(50) NOT NULL,
constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS),
constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY)
references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY)
);

Example of an out-of-line constraint:

ALTER TABLE SAT_REGIONS ADD constraint 
SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);

It’s that easy. So why don’t you have constraints in your data warehouse?

 

Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.

SnowSQL Part I: Introducing SnowSQL – a modern command line tool built for the cloud

What & Why

Starting this month, we have introduced a new modern command line tool for an interactive query experience with the Snowflake Elastic Data Warehouse cloud service. Yes indeed – a new command line tool!

Now you might ask – why did Snowflake decide to invest in building a new SQL command line tool in the 21st century when there are so many different ways of accessing and developing against the Snowflake service? For example, you could use our current user interface, or various drivers, or other SQL editor tools available today.

The answer is simple:

  • First of all, developers still care very deeply about light-weight mechanisms to quickly ask SQL questions via a text-based terminal with simple text input and output.
  • Secondly, we recognized the need for a more modern command line tool designed for the cloud that is easy to use, built on high security standards, and is tightly integrated with the actual core Snowflake elastic cloud data service.
  • Finally, we wanted to build a command line tool which offers users more powerful scripting capabilities overall.   

The result of these efforts is SnowSQL – our new SQL command line tool that is entirely built in Python and leverages Snowflake’s Python connector underneath.

Looking at existing more generic command line tools, such as SQLLine or HenPlus, it became obvious that these tools lack the ease-of-use and did not really offer sufficient scripting features. Also, Snowflake’s approach was to consider the command line tool as part of the overall cloud service. That is, SnowSQL can be seen as an extension of the Snowflake Elastic DW service. This cloud-first philosophy has important implications on the entire life cycle of Snowflake’s command line tool including the agility of delivering new SnowSQL features as part of the frequent Snowflake service updates (please see below the auto upgrade capabilities)

Getting Started

To get started, you can download SnowSQL from the Snowflake UI after logging into your Snowflake account. SnowSQL is currently supported on all three major platforms including:

  • Linux 64-bit,
  • Mac OS X 10.6+, and
  • Windows 64-bit

We provide a native installer for each platform with easy-to-follow installation steps. Once you have installed SnowSQL, open a new terminal and type:

$ SNOWSQL_PWD=<password> snowsql -a <account_name> -u <user_name>

Upon first use, SnowSQL users will notice a progress bar which indicates the download of the initial version of SnowSQL. This is a one time operation.  Any future downloads will happen in the background and remain entirely transparent for the user. The ability to automatically and fully transparent upgrade the command line tool is one of the key benefits of building a command line tool as cloud extension rather than a stand-alone software component interacting with a remote service (please see the auto-upgrade section below).

Using SnowSQL

There are a few capabilities we would like to highlight in this blog.  

SnowSQL Commands

First, SnowSQL offers a wide range of commands a user can make use of. As  a general rule, all SnowSQL commands start with a bang character ‘!’. For a complete list of currently supported commands, please see our documentation here.

Auto-Complete and Syntax Highlighting

Secondly, with our context-sensitive auto-complete feature, SnowSQL users are released of cumbersome and error-prone typing of long object names Instead they can complete SQL keywords and functions once typing the first three letters. By leveraging auto-complete, SnowSQL users can become increasingly more productive and quickly explore data in Snowflake. Furthermore, SQL statements are highlighted in different colors resulting in a better readability for SnowSQL users when interacting with a terminal

 

Auto-Upgrade

Thinking as a service, the auto-update framework enables users to always stay up-to-date with both –  Snowflake’s and SnowSQL’s latest features to streamline end user experience. No more additional downloads or tedious re-installation are needed. The upgrade is transparent for the end user and takes place  as a background process when you start SnowSQL. Next time a user runs SnowSQL, the new version will be automatically picked up while their workflows remain unaffected and will not be interrupted during the upgrade.

Secure Connection and Encryption

Finally, security is core in  SnowSQL’s design. SnowSQL secures connections to Snowflake using TLS (Transport Layer Security) with OCSP (Online Certificate Status Protocol – OCSP) checks. The auto-upgrade binaries are always validated by using RSA signature. In addition to the secured connection, SnowSQL provides end-to-end security of data moving in and out of Snowflake by using AES (Advanced Encryption Standard)  for Snowflake’s PUT and GET commands.

In the next few weeks, we will provide a deeper dive into some of the unique capabilities of SnowSQL. Please stay tuned as we gather and incorporate feedback from our customers. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions.

Finally, all of this would not have been possible without the active Python community who inspired us in many ways and offered us tools and packages to build SnowSQL. Thank you.   

As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.

Elasticity & Separation of Compute and Storage

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#4 Elasticity & Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake takes full advantage of the elasticity of the cloud. With the advantage of building a data warehouse from the ground up, Snowflake has built an environment that really allows the customer to scale their data warehouse as they want. This unique environment gives the users the ability to add more compute power on demand, even in the middle of a process.

How it works

  1. Data is stored, once, on elastic disk storage in the cloud (for example it is currently stored AWS S3)
  2. Compute clusters (for example using AWS EC2 nodes) are allocated dynamically as required to do work (queries or loads)
  3. The compute clusters are connected to the data storage. The access to the data is based on the approved security credentials of the user that is running the query

The important point is that storage and warehouse are not tightly coupled together. This means you can grow and reduce both independently of each other.

More Power at the Push of a Button

Say you are running a series of complex SQL queries on a Small (2 node) warehouse in Snowflake. Normally you have until the close of business to run the reports and prepare the data. But today, at 8 AM, your management requests the results of that analysis for a board meeting at noon. So now you have half the time to get the reports to run. Yikes!

In a traditional data warehouse you would pretty much have only one option: kick everyone off the system and hope things run faster.

If however, you are running the reports on the Snowflake Elastic Data Warehouse, all you have to do is increase the size of the warehouse to a Medium which doubles the horsepower of the data warehouse (i.e. it doubles the number of nodes). And you can perform this change with a few clicks in our easy to use web-based interface, even if the process has already started.

Warehouse Resize 1

Warehouse Resize 2

Once you bump up the size of the warehouse, the next statement that starts will immediately use the additional resources. Notice that the first statement running with the Medium warehouse runs in about half the time of the prior statement, with about double the data!

Warehouse Resize 3

You don’t need to interrupt the process, start over, or restart the service. It just takes effect automatically, at the push of a button. And once the process is done, you can reduce the warehouse back to a Small or even just turn it off.

And in this way, the unique cloud-based architecture of Snowflake allows you to truly take advantage of the elasticity of the cloud for your data warehouse.

Alternatives

In many other architectures, cloud and on-premises appliances, you have a certain amount of disk space associated with the compute nodes. If you need more storage; you must add more nodes. If you need more compute power, likewise you add more nodes. Plus other architectures may not allow you to scale down. Also in most of these cases, when adding more nodes, you must also shut down the environment or make it read only for a time while you redistribute the data manually across the new nodes.

Thus in the legacy data warehouses, elasticity is limited. And this forces you to plan for, acquire and manage the resources for your expected, future peak capacity and peak data demand. Otherwise you risk having your queries or your hardware fail (because you run out of disk capacity or CPU power, or both). In addition, trying to scale beyond existing capacity may lead to disruptions to the service (as you rebuild/extend the hardware), or some workloads being rejected in favor of others. And if you over-provision, and buy too much capacity, there may not be a way to scale back after the fact.

The Snowflake Advantage

For Snowflake, this is not an issue. You can grow and shrink the environment dynamically. The data storage grows and shrinks as you add or remove data, while the compute nodes can be ramped up or down, or turned off, as you require. You are not forced to pay for capacity up front, or kick other workloads off, or plan downtimes when ramping up your data warehouse capacity. That is the promise of cloud. This is why Snowflake is the real elastic Data Warehouse as a Service.

Very cool indeed.

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Kent Graziano and Saqib Mustafa