What makes Snowflake a data warehouse?

One of the most common questions I get when speaking to people about Snowflake is: “Why do you call it a data warehouse and not a database?” This is a very reasonable question given some of the characteristics of Snowflake.

At Snowflake, in part, we say we are a full relational database management system (RDBMS) built for the cloud. We are ACID compliant and we support standard SQL. Sounds like a database to me, too. Let’s take a closer look just to be sure.

What is a database?

A database is a collection of information organized to be easily accessed, managed and updated. While there are many types of databases available today, the most common is an RDBMS. But when most folks say “database”, they usually mean a traditional RDBMS that handles Online Transaction Processing (OLTP).

So, what are some of the defining characteristics of an OLTP database?

  • Designed for rapid storage and retrieval of small sets of current data records in support of transactions and interactions within an enterprise.
  • Data is organized in tables and columns, allowing users access via structured query language (SQL).
  • Handles quick, real-time activity such as entering a customer name, recording a sale and recording all accounting activity of that sale.
  • Works well for basic operational reporting of a limited number of records. Analytic reporting is relegated to simple, static reports often driven by IT.

What is a data warehouse?

Some of the defining characteristics of a data warehouse are:

  • A database designed to store and process large volumes of current and historical data collected from multiple sources inside and outside the enterprise for deep analysis.
  • Organizes data into tables and columns, and allows users access via SQL.
  • Optimized for loading, integrating and analyzing very large amounts of data.
  • Designed to support descriptive, diagnostic, predictive and prescriptive analytic workloads.

Snowflake definitely includes the overlapping characteristics of both a database and a data warehouse-ACID compliant, support for standard SQL, etc. But Snowflake also embodies all of the defining characteristics of a data warehouse.

One of the key differentiators of Snowflake, from other solutions, is that it’s specifically designed for data warehousing and high speed analytic processing. Rather than a generalized SQL database that has been “tuned” or even adapted to handle these type of workloads, Snowflake was built from the ground up for the cloud to optimize loading, processing and query performance for very large volumes of data. Therefore, hands down, Snowflake is a data warehouse.

So, why do we still need a specialized data warehouse engine?

As OLTP databases have been able to scale higher and innovations like in-memory databases have emerged, some organizations have questioned whether they still need a separate technology or specialized system for reporting and analytics. The answer, again, requires us to look at the basics: What benefits emerge from storing and analyzing data in a separate system?

  1. It eases the burden of reporting from transactional systems by removing the contention for limited and expensive resources.
  2. It produces more business-friendly data results by allowing the data to be restructured to a more suitable format.
  3. It provides access to a wider array of reports more quickly because all the resources in the data warehouse are dedicated to reporting and analysis.
  4. It integrates valuable data from across the enterprise for richer insight. Something that can’t (and shouldn’t) be done in an OLTP system.

For more information on how you can up your data warehousing game with a modern, built-for-the-cloud approach, check out some of our free resources such as our ebook The Data Warehouse: The Engine That Drives Analytics. We would love to help you on your journey to the cloud so keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) to keep up with all the news and happenings here at Snowflake.

Understanding Your Snowflake Utilization: Part 3 – Query Profiling

This article is the third in a three-part series to help you utilize the functionality and data in Snowflake’s Information Schema to better understand and effectively Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage. In my second post, I showed you how to get a handle on your storage usage.

In this final post, I will deep-dive into understanding query profiling. To do this, I will show you examples using the QUERY_HISTORY family of  functions. I will also show you a handy page in the UI that provides a graphical view of each query. Keep in mind that you’ll need warehouse MONITOR privileges to perform the tasks described in this post. Typically, the SYSADMIN role has the necessary warehouse MONITOR privileges across your entire account; however, other lower-level roles may also have the necessary privileges.

Ready to get started? Here we go!

Query History Profiling

Query profiling is perhaps one of the more popular topics I field questions about. Many customers are interested in improving their query performance. Although every development team should strive to periodically refactor their code, many find it challenging to determine where to start. Going through this analysis should help with identifying a good starting point.

Let’s look at some syntax, per our documentation for QUERY_HISTORY:

select *
from table(information_schema.query_history(dateadd('hours',-1, current_timestamp()),current_timestamp()))
order by start_time;

This query provides a view into all of the queries run by the current user in the past hour:

Query Profiling - History

We can also leverage the QUERY_HISTORY companion functions to narrow down your focus:

  • QUERY_HISTORY_BY_SESSION
  • QUERY_HISTORY_BY_USER
  • QUERY_HISTORY_BY_WAREHOUSE

These are particularly useful if you have identified specific workflow issues you need to address.

Profiling Tip #1: Using HASH()

Now for a particularly useful tip: utilizing HASH on the QUERY_TEXT column can help you consolidate and group on similar queries (the HASH function will return the same result if any queries are exactly the same). As a general rule, identifying query groups and finding the max and min query runtime should help you sort through specific workflows. In the example below, I’m doing an analysis on average compile and execution time. Additionally, I’m collecting a count of the queries with the same syntax:

select hash(query_text), query_text, count(*), avg(compilation_time), avg(execution_time)
from table(information_schema.query_history(dateadd('hours',-1,current_timestamp()),current_timestamp()))
group by hash(query_text), query_text
order by count(*) desc;

Output:

Query Profiling - Query Groups

Using the HASH function further allows a user to easily query a particular instance of this query from the QUERY_HISTORY function. In the example above, I could check for specific queries where the HASH of the query text converted to the value -102792116783286838. For example:

select *
from table(information_schema.query_history())
where hash(query_text) = -102792116783286838
order by start_time;

Output:

Query Profiling - Single Query (SQL)

The above result shows you all of the times you have issued this particular query (going back 7 days). Pay specific attention to the following columns:

  • COMPILATION_TIME
  • EXECUTION_TIME
  • QUEUED (times)

If a query is spending more time compiling (COMPILATION_TIME) than executing (EXECUTION_TIME), perhaps it is time to review the complexity of the query. Snowflake’s query compiler will optimize your query and identify all of the resources required to perform the query in the most efficient manner. If a query is overly complex, the compiler needs to spend more time sorting through the query logic. Take a look at your query and see if there are many nested subqueries or unnecessary joins. Additionally, if there are more columns being selected than required, then perhaps be more specific in your SELECT statement by specifying certain columns.

QUEUED time is interesting because it could be an indicator about your warehouse size and the amount of workload you’ve placed on the warehouse. Snowflake is able to run concurrent queries and it does a very good job in doing so. However, there will be times when a particularly large query will require more resources and, thus, cause other queries to queue as they wait for compute resources to be freed up. If you see a lot of queries spending a long time in queue, you could either:

  • Dedicate a warehouse to these large complex running queries, or
  • Utilize Snowflake’s multi-clustering warehouse feature to allow more parallel execution of the queries. For more information about multi-cluster warehouses, see the Snowflake documentation.

In the recent updates to our QUERY_HISTORY_* Information Schema functions, we have added more metadata references to the results and now you should have a range of metadata at your disposal:

  • SESSION_ID
  • USER_NAME , ROLE_NAME
  • DATABASE_NAME , SCHEMA_NAME
  • WAREHOUSE_NAME , WAREHOUSE_SIZE , WAREHOUSE_TYPE

These columns will help you identify the origin of the queries and help you fine tune your workflow. A simple example would be to find the warehouse with the longest-running queries. Or, find the user who typically issues these queries.

Profiling Tip #2: Using the UI

Once you have identified a particular query you would like to review, you can copy its QUERY_ID value and use this value to view its query plan in Snowflake’s Query Profile. To do this, click on the History icon, add a QUERY ID filter, and paste the QUERY_ID in question. For example:

Query Profile - Using the UI

Hint: If you don’t see a result, make sure you are using a role with the necessary warehouse MONITOR privilege (e.g. SYSADMIN or ACCOUNTADMIN) and you’ve selected the correct QUERY ID.

Once the search is complete, you should be able to click on the link provided under the Query ID column to go to the query’s detail page:

Query Profiling - Profile Page in UI

Now click on the Profile tab.

You should see a visualization of the Query Profile. In my example, Snowflake shows that this particular query executed in two steps:

Step 1:

Query Profiling - Profile Step 1 in UI

Step 2:

Query Profiling - Profile Step 2 in UI

The online documentation provides in-depth details on how to interpret this view. Pay particular attention to the orange bar in this view. It indicates the percentage of overall query time spent on this particular process. In this instance, we can see our query spent most of the time reading data from the table. If we run this query often enough, we should see this time decrease because we’ll be reading the data from cache instead of disk.

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your queries to help you understand your current workflow and identify queries that can be better optimized. This will help save you money in the long run and also improve your user experience. Snowflake will continue to invest in tools like these to help our users better understand and use our platform.

To dig in some more on this subject, check out our online documentation:

I hope this article and this series gave you some inspiration for how you would like to manage your Snowflake instance. There are a lot of options to play with and they’re all intended to provide you with the flexibility and control you need to best use Snowflake. Please share your thoughts with us! We would love to help you on your journey to the cloud.

For more information, please feel free to reach out to us at info@snowflake.net. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Data Encryption with Customer-Managed Keys

The security of customer data is Snowflake’s first priority. All customer data is encrypted using industry-standard techniques such as AES-256. Encryption keys are organized hierarchically, rooted in a hardware security module (HSM). This allows complete isolation of customer data and greatly reduces the attack vectors.

For customers with the highest security requirements, we are adding another security component: customer-managed keys. With customer-managed keys, the customer manages the encryption key and makes it available to Snowflake. The customer has full control over this key. If the customer disables access to the encryption key, Snowflake can no longer access the customer’s data. Your data. Your encryption keys.

In this blog post, we will explain the benefits of customer-managed keys and their implementation in the Snowflake cloud data warehouse.

Benefits

Customer-managed keys provide the following benefits:

More Control over Data Access: Customer-managed keys make it impossible for Snowflake to comply with requests to access customer data. If data is encrypted using customer-managed keys and the customer disables access to the encryption key, it is technically impossible for Snowflake to decrypt the data. It is therefore the customer’s responsibility to comply with such requests directly.

Stop Data Breaches: If a customer experiences a data breach, they may disable access of customer-managed keys to Snowflake. This will halt all running queries in Snowflake, including queries that may inspect data or unload data. Disabling customer-managed keys allows customers to stop ongoing exfiltration of their data.

More Control over Data Lifecycle: The last reason why customers require this feature is lack of trust with any cloud provider. Customers may have sensitive data that they do not trust Snowflake to manage on their own. Using customer-managed keys, such sensitive data is ultimately encrypted with the customer’s key. It is impossible for Snowflake to decrypt this data without the customer’s consent. The customer has full control over the data’s lifecycle.

Implementation

Before we explain the implementation of customer-managed keys, we should first give a background of Snowflake’s key hierarchy and Amazon’s key management service.

Background 1: Snowflake’s Key Hierarchy

Snowflake manages encryption keys hierarchically. Within this key hierarchy, a parent key encrypts all of its child keys. When a key encrypts another key, it is called “wrapping”. When the key is decrypted again, it is called “unwrapping”.

Encryption key hierarchy - Snowflake

Figure 1: Encryption key hierarchy in Snowflake.

Figure 1 shows Snowflake’s hierarchy of encryption keys. The top-most root keys are stored in a hardware security module (or CloudHSM). A root key wraps account master keys. Each account master key corresponds to one customer account in Snowflake. Account master keys, in turn, wrap all data-level keys, including table master keys, stage master keys, and result master keys. In addition, every single data file is encrypted with a separate key. A detailed overview of Snowflake’s encryption key management is provided in this Blog post.

Background 2: AWS Key Management Service

Amazon’s AWS Key Management Service (KMS) is a service to store encryption keys and tightly control access to them. Amazon provides an audit log of all operations and interactions with KMS by using CloudTrail. This allows customers to manage their own encryption keys and validate their usage via the audit log. KMS also allows customers to disable access to any keys at any time. Combining KMS with Snowflake’s encryption key hierarchy allows us to implement customer-managed keys. More details about AWS KMS can be found on the Amazon website.

Implementation of Customer-Managed Keys

The implementation of customer-managed keys changes the way account master keys (AMKs) are stored within Snowflake’s encryption key hierarchy. Normally, as shown in Figure 1 above, an AMK is wrapped by the root key stored in CloudHSM. For customer-managed keys, this is only partly true. There are two AMKs involved: a first key is wrapped by the root key stored in the CloudHSM and a second key is wrapped by the customer key in KMS. Unwrapping and combining these two keys leads to the composed account master key, which then wraps and unwraps all underlying keys in the hierarchy (table master keys, result master keys, etc.).

Account master key - Customer-Managed Keys

Figure 2: Account master key composed of AMK-S and AMK-C. AMK-C is wrapped by KMS.

Figure 2 shows this concept in detail. With customer-managed keys, the AMK is composed of two keys: AMK-S and AMK-C. AMK-S is a random 256-bit key that is wrapped with the root key stored in HSM. AMK-C is a second random 256-bit key that is wrapped with the customer key stored in KMS. AMK-S and AMK-C are completely random and unrelated. Both wrapped keys are stored in Snowflake’s encryption key hierarchy.

Figure 3: Unwrapping and composing of AMK.

When the customer runs a query in Snowflake that requires access to customer data, the composed AMK is produced as follows (see Figure 3). Both wrapped keys, AMK-S and AMK-C, are retrieved from the encryption key hierarchy. AMK-S is unwrapped using the root key in HSM. AMK-C is unwrapped using the customer key in KMS. The KMS audit log logs an access event to the customer key. Both unwrapped 256-bit keys are combined using XOR to form the composed AMK. The composed AMK is then used to unwrap the underlying table master keys to access the customer data.

The composed AMK is cached within the Snowflake data warehouse for performance reasons. This cache has a timeout period after which the cached AMK is not accessible anymore. The cache is refreshed in the background such that continuous queries are not impacted by any latency to KMS. If access to KMS is revoked, refreshing the cache fails and the AMK is removed from the cache immediately. Any running queries are aborted. New queries fail to start because no AMK can be composed. The customer’s data can no longer be decrypted by the Snowflake service.

Summary

Customer-managed keys provide an extra level of security for customers with sensitive data. With this feature, the customer manages the encryption key themselves and makes it accessible to Snowflake. If the customer decides to disable access, data can no longer be decrypted. In addition, all running queries are aborted. This has the following benefits for customers: (a) it makes it technically impossible for Snowflake to comply with requests for access to customer data, (b) the customer can actively mitigate data breaches and limit data exfiltration, and (c) it gives the customer full control over data lifecycle.

Availability

Customer-managed keys are a primary component of Tri-Secret Secure, a Snowflake Enterprise Edition for Sensitive Data (ESD) feature. To enable Tri-Secret Secure for your ESD account, you need to first create a key in AWS KMS (in your AWS account) and then contact Snowflake Support.

Acknowledgements

We want to thank Difei Zhang for his contributions to this project.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and developments here at Snowflake Computing.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Understanding Your Snowflake Utilization, Part 2: Storage Profiling

This article is the second in a three-part series to help you use Snowflake’s Information Schema to better understand and effectively utilize Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. And I’m now passing along some of what I’ve learned to help you become more self-sufficient. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage.

In this post, I provide a deep-dive into understanding how you are utilizing data storage in Snowflake at the database, stage, and table level. To do this, I will show you examples of two functions and a view provided in the Information Schema for monitoring storage usage. I will also show you a handy page in the UI that provides an account-level view of your storage. Keep in mind that you need ACCOUNTADMIN access to perform any of the tasks described in this post.

Let’s get started.

Summary Storage Profiling in the UI

Before diving into our detailed analysis of data storage, let’s take a quick look at the summary, account-level storage view provided by Snowflake. As a user with the ACCOUNTADMIN role, you can navigate to the Account page in the Snowflake UI to get a visual overview of the data storage for your account.


This page provides a view, by month, of the average and daily storage usage across your entire account. You can use the filters on the page to filter by database, Snowflake stage, and data maintained in Fail-safe (for disaster recovery).

Detailed Storage Profiling Using the Information Schema

The Snowflake Information Schema provides two functions and one view for monitoring detailed data storage at the database, stage, and table level:

  • DATABASE_STORAGE_USAGE_HISTORY (function)
  • STAGE_STORAGE_USAGE_HISTORY (function)
  • TABLE_STORAGE_METRICS (view)

The DATABASE_STORAGE_USAGE_HISTORY table function shows your database status and usage for all databases in your account or a specified database. Here’s an example of the usage over the last 10 days for a database named sales:

use warehouse mywarehouse;

select * from table(sales.information_schema.database_storage_usage_history(dateadd('days',-10,current_date()),current_date(), ‘SALES’));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation. Also, per the Snowflake documentation:

If a database has been dropped and its data retention period has passed (i.e. database cannot be recovered using Time Travel), then the database name is reported as DROPPED_id.

At its core, the most useful insight from this function is the average growth in your database. Keep in mind, the output includes both AVERAGE_DATABASE_BYTES and AVERAGE_FAILSAFE_BYTES. Leveraging these data points to derive a percentage of Fail-safe over actual database size should give you an idea of how much you should be investing towards your Fail-safe storage. If certain data is not mission critical and doesn’t require Fail-safe, try setting these tables to transient. More granular information about Fail-safe data is provided in TABLE_STORAGE_METRICS, which we will look at more closely later in this post.

Next, let’s look at STAGE_STORAGE_USAGE_HSTORY. This function shows you how much storage is being used for staged files across all your Snowflake staging locations, including named, internal stages. Note that this function does not allow querying storage on individual stages.

Here’s an example of staged file usage for the last 10 days:

select *
from table(sales.information_schema.stage_storage_usage_history(dateadd('days',-10,current_date()),current_date()));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation.

Also note that you can only query up to 6 months worth of data using this function. Some of our users like to use Snowflake stages to store their raw data. For example, one user leverages table staging locations for their raw data storage just in case they need to access the data in the future. There’s nothing wrong with this approach, and since Snowflake compresses your staged data files, it certainly makes sense; however, only the last 6 months of staged data storage is available.

Finally, the TABLE_STORAGE_METRICS view shows your table-level storage at runtime. This is a snapshot of your table storage which includes your active and Fail-safe storage. Additionally, you can derive cloned storage as well utilizing the CLONE_GROUP_ID column. As of today, this is the most granular level of storage detail available to users.

Here’s a general use example:

select *
from sales.information_schema.table_storage_metrics
where table_catalog = 'SALES';

Note that the above screenshot only shows a portion of the output columns. For full details about the output, see the online documentation.

One interesting analysis I’ve been helping our customers with is deriving how much of their table storage is based on cloned data. In Snowflake, cloning data has no additional costs (until the data is modified or deleted) and it’s done very quickly. All users benefit from “zero-copy cloning”, but some are curious to know exactly what percentage of their table storage actually came from cloned data. To determine this, we’ll leverage the CLONE_GROUP_ID column in TABLE_STORAGE_METRICS.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select * , referred_bytes / owned_bytes as ratio
from storage_sum
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

The ratio in the above query gives you an idea of how much of the original data is being “referred to” by the clone. In general, when you make a clone of a table, the CLONE_GROUP_ID for the original table is assigned to the new, cloned table. As you perform DML on the new table, your REFERRED_BYTES value gets updated.  If you join the CLONE_GROUP_ID back into the original view, you get the output of the original table along with the cloned table. A ratio of 1 in the above example means the table data is not cloned.

If you need to find out the exact table name from the above query, then simply join the CTE back to the TABLE_STORAGE_METRICS view and ask for the TABLE_NAME column.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select b.table_name, a,* , referred_bytes / owned_bytes as ratio
from storage_sum a
join concurrency_wh.information_schema.table_storage_metrics b
on a.clone_group_id = b.clone_group_id
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

Storage Profiling - Example

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your data storage to help you keep your storage costs under control and understand how your business is growing over time. It’s a good idea to take regular snapshots of your storage so that you can analyze your growth month-over-month. This will help you both formulate usage insight and take actions.

To dig in some more on this subject, check out our online documentation:

I hope this article has given you some good ideas for how to manage your Snowflake instance. Look for Part 3 of this series in coming weeks where I will show you how to analyze your query performance. As already shown in Parts 1 and 2, there are a lot of options to play with in Snowflake and they’re all intended to give you the flexibility and control you need to best use Snowflake. Please share your thoughts with us!

Also, for more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Snowflake Vision Emerges as Industry Benchmark

Technology research and analysis firm Gigaom has ranked Snowflake as the #1 cloud data warehouse in a recent study. We surpassed enterprise data warehouse products including, Google BigQuery, Teradata, IBM dashDB, HPE Vertica, Microsoft Azure SQL, SAP HANA and Oracle Exadata. Snowflake emerged with a top score of 4.85 out of a possible 5.0. The competition averaged a score of 3.5. The six “disruption vectors” Gigaom used as its key scoring criteria are congruent with what we wanted to achieve back in the summer of 2012, when we started Snowflake.

But long before we wrote the first line of Snowflake code, we asked one another: “What should a data warehouse deliver that no other product has before? How can we enable organizations to make the best, data-driven decisions? And how will the world’s most powerful data warehouse help organizations achieve their existing goals and help reveal their future goals?” We then set out to answer those questions.

We wanted to enable organizations to easily and affordably store all of their data in one location, and make that data accessible to all concurrent users without degrading performance. We also wanted Snowflake to scale infinitely, with ease, and cost effectively so organizations would only pay for the compute and storage they used. And the product had to work with the tools that users already knew and loved. Finally, we wanted a data warehouse that required zero management by our customers – nothing to tweak, no tuning required. These defining qualities aligned with the new world of cloud services, and they are what formed the foundation of Snowflake.

What’s happened since the early days of Snowflake? We got to work, and we stuck to hiring the best engineers the world has to offer. We built Snowflake from the ground up, for the cloud, and incorporated all of these elements as the core of the product. In early 2015, we offered the first commercial version of Snowflake – the one and only data warehouse built for the cloud. Since then, our engineering team has added more and more industry-leading capabilities to Snowflake, leapfrogging the traditional data warehouse vendors.

Along the way, we’ve hired high-calibre teams to execute the sales, marketing and finance functions of the company so our customers and partners get the highest value from working with Snowflake. We also built a great customer support organization, providing the level of service our users love. In more recent times, we’ve expanded operations outside of North America to Europe, with Asia-Pacific and other regions coming online soon. We’ve also added Snowflake On Demand™ – the easiest way to get started with Snowflake by simply signing up on our website with just a credit card. All of these efforts over the past four years have led to Snowflake’s most recent inflection point – being chosen as the number one cloud data warehouse.

What does all this mean? Snowflake’s current and future customers have every opportunity to explore all of their data in ways they never thought possible. They can gain the insight, solve the problems and create the opportunities they simply couldn’t with their previous data platforms. We committed to building the world’s best data warehouse – the only data warehouse built for the cloud. Our customers, our partners and now the industry have indicated we’ve likely achieved what we set out to do back in the summer of 2012. Going forward, we’ll continue to serve our customers and partners with the best technology, the best solutions and the best services available.

Read the full report >

Migrating to the Cloud? Why you should start with your EDW

Many organizations we engage with are seriously considering transforming their business and moving some (or all) of their IT operations into the cloud. A lot of executives I have encountered are struggling with the same question: “How do I get started?” There is a strong case to be made that starting with your Enterprise Data Warehouse (EDW), or at least a data mart, is the fastest, and most risk-free path, with added upside potential to increase revenue and set you up for future growth. As operational data volumes continue to grow at exponential rates, it’s not a matter of if you go to the cloud to manage your enterprise data, but when.

Before going too far on your cloud journey, I would recommend an exercise in segmenting your business from an IT perspective in a very simple way. To get you started, let me suggest five possible categories, along with some risks to consider for each:

  • Customer-facing Applications – This is the heart and soul of your business. If something goes wrong, you lose business and revenue, and people potentially get fired. Risk: HIGH
  • Internal Applications – Mail, Payroll, General Ledger, AP, AR, things like that. Every person inside the organization relies on at least one of these services, and a lot of analysis needs to take place to figure out all the integration points to ensure nothing gets missed during a migration to the cloud. Risk: HIGH
  • Desktop/Laptop OS and Applications – There are whole books and schools of thought about how to migrate these, which means it’s a big decision and a big deal. Impacting everyone in the company on your first cloud initiative? Risk: HIGH
  • Operations Monitoring and Alerting – Got a Network Operation Center (NOC)? These guys are integrated with every system that is important, so moving them to the cloud could be a large undertaking. Risk: HIGH
  • Reporting and Analytics – Hmmm….if my constituents don’t get their weekly or monthly reports on time, is that a disaster? Can they get by with a small outage during the migration? Risk: LOW

Starting with the Data

Let’s take a closer look at why starting your cloud journey with your EDW could be a viable option, and even have some benefits that could help sell the idea (of the cloud) internally. In no particular order, I would highlight these points:

  • Doesn’t disrupt the business – Many EDW implementations are not mission critical today (as compared to enterprise applications). As more data becomes available through social media or Internet of Things (IOT) applications, businesses need access to much larger volumes of data and they will want access to it earlier in the data pipeline. Traditional DWs contain aggregations and are used for doing trend analysis, analyzing data over a period of time to make strategic, rather than tactical decisions. They are not architected to handle this new influx of raw data in a cost-effective manner. By starting your cloud journey with the EDW, you reduce risk (by going to a more flexible architecture) while getting your team early exposure to working with cloud services.
  • Doesn’t disrupt internal users – When moving to the cloud, you want to show incremental success and don’t want to add a lot of unnecessary risk. It’s simple to keep running your existing EDW in parallel with your new cloud DW, giving you a built-in fall-back plan for the early stages. Or you may decide to start with a small data mart as a pilot project.
  • Start-up costs are a fraction of on-premises, appliance solutions – Some of our customers invested as much as $10 million (or more) years ago on a data warehouse appliance that is now outdated technologically. And the renewal costs to keep that tech going are coming due. If they re-invest another huge sum of money, this will delay them getting to the cloud by another 4-5 years, putting them behind their competition. Rather than outlaying a large capital expenditure to extend the life of the older technology, it may make better sense to move to the cloud. The cloud offers a utility-based model, allowing you to pay for what you use and when you use it, as opposed to what you think you are going to need 2-3 years in the future. As a result, not only is the cost of entry lower, but you are not risking a huge sum of money to make the move.
  • Data is growing at an exponential rate – Will you ever have less data to worry about in your business? If you plan on being successful, I don’t think so. Many organizations are looking at new and different ways to manage and analyze ever-increasing volumes of data coming in various formats from multiple sources (such as semi-structured web logs). Your current on-premises EDW was not designed for this kind of workload or data.  If you are considering changing infrastructure platforms to accommodate it, why not select tools that were built for today’s modern data challenges instead of legacy-based architectures? Moving to the cloud also gives you the opportunity to consolidate operations and streamline business processes.
  • Enable new capability – There are some new analytic paradigms happening in the cloud (such as machine learning). Cloud-based platforms allow you to work with both detailed and aggregated data at scales never imaged (see the case study about DoubleDown as an example). Need to run a complex analytic job with a 256-node Massively Parallel Processing (MPP) cluster for an hour, and then shut it down? No problem. Can your platform support a thousand users without concurrency issues?  How would that change your business if it could dynamically adjust to handle those new demands?

As with any infrastructure move, the benefits have to be clear enough that the status quo mentality can be overcome and analysis paralysis doesn’t push out your journey to the cloud for months or even years. The beauty of the cloud model is that it is easy to start small and scale without risking a huge investment up front. Every business needs some proof before committing time and resources to move anything to the cloud and your EDW is a perfect candidate. Snowflake is the first and only EDW built for the cloud to be truly elastic for all of your analytic and big data needs.

Please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Is My Data Safe in the Cloud?

This post comes to us from a guest blogger, Robert Lockard. Robert is an Oracle ACE with over 30 years experience as an Oracle DBA / Designer and Developer. He has spent the past 10 years focusing on database security. Robert frequently speaks at conferences all over the world, teaching DBA’s and Developers how to secure their data. You can follow him on twitter at @YourNavionPilot and read more of his blogs at http://www.oraclewizard.com.

I have four mantras for any system I’m involved in: Stability, Security, Accuracy and Performance.

When it comes to information security in the cloud, there are lots of recommendations for vendors and customers to consider. When I am called into customer sites to ensure their data is properly protected, my first step is making sure the person connecting to the database is authenticated properly and that encryption has been properly set up, along with a host of other related things. The amount of work required just to get data encrypted, make sure there is no ghost data, and ensure there is no spillage is a difficult process. From my perspective, Snowflake makes the end-to-end encryption easy because encryption is turned on by default. Your data is being encrypted from day one.

Here is my best advice on how to insure your critical business data is secure in the cloud along with my review of how Snowflake addresses these issues.

#1 – Demand Multi-factor Authentication

Snowflake supports using DUO multi-factor authentication to ensure the user who is authenticated to the database is also authorized. Stolen credentials are one of the top ways bad actors get in where they do not belong. The username / password paradigm of authentication has been around for several decades. Username / password harvesting and many of the data breaches, could be reduced by following best practices and a best practice is to use multi-factor authentication. This is something you know and something you have. When you authenticate with a username / password, a token will display a code;  you enter that code into the system and then you are authenticated. This is something Snowflake has built into their system. Multi-factor authentication is not an add-on option; it comes standard with their product.

In addition, Snowflake’s implementation of DUO also allows a verification request to be sent to a mobile device. When you enter your username / password, a request is sent to your mobile device. If you approve the request, you are connected. The beauty of this is, most of us always have our mobile phones with us. If you get a request that you did not originate, you know someone else is trying to use your credentials to get into the system. When you reject the request, they will be locked out.

#2 – Demand End-to-end Encryption

The customer should require end-to-end encryption from any cloud provider they decide to use. Snowflake delivers AES-256 encryption at no extra cost. The AES standard was adopted in 2001 by the National Institute of Standards and Technology (NIST). The AES standard has displaced the older DES standard. The AES standard uses 128 block sizes with three keys sizes 128, 192 and 256 bits. AES encryption is the US Government standard for strong encryption and has been adopted by financial, government and other commercial entities world wide.

#2a – Demand Sophisticated Encryption Key Management

Snowflake uses the Amazon Hardware Security Module (HSM) to manage sophisticated key structures such as key wrapping, key rotation and re-keying.

Key Wrapping (aka Hierarchical Keys)

Key wrapping has become the industry standard. Oracle provides it in Transparent Data Encryption with a master key that is stored outside the database to decrypt the keys that are used to secure the data. Microsoft SQL Server uses a certificate that is stored outside the database to decrypt the key that is used to decrypt the data. Snowflake uses four levels of keys to encrypt customers data.

Using this model, each customer’s data is encrypted with a unique set of keys. This way, customer “A”s data is encrypted with a different set of keys then customer “B”. By using multiple customer keys, each customer’s data is segregated, and secured, from the others.

See the Snowflake Encryption Key Management blog post for more details.

Cryptologic Key Life Time

National Institute of Standards (NIST) recommends having a policy in place so each key has a limited lifetime. The longer a key is in use, the greater the odds it will be compromised (see NIST Encryption Key Management Recommendation for more details). Snowflake uses two methods to control the lifetime of keys: key rotation and rekeying.

1) Cryptologic Key Rotation

One way to minimize the impact of a cryptologic key being compromised is to rotate the keys at a set interval. Snowflake rotates keys at a system defined interval. This ensures if a cryptologic key is compromised then the amount of data at risk is minimized.

The easy way to understand key rotation is this. Data is encrypted and decrypted using a key and, after a set period of time, another key is added; all new data is encrypted and decrypted with this new key and old data is decrypted with the old key. Again, see the Snowflake Encryption Key Management blog post for more details.

The advantage of using key rotation is two fold: a) Because you are adding a key, you don’t have to change the keying on the data that had been encrypted; you get performance. b) Because now there are multiple keys; if any key is compromised only a subset of information is vulnerable.

2) Cryptologic Rekeying

As an additional security option, customers can choose to have their data rekeyed annually. By rekeying the data annually, Snowflake minimizes the amount of time a key is in use, thereby minimizing the odds a cryptologic key will be compromised. Once data is rekeyed, the old keys are destroyed.

Network Encryption

Now we need to deal with data that is moving over the network. All network communications between loading the data and analyzing the data is encrypted using TLS (see Advantages of Using TLS for more details). In addition, for ESD (Enterprise with Sensitive Data) customers, all internal network communications is encrypted using TLS.

By using the TLS standard, Snowflake has implemented the industry best practice.

What about Performance and Stability?

There exists a perception that encryption has a negative performance impact on CPU. The HSM performs a variety of functions. It manages encrypting and decrypting keys, handles key rotation and rekeying. In addition, the HSM is only used to wrap and unwrap account master keys. All remaining encryption is performed by other services in the background, resulting in no impact on customer workloads.

The hardware encryption module is also tamper-resistant. Efforts to tamper with the hardware encryption module will shut the HSM down, yet Snowflake has configured the HSM to be stable and reliable.

Conclusion

Based on my evaluation, by using Snowflake, the customer gets three of my four mantras:

Security: You get multi-factor authentication and end-to-end encryption right out of the box.

Stability: Snowflake has configured the HSM to be very stable, plus it lives on AWS which has proven to be a very stable cloud platform.

Performance: Snowflake’s use of the Amazon Elastic Compute Cloud gives performance on demand.

My fourth manta, Accuracy, is up to you.

SnowSQL Part I: Introducing SnowSQL – a modern command line tool built for the cloud

What & Why

Starting this month, we have introduced a new modern command line tool for an interactive query experience with the Snowflake Elastic Data Warehouse cloud service. Yes indeed – a new command line tool!

Now you might ask – why did Snowflake decide to invest in building a new SQL command line tool in the 21st century when there are so many different ways of accessing and developing against the Snowflake service? For example, you could use our current user interface, or various drivers, or other SQL editor tools available today.

The answer is simple:

  • First of all, developers still care very deeply about light-weight mechanisms to quickly ask SQL questions via a text-based terminal with simple text input and output.
  • Secondly, we recognized the need for a more modern command line tool designed for the cloud that is easy to use, built on high security standards, and is tightly integrated with the actual core Snowflake elastic cloud data service.
  • Finally, we wanted to build a command line tool which offers users more powerful scripting capabilities overall.   

The result of these efforts is SnowSQL – our new SQL command line tool that is entirely built in Python and leverages Snowflake’s Python connector underneath.

Looking at existing more generic command line tools, such as SQLLine or HenPlus, it became obvious that these tools lack the ease-of-use and did not really offer sufficient scripting features. Also, Snowflake’s approach was to consider the command line tool as part of the overall cloud service. That is, SnowSQL can be seen as an extension of the Snowflake Elastic DW service. This cloud-first philosophy has important implications on the entire life cycle of Snowflake’s command line tool including the agility of delivering new SnowSQL features as part of the frequent Snowflake service updates (please see below the auto upgrade capabilities)

Getting Started

To get started, you can download SnowSQL from the Snowflake UI after logging into your Snowflake account. SnowSQL is currently supported on all three major platforms including:

  • Linux 64-bit,
  • Mac OS X 10.6+, and
  • Windows 64-bit

We provide a native installer for each platform with easy-to-follow installation steps. Once you have installed SnowSQL, open a new terminal and type:

$ SNOWSQL_PWD=<password> snowsql -a <account_name> -u <user_name>

Upon first use, SnowSQL users will notice a progress bar which indicates the download of the initial version of SnowSQL. This is a one time operation.  Any future downloads will happen in the background and remain entirely transparent for the user. The ability to automatically and fully transparent upgrade the command line tool is one of the key benefits of building a command line tool as cloud extension rather than a stand-alone software component interacting with a remote service (please see the auto-upgrade section below).

Using SnowSQL

There are a few capabilities we would like to highlight in this blog.  

SnowSQL Commands

First, SnowSQL offers a wide range of commands a user can make use of. As  a general rule, all SnowSQL commands start with a bang character ‘!’. For a complete list of currently supported commands, please see our documentation here.

Auto-Complete and Syntax Highlighting

Secondly, with our context-sensitive auto-complete feature, SnowSQL users are released of cumbersome and error-prone typing of long object names Instead they can complete SQL keywords and functions once typing the first three letters. By leveraging auto-complete, SnowSQL users can become increasingly more productive and quickly explore data in Snowflake. Furthermore, SQL statements are highlighted in different colors resulting in a better readability for SnowSQL users when interacting with a terminal

 

Auto-Upgrade

Thinking as a service, the auto-update framework enables users to always stay up-to-date with both –  Snowflake’s and SnowSQL’s latest features to streamline end user experience. No more additional downloads or tedious re-installation are needed. The upgrade is transparent for the end user and takes place  as a background process when you start SnowSQL. Next time a user runs SnowSQL, the new version will be automatically picked up while their workflows remain unaffected and will not be interrupted during the upgrade.

Secure Connection and Encryption

Finally, security is core in  SnowSQL’s design. SnowSQL secures connections to Snowflake using TLS (Transport Layer Security) with OCSP (Online Certificate Status Protocol – OCSP) checks. The auto-upgrade binaries are always validated by using RSA signature. In addition to the secured connection, SnowSQL provides end-to-end security of data moving in and out of Snowflake by using AES (Advanced Encryption Standard)  for Snowflake’s PUT and GET commands.

In the next few weeks, we will provide a deeper dive into some of the unique capabilities of SnowSQL. Please stay tuned as we gather and incorporate feedback from our customers. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions.

Finally, all of this would not have been possible without the active Python community who inspired us in many ways and offered us tools and packages to build SnowSQL. Thank you.   

As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.

Automatic Concurrency Scaling in Snowflake – Another Way the Cloud Changes the Game

The challenge with concurrency

Today we take a major step forward by extending our elastic architecture to solve another major pain point in existing on-premises and cloud data warehousing solutions: how to run massively concurrent workloads at scale in a single system.

Have you had the following experiences when building mission-critical applications that incorporate data analytics:

  • My application can only support a certain level of user concurrency due to the underlying data warehouse, which only allows 32-50 concurrent user queries.
  • To build my application, I need to acquire multiple data warehouse instances in order to isolate numerous workloads and users from each other. This adds to costs and complexity.
  • We have built our own scheduling policies around the data warehouse. We use query queues to control and prioritize incoming queries issued by our numerous users.
  • During peak times, users are getting frustrated because their requests are getting queued or fail entirely.

At Snowflake, we separate compute from storage by introducing the unique concept of virtual data warehouses. That concept makes it possible to instantly resize virtual warehouses or pause them entirely. In addition, because of that concept Snowflake is the only cloud data warehousing solution that allows concurrent workloads to run without impacting each other.

However, we saw the need to go a step further to offer a service that adapts to changing workloads and addresses concurrency at the same time:

  • Imagine you didn’t have any concurrency limitations on your mission-critical business application.
  • Imagine you didn’t need users to adjust their workloads to accommodate data warehouse bottlenecks.
  • Imagine the data warehouse itself could detect increasing workloads and add additional compute resources as needed or shut-down/pause compute resources when workload activities subside again.
  • Imagine your application could scale out-of-the-box with one single (virtual) data warehouse without the need to provision additional data warehouses.
  • Imagine a world without any scheduling scripts and queued queries – a world in which you can leverage a smart data warehousing service that ensures all your users get their questions answered within the application’s SLA.

With Snowflake, we allow you to do that all of this for real, not just in your imagination, with our new multi-cluster data warehouse feature.

Multi-cluster data warehouses

A virtual warehouse represents a number of physical nodes a user can provision to perform data warehousing tasks, e.g. running analytical queries. While a user can instantly resize a warehouse by choosing a different size (e.g. from small to 3X large), until now a virtual data warehouse in Snowflake always consisted of one physical cluster.

With the recent introduction of multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, more resources for a warehouse by specifying additional clusters for the warehouse.

QueryScheduler-1b

Figure 1: How automatic scaling works

 

The figure above shows a multi-cluster DW that consists of three compute clusters. All compute clusters in the warehouse are of the same size. The user can choose from two different modes for the warehouse:

  • Maximized:When the warehouse is started, Snowflake always starts all the clusters to ensure maximum resources are available while the warehouse is running.
  • Auto-scale:Snowflake starts and stops clusters as needed to dynamically manage the workload on the warehouse.

As always, in Snowflake a user can either leverage the user interface or use SQL to specify the minimum/maximum number of clusters per multi-cluster DW:

Create Warehouse UI wizard

MultiCluster

Figure 2: Create Warehouse in UI Wizard

 

Create Warehouse SQL script

Multi_cluster_sql

Figure 3: Create Warehouse via SQL

 

Similar to regular virtual warehouses, a user can resize all additional clusters of a multi-cluster warehouse instantly by choosing a different size (e.g. XS, S, M, L, …) either through the UI or programmatically via corresponding SQL DDL statements. In auto-scale mode, Snowflake automatically adds or resumes additional clusters (up to the maximum number defined by user) as soon as the workload increases. If the load subsides again, Snowflake shuts down or pauses the additional clusters. No user interaction is required – this all takes place transparently to the end user.

For these decisions, internally, the query scheduler takes into account multiple factors. There are two main factors considered in this context:

  1. The memory capacity of the cluster, i.e. whether clusters have reached their maximum memory capacity
  2. The degree of concurrency in a particular cluster, i.e. whether there are many queries executing concurrently on the cluster

As we learn more from our customers’ use cases, we will extend this feature further and share interesting use cases where multi-cluster data warehouses make a difference. Please stay tuned as we continue reinventing modern data warehousing and analytics by leveraging the core principles of cloud computing.

We would like to thank our co-authors, Florian Funke and Benoit Dageville, for their significant contributions as main engineer and architect, respectively, to making multi-cluster warehouses a reality. As always, keep an eye on the blog and our Snowflake Twitter feed (@SnowflakeDB) for updates on Snowflake Computing.