Financial Services: Welcome to Virtual Private Snowflake

Correct, consistent data is the lifeblood of the financial services industry. If your data is correct and consistent, it’s valuable. If it’s wrong or inconsistent, it’s useless and may be dangerous to your organization.

I saw this firsthand during the financial meltdown of 2007/08. At that time, I had been working in the industry for nearly 20 years as a platform architect. Financial services companies needed that “single source of truth” more than ever. To remain viable, we needed to consolidate siloed data sets before we could calculate risk exposure. Most financial services companies were on the brink of collapse. Those that survived did so because they had access to the right data.

At my employer, we looked for a way to achieve this single source with in-house resources, but my team and I quickly realized it would be an extraordinary challenge. Multiple data marts were sprawled across the entire enterprise, and multiple sets of the same data existed in different places, so the numbers didn’t add up. In a global financial services company, even a one percent difference can represent billions of dollars and major risk.

We ultimately built an analytics platform powered by a data warehouse. It was a huge a success. It was so successful that everybody wanted to use it for wide-ranging production use cases. However, it couldn’t keep up with demand, and no amount of additional investment would solve that problem.

That’s when I began my quest to find a platform that could provide universal access, true data consistency and unlimited concurrency. And for financial services, it had to be more secure than anything enterprises were already using. I knew the cloud could address most of these needs. However, even with the right leap forward in technical innovation, would the industry accept it as secure? Then I found Snowflake. But my story doesn’t end there.

I knew Snowflake, the company, was onto something. So, I left financial services to join Snowflake and lead its product team. Snowflake represents a cloud-first approach to data warehousing, with a level of security and unlimited concurrency that financial services companies demand.

We’ve since taken that a step further with Virtual Private Snowflake (VPS) – our most secure version of Snowflake. VPS gives each customer a dedicated and managed instance of Snowflake within a separate Amazon Web Services (AWS) Virtual Private Cloud (VPC). In addition, customers get our existing, best-in-class Snowflake security features including end-to-end encryption, at rest and in-transit. VPS also includes Tri-Secret Secure, which combines a customer-provided encryption key, a Snowflake-provided encryption key and user credentials. Together, these features thwart an attempted data decryption attack by instantly rendering data unreadable. Tri-Secret Secure also includes user credentials to authenticate approved users.

VPS is more secure than any on-premises solution and provides unlimited access to a single source of data without degrading performance. This means financial services companies don’t have to look at the cloud as a compromise between security and performance.

To find out more, read our VPS white paper and solution brief: Snowflake for Sensitive Data.

 

TPC-DS at 100TB and 10TB Scale Now Available in Snowflake’s Samples

We are happy to announce that a full 100 TB version of TPC-DS data, along with samples of all the benchmark’s 99 queries, are available now to all Snowflake customers for exploration and testing. We also provide a 10TB version if you are interested in smaller scale testing.

STORE_SALES sub-schema from the TPC-DS Benchmark

The STORE_SALES sub-schema from the TPC-DS Benchmark
Source: TPC Benchmark™ DS Specification

You can find the tables in:

  • Database: SNOWFLAKE_SAMPLE_DATA
  • Schema: TPCDS_SF100TCL (100TB version) or TPCDS_SF10TCL (10TB version) .

(Note that the raw data compresses in Snowflake to less than 1/3 of it’s original size.)

Sample TPC-DS queries are available as a tutorial under the + menu in the Snowflake Worksheet UI:

TPC-DS tutorial in Snowflake web interface

Accessing Sample TPC-DS queries in the Snowflake Worksheet UI

What is TPS-DS?

TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables: Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.

The 100TB version of TPC-DS is the largest public sample relational database we know of available on any platform for public testing and evaluation. For perspective, the STORE_SALES table alone contains over 280 billion rows loaded using 42 terabytes of CSV files.

Full details of the TPC-DS schema and queries, including business descriptions of each query, can be found in the TPC Benchmark™ DS Specification. To test examples of different types of queries, consider:

 Type  Queries
 Interactive (1-3 months of data scanned) — Simple star-join queries  19, 42, 52, 55
 Reporting (1 year of data scanned) — Simple star-join queries  3, 7, 53, 89
 Analytic (Multiple years, customer patterns) — Customer extracts, star joins  34, 34, 59
 Complex — Fact-to-fact joins, windows, extensive subqueries  23, 36, 64, 94
  • At 10 TB scale, the full set of 99 queries should complete in well under 2 hours on a Snowflake 2X-Large virtual warehouse.
  • At 100 TB, we recommend using the largest size virtual warehouse available. For example, on a 3X-Large warehouse, you can expect all 99 queries to complete within 7 hours.

Note that, if you plan to run identical queries multiple times or concurrently, be sure to disable result caching in Snowflake when you run tests by adding the following to your script:

alter session set use_cached_result = false;

TPS-DS Benchmark Kit and Working with Date Ranges

While we provide samples of the 99 queries containing specific parameter values, the TPC-DS Benchmark Kit includes tools for generating random permutations of parameters for each query — which is what we use in our internal testing.

In all queries, the date ranges are supplied using predicates on the DATE_DIM table — as specified by the TPC-DS benchmark — rather than using Date Key restrictions directly on the large fact tables (a strategy that some vendors have used to unrealistically simplify queries). If you want to create variations on these queries without using the benchmark kit, you can create versions that scan different ranges by changing the year, month and day restrictions in the WHERE clauses.

Conclusion

TPC-DS data (and other sample data sets) are made available to you through Snowflake’s unique Data Sharing feature, which allows the contents of any database in Snowflake to be shared with other Snowflake customers without requiring copies of the data.

We hope you enjoy working with this demanding and diverse workload, and invite you to compare your Snowflake results with other platforms.

And, be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb) for all the news and happenings here at Snowflake.

Rethink What You Know About Creating a Data Lake for JSON

Over the last 10 years, the notion has been that to quickly and cost-effectively gain insights from a variety of data sources, you need a Hadoop platform. Sources of data could be weblogs, clickstreams, events, IoT and other machine-born JSON or semi-structured data. The proposition with Hadoop-based data processing is having a single repository (a data lake) with the flexibility, capacity and performance to store and analyze an array of data types.

It shouldn’t be complicated

In reality, analyzing data with an Hadoop-based platform is not simple. Hadoop platforms start you with an HDFS file system, or equivalent. You then must piece together about a half-dozen software packages (minimum) just to provide basic enterprise-level functionality. Functionality such as provisioning, security, system management, data protection, database management and the necessary interface to explore and query data.

Despite the efforts of open-source communities to provide tools to improve the capabilities of Hadoop platforms to operate at the highest enterprise-class level, there is the constant need for highly skilled resources. Skilled resources to continually support Hadoop to keep it up and running, while enabling users to do more than just explore data. This all adds up to unnecessary complexity.

A much simpler proposition

Snowflake, which is built for the cloud and delivered as a service, provides you with a different option for handling JSON and semi-structured data. Just point your data pipelines to Snowflake, land the data in our elastic storage repository and you have instant access to a bottomless data lake. You also have access to a full-fledged data warehouse. With Snowflake, you can easily load JSON and query the data with relational, robust SQL. You can mix JSON with traditional structured data and data from other sources, all from within the same database. Moreover, you can also support endless concurrent analytic workloads and work groups against the JSON data. Whether it is one workload or 1,000 workloads, Snowflake can handle it all with ease.

As a combined data lake and data warehouse platform, Snowflake allows you do much more. Read more about it with our new eBook, Beyond Hadoop: Modern Cloud Data Warehousing.

Try Snowflake for free. Sign up and receive $400 dollars of free usage credits. You can create a sandbox or launch a production implementation from the same Snowflake account.

Rethink what you’ve been told

In order to gain insights from JSON or other machine data, Hadoop is not a prerequisite

When you need to store, warehouse and analyze JSON and other machine data, rethink what you’ve been told. Snowflake, easily, allows you to develop insights or uncover relationships that can drive business forward. You can support all of your structured and semi-structured data warehousing and analytic workload needs with a single tool. A single tool that is built for the cloud and is ACID-compliant. Unlike the special skills often needed to operate an Hadoop platform, Snowflake is a fully relational SQL environment that utilizes the familiar semantics and commands that are known to millions of SQL users and programmers, and thousands of SQL tools.

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @miclnixon1) for all the news and happenings here at Snowflake.

 

 

 

 

New Snowflake features released in Q2’17

It has been an incredible few months at Snowflake. Along with the introduction of self-service and numerous other features added in the last quarter, we have witnessed:

  • Our customer base has grown exponentially with large numbers of applications in full production.
  • Billions of analytical jobs successfully executed this year alone, with petabytes of data stored in Snowflake today, and without a single failed deployment to-date.
  • A strong interest in pushing the boundaries for data warehousing even further by allowing everyone in organizations to share, access and analyze data.

Continuing to engage closely with our customers during this rapid growth period, we rolled out key new product capabilities throughout the second quarter.

Instantly sharing data without limits – Introducing the Data Sharehouse  

Giving users the ability to easily and securely share data in the right format without the need for cumbersome data integration pipelines.

One of our highlights was the general availability of Snowflake’s Data Sharing feature. It allows multiple Snowflake customers to instantly share data with each other.

With data sharing, there is no need to use the expensive, insecure, and often complicated and error-prone procedure of transferring large numbers of single files from one location to another. Instead, customers can now simply share data via SQL statements and secure views fully integrated with our role-based access control model.

To learn more about how Snowflake customers are already leveraging this capability to build new products and features to drive their business, we encourage you to read:

Improving out-of-the box performance & SQL programmability  

Our ongoing mission is to build the fastest database for data warehousing with the SQL you love and no concurrency limits.

  • We continued to make end-to-end query execution faster with more efficient pruning for sub-columns in VARIANT types (JSON), general JOIN performance improvements, and faster global median calculations.
  • We addressed popular customer requests for improved programmability via SQL by:
  • Customers who are using our multi-cluster warehouses auto-scale feature for spiking workloads can now specify up to 10 compute clusters. This allows running hundreds of queries without any query queuing.

Staying ahead with enterprise-ready security and compliance

From day one, security has always been core to Snowflake’s design.

  • One of our exciting new capabilities this quarter is the general availability of customer-managed keys which added an additional layer of security for customers with sensitive data. This feature is the primary component of Tri-Secret Secure, a Snowflake Enterprise Edition for Sensitive Data (ESD) feature. You can find more details in our engineering blog about customer-managed keys in Snowflake.  
  • We also improved the ability for our users to monitor and filter the query history Information Schema table function for more specific SQL command types.
  • After its preview in Q1, secure views reached general availability in Q2.
  • In terms of certification, Snowflake received PCI DSS compliance – a common requirement for customers in banking, financial services, retail, services, and more. Customers and prospects who have PCI requirements will need to subscribe to the Snowflake Enterprise Edition for Sensitive Data (ESD).  

Improving our ecosystem and data loading  

Enabling developers and builders to create applications with their favorite tools, drivers, and languages remains a top priority.

  • For data warehouse automation, Wherescape added support for Snowflake.
  • We enhanced our parallel data loading & unloading via the COPY command; developers can now:
  • Expanding our driver support, we announced a preview version of our open-source Go driver, available in Snowflake’s Github repo.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Snowflake users can now set up MFA via the UI; they no longer need to reach out to our support team to enable the feature.
  • Building on the general availability of Query Profile in Q1, we added a number of additional usability enhancements that can be leveraged to better understand where time is spent during query execution.   
  • The AWS Key ID is now displayed in DESC STAGE output and interface.
  • We added support for leveraging file extensions used for data unloading operations, and changed the default behavior when loading files containing byte order marks, i.e. we now detect and skip the marks instead of throwing an error because the data could not be converted to proper data types.
  • To allow Snowflake users to better control consumption of compute resources, we also enhanced resource monitors (currently in preview). Users can now explicitly assign them to virtual warehouses and specify certain actions if a credit threshold is met or exceeded. Please stay tuned for a separate blog on this important capability.  

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Given our rapid growth since the beginning of this year, we continued working on product scale, reliability, and availability to ensure our readiness for the next phase of hyper-growth.
  • We’re already seeing our efforts to automate all operations, particularly Snowflake deployments, pay off. We were able to quickly roll out a new region (US East – Virginia in Q2) while working on additional deployments around the globe at the same time.

Acknowledgements and conclusion

As always, we want to first thank our customers for their continuous feedback. Additionally, we want to recognize that moving rapidly while also scaling would not be possible without our mighty Engineering Team, which has proven once again that it’s possible to ship high-quality features while serving our existing users’ needs at the same time.

Now, onwards to Q3 and the next set of exciting capabilities you will hear about very soon!

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

3 Ways a Data Sharehouse™ Can Catapult Your Data-Driven Initiatives

How would your organization benefit from effortlessly sharing limitless amounts of data with business partners and commercializing that data to share with other organizations?

With most data sharing methods today, the best you can do is imagine these benefits because it’s cumbersome, time consuming and costly to share even small slices of data. If you share data using an FTP approach, you will spend time deconstructing, scripting, securing and governing the data, and your data consumers will spend time reconstructing, scripting and rebuilding the data. Sharing data via email can be very slow and insecure. Email is not even a practical option for large data sizes and is exponentially more difficult if sharing a large database.

Sharing data using a basic cloud storage service is equally inefficient, including the lack of ability for you or for your data consumers to query the data directly. And if you want to enable “direct” data queries, without loading, you will likely evaluate a traditional cloud data warehouse or Hadoop platform that is “integrated” with a data lake storage platform. Do you think this will be simple? Not so fast! You’ll have to factor in the need to manage a separate data catalogue and external tables to share data, or you’ll have to contend with data inconsistency and performance issues.

These limitations all add up to mean that traditional options to share data are more difficult to take data sharing to the next level of capabilities, which goes beyond finding insights to commercializing your data. If any of this feels familiar, consider these three reasons why Snowflake Data Sharing – the technology making it possible to create a Data Sharehouse™ – is so compelling.

1 – Drive simplicity: Allocate more of your time and resources to strategic data sharing projects

Traditional options to share data create unnecessary data processing and platform complexity. Complexity adds burdens and requires a lot of extra time and resources, including infrastructure costs, for you and for your data consumers. 

Snowflake Data Sharing, on the other hand, is an all-in-one, data warehouse-as-service solution that makes sharing data simple and easy. The vision for the Snowflake architecture, from day one, featured separation of compute, storage, and services to allow unlimited scaling, concurrency and data sharing.  

 

 

For example, with Snowflake, it is not necessary to manage an external data catalogue and not required to have a separate effort to build data security. It’s all built-in. In addition, Snowflake metadata management makes it possible to keep tabs and track all data warehouse activities. Thus, to share data, follow these steps:

  • Point your pipeline to land your data into Snowflake and set up your warehouse.
  • To share any portions of your data warehouse, use Snowflake’s included SQL client to CREATE a new database shell that could contain the entire database or any portions of it. CREATE a reference list for the objects you wish to share.
  • Issue GRANT statements that will enable access to the shared-database and any objects referenced within “the share”.

These are all the steps you will need to quickly and easily share data. All of the related metadata activity, including cataloguing, happens automatically within the Snowflake data warehouse service. No additional infrastructure or separate cataloguing effort required.

2 – Enable more valuable insights: Your data consumers will always operate with live and current data

Most options to share data require you to unload the data from your data warehouse in order to transmit or email to your data consumers. Or if using a cloud data warehouse platform to avoid this, it relies on a physically separate storage pool in order to scale and share data. The downside with either option is that the data set is read-only and disconnected from any updates that occur within the data warehouse. If you modify the data set, or if the data set is regularly updated, you must reload (ETL) it into the data warehouse, operate on it, and then unload it again to retransmit or place it in the shared storage pool.

For data consumers that received or were connected to the old data set, this means there is a period of time during which they’ll be exposed to stale and inconsistent data. Fresh data won’t be available until a new transmission is made or until a new connection is made to the fresh data. No one wants to run analytics on stale data.

Snowflake Data sharing delivers a better choice. Because data sets shared with data consumers within the Snowflake environment are live, in real time, and from the data warehouse, data consumers will immediately see fresh data as soon as an update or transaction is committed and successfully executed. Data consistency is maintained, without any extra effort from you. Your data consumers will not have struggle with decisions to run analytics now or wait for a new update.

3 – Support any scale, with ease: Seamlessly and cost-effectively share data with any number of data consumers

At the end of the day, it’s all about growing and expanding your business and services while providing excellent experiences for the customers and consumers of your data. If you anticipate sharing data with tens, hundreds, or thousands of data consumers, each of which with unique data sharing requirements, how can you easily support this? And support the growth without manually building more clusters, managing external tables or metadata stores, suffering through performance penalties or creating data inconsistencies? It would be very difficult or impossible with other architecture approaches to accomplish these objectives.

Snowflake data sharing allows you to scale and easily add data consumers and specify granular secure views, all at the highest performance profile and with data consistency. On the other end, with Snowflake, your data consumers can immediately use and query the shared data, also at the highest performance profile.

Summary: Share and imagine more

These are just a few compelling examples of how data Snowflake Data Sharing, and the Data Sharehouse approach, can transform the creation of high-value business assets from data. It’s a new, exciting, powerful and easy-to-use feature of all Snowflake data warehouse deployments.

Also check out data sharing blogs from Snowflake customers Snagajob and Playfab. To learn more, you can easily signup for Snowflake here. Or jump to our webpage to review our data sharing ebook, white paper, demo, and other resources. In future posts, we’ll cover more details about the technical capabilities and business-value advantages of Snowflake Data Sharing – built for the cloud. Be sure to follow us on Twitter (@snowflakedb and @miclnixon1) for all the news and happenings here at Snowflake. Stay tuned!

Understanding Your Snowflake Utilization: Part 3 – Query Profiling

This article is the third in a three-part series to help you utilize the functionality and data in Snowflake’s Information Schema to better understand and effectively Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage. In my second post, I showed you how to get a handle on your storage usage.

In this final post, I will deep-dive into understanding query profiling. To do this, I will show you examples using the QUERY_HISTORY family of  functions. I will also show you a handy page in the UI that provides a graphical view of each query. Keep in mind that you’ll need warehouse MONITOR privileges to perform the tasks described in this post. Typically, the SYSADMIN role has the necessary warehouse MONITOR privileges across your entire account; however, other lower-level roles may also have the necessary privileges.

Ready to get started? Here we go!

Query History Profiling

Query profiling is perhaps one of the more popular topics I field questions about. Many customers are interested in improving their query performance. Although every development team should strive to periodically refactor their code, many find it challenging to determine where to start. Going through this analysis should help with identifying a good starting point.

Let’s look at some syntax, per our documentation for QUERY_HISTORY:

select *
from table(information_schema.query_history(dateadd('hours',-1, current_timestamp()),current_timestamp()))
order by start_time;

This query provides a view into all of the queries run by the current user in the past hour:

Query Profiling - History

We can also leverage the QUERY_HISTORY companion functions to narrow down your focus:

  • QUERY_HISTORY_BY_SESSION
  • QUERY_HISTORY_BY_USER
  • QUERY_HISTORY_BY_WAREHOUSE

These are particularly useful if you have identified specific workflow issues you need to address.

Profiling Tip #1: Using HASH()

Now for a particularly useful tip: utilizing HASH on the QUERY_TEXT column can help you consolidate and group on similar queries (the HASH function will return the same result if any queries are exactly the same). As a general rule, identifying query groups and finding the max and min query runtime should help you sort through specific workflows. In the example below, I’m doing an analysis on average compile and execution time. Additionally, I’m collecting a count of the queries with the same syntax:

select hash(query_text), query_text, count(*), avg(compilation_time), avg(execution_time)
from table(information_schema.query_history(dateadd('hours',-1,current_timestamp()),current_timestamp()))
group by hash(query_text), query_text
order by count(*) desc;

Output:

Query Profiling - Query Groups

Using the HASH function further allows a user to easily query a particular instance of this query from the QUERY_HISTORY function. In the example above, I could check for specific queries where the HASH of the query text converted to the value -102792116783286838. For example:

select *
from table(information_schema.query_history())
where hash(query_text) = -102792116783286838
order by start_time;

Output:

Query Profiling - Single Query (SQL)

The above result shows you all of the times you have issued this particular query (going back 7 days). Pay specific attention to the following columns:

  • COMPILATION_TIME
  • EXECUTION_TIME
  • QUEUED (times)

If a query is spending more time compiling (COMPILATION_TIME) than executing (EXECUTION_TIME), perhaps it is time to review the complexity of the query. Snowflake’s query compiler will optimize your query and identify all of the resources required to perform the query in the most efficient manner. If a query is overly complex, the compiler needs to spend more time sorting through the query logic. Take a look at your query and see if there are many nested subqueries or unnecessary joins. Additionally, if there are more columns being selected than required, then perhaps be more specific in your SELECT statement by specifying certain columns.

QUEUED time is interesting because it could be an indicator about your warehouse size and the amount of workload you’ve placed on the warehouse. Snowflake is able to run concurrent queries and it does a very good job in doing so. However, there will be times when a particularly large query will require more resources and, thus, cause other queries to queue as they wait for compute resources to be freed up. If you see a lot of queries spending a long time in queue, you could either:

  • Dedicate a warehouse to these large complex running queries, or
  • Utilize Snowflake’s multi-clustering warehouse feature to allow more parallel execution of the queries. For more information about multi-cluster warehouses, see the Snowflake documentation.

In the recent updates to our QUERY_HISTORY_* Information Schema functions, we have added more metadata references to the results and now you should have a range of metadata at your disposal:

  • SESSION_ID
  • USER_NAME , ROLE_NAME
  • DATABASE_NAME , SCHEMA_NAME
  • WAREHOUSE_NAME , WAREHOUSE_SIZE , WAREHOUSE_TYPE

These columns will help you identify the origin of the queries and help you fine tune your workflow. A simple example would be to find the warehouse with the longest-running queries. Or, find the user who typically issues these queries.

Profiling Tip #2: Using the UI

Once you have identified a particular query you would like to review, you can copy its QUERY_ID value and use this value to view its query plan in Snowflake’s Query Profile. To do this, click on the History icon, add a QUERY ID filter, and paste the QUERY_ID in question. For example:

Query Profile - Using the UI

Hint: If you don’t see a result, make sure you are using a role with the necessary warehouse MONITOR privilege (e.g. SYSADMIN or ACCOUNTADMIN) and you’ve selected the correct QUERY ID.

Once the search is complete, you should be able to click on the link provided under the Query ID column to go to the query’s detail page:

Query Profiling - Profile Page in UI

Now click on the Profile tab.

You should see a visualization of the Query Profile. In my example, Snowflake shows that this particular query executed in two steps:

Step 1:

Query Profiling - Profile Step 1 in UI

Step 2:

Query Profiling - Profile Step 2 in UI

The online documentation provides in-depth details on how to interpret this view. Pay particular attention to the orange bar in this view. It indicates the percentage of overall query time spent on this particular process. In this instance, we can see our query spent most of the time reading data from the table. If we run this query often enough, we should see this time decrease because we’ll be reading the data from cache instead of disk.

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your queries to help you understand your current workflow and identify queries that can be better optimized. This will help save you money in the long run and also improve your user experience. Snowflake will continue to invest in tools like these to help our users better understand and use our platform.

To dig in some more on this subject, check out our online documentation:

I hope this article and this series gave you some inspiration for how you would like to manage your Snowflake instance. There are a lot of options to play with and they’re all intended to provide you with the flexibility and control you need to best use Snowflake. Please share your thoughts with us! We would love to help you on your journey to the cloud.

For more information, please feel free to reach out to us at info@snowflake.net. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Virtual Warehouse Billing Made Smarter

One of our most important commitments to our users is reducing/eliminating the management and tuning tasks imposed by other systems. We are always on the lookout for new and innovative ways of making our service easier to use.

Most recently, we looked at how our users manage their Snowflake virtual warehouses and realized we could be smarter about how we charge for the compute resources used in these warehouses.

First, Some Background Information…

To begin, here are some details about what happens when a virtual warehouse (or simply “warehouse”) is resumed and suspended in Snowflake:

  1. Each time the warehouse is resumed:
    • Snowflake provisions servers from AWS EC2 for a minimum of one hour. This is based on how AWS charges for EC2 resources.
    • The number of Snowflake credits charged depends on the number of servers provisioned, which is determined by the size of the warehouse (XS, S, M, L, XL, etc.) and the number of clusters in the warehouse (if multi-cluster warehouses are enabled).
  2. When the warehouse is suspended, the servers are shut down and credits are no longer charged.
  3. When the warehouse is resumed again, the servers are re-provisioned and the charges start over, regardless of how much time has passed since the warehouse was last charged. As a result, if the same warehouse is resumed multiple times within the same hour, credits are charged each time. As stated earlier, this follows the AWS EC2 model.

What We Changed

We noticed that some of our users were spending time and effort managing their virtual warehouses to ensure that credits were not consumed unnecessarily. We decided to eliminate this extra work by introducing Warehouse Billing Continuation (WBC).

With WBC, we now track the last time each individual server in a warehouse was charged and, if the warehouse is suspended and resumed within 60 minutes of the last charge, we don’t charge again for the server. The charge is continued from the last time as if the warehouse had never been suspended. This eliminates any extra charges, thereby reducing the need for strictly monitoring and controlling when warehouses are suspended and resumed.

How Does WBC Work?

The simple answer is it just works, regardless of how often you resume and suspend your virtual warehouses. If this answer satisfies your curiosity, you can skip now to the end of this post. Otherwise, read on for the gory details…

The best way to explain WBC is with examples. Say you have a Small warehouse (2 servers) that’s been suspended for longer than an hour. Now, imagine this warehouse goes through the following status changes:

Resumed Suspended Credits Charged before WBC Credits Charged with WBC
09:15 09:25 2 2
09:40 09:50 2
10:05 10:10 2
10:30 10:50 2 2
11:20 11:40 2 2 (at 11:30)

Before WBC, every time the warehouse was resumed, we charged 2 credits (1 credit per server in the warehouse). For example, in the scenario described above, between 09:20 and 11:40, the warehouse incurred 5 charges for a total of 10 credits.

With WBC, the behavior is different. The warehouse only incurs 3 charges for a total of 6 credits. The following diagram provides a more detailed explanation of what actually happens:

Example of charges for resuming and suspending a Small virtual warehouse
Figure 1: Example of charges for resuming and suspending a Small virtual warehouse
  1. At 09:15, 2 credits are charged for the servers in the warehouse because the warehouse has been suspended for longer than an hour and there’s no previous charge to continue.
  2. At 9:40 and 10:05, we recognize that the warehouse was charged within the last hour so no additional charges are incurred. In other words, the current charge doesn’t expire until 60 minutes after it was first incurred, so the earliest the warehouse will be charged again is 10:15, regardless of how many times the warehouse is suspended and resumed during this period.
  3. At 10:15, the warehouse isn’t charged because it’s not running at that time.
  4. At 10:30, the 2 servers are charged again because more than 60 minutes have elapsed since the initial charge. More importantly, this new time is now used to track all subsequent charges, i.e. the earliest time for the next charge is 11:30.
  5. At 11:20, no charge is incurred because 60 minutes haven’t elapsed since the last charge.
  6. At 11:30, the warehouse is charged again because 60 minutes have now elapsed since the last charge and the warehouse is running.

This example covers a relatively simple case for a Small warehouse. Each successively larger warehouse has more servers, so the scenarios are slightly more involved, especially if the warehouse is resized (or multi-cluster warehouses are being used); however, the mechanics and calculations are all the same. The most important thing to remember is that every server in a warehouse is charged independently.

WBC Example with Virtual Warehouse Resizing

Consider the same example from earlier, but starting with a Medium warehouse (4 servers) and the following resize events:

  • Warehouse resized down to Small (2 servers) at 09:30, 10:00, and 11:00.
  • Warehouse resized back to Medium (4 servers) at 09:45, 10:45, and 11:15.
Resumed Credits Charged for Resume Resized Credits Charged for Resize
09:15 4 09:30 (Small)
09:40 09:45 (Medium)
10:00 (Small)
10:05
10:30 2 10:45 (Medium) 2
11:00 (Small)
11:15 (Medium)
11:20 2 (at 11:30) 2 (at 11:45)

The total credits charged would be (4+2+2) + (2+2) = 12. The following diagram shows how the charges are incurred (remember that I warned you earlier about the gory details):

Example of charges for resuming and suspending a Medium virtual warehouse with resizing
Figure 2: Example of charges for resuming and suspending a Medium virtual warehouse with resizing
  1. At 09:15, the initial charge for the Medium warehouse is 4 credits.
  2. At 10:30, the new charge is 2 credits (reflecting the Small size at the time the warehouse is resumed). These 2 servers will be charged next in 60 minutes at 11:30.
  3. At 10:45, a new, separate charge for 2 additional servers is incurred (due to the resize to Medium). These 2 additional servers will be charged next at 11:45, independently of the other 2 servers.
  4. All the other resizing events increase or decrease the number of servers running at that time, but incur no additional charges.

Note that this diagram illustrates how we remove servers from a warehouse, i.e. we always start with the most recently-provisioned servers (i.e. LIFO) and we add them back in the same order. This is important because servers are each charged according to their individual start times. This is also important because there’s no benefit to reducing the size of a warehouse within each hour that it runs because the servers have already been charged for the hour.

So, What Next?

There’s no need to make any changes to your virtual warehouses. We’ve enabled Warehouse Billing Continuation by default for all accounts and warehouses, including any existing warehouses. In fact, we implemented WBC near the end of March so you probably have already noticed reduced charges for some of your warehouses, particularly the ones that you resume and suspend frequently.

However, to take full advantage of this change, you might want to revisit your warehouse procedures and settings. For example, you can now set auto-suspend for a warehouse to a shorter value (e.g. 5 or 10 minutes) without worrying about being charged additional credits each time the warehouse resumes within the hour.

Interested in learning more? Check out the Snowflake documentation for more in-depth information about warehouses:

Also, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Understanding Your Snowflake Utilization, Part 2: Storage Profiling

This article is the second in a three-part series to help you use Snowflake’s Information Schema to better understand and effectively utilize Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. And I’m now passing along some of what I’ve learned to help you become more self-sufficient. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage.

In this post, I provide a deep-dive into understanding how you are utilizing data storage in Snowflake at the database, stage, and table level. To do this, I will show you examples of two functions and a view provided in the Information Schema for monitoring storage usage. I will also show you a handy page in the UI that provides an account-level view of your storage. Keep in mind that you need ACCOUNTADMIN access to perform any of the tasks described in this post.

Let’s get started.

Summary Storage Profiling in the UI

Before diving into our detailed analysis of data storage, let’s take a quick look at the summary, account-level storage view provided by Snowflake. As a user with the ACCOUNTADMIN role, you can navigate to the Account page in the Snowflake UI to get a visual overview of the data storage for your account.


This page provides a view, by month, of the average and daily storage usage across your entire account. You can use the filters on the page to filter by database, Snowflake stage, and data maintained in Fail-safe (for disaster recovery).

Detailed Storage Profiling Using the Information Schema

The Snowflake Information Schema provides two functions and one view for monitoring detailed data storage at the database, stage, and table level:

  • DATABASE_STORAGE_USAGE_HISTORY (function)
  • STAGE_STORAGE_USAGE_HISTORY (function)
  • TABLE_STORAGE_METRICS (view)

The DATABASE_STORAGE_USAGE_HISTORY table function shows your database status and usage for all databases in your account or a specified database. Here’s an example of the usage over the last 10 days for a database named sales:

use warehouse mywarehouse;

select * from table(sales.information_schema.database_storage_usage_history(dateadd('days',-10,current_date()),current_date(), ‘SALES’));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation. Also, per the Snowflake documentation:

If a database has been dropped and its data retention period has passed (i.e. database cannot be recovered using Time Travel), then the database name is reported as DROPPED_id.

At its core, the most useful insight from this function is the average growth in your database. Keep in mind, the output includes both AVERAGE_DATABASE_BYTES and AVERAGE_FAILSAFE_BYTES. Leveraging these data points to derive a percentage of Fail-safe over actual database size should give you an idea of how much you should be investing towards your Fail-safe storage. If certain data is not mission critical and doesn’t require Fail-safe, try setting these tables to transient. More granular information about Fail-safe data is provided in TABLE_STORAGE_METRICS, which we will look at more closely later in this post.

Next, let’s look at STAGE_STORAGE_USAGE_HSTORY. This function shows you how much storage is being used for staged files across all your Snowflake staging locations, including named, internal stages. Note that this function does not allow querying storage on individual stages.

Here’s an example of staged file usage for the last 10 days:

select *
from table(sales.information_schema.stage_storage_usage_history(dateadd('days',-10,current_date()),current_date()));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation.

Also note that you can only query up to 6 months worth of data using this function. Some of our users like to use Snowflake stages to store their raw data. For example, one user leverages table staging locations for their raw data storage just in case they need to access the data in the future. There’s nothing wrong with this approach, and since Snowflake compresses your staged data files, it certainly makes sense; however, only the last 6 months of staged data storage is available.

Finally, the TABLE_STORAGE_METRICS view shows your table-level storage at runtime. This is a snapshot of your table storage which includes your active and Fail-safe storage. Additionally, you can derive cloned storage as well utilizing the CLONE_GROUP_ID column. As of today, this is the most granular level of storage detail available to users.

Here’s a general use example:

select *
from sales.information_schema.table_storage_metrics
where table_catalog = 'SALES';

Note that the above screenshot only shows a portion of the output columns. For full details about the output, see the online documentation.

One interesting analysis I’ve been helping our customers with is deriving how much of their table storage is based on cloned data. In Snowflake, cloning data has no additional costs (until the data is modified or deleted) and it’s done very quickly. All users benefit from “zero-copy cloning”, but some are curious to know exactly what percentage of their table storage actually came from cloned data. To determine this, we’ll leverage the CLONE_GROUP_ID column in TABLE_STORAGE_METRICS.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select * , referred_bytes / owned_bytes as ratio
from storage_sum
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

The ratio in the above query gives you an idea of how much of the original data is being “referred to” by the clone. In general, when you make a clone of a table, the CLONE_GROUP_ID for the original table is assigned to the new, cloned table. As you perform DML on the new table, your REFERRED_BYTES value gets updated.  If you join the CLONE_GROUP_ID back into the original view, you get the output of the original table along with the cloned table. A ratio of 1 in the above example means the table data is not cloned.

If you need to find out the exact table name from the above query, then simply join the CTE back to the TABLE_STORAGE_METRICS view and ask for the TABLE_NAME column.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select b.table_name, a,* , referred_bytes / owned_bytes as ratio
from storage_sum a
join concurrency_wh.information_schema.table_storage_metrics b
on a.clone_group_id = b.clone_group_id
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

Storage Profiling - Example

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your data storage to help you keep your storage costs under control and understand how your business is growing over time. It’s a good idea to take regular snapshots of your storage so that you can analyze your growth month-over-month. This will help you both formulate usage insight and take actions.

To dig in some more on this subject, check out our online documentation:

I hope this article has given you some good ideas for how to manage your Snowflake instance. Look for Part 3 of this series in coming weeks where I will show you how to analyze your query performance. As already shown in Parts 1 and 2, there are a lot of options to play with in Snowflake and they’re all intended to give you the flexibility and control you need to best use Snowflake. Please share your thoughts with us!

Also, for more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Understanding Your Snowflake Utilization, Part 1: Warehouse Profiling

This is the first in a series of follow-up posts to Kent Graziano’s earlier post, Using the Snowflake Information Schema. This series will take a deeper dive into the Information Schema (Snowflake’s data dictionary) and show you some practical ways to use this data to better understand your usage of Snowflake.

As a Customer Success Engineer, much of my day-to-day activity involves profiling some of our most important customers and understanding exactly how the service is being used. I measure their compute credit usage, storage growth, and query performance. Many of these customers often ask me how they can implement similar profiling for their Snowflake account on their own.

The answer is to utilize the Information Schema. Aside from providing a set of detailed views into Snowflake’s metadata store, the Information Schema goes a step further and provides several powerful table functions that can be called directly in SQL. These functions can be used to return historical information about executed queries, data stored in databases and stages, and virtual warehouse (i.e. compute) usage.

In addition to these functions, I also recommend leveraging the recently implemented TABLE_STORAGE_METRICS view (also in the Information Schema) to dive even deeper into your analysis.

In this post, I will show you how to leverage these easy-to-use function to gather detailed information about the usage of your virtual warehouses. So let’s get started.

Warehouse Profiling

To profile your current warehouse usage, use the WAREHOUSE_LOAD_HISTORY and WAREHOUSE_METERING_HISTORY functions. A good way to think about the relationship between these two functions is that the first one shows how much work was done over a period of time (load) and the second one shows the cost for doing the work (metering).

The syntax for calling these functions is simple, and can be executed in the Worksheet in the Snowflake web interface. For example:

use warehouse mywarehouse;

select * from table(information_schema.warehouse_load_history(date_range_start=>dateadd('hour',-1,current_timestamp())));

select * from table(information_schema.warehouse_metering_history(dateadd('hour',-1,current_date()),current_date()));

The above queries show warehouse load and credits used for the past hour for all your warehouses. Be sure to check out the Usage Notes section (in the documentation) for each function to understand all the requirements and rules. For example, the WAREHOUSE_LOAD_HISTORY function returns results in different intervals based on the timeframe you specify:

  • 5-second intervals when the timeframe is less than 7 hours.
  • 5-minute intervals when the timeframe is greater than 7 hours.

Here’s an example of the output from the WAREHOUSE_LOAD_HISTORY query against SNOWHOUSE, a warehouse that we use internally:

Looking at Warehouse Load History

Per our documentation:

  • AVG_RUNNING – Average number of queries executed.
  • AVG_QUEUE_LOAD – Average number of queries queued because the warehouse was overloaded.
  • AVG_QUEUE_PROVISION – Average number of queries queued because the warehouse was being provisioned.
  • AVG_BLOCKED – Average number of queries blocked by a transaction lock.

And here’s an example of the output from the WAREHOUSE_METERING_HISTORY query against SNOWHOUSE:

Looking at Warehouse Metering History

Now that we know the amount of work that was performed during the time period (via WAREHOUSE_LOAD_HISTORY) and the cost per time period (via WAREHOUSE_METERING_HISTORY),  we can perform a simple efficiency ratio calculation for a particular warehouse. This example returns this information for a warehouse named XSMALL:

with cte as (
  select date_trunc('hour', start_time) as start_time, end_time, warehouse_name, credits_used
  from table(information_schema.warehouse_metering_history(dateadd('days',-1,current_date()),current_date()))
  where warehouse_name = 'XSMALL')
select date_trunc('hour', a.start_time) as start_time, avg(AVG_RUNNING), avg(credits_used), avg(AVG_RUNNING) / avg(credits_used) * 100 
from table(information_schema.warehouse_load_history(dateadd('days',-1,current_date()),current_date())) a
join cte b on a.start_time = date_trunc('hour', a.start_time)
where a.warehouse_name = 'XSMALL'
group by 1
order by 1;

In the above query, we are treating the average of AVG_RUNNING as work and the average of CREDITS_USED as cost and we apply a simple efficiency ratio on both of these values. Feel free to experiment any way you like.

Calculating an Efficiency Ratio

Next, let’s talk about the specific use of WAREHOUSE_LOAD_HISTORY in our example above:

select date_trunc('hour', start_time), hour(start_time), avg(avg_running)
from table(information_schema.warehouse_load_history(date_range_start=>dateadd('day',-1,current_timestamp())))
group by date_trunc('hour', start_time), hour(start_time)
order by date_trunc('hour', start_time) asc;

Here is the output:

Calculating Warehouse Workload over Time

In this case, I’m indeed asking for an average of an average. I’m grouping the values by hours so I can get a general overview of my warehouse workload. I can see my warehouse is working almost a full day. However, if I see some time gaps in this output, then I might do some additional investigation around those times and see if the warehouse should be doing work.

Another thing you can see in the output from this function is whether these time gaps repeat over a few days. If they do, then I would recommend that you script the warehouse to sleep when not in use (i.e. to save money), or enable AUTO_SUSPEND and AUTO_RESUME for that warehouse.

The Snowflake web interface also has a nice visual representation of this function (under the Warehouse tab):

Web UI - Warehouse Load over Time

For more details on this chart, see our online documentation.

Whether you use the visual chart or the manual query, for the four available metrics, pay particular attention to AVG_RUNNING. This should give you an idea how each warehouse performs. If you have split your workload across several different warehouses, it should tell you how well your queries are distributed.

AVG_QUEUE_LOAD and AVG_BLOCKED are also interesting and should provide you with good insight about how well your warehouses are sized. Keep in mind that queuing is not necessarily a bad thing and you shouldn’t expect zero queuing. The idea is to accept a certain amount of queuing per time period based on your usage requirements.

Using these metrics, you can determine what to do:

  • Increasing the warehouse size will provide more throughput in processing the queries and thereby can help reduce the queuing time.
  • Increasing the cluster count (if using a multi-cluster warehouse) will allow more concurrency, which should also help reduce queuing and blocking.

Finding an Underutilized Warehouse

Is there a warehouse that’s underutilized? For example, any similar sized warehouses being shared across several users could potentially be consolidated to a single warehouse. You can surface this information by comparing your AVG_RUNNING and AVG_QUEUE_LOAD scores across your warehouses:

  • If you see a warehouse with a very low number of queries running, you may want to turn that warehouse off and redirect the queries to another less used warehouse.
  • If a warehouse is running queries and queuing, perhaps it’s time to review your workflow to increase your warehouse sizes.
  • If you have built your own client application to interface with Snowflake, reviewing your client scripts / application code should also reveal any biases towards one warehouse over another.

Putting it all together…

The Information Schema views and functions are provided to help you with dissecting your query, warehouse, and database usage. Actual implementation and analysis is based on your specific needs.

To dig in some more on these functions, you can check out our online documentation:

https://docs.snowflake.net/manuals/sql-reference/info-schema.html
https://docs.snowflake.net/manuals/sql-reference/functions/warehouse_load_history.html
https://docs.snowflake.net/manuals/sql-reference/functions/warehouse_metering_history.html

Look for Part 2 of this series in coming weeks where I will show you how to analyze your storage utilization.

Until then, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.