New Snowflake features released in Q2’17

It has been an incredible few months at Snowflake. Along with the introduction of self-service and numerous other features added in the last quarter, we have witnessed:

  • Our customer base has grown exponentially with large numbers of applications in full production.
  • Billions of analytical jobs successfully executed this year alone, with petabytes of data stored in Snowflake today, and without a single failed deployment to-date.
  • A strong interest in pushing the boundaries for data warehousing even further by allowing everyone in organizations to share, access and analyze data.

Continuing to engage closely with our customers during this rapid growth period, we rolled out key new product capabilities throughout the second quarter.

Instantly sharing data without limits – Introducing the Data Sharehouse  

Giving users the ability to easily and securely share data in the right format without the need for cumbersome data integration pipelines.

One of our highlights was the general availability of Snowflake’s Data Sharing feature. It allows multiple Snowflake customers to instantly share data with each other.

With data sharing, there is no need to use the expensive, insecure, and often complicated and error-prone procedure of transferring large numbers of single files from one location to another. Instead, customers can now simply share data via SQL statements and secure views fully integrated with our role-based access control model.

To learn more about how Snowflake customers are already leveraging this capability to build new products and features to drive their business, we encourage you to read:

Improving out-of-the box performance & SQL programmability  

Our ongoing mission is to build the fastest database for data warehousing with the SQL you love and no concurrency limits.

  • We continued to make end-to-end query execution faster with more efficient pruning for sub-columns in VARIANT types (JSON), general JOIN performance improvements, and faster global median calculations.
  • We addressed popular customer requests for improved programmability via SQL by:
  • Customers who are using our multi-cluster warehouses auto-scale feature for spiking workloads can now specify up to 10 compute clusters. This allows running hundreds of queries without any query queuing.

Staying ahead with enterprise-ready security and compliance

From day one, security has always been core to Snowflake’s design.

  • One of our exciting new capabilities this quarter is the general availability of customer-managed keys which added an additional layer of security for customers with sensitive data. This feature is the primary component of Tri-Secret Secure, a Snowflake Enterprise Edition for Sensitive Data (ESD) feature. You can find more details in our engineering blog about customer-managed keys in Snowflake.  
  • We also improved the ability for our users to monitor and filter the query history Information Schema table function for more specific SQL command types.
  • After its preview in Q1, secure views reached general availability in Q2.
  • In terms of certification, Snowflake received PCI DSS compliance – a common requirement for customers in banking, financial services, retail, services, and more. Customers and prospects who have PCI requirements will need to subscribe to the Snowflake Enterprise Edition for Sensitive Data (ESD).  

Improving our ecosystem and data loading  

Enabling developers and builders to create applications with their favorite tools, drivers, and languages remains a top priority.

  • For data warehouse automation, Wherescape added support for Snowflake.
  • We enhanced our parallel data loading & unloading via the COPY command; developers can now:
  • Expanding our driver support, we announced a preview version of our open-source Go driver, available in Snowflake’s Github repo.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Snowflake users can now set up MFA via the UI; they no longer need to reach out to our support team to enable the feature.
  • Building on the general availability of Query Profile in Q1, we added a number of additional usability enhancements that can be leveraged to better understand where time is spent during query execution.   
  • The AWS Key ID is now displayed in DESC STAGE output and interface.
  • We added support for leveraging file extensions used for data unloading operations, and changed the default behavior when loading files containing byte order marks, i.e. we now detect and skip the marks instead of throwing an error because the data could not be converted to proper data types.
  • To allow Snowflake users to better control consumption of compute resources, we also enhanced resource monitors (currently in preview). Users can now explicitly assign them to virtual warehouses and specify certain actions if a credit threshold is met or exceeded. Please stay tuned for a separate blog on this important capability.  

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Given our rapid growth since the beginning of this year, we continued working on product scale, reliability, and availability to ensure our readiness for the next phase of hyper-growth.
  • We’re already seeing our efforts to automate all operations, particularly Snowflake deployments, pay off. We were able to quickly roll out a new region (US East – Virginia in Q2) while working on additional deployments around the globe at the same time.

Acknowledgements and conclusion

As always, we want to first thank our customers for their continuous feedback. Additionally, we want to recognize that moving rapidly while also scaling would not be possible without our mighty Engineering Team, which has proven once again that it’s possible to ship high-quality features while serving our existing users’ needs at the same time.

Now, onwards to Q3 and the next set of exciting capabilities you will hear about very soon!

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

3 Ways a Data Sharehouse™ Can Catapult Your Data-Driven Initiatives

How would your organization benefit from effortlessly sharing limitless amounts of data with business partners and commercializing that data to share with other organizations?

With most data sharing methods today, the best you can do is imagine these benefits because it’s cumbersome, time consuming and costly to share even small slices of data. If you share data using an FTP approach, you will spend time deconstructing, scripting, securing and governing the data, and your data consumers will spend time reconstructing, scripting and rebuilding the data. Sharing data via email can be very slow and insecure. Email is not even a practical option for large data sizes and is exponentially more difficult if sharing a large database.

Sharing data using a basic cloud storage service is equally inefficient, including the lack of ability for you or for your data consumers to query the data directly. And if you want to enable “direct” data queries, without loading, you will likely evaluate a traditional cloud data warehouse or Hadoop platform that is “integrated” with a data lake storage platform. Do you think this will be simple? Not so fast! You’ll have to factor in the need to manage a separate data catalogue and external tables to share data, or you’ll have to contend with data inconsistency and performance issues.

These limitations all add up to mean that traditional options to share data are more difficult to take data sharing to the next level of capabilities, which goes beyond finding insights to commercializing your data. If any of this feels familiar, consider these three reasons why Snowflake Data Sharing – the technology making it possible to create a Data Sharehouse™ – is so compelling.

1 – Drive simplicity: Allocate more of your time and resources to strategic data sharing projects

Traditional options to share data create unnecessary data processing and platform complexity. Complexity adds burdens and requires a lot of extra time and resources, including infrastructure costs, for you and for your data consumers. 

Snowflake Data Sharing, on the other hand, is an all-in-one, data warehouse-as-service solution that makes sharing data simple and easy. The vision for the Snowflake architecture, from day one, featured separation of compute, storage, and services to allow unlimited scaling, concurrency and data sharing.  

 

 

For example, with Snowflake, it is not necessary to manage an external data catalogue and not required to have a separate effort to build data security. It’s all built-in. In addition, Snowflake metadata management makes it possible to keep tabs and track all data warehouse activities. Thus, to share data, follow these steps:

  • Point your pipeline to land your data into Snowflake and set up your warehouse.
  • To share any portions of your data warehouse, use Snowflake’s included SQL client to CREATE a new database shell that could contain the entire database or any portions of it. CREATE a reference list for the objects you wish to share.
  • Issue GRANT statements that will enable access to the shared-database and any objects referenced within “the share”.

These are all the steps you will need to quickly and easily share data. All of the related metadata activity, including cataloguing, happens automatically within the Snowflake data warehouse service. No additional infrastructure or separate cataloguing effort required.

2 – Enable more valuable insights: Your data consumers will always operate with live and current data

Most options to share data require you to unload the data from your data warehouse in order to transmit or email to your data consumers. Or if using a cloud data warehouse platform to avoid this, it relies on a physically separate storage pool in order to scale and share data. The downside with either option is that the data set is read-only and disconnected from any updates that occur within the data warehouse. If you modify the data set, or if the data set is regularly updated, you must reload (ETL) it into the data warehouse, operate on it, and then unload it again to retransmit or place it in the shared storage pool.

For data consumers that received or were connected to the old data set, this means there is a period of time during which they’ll be exposed to stale and inconsistent data. Fresh data won’t be available until a new transmission is made or until a new connection is made to the fresh data. No one wants to run analytics on stale data.

Snowflake Data sharing delivers a better choice. Because data sets shared with data consumers within the Snowflake environment are live, in real time, and from the data warehouse, data consumers will immediately see fresh data as soon as an update or transaction is committed and successfully executed. Data consistency is maintained, without any extra effort from you. Your data consumers will not have struggle with decisions to run analytics now or wait for a new update.

3 – Support any scale, with ease: Seamlessly and cost-effectively share data with any number of data consumers

At the end of the day, it’s all about growing and expanding your business and services while providing excellent experiences for the customers and consumers of your data. If you anticipate sharing data with tens, hundreds, or thousands of data consumers, each of which with unique data sharing requirements, how can you easily support this? And support the growth without manually building more clusters, managing external tables or metadata stores, suffering through performance penalties or creating data inconsistencies? It would be very difficult or impossible with other architecture approaches to accomplish these objectives.

Snowflake data sharing allows you to scale and easily add data consumers and specify granular secure views, all at the highest performance profile and with data consistency. On the other end, with Snowflake, your data consumers can immediately use and query the shared data, also at the highest performance profile.

Summary: Share and imagine more

These are just a few compelling examples of how data Snowflake Data Sharing, and the Data Sharehouse approach, can transform the creation of high-value business assets from data. It’s a new, exciting, powerful and easy-to-use feature of all Snowflake data warehouse deployments.

Also check out data sharing blogs from Snowflake customers Snagajob and Playfab. To learn more, you can easily signup for Snowflake here. Or jump to our webpage to review our data sharing ebook, white paper, demo, and other resources. In future posts, we’ll cover more details about the technical capabilities and business-value advantages of Snowflake Data Sharing – built for the cloud. Stay tuned!

Understanding Your Snowflake Utilization: Part 3 – Query Profiling

This article is the third in a three-part series to help you utilize the functionality and data in Snowflake’s Information Schema to better understand and effectively Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage. In my second post, I showed you how to get a handle on your storage usage.

In this final post, I will deep-dive into understanding query profiling. To do this, I will show you examples using the QUERY_HISTORY family of  functions. I will also show you a handy page in the UI that provides a graphical view of each query. Keep in mind that you’ll need warehouse MONITOR privileges to perform the tasks described in this post. Typically, the SYSADMIN role has the necessary warehouse MONITOR privileges across your entire account; however, other lower-level roles may also have the necessary privileges.

Ready to get started? Here we go!

Query History Profiling

Query profiling is perhaps one of the more popular topics I field questions about. Many customers are interested in improving their query performance. Although every development team should strive to periodically refactor their code, many find it challenging to determine where to start. Going through this analysis should help with identifying a good starting point.

Let’s look at some syntax, per our documentation for QUERY_HISTORY:

select *
from table(information_schema.query_history(dateadd('hours',-1, current_timestamp()),current_timestamp()))
order by start_time;

This query provides a view into all of the queries run by the current user in the past hour:

Query Profiling - History

We can also leverage the QUERY_HISTORY companion functions to narrow down your focus:

  • QUERY_HISTORY_BY_SESSION
  • QUERY_HISTORY_BY_USER
  • QUERY_HISTORY_BY_WAREHOUSE

These are particularly useful if you have identified specific workflow issues you need to address.

Profiling Tip #1: Using HASH()

Now for a particularly useful tip: utilizing HASH on the QUERY_TEXT column can help you consolidate and group on similar queries (the HASH function will return the same result if any queries are exactly the same). As a general rule, identifying query groups and finding the max and min query runtime should help you sort through specific workflows. In the example below, I’m doing an analysis on average compile and execution time. Additionally, I’m collecting a count of the queries with the same syntax:

select hash(query_text), query_text, count(*), avg(compilation_time), avg(execution_time)
from table(information_schema.query_history(dateadd('hours',-1,current_timestamp()),current_timestamp()))
group by hash(query_text), query_text
order by count(*) desc;

Output:

Query Profiling - Query Groups

Using the HASH function further allows a user to easily query a particular instance of this query from the QUERY_HISTORY function. In the example above, I could check for specific queries where the HASH of the query text converted to the value -102792116783286838. For example:

select *
from table(information_schema.query_history())
where hash(query_text) = -102792116783286838
order by start_time;

Output:

Query Profiling - Single Query (SQL)

The above result shows you all of the times you have issued this particular query (going back 7 days). Pay specific attention to the following columns:

  • COMPILATION_TIME
  • EXECUTION_TIME
  • QUEUED (times)

If a query is spending more time compiling (COMPILATION_TIME) than executing (EXECUTION_TIME), perhaps it is time to review the complexity of the query. Snowflake’s query compiler will optimize your query and identify all of the resources required to perform the query in the most efficient manner. If a query is overly complex, the compiler needs to spend more time sorting through the query logic. Take a look at your query and see if there are many nested subqueries or unnecessary joins. Additionally, if there are more columns being selected than required, then perhaps be more specific in your SELECT statement by specifying certain columns.

QUEUED time is interesting because it could be an indicator about your warehouse size and the amount of workload you’ve placed on the warehouse. Snowflake is able to run concurrent queries and it does a very good job in doing so. However, there will be times when a particularly large query will require more resources and, thus, cause other queries to queue as they wait for compute resources to be freed up. If you see a lot of queries spending a long time in queue, you could either:

  • Dedicate a warehouse to these large complex running queries, or
  • Utilize Snowflake’s multi-clustering warehouse feature to allow more parallel execution of the queries. For more information about multi-cluster warehouses, see the Snowflake documentation.

In the recent updates to our QUERY_HISTORY_* Information Schema functions, we have added more metadata references to the results and now you should have a range of metadata at your disposal:

  • SESSION_ID
  • USER_NAME , ROLE_NAME
  • DATABASE_NAME , SCHEMA_NAME
  • WAREHOUSE_NAME , WAREHOUSE_SIZE , WAREHOUSE_TYPE

These columns will help you identify the origin of the queries and help you fine tune your workflow. A simple example would be to find the warehouse with the longest-running queries. Or, find the user who typically issues these queries.

Profiling Tip #2: Using the UI

Once you have identified a particular query you would like to review, you can copy its QUERY_ID value and use this value to view its query plan in Snowflake’s Query Profile. To do this, click on the History icon, add a QUERY ID filter, and paste the QUERY_ID in question. For example:

Query Profile - Using the UI

Hint: If you don’t see a result, make sure you are using a role with the necessary warehouse MONITOR privilege (e.g. SYSADMIN or ACCOUNTADMIN) and you’ve selected the correct QUERY ID.

Once the search is complete, you should be able to click on the link provided under the Query ID column to go to the query’s detail page:

Query Profiling - Profile Page in UI

Now click on the Profile tab.

You should see a visualization of the Query Profile. In my example, Snowflake shows that this particular query executed in two steps:

Step 1:

Query Profiling - Profile Step 1 in UI

Step 2:

Query Profiling - Profile Step 2 in UI

The online documentation provides in-depth details on how to interpret this view. Pay particular attention to the orange bar in this view. It indicates the percentage of overall query time spent on this particular process. In this instance, we can see our query spent most of the time reading data from the table. If we run this query often enough, we should see this time decrease because we’ll be reading the data from cache instead of disk.

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your queries to help you understand your current workflow and identify queries that can be better optimized. This will help save you money in the long run and also improve your user experience. Snowflake will continue to invest in tools like these to help our users better understand and use our platform.

To dig in some more on this subject, check out our online documentation:

I hope this article and this series gave you some inspiration for how you would like to manage your Snowflake instance. There are a lot of options to play with and they’re all intended to provide you with the flexibility and control you need to best use Snowflake. Please share your thoughts with us! We would love to help you on your journey to the cloud.

For more information, please feel free to reach out to us at info@snowflake.net. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Virtual Warehouse Billing Made Smarter

One of our most important commitments to our users is reducing/eliminating the management and tuning tasks imposed by other systems. We are always on the lookout for new and innovative ways of making our service easier to use.

Most recently, we looked at how our users manage their Snowflake virtual warehouses and realized we could be smarter about how we charge for the compute resources used in these warehouses.

First, Some Background Information…

To begin, here are some details about what happens when a virtual warehouse (or simply “warehouse”) is resumed and suspended in Snowflake:

  1. Each time the warehouse is resumed:
    • Snowflake provisions servers from AWS EC2 for a minimum of one hour. This is based on how AWS charges for EC2 resources.
    • The number of Snowflake credits charged depends on the number of servers provisioned, which is determined by the size of the warehouse (XS, S, M, L, XL, etc.) and the number of clusters in the warehouse (if multi-cluster warehouses are enabled).
  2. When the warehouse is suspended, the servers are shut down and credits are no longer charged.
  3. When the warehouse is resumed again, the servers are re-provisioned and the charges start over, regardless of how much time has passed since the warehouse was last charged. As a result, if the same warehouse is resumed multiple times within the same hour, credits are charged each time. As stated earlier, this follows the AWS EC2 model.

What We Changed

We noticed that some of our users were spending time and effort managing their virtual warehouses to ensure that credits were not consumed unnecessarily. We decided to eliminate this extra work by introducing Warehouse Billing Continuation (WBC).

With WBC, we now track the last time each individual server in a warehouse was charged and, if the warehouse is suspended and resumed within 60 minutes of the last charge, we don’t charge again for the server. The charge is continued from the last time as if the warehouse had never been suspended. This eliminates any extra charges, thereby reducing the need for strictly monitoring and controlling when warehouses are suspended and resumed.

How Does WBC Work?

The simple answer is it just works, regardless of how often you resume and suspend your virtual warehouses. If this answer satisfies your curiosity, you can skip now to the end of this post. Otherwise, read on for the gory details…

The best way to explain WBC is with examples. Say you have a Small warehouse (2 servers) that’s been suspended for longer than an hour. Now, imagine this warehouse goes through the following status changes:

Resumed Suspended Credits Charged before WBC Credits Charged with WBC
09:15 09:25 2 2
09:40 09:50 2
10:05 10:10 2
10:30 10:50 2 2
11:20 11:40 2 2 (at 11:30)

Before WBC, every time the warehouse was resumed, we charged 2 credits (1 credit per server in the warehouse). For example, in the scenario described above, between 09:20 and 11:40, the warehouse incurred 5 charges for a total of 10 credits.

With WBC, the behavior is different. The warehouse only incurs 3 charges for a total of 6 credits. The following diagram provides a more detailed explanation of what actually happens:

Example of charges for resuming and suspending a Small virtual warehouse
Figure 1: Example of charges for resuming and suspending a Small virtual warehouse
  1. At 09:15, 2 credits are charged for the servers in the warehouse because the warehouse has been suspended for longer than an hour and there’s no previous charge to continue.
  2. At 9:40 and 10:05, we recognize that the warehouse was charged within the last hour so no additional charges are incurred. In other words, the current charge doesn’t expire until 60 minutes after it was first incurred, so the earliest the warehouse will be charged again is 10:15, regardless of how many times the warehouse is suspended and resumed during this period.
  3. At 10:15, the warehouse isn’t charged because it’s not running at that time.
  4. At 10:30, the 2 servers are charged again because more than 60 minutes have elapsed since the initial charge. More importantly, this new time is now used to track all subsequent charges, i.e. the earliest time for the next charge is 11:30.
  5. At 11:20, no charge is incurred because 60 minutes haven’t elapsed since the last charge.
  6. At 11:30, the warehouse is charged again because 60 minutes have now elapsed since the last charge and the warehouse is running.

This example covers a relatively simple case for a Small warehouse. Each successively larger warehouse has more servers, so the scenarios are slightly more involved, especially if the warehouse is resized (or multi-cluster warehouses are being used); however, the mechanics and calculations are all the same. The most important thing to remember is that every server in a warehouse is charged independently.

WBC Example with Virtual Warehouse Resizing

Consider the same example from earlier, but starting with a Medium warehouse (4 servers) and the following resize events:

  • Warehouse resized down to Small (2 servers) at 09:30, 10:00, and 11:00.
  • Warehouse resized back to Medium (4 servers) at 09:45, 10:45, and 11:15.
Resumed Credits Charged for Resume Resized Credits Charged for Resize
09:15 4 09:30 (Small)
09:40 09:45 (Medium)
10:00 (Small)
10:05
10:30 2 10:45 (Medium) 2
11:00 (Small)
11:15 (Medium)
11:20 2 (at 11:30) 2 (at 11:45)

The total credits charged would be (4+2+2) + (2+2) = 12. The following diagram shows how the charges are incurred (remember that I warned you earlier about the gory details):

Example of charges for resuming and suspending a Medium virtual warehouse with resizing
Figure 2: Example of charges for resuming and suspending a Medium virtual warehouse with resizing
  1. At 09:15, the initial charge for the Medium warehouse is 4 credits.
  2. At 10:30, the new charge is 2 credits (reflecting the Small size at the time the warehouse is resumed). These 2 servers will be charged next in 60 minutes at 11:30.
  3. At 10:45, a new, separate charge for 2 additional servers is incurred (due to the resize to Medium). These 2 additional servers will be charged next at 11:45, independently of the other 2 servers.
  4. All the other resizing events increase or decrease the number of servers running at that time, but incur no additional charges.

Note that this diagram illustrates how we remove servers from a warehouse, i.e. we always start with the most recently-provisioned servers (i.e. LIFO) and we add them back in the same order. This is important because servers are each charged according to their individual start times. This is also important because there’s no benefit to reducing the size of a warehouse within each hour that it runs because the servers have already been charged for the hour.

So, What Next?

There’s no need to make any changes to your virtual warehouses. We’ve enabled Warehouse Billing Continuation by default for all accounts and warehouses, including any existing warehouses. In fact, we implemented WBC near the end of March so you probably have already noticed reduced charges for some of your warehouses, particularly the ones that you resume and suspend frequently.

However, to take full advantage of this change, you might want to revisit your warehouse procedures and settings. For example, you can now set auto-suspend for a warehouse to a shorter value (e.g. 5 or 10 minutes) without worrying about being charged additional credits each time the warehouse resumes within the hour.

Interested in learning more? Check out the Snowflake documentation for more in-depth information about warehouses:

Also, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Understanding Your Snowflake Utilization, Part 2: Storage Profiling

This article is the second in a three-part series to help you use Snowflake’s Information Schema to better understand and effectively utilize Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. And I’m now passing along some of what I’ve learned to help you become more self-sufficient. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage.

In this post, I provide a deep-dive into understanding how you are utilizing data storage in Snowflake at the database, stage, and table level. To do this, I will show you examples of two functions and a view provided in the Information Schema for monitoring storage usage. I will also show you a handy page in the UI that provides an account-level view of your storage. Keep in mind that you need ACCOUNTADMIN access to perform any of the tasks described in this post.

Let’s get started.

Summary Storage Profiling in the UI

Before diving into our detailed analysis of data storage, let’s take a quick look at the summary, account-level storage view provided by Snowflake. As a user with the ACCOUNTADMIN role, you can navigate to the Account page in the Snowflake UI to get a visual overview of the data storage for your account.


This page provides a view, by month, of the average and daily storage usage across your entire account. You can use the filters on the page to filter by database, Snowflake stage, and data maintained in Fail-safe (for disaster recovery).

Detailed Storage Profiling Using the Information Schema

The Snowflake Information Schema provides two functions and one view for monitoring detailed data storage at the database, stage, and table level:

  • DATABASE_STORAGE_USAGE_HISTORY (function)
  • STAGE_STORAGE_USAGE_HISTORY (function)
  • TABLE_STORAGE_METRICS (view)

The DATABASE_STORAGE_USAGE_HISTORY table function shows your database status and usage for all databases in your account or a specified database. Here’s an example of the usage over the last 10 days for a database named sales:

use warehouse mywarehouse;

select * from table(sales.information_schema.database_storage_usage_history(dateadd('days',-10,current_date()),current_date(), ‘SALES’));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation. Also, per the Snowflake documentation:

If a database has been dropped and its data retention period has passed (i.e. database cannot be recovered using Time Travel), then the database name is reported as DROPPED_id.

At its core, the most useful insight from this function is the average growth in your database. Keep in mind, the output includes both AVERAGE_DATABASE_BYTES and AVERAGE_FAILSAFE_BYTES. Leveraging these data points to derive a percentage of Fail-safe over actual database size should give you an idea of how much you should be investing towards your Fail-safe storage. If certain data is not mission critical and doesn’t require Fail-safe, try setting these tables to transient. More granular information about Fail-safe data is provided in TABLE_STORAGE_METRICS, which we will look at more closely later in this post.

Next, let’s look at STAGE_STORAGE_USAGE_HSTORY. This function shows you how much storage is being used for staged files across all your Snowflake staging locations, including named, internal stages. Note that this function does not allow querying storage on individual stages.

Here’s an example of staged file usage for the last 10 days:

select *
from table(sales.information_schema.stage_storage_usage_history(dateadd('days',-10,current_date()),current_date()));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation.

Also note that you can only query up to 6 months worth of data using this function. Some of our users like to use Snowflake stages to store their raw data. For example, one user leverages table staging locations for their raw data storage just in case they need to access the data in the future. There’s nothing wrong with this approach, and since Snowflake compresses your staged data files, it certainly makes sense; however, only the last 6 months of staged data storage is available.

Finally, the TABLE_STORAGE_METRICS view shows your table-level storage at runtime. This is a snapshot of your table storage which includes your active and Fail-safe storage. Additionally, you can derive cloned storage as well utilizing the CLONE_GROUP_ID column. As of today, this is the most granular level of storage detail available to users.

Here’s a general use example:

select *
from sales.information_schema.table_storage_metrics
where table_catalog = 'SALES';

Note that the above screenshot only shows a portion of the output columns. For full details about the output, see the online documentation.

One interesting analysis I’ve been helping our customers with is deriving how much of their table storage is based on cloned data. In Snowflake, cloning data has no additional costs (until the data is modified or deleted) and it’s done very quickly. All users benefit from “zero-copy cloning”, but some are curious to know exactly what percentage of their table storage actually came from cloned data. To determine this, we’ll leverage the CLONE_GROUP_ID column in TABLE_STORAGE_METRICS.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select * , referred_bytes / owned_bytes as ratio
from storage_sum
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

The ratio in the above query gives you an idea of how much of the original data is being “referred to” by the clone. In general, when you make a clone of a table, the CLONE_GROUP_ID for the original table is assigned to the new, cloned table. As you perform DML on the new table, your REFERRED_BYTES value gets updated.  If you join the CLONE_GROUP_ID back into the original view, you get the output of the original table along with the cloned table. A ratio of 1 in the above example means the table data is not cloned.

If you need to find out the exact table name from the above query, then simply join the CTE back to the TABLE_STORAGE_METRICS view and ask for the TABLE_NAME column.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select b.table_name, a,* , referred_bytes / owned_bytes as ratio
from storage_sum a
join concurrency_wh.information_schema.table_storage_metrics b
on a.clone_group_id = b.clone_group_id
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

Storage Profiling - Example

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your data storage to help you keep your storage costs under control and understand how your business is growing over time. It’s a good idea to take regular snapshots of your storage so that you can analyze your growth month-over-month. This will help you both formulate usage insight and take actions.

To dig in some more on this subject, check out our online documentation:

I hope this article has given you some good ideas for how to manage your Snowflake instance. Look for Part 3 of this series in coming weeks where I will show you how to analyze your query performance. As already shown in Parts 1 and 2, there are a lot of options to play with in Snowflake and they’re all intended to give you the flexibility and control you need to best use Snowflake. Please share your thoughts with us!

Also, for more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Understanding Your Snowflake Utilization, Part 1: Warehouse Profiling

This is the first in a series of follow-up posts to Kent Graziano’s earlier post, Using the Snowflake Information Schema. This series will take a deeper dive into the Information Schema (Snowflake’s data dictionary) and show you some practical ways to use this data to better understand your usage of Snowflake.

As a Customer Success Engineer, much of my day-to-day activity involves profiling some of our most important customers and understanding exactly how the service is being used. I measure their compute credit usage, storage growth, and query performance. Many of these customers often ask me how they can implement similar profiling for their Snowflake account on their own.

The answer is to utilize the Information Schema. Aside from providing a set of detailed views into Snowflake’s metadata store, the Information Schema goes a step further and provides several powerful table functions that can be called directly in SQL. These functions can be used to return historical information about executed queries, data stored in databases and stages, and virtual warehouse (i.e. compute) usage.

In addition to these functions, I also recommend leveraging the recently implemented TABLE_STORAGE_METRICS view (also in the Information Schema) to dive even deeper into your analysis.

In this post, I will show you how to leverage these easy-to-use function to gather detailed information about the usage of your virtual warehouses. So let’s get started.

Warehouse Profiling

To profile your current warehouse usage, use the WAREHOUSE_LOAD_HISTORY and WAREHOUSE_METERING_HISTORY functions. A good way to think about the relationship between these two functions is that the first one shows how much work was done over a period of time (load) and the second one shows the cost for doing the work (metering).

The syntax for calling these functions is simple, and can be executed in the Worksheet in the Snowflake web interface. For example:

use warehouse mywarehouse;

select * from table(information_schema.warehouse_load_history(date_range_start=>dateadd('hour',-1,current_timestamp())));

select * from table(information_schema.warehouse_metering_history(dateadd('hour',-1,current_date()),current_date()));

The above queries show warehouse load and credits used for the past hour for all your warehouses. Be sure to check out the Usage Notes section (in the documentation) for each function to understand all the requirements and rules. For example, the WAREHOUSE_LOAD_HISTORY function returns results in different intervals based on the timeframe you specify:

  • 5-second intervals when the timeframe is less than 7 hours.
  • 5-minute intervals when the timeframe is greater than 7 hours.

Here’s an example of the output from the WAREHOUSE_LOAD_HISTORY query against SNOWHOUSE, a warehouse that we use internally:

Looking at Warehouse Load History

Per our documentation:

  • AVG_RUNNING – Average number of queries executed.
  • AVG_QUEUE_LOAD – Average number of queries queued because the warehouse was overloaded.
  • AVG_QUEUE_PROVISION – Average number of queries queued because the warehouse was being provisioned.
  • AVG_BLOCKED – Average number of queries blocked by a transaction lock.

And here’s an example of the output from the WAREHOUSE_METERING_HISTORY query against SNOWHOUSE:

Looking at Warehouse Metering History

Now that we know the amount of work that was performed during the time period (via WAREHOUSE_LOAD_HISTORY) and the cost per time period (via WAREHOUSE_METERING_HISTORY),  we can perform a simple efficiency ratio calculation for a particular warehouse. This example returns this information for a warehouse named XSMALL:

with cte as (
  select date_trunc('hour', start_time) as start_time, end_time, warehouse_name, credits_used
  from table(information_schema.warehouse_metering_history(dateadd('days',-1,current_date()),current_date()))
  where warehouse_name = 'XSMALL')
select date_trunc('hour', a.start_time) as start_time, avg(AVG_RUNNING), avg(credits_used), avg(AVG_RUNNING) / avg(credits_used) * 100 
from table(information_schema.warehouse_load_history(dateadd('days',-1,current_date()),current_date())) a
join cte b on a.start_time = date_trunc('hour', a.start_time)
where a.warehouse_name = 'XSMALL'
group by 1
order by 1;

In the above query, we are treating the average of AVG_RUNNING as work and the average of CREDITS_USED as cost and we apply a simple efficiency ratio on both of these values. Feel free to experiment any way you like.

Calculating an Efficiency Ratio

Next, let’s talk about the specific use of WAREHOUSE_LOAD_HISTORY in our example above:

select date_trunc('hour', start_time), hour(start_time), avg(avg_running)
from table(information_schema.warehouse_load_history(date_range_start=>dateadd('day',-1,current_timestamp())))
group by date_trunc('hour', start_time), hour(start_time)
order by date_trunc('hour', start_time) asc;

Here is the output:

Calculating Warehouse Workload over Time

In this case, I’m indeed asking for an average of an average. I’m grouping the values by hours so I can get a general overview of my warehouse workload. I can see my warehouse is working almost a full day. However, if I see some time gaps in this output, then I might do some additional investigation around those times and see if the warehouse should be doing work.

Another thing you can see in the output from this function is whether these time gaps repeat over a few days. If they do, then I would recommend that you script the warehouse to sleep when not in use (i.e. to save money), or enable AUTO_SUSPEND and AUTO_RESUME for that warehouse.

The Snowflake web interface also has a nice visual representation of this function (under the Warehouse tab):

Web UI - Warehouse Load over Time

For more details on this chart, see our online documentation.

Whether you use the visual chart or the manual query, for the four available metrics, pay particular attention to AVG_RUNNING. This should give you an idea how each warehouse performs. If you have split your workload across several different warehouses, it should tell you how well your queries are distributed.

AVG_QUEUE_LOAD and AVG_BLOCKED are also interesting and should provide you with good insight about how well your warehouses are sized. Keep in mind that queuing is not necessarily a bad thing and you shouldn’t expect zero queuing. The idea is to accept a certain amount of queuing per time period based on your usage requirements.

Using these metrics, you can determine what to do:

  • Increasing the warehouse size will provide more throughput in processing the queries and thereby can help reduce the queuing time.
  • Increasing the cluster count (if using a multi-cluster warehouse) will allow more concurrency, which should also help reduce queuing and blocking.

Finding an Underutilized Warehouse

Is there a warehouse that’s underutilized? For example, any similar sized warehouses being shared across several users could potentially be consolidated to a single warehouse. You can surface this information by comparing your AVG_RUNNING and AVG_QUEUE_LOAD scores across your warehouses:

  • If you see a warehouse with a very low number of queries running, you may want to turn that warehouse off and redirect the queries to another less used warehouse.
  • If a warehouse is running queries and queuing, perhaps it’s time to review your workflow to increase your warehouse sizes.
  • If you have built your own client application to interface with Snowflake, reviewing your client scripts / application code should also reveal any biases towards one warehouse over another.

Putting it all together…

The Information Schema views and functions are provided to help you with dissecting your query, warehouse, and database usage. Actual implementation and analysis is based on your specific needs.

To dig in some more on these functions, you can check out our online documentation:

https://docs.snowflake.net/manuals/sql-reference/info-schema.html
https://docs.snowflake.net/manuals/sql-reference/functions/warehouse_load_history.html
https://docs.snowflake.net/manuals/sql-reference/functions/warehouse_metering_history.html

Look for Part 2 of this series in coming weeks where I will show you how to analyze your storage utilization.

Until then, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Part II: Designing Big Data Stream Ingestion Architectures Using Snowflake

In PART I of this blog post, we discussed some of the architectural decisions for building a streaming data pipeline and how Snowflake can best be used as both your Enterprise Data Warehouse (EDW) and your Big Data platform. Conceptually, you want to set yourself up for analyzing data in near “real-time”, providing your business users with access to data in minutes, instead of the hours or days they are accustomed to. This is what the Lambda Architecture (http://lambda-architecture.net/) attempts to achieve and this is what Snowflake customers love about our platform.

Snowflake provides the best of both worlds, i.e. users can query semi-structured data using SQL while data is being loaded, with no contention and full transactional consistency. If you are like me, you are saying to yourself, “Really?” In short, the answer is a resounding “YES!” So what’s our secret? We take advantage of being the first cloud-native database hosted on AWS and using the S3 file system service’s capabilities to their fullest.

PART II explores the Snowflake COPY command and its powerful options for gradually adding capacity and capability to your data pipeline for error handling and transformation. It also discusses options for validating and transforming data after it’s been loaded.

Snowflake’s COPY Command

The COPY command is extremely feature rich, giving you the flexibility to decide where in the pipeline to handle certain functions or transformations, such as using ELT (Extract/Load/Transform) instead of traditional ETL.

Using the COPY command, data can be loaded from S3 in parallel, enabling loading hundreds of Terabytes (TB) in hours, without compromising near real-time access to data that has been loaded. The COPY command can be executed from the Worksheet tab in the UI, just like any other DDL or DML operation, or programmatically using one of the supported languages, such as Python, Node.js, ODBC, JDBC, or Spark.

Tip: As a best practice, we recommend testing the syntax of your commands through the UI before embedding it in code.

Here are the highlights of the many options that the COPY command provides for managing the loading process end-to-end.

Compression

Because Snowflake uses a columnar Massively Parallel Processing (MPP) implementation, data is compressed for optimized storage and query performance. In addition, data can be loaded from already-compressed files as shown below:

compressedfile

Tip: Design your data-collection application to output files of about 100 MB in size, which will not only give the best storage value, but also optimal query performance based on Snowflake’s underlying architecture.

Parallel Load

Snowflake’s ability to load multiple files in parallel by executing a single COPY command means each virtual warehouse (VW) node can operate simultaneously on multiple files. The design consideration here is whether to create one large file (e.g. multiple GB) or many small files for optimal load performance. The Snowflake COPY command can handle either configuration, but will load the smaller, more numerous files much faster. Smaller files also allow for easier error handling if there is a problem with the data.

Error Handling

COPY command options that should be considered for error handling include:

  • Validate before loading: Snowflake allows you to run the COPY command in Validation mode without actually loading the data. It can either return the number of rows that failed or return all errors.  This feature is nice for smaller data sets, but has the downside of requiring the command to be executed twice: once for error checking, and then again for the actual load, which could be excessive overhead, depending on your use case.
  • Validate on load: The COPY command offers multiple options for controlling behavior in the event of error during the load:
    • The simplest strategy is to continue on error, which for some use cases or initial testing purposes is acceptable.
    • The next option is to skip the file if any errors occur, or if a specific number of errors occur, or if a specific percentage of the records contain errors. Keep in mind that Snowflake maintains metadata regarding the files that have been loaded and, unless specifically configured to do so, the COPY command won’t load the same file twice into the same table. This allows you to use the same COPY command with the same pattern match over and over again as more files are added into the same directory structure. However, you must remember this when trying to load.
    • Lastly, you can abort on error and stop loading completely.

File Cleanup

The COPY command has a parameter setting to automatically purge files that loaded from the S3 staging area upon success or leave them in place for regulatory purposes or long term storage. Most clients chose to retain these files for at least a short period of time. Once the data is validated, you can decide whether to keep it in its current location (i.e. in the staging area), move it to cheaper, long-term storage (such as Amazon Glacier), or delete it completely.

Loading to Production Tables or Staging Tables?

You can choose to load directly into your production table or you can stage your data for post-processing once ingested. There are benefits and trade-offs to both. The simplest and most straightforward implementation would be to have all transformational and error logic built around your production-level tables. Depending on your data volumes, particularly how much you trust the quality of the data, this could work fine. The downside of this approach is flexibility. As you can imagine, you have now set yourself up to let this table (or set of tables) grow forever.

Depending on your use case, if you decide to do transformations, you may find yourself working from extremely large, potentially unclean data sets, which might not be optimal. Most clients start with this method, then transition to the architecture depicted in PART I of this blog. It’s relatively easy to transition any code written to work with staging tables so that it writes smaller chunks of data into production tables instead.

Post-Load Processing

Once data has been loaded using the COPY command, Snowflake supports additional post-processing options.

Dynamic Handling of Semi-structured Data

Most streaming data that our customers load into Snowflake comes as semi-structured data, such as JSON, AVRO, or XML. One of the biggest benefits of using Snowflake for processing this data is not requiring code changes in the data pipeline as the structure of the data changes. Most databases require shredding prior to ingest to put the data into predefined tables, which enables access to it via SQL. We have several customers where their data format changes daily, and prior to Snowflake, the application development team had to prioritize the elements that would be exposed and the parts of the pipeline that would be fixed.

Not any longer with Snowflake!

Our VARIANT data type automatically adapts to allow new elements to be ingested and available right away without changing any of your application code or database schema. Each element becomes an internal sub-column that can be referenced using a dot notation, e.g.:

SELECT COUNT(:) FROM elt_table;

This provides end users instant access to new data as it becomes part of the stream, without having to ask the development team to do new work. AWESOME!

Data Validation

Similar to the Validation mode that can be used for error handling before loading, Snowflake users can tap into the metadata collected for every file loaded. For each COPY command, the VALIDATE table function provides information on the types of errors encountered, the name of the file containing the error, and the line and byte offset of the error, which can be used to analyze malformed data in the stream. A best practice would be to execute this function after each COPY command and save the results into an error table for debugging at a later time, e.g.:

INSERT INTO save_copy_errors SELECT * FROM TABLE(VALIDATE(elt_table, Job_ID => '_LAST'));

Data Transformation

Depending on your use cases, the skill of your end users, the types of analytics you perform, and your query tools, you may want (or need) to transform semi-structured data (including arrays and objects in non-relational structures) into relational tables with regular data types (INTEGER, CHAR, DATE, etc.), including data models using Third Normal Form, Stars, Snowflake, or any kind of schema. Snowflake is a fully-functional, ANSI-compliant RDBMS that doesn’t limit your data model creation. Snowflake’s built-in functionality promotes ELT (Extract/Load/Transform) in favor of traditional ETL or Data Lake architectures that rely on using Hadoop for transformation. These features include:

  • Extreme Performance using SQL with semi-structured data: Snowflake automatically shreds semi-structured objects into sub-columns. An added benefit is that each of these columns has statistical metadata captured on ingest; this is used to optimize query performance, similar to regular relational columns. This allows us to deliver improved performance over Hive or other Hadoop/SQL overlay technologies by a factor of 10 to 1000 times. Moreover, you can always improve performance by scaling your VW without reloading data.
  • User Defined Function (UDF) Support: If Snowflake does not provide a native function, aggregation or transformation for the operation you wish to perform, you can easily create SQL UDFs and JavaScript UDFs.
  • Data Typecasting: Snowflake simplifies data typecasting with both a CAST function and functionally equivalent ‘::’ notation to implicitly convert data types. Special care should be taken when converting timestamps and dates to negate further casting downstream in the pipeline during end-user queries.

Conclusion

Streamlining the data pipeline can have a huge impact on the value of data flowing into your business. End users that used to wait for days to see the results of changes they have made to operational systems will be able to see their results in minutes.  Utilizing the power of a cloud-based Massively Parallel Processing RDBMS to do ELT means you don’t have to move TBs of data between your Hadoop infrastructure and your EDW, which results in significantly shortening or eliminating data load windows. The Snowflake Data platform allows you to start small with little complexity and then scale your architecture as the needs of your business change.

I have really only scratched the surface on the capabilities of Snowflake related to processing semi-structured data, so look for more posts as I work with customers on their most challenging Big Data analytic scenarios and please get in touch with any questions! And, as always, make sure to follow our Snowflake-related Twitter feeds (@SnowflakeDB) for all the latest developments and announcements.

Part I: Designing Big Data Stream Ingestion Architectures Using Snowflake

This post highlights some of Snowflake’s advanced functionality and pinpoints key architectural decisions that might be influenced when designing pipelines for ingesting streamed data.

What is Streamed Data?

Streamed data is information that is generated continuously by data sources such as social networks, connected devices, video games, sensors etc. Thus, in the age of social media and the Internet of Things (IOT), streaming architectures for data capture and analysis have become extremely popular. Building these data pipelines to collect semi-structured data from sources such as Facebook, Twitter and LinkedIn requires a platform capable of handling these new constructs and approach to analytics.

Why Snowflake?

Snowflake’s elastic cloud data warehouse is extremely well-suited to handle the volume, variety and velocity challenges of working with Big Data. The prevailing best practice for stream ingestion is the Lambda Architecture (http://lambda-architecture.net/). Snowflake can easily be used as a core component of Lambda, simplifying the architecture and speeding access to data in both the batch layer and the speed layer.

The following diagram provides a high-level view of a data stream ingestion architecture, incorporating both cloud infrastructure and Snowflake elements:

Ingest Architecture

When designing complex architectures, it helps to break the problem into manageable components, so I will divide this discussion into two parts:

  • Decisions to make before loading
  • Decisions to make about loading (using the COPY command) and post-processing.

This blog post focuses on decisions to make before loading data. A follow-on post will focus on the COPY command and how it handles errors and transformations during loading, as well as a discussion of post-processing options after your data is loaded.

Decisions to Make Before Loading

Snowflake’s Elastic Data Warehouse runs on Amazon’s AWS platform, which offers a wide variety of choices for building your data pipeline. It’s best to start small with a sample data set and a limited number of services, becoming familiar with the nuances of each. One of the biggest benefits of using Snowflake as your Big Data platform is our VARIANT data type implementation for semi-structured data. Due to the variable nature of semi-structured data, new elements are often added to the stream, forcing developers to constantly update their ingest scripts and internal data structures. Snowflake changes this dynamic by automatically shredding these objects into sub-columns on ingest, giving users instant access to the data via SQL (see below for more details).

Programming Language

For streaming applications, choose a language that works well with the cloud platform service (i.e. Snowflake on Amazon AWS) that you will leverage for the following components:

  • Data collection (e.g. Kinesis, Firehose, SQS)
  • Notification (e.g. SNS)
  • Control flow execution (e.g. Data Pipeline, Lambda)

Snowflake supports multiple programming languages, including but not limited to Python, Node.js, ODBC, JDBC, and Spark.  All connections into Snowflake use TLS and come through our secure REST API.

File System Directory Structure

Snowflake supports loading directly from Amazon Simple Storage Service (S3) using files of type Avro, JSON, XML and CSV/TSV. Most streaming architectures are designed to write files (either compressed or uncompressed) to a temporary staging area for fault tolerance and manageability (Amazon S3 has some really nice features for moving data around). When setting up S3 buckets and directory structures, consider that the most efficient way to load data into Snowflake is to use micro-batching and the COPY command (more on this in upcoming posts). If you use a directory and file naming convention such as s3://<bucketname>/<projectname>/<tablename>/<year>/<month>/<day>/<hour>, you can use the PATTERN option in the COPY command. The PATTERN option uses regular expressions to match specific files and/or directories to load. By segmenting the data and files by time, data can be loaded in subsets or all at once, depending on your use case.

Tip: Snowflake’s COPY command only loads each file into a table once unless explicitly instructed to reload files. This is a great feature because you can stream your data into the same directory and continue executing the same COPY command over and over again, and only the new files will be loaded.  However, as you evolve your load process, don’t abuse this feature. For each COPY command, Snowflake must first get the list of files in the directory to do the matching and, if the list grows too large, the command may spend more time figuring out which files to load than actually loading the data.

Warehouse for Loading

Snowflake has the unique ability to spin up and down compute resources, and the resources can be configured in different sizes. This enables storing and analyzing petabyte-scale data sets.  To load or query data, a Virtual Warehouse (VW) must be running and in use within the session.  Since Snowflake offers a utility-based pricing model similar to Amazon (i.e. we charge for compute resources by the hour), you will want to optimize your VW for loading based on your use case (i.e. how much data you are loading and how quickly you need access to it).  You can bring up or down a VW using SQL or the web user interface. You can also set parameters for auto-suspend and auto-resume. Here is an example command to create and start a VW:

CREATE WAREHOUSE load_warehouse WITH
                 WAREHOUSE_SIZE = 'MEDIUM'
                 WAREHOUSE_TYPE = 'STANDARD'
                 AUTO_SUSPEND = 1800 AUTO_RESUME = TRUE
                 MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 1;

Tip: Snowflake loads data from files in parallel extremely efficiently, so start with an x-small or small size, and use a single VW for both loading and querying. As your data volumes increase or SLAs for query performance tighten, you can separate out these functions into separate VWs — one for loading, one for querying — and size each for the appropriate workload.

Table Definition

Before loading data into Snowflake, you must define a structure. Since Snowflake is ANSI SQL-compliant, you can execute DDL such as CREATE TABLE AS from within the web interface (in the Worksheet tab) or through the command line interface (snowsql), or you can just use the Database tab in the web interface. There are two key elements to defining a table:

  • Table Type
  • Table Structure

Table Type

Snowflake supports three table types: permanent, transient, and temporary. The differences between the three are twofold: how long the table persists and how much Time Travel and Fail-safe protection is provided (which contributes to storage cost).

  • Temporary tables are just that — they exist for the length of your session. They are typically used as a temporary place to store and act on data while programming. They do not provide any Time Travel or Fail-safe protection so they incur no additional storage footprint beyond the data in the table, which is purged from Snowflake when the session ends.
  • Transient tables are often used as staging areas for data that could easily be reloaded if lost or corrupted. Like temporary tables, they provide no Fail-safe protection, but they persist for a day upon deleting, dropping or truncating the table, allowing for Time Travel (https://docs.snowflake.net/manuals/user-guide/data-time-travel.html). Transient tables persist data exactly like permanent tables, except for the Fail-safe aspect.
  • Permanent tables are the default type created by the CREATE TABLE command. By default, they provide 1 day of Time Travel and have the added benefit of seven days of Fail-safe protection (https://docs.snowflake.net/manuals/user-guide/data-cdp-storage-costs.html). And, with our Enterprise offering, permanent tables can be configured for up to 90 days of Time Travel.

Table Structure

The structure you chose for your table depends on the type of data you are loading:

  • For structured data (CSV/TSV), we highly recommend pre-defining each column/field in your target table with proper data type definition before import. Snowflake provides a File Format object that can help with date conversion, column and row separators, null value definition, etc.
  • For semi-structured data (JSON, Avro, XML), define a table with a single column using VARIANT as the data type.

Create variant table

When you load semi-structured data into the VARIANT column, Snowflake automatically shreds each object into a sub-column optimized for query processing and accessible via SQL using dot notation; e.g.:

SELECT COUNT(fieldname:objectname) FROM table;

Conclusion

Ultimately, designing stream architectures and analyzing that data can be simplified based on your architectural decisions and leveraging the platform’s functionality to your advantage. Start small and continue to add capability as your data volumes, data types, and use cases expand.

In PART II of this blog series (coming soon), we will look at how data is loaded into Snowflake and what decisions should be made regarding in-database transformations, such as moving from the traditional ETL model to ELT, which can be much more efficient for big data processing.

And, as always, keep watching this blog site and our Snowflake-related Twitter feed (@SnowflakeDB) for more useful information and up-to-the-minute news about Snowflake Computing.