New Snowflake features released in Q2’17

It has been an incredible few months at Snowflake. Along with the introduction of self-service and numerous other features added in the last quarter, we have witnessed:

  • Our customer base has grown exponentially with large numbers of applications in full production.
  • Billions of analytical jobs successfully executed this year alone, with petabytes of data stored in Snowflake today, and without a single failed deployment to-date.
  • A strong interest in pushing the boundaries for data warehousing even further by allowing everyone in organizations to share, access and analyze data.

Continuing to engage closely with our customers during this rapid growth period, we rolled out key new product capabilities throughout the second quarter.

Instantly sharing data without limits – Introducing the Data Sharehouse  

Giving users the ability to easily and securely share data in the right format without the need for cumbersome data integration pipelines.

One of our highlights was the general availability of Snowflake’s Data Sharing feature. It allows multiple Snowflake customers to instantly share data with each other.

With data sharing, there is no need to use the expensive, insecure, and often complicated and error-prone procedure of transferring large numbers of single files from one location to another. Instead, customers can now simply share data via SQL statements and secure views fully integrated with our role-based access control model.

To learn more about how Snowflake customers are already leveraging this capability to build new products and features to drive their business, we encourage you to read:

Improving out-of-the box performance & SQL programmability  

Our ongoing mission is to build the fastest database for data warehousing with the SQL you love and no concurrency limits.

  • We continued to make end-to-end query execution faster with more efficient pruning for sub-columns in VARIANT types (JSON), general JOIN performance improvements, and faster global median calculations.
  • We addressed popular customer requests for improved programmability via SQL by:
  • Customers who are using our multi-cluster warehouses auto-scale feature for spiking workloads can now specify up to 10 compute clusters. This allows running hundreds of queries without any query queuing.

Staying ahead with enterprise-ready security and compliance

From day one, security has always been core to Snowflake’s design.

  • One of our exciting new capabilities this quarter is the general availability of customer-managed keys which added an additional layer of security for customers with sensitive data. This feature is the primary component of Tri-Secret Secure, a Snowflake Enterprise Edition for Sensitive Data (ESD) feature. You can find more details in our engineering blog about customer-managed keys in Snowflake.  
  • We also improved the ability for our users to monitor and filter the query history Information Schema table function for more specific SQL command types.
  • After its preview in Q1, secure views reached general availability in Q2.
  • In terms of certification, Snowflake received PCI DSS compliance – a common requirement for customers in banking, financial services, retail, services, and more. Customers and prospects who have PCI requirements will need to subscribe to the Snowflake Enterprise Edition for Sensitive Data (ESD).  

Improving our ecosystem and data loading  

Enabling developers and builders to create applications with their favorite tools, drivers, and languages remains a top priority.

  • For data warehouse automation, Wherescape added support for Snowflake.
  • We enhanced our parallel data loading & unloading via the COPY command; developers can now:
  • Expanding our driver support, we announced a preview version of our open-source Go driver, available in Snowflake’s Github repo.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Snowflake users can now set up MFA via the UI; they no longer need to reach out to our support team to enable the feature.
  • Building on the general availability of Query Profile in Q1, we added a number of additional usability enhancements that can be leveraged to better understand where time is spent during query execution.   
  • The AWS Key ID is now displayed in DESC STAGE output and interface.
  • We added support for leveraging file extensions used for data unloading operations, and changed the default behavior when loading files containing byte order marks, i.e. we now detect and skip the marks instead of throwing an error because the data could not be converted to proper data types.
  • To allow Snowflake users to better control consumption of compute resources, we also enhanced resource monitors (currently in preview). Users can now explicitly assign them to virtual warehouses and specify certain actions if a credit threshold is met or exceeded. Please stay tuned for a separate blog on this important capability.  

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Given our rapid growth since the beginning of this year, we continued working on product scale, reliability, and availability to ensure our readiness for the next phase of hyper-growth.
  • We’re already seeing our efforts to automate all operations, particularly Snowflake deployments, pay off. We were able to quickly roll out a new region (US East – Virginia in Q2) while working on additional deployments around the globe at the same time.

Acknowledgements and conclusion

As always, we want to first thank our customers for their continuous feedback. Additionally, we want to recognize that moving rapidly while also scaling would not be possible without our mighty Engineering Team, which has proven once again that it’s possible to ship high-quality features while serving our existing users’ needs at the same time.

Now, onwards to Q3 and the next set of exciting capabilities you will hear about very soon!

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Virtual Warehouse Billing Made Smarter

One of our most important commitments to our users is reducing/eliminating the management and tuning tasks imposed by other systems. We are always on the lookout for new and innovative ways of making our service easier to use.

Most recently, we looked at how our users manage their Snowflake virtual warehouses and realized we could be smarter about how we charge for the compute resources used in these warehouses.

First, Some Background Information…

To begin, here are some details about what happens when a virtual warehouse (or simply “warehouse”) is resumed and suspended in Snowflake:

  1. Each time the warehouse is resumed:
    • Snowflake provisions servers from AWS EC2 for a minimum of one hour. This is based on how AWS charges for EC2 resources.
    • The number of Snowflake credits charged depends on the number of servers provisioned, which is determined by the size of the warehouse (XS, S, M, L, XL, etc.) and the number of clusters in the warehouse (if multi-cluster warehouses are enabled).
  2. When the warehouse is suspended, the servers are shut down and credits are no longer charged.
  3. When the warehouse is resumed again, the servers are re-provisioned and the charges start over, regardless of how much time has passed since the warehouse was last charged. As a result, if the same warehouse is resumed multiple times within the same hour, credits are charged each time. As stated earlier, this follows the AWS EC2 model.

What We Changed

We noticed that some of our users were spending time and effort managing their virtual warehouses to ensure that credits were not consumed unnecessarily. We decided to eliminate this extra work by introducing Warehouse Billing Continuation (WBC).

With WBC, we now track the last time each individual server in a warehouse was charged and, if the warehouse is suspended and resumed within 60 minutes of the last charge, we don’t charge again for the server. The charge is continued from the last time as if the warehouse had never been suspended. This eliminates any extra charges, thereby reducing the need for strictly monitoring and controlling when warehouses are suspended and resumed.

How Does WBC Work?

The simple answer is it just works, regardless of how often you resume and suspend your virtual warehouses. If this answer satisfies your curiosity, you can skip now to the end of this post. Otherwise, read on for the gory details…

The best way to explain WBC is with examples. Say you have a Small warehouse (2 servers) that’s been suspended for longer than an hour. Now, imagine this warehouse goes through the following status changes:

Resumed Suspended Credits Charged before WBC Credits Charged with WBC
09:15 09:25 2 2
09:40 09:50 2
10:05 10:10 2
10:30 10:50 2 2
11:20 11:40 2 2 (at 11:30)

Before WBC, every time the warehouse was resumed, we charged 2 credits (1 credit per server in the warehouse). For example, in the scenario described above, between 09:20 and 11:40, the warehouse incurred 5 charges for a total of 10 credits.

With WBC, the behavior is different. The warehouse only incurs 3 charges for a total of 6 credits. The following diagram provides a more detailed explanation of what actually happens:

Example of charges for resuming and suspending a Small virtual warehouse
Figure 1: Example of charges for resuming and suspending a Small virtual warehouse
  1. At 09:15, 2 credits are charged for the servers in the warehouse because the warehouse has been suspended for longer than an hour and there’s no previous charge to continue.
  2. At 9:40 and 10:05, we recognize that the warehouse was charged within the last hour so no additional charges are incurred. In other words, the current charge doesn’t expire until 60 minutes after it was first incurred, so the earliest the warehouse will be charged again is 10:15, regardless of how many times the warehouse is suspended and resumed during this period.
  3. At 10:15, the warehouse isn’t charged because it’s not running at that time.
  4. At 10:30, the 2 servers are charged again because more than 60 minutes have elapsed since the initial charge. More importantly, this new time is now used to track all subsequent charges, i.e. the earliest time for the next charge is 11:30.
  5. At 11:20, no charge is incurred because 60 minutes haven’t elapsed since the last charge.
  6. At 11:30, the warehouse is charged again because 60 minutes have now elapsed since the last charge and the warehouse is running.

This example covers a relatively simple case for a Small warehouse. Each successively larger warehouse has more servers, so the scenarios are slightly more involved, especially if the warehouse is resized (or multi-cluster warehouses are being used); however, the mechanics and calculations are all the same. The most important thing to remember is that every server in a warehouse is charged independently.

WBC Example with Virtual Warehouse Resizing

Consider the same example from earlier, but starting with a Medium warehouse (4 servers) and the following resize events:

  • Warehouse resized down to Small (2 servers) at 09:30, 10:00, and 11:00.
  • Warehouse resized back to Medium (4 servers) at 09:45, 10:45, and 11:15.
Resumed Credits Charged for Resume Resized Credits Charged for Resize
09:15 4 09:30 (Small)
09:40 09:45 (Medium)
10:00 (Small)
10:05
10:30 2 10:45 (Medium) 2
11:00 (Small)
11:15 (Medium)
11:20 2 (at 11:30) 2 (at 11:45)

The total credits charged would be (4+2+2) + (2+2) = 12. The following diagram shows how the charges are incurred (remember that I warned you earlier about the gory details):

Example of charges for resuming and suspending a Medium virtual warehouse with resizing
Figure 2: Example of charges for resuming and suspending a Medium virtual warehouse with resizing
  1. At 09:15, the initial charge for the Medium warehouse is 4 credits.
  2. At 10:30, the new charge is 2 credits (reflecting the Small size at the time the warehouse is resumed). These 2 servers will be charged next in 60 minutes at 11:30.
  3. At 10:45, a new, separate charge for 2 additional servers is incurred (due to the resize to Medium). These 2 additional servers will be charged next at 11:45, independently of the other 2 servers.
  4. All the other resizing events increase or decrease the number of servers running at that time, but incur no additional charges.

Note that this diagram illustrates how we remove servers from a warehouse, i.e. we always start with the most recently-provisioned servers (i.e. LIFO) and we add them back in the same order. This is important because servers are each charged according to their individual start times. This is also important because there’s no benefit to reducing the size of a warehouse within each hour that it runs because the servers have already been charged for the hour.

So, What Next?

There’s no need to make any changes to your virtual warehouses. We’ve enabled Warehouse Billing Continuation by default for all accounts and warehouses, including any existing warehouses. In fact, we implemented WBC near the end of March so you probably have already noticed reduced charges for some of your warehouses, particularly the ones that you resume and suspend frequently.

However, to take full advantage of this change, you might want to revisit your warehouse procedures and settings. For example, you can now set auto-suspend for a warehouse to a shorter value (e.g. 5 or 10 minutes) without worrying about being charged additional credits each time the warehouse resumes within the hour.

Interested in learning more? Check out the Snowflake documentation for more in-depth information about warehouses:

Also, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Tech Tip: Quick Start for Getting Your First Data into Snowflake

If you are like me and fairly new to this whole cloud thing, then one of your main questions is likely:

“How do I get data from my desktop (or server) into Snowflake so I can query it?”

Which, in reality, translates to:

“How do I load data in the cloud?”

Well, there are several methods, including a variety of integration and ETL tools. In this tech tip I will show you the simplest way to accomplish this via the Snowflake Web UI.

Our brilliant engineers built a nice little wizard to make it simple for cloud novices (like me) to easily upload a data file from your desktop and load it into a table in Snowflake.

Getting Started

First off, you need a table in the database that matches the structure of the data file you are going to load (you modeled that table right?).  You can build that by running a create table DDL in a worksheet once you have logged in.

So something like this:

CREATE TABLE Region_t1


  (

    Name               VARCHAR2 (200) NOT NULL ,

    Code               VARCHAR2 (2) NOT NULL ,

    Abbreviation       VARCHAR2 (2) NOT NULL ,

    Sort_Order         NUMBER (2) NOT NULL ,

    External_Reference VARCHAR2 (1000) NOT NULL ,

    Record_Source      VARCHAR2 (100) NOT NULL ,

    Record_Owner       NUMBER (1) NOT NULL ,

    Comments           VARCHAR2 (2000)

  ) ;

ALTER TABLE Region_t1 ADD CONSTRAINT PK_Region_t1 PRIMARY KEY ( Code ) ;

ALTER TABLE Region_t1 ADD CONSTRAINT UK_Region_t1 UNIQUE ( Name ) ;

Invoke the Load Wizard

To begin the load process, navigate to the Database tab in the UI, drill into your database and then click on the table name. Once you do that, the Load Table widget is displayed (fourth from the left on the toolbar).  Click on the widget, and the Load Data wizard opens and asks for a warehouse. Select a virtual warehouse to use from the drop-down list. (Note: The list displays any warehouse you have the USAGE privilege on.)

select-wh

Select the File to Load

In the next step, you select the source file to load. You have options to get a file from your file system, or from an AWS S3 bucket if you have already moved some files to the cloud.

file-selected

Note that you can also load multiple files! The select dialog box supports multi-select. But keep in mind that the files must all be the same type with the same structure because they’re all going into the same table.

In the next step, you pick a file format from the drop-down list. In this case, I have selected a format that specifies a CSV type file with comma delimiters.

pick-file-format-for-load

Creating a File Format (if necessary)

If you do not have an appropriate file format already, simply click the plus sign (+) next to the drop-down arrow in the UI to define a new one. For a standard CSV file that uses commas as the delimiter, you really just need to give the file format a name and click Finish. However, if the first row of the file includes column headings, be sure to set Header lines to skip to 1.

create-file-format

Setting Load Options

The next step of the wizard allows you to choose how to handle load errors. There are four options. I chose to have the load continue by skipping any problem records (which I will deal with after the load).

load-options

At this point, you are set to go. If you want to see the code for the load (so you can learn something new, right?), just click the Show SQL link before you click Load.

load-sql

Running the Load

Once you click Load, the process starts and does the following:

  1. Uses PUT to copy the file to a Snowflake internal staging area, automatically encrypting the data during the process.
  2. Uses COPY to load the data into the target table and deletes the file from the staging area upon completion.
  3. Reports the results.

load-success

And voila, you have now loaded data from your desktop into the cloud-native Snowflake Elastic Data Warehouse. Congratulations.

For a much deeper dive into loading data into Snowflake check out the Data Loading topic in our user guide.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more tips and tricks, along with updates on all the action and activities here at Snowflake Computing.

Power BI Enables Connectivity to Snowflake

Guest post by Senior Program Manager for Power BI at Microsoft. This post can also be found on the Power BI Blog.

As part of the Power BI Desktop August Update we are very excited to announce a preview of a new data connector for Snowflake.

Snowflake is a cloud-native elastic data warehouse service that makes it easy to bring together data from disparate sources and make it available to all users and systems that need to analyze it. With the new Power BI Desktop connector, users can import data from Snowflake, combine it with data from other sources, and build rich data visuals as part of their Power BI reports.

In order to use this new connector, users need to first enable the Preview feature option within the Options dialog (File -> Options and Settings -> Options).

clip_image001

As a pre-requirement, the Snowflake connector requires users to install the Snowflake ODBC driver in their machines, matching the architecture of the Power BI Desktop installation (i.e. 32-bit vs. 64-bit). The Snowflake ODBC driver can be downloaded from this location: http://go.microsoft.com/fwlink/?LinkID=823762

After the feature has been enabled, users can find the Snowflake connector within the “Get Data” dialog, under the Database category.

clip_image002

In the connector dialog, users can specify the Snowflake Server and Warehouse to connect to.

clip_image004

Once the connection details have been specified, Power BI Desktop will attempt to connect to Snowflake, which will require users to specify credentials. In Power BI Desktop, credentials are only asked from the user on first-time connection, after which they are stored as part of the user profile and can be managed from the “Data Source Settings” dialog (under File -> Options and Settings). Subsequent connections to Snowflake will reuse the stored credentials for the current user.

clip_image006

After the connection has been successfully established and the user has been authenticated, the Navigator dialog will allow users to browse the list of databases available within the specified warehouse. In this dialog, users can preview and select one or multiple tables to use in their Power BI report.

clip_image008
Once one or more tables have been selected in the Navigator dialog, users can either:

  • Load: This will load all data rows from each of the tables selected into the local Power BI Desktop data model, from where users can build their reports.
  • Edit: This option will bring users into the Query Editor dialog, from where they can perform data transformations and filters on a sample subset of the data, without the need to download the entire data set. Operations applied within the Query Editor dialog will run on the Snowflake backend, as long as the operations are supported in the source (i.e. filters, etc.)

We’re very glad to enable connectivity to Snowflake from Power BI and are looking forward to customers’ feedback. We encourage you to give it a try and share with us any feedback or issues that you encounter via the “Send a Frown” feature in Power BI Desktop. In subsequent updates, we will also enable DirectQuery capabilities for this connector, as well as support for refreshing datasets based on Snowflake data in the Power BI service via the Power BI Data Gateway…

Stay tuned for improvements to the Snowflake connector and for many other connectors in upcoming releases. If yours isn’t supported yet, just let us know.

Miguel Llopis

Automatic Query Optimization. No Tuning!

Hopefully you had a chance to read all our previous top 10 posts. We now conclude the series with a final deep dive into my #1 Cool Feature from Snowflake:

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about query performance at all. It is all handled “auto-magically” via a dynamic optimization engine in our cloud services layer. I just model, load, and query the data.

Cloud Services

Snowflake is an MPP, columnar store thus designed for high speed analytic queries by definition. Our cloud services layer does all the query planning and optimization based on data profiles that are collected automatically as the data is loaded. It automatically collects and maintains the required statistics to determine how to most effectively distribute the data and queries across the available compute nodes.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for even distribution, and no need to remember to update statistics.

Another key point to remember is that not only does this dynamic optimization auto-tune queries that work against traditional structured data, it also works against semi-structured data (like JSON, Avro, or XML). When semi-structured data is loaded into our VARIANT data type, as mentioned in an earlier blog post, the data is profiled and columnarized behind the scenes. With this approach, querying the data (using our JSON or XML extensions) is optimized just like for structured relational data, without having to spend the time and effort to split it out into relational tables and columns. Because of this unique approach to handling semi-structured data, Snowflake not only supports joining structured and semi-structured data with SQL, it optimizes those joins too.

Let go Luke – Trust the Force!

As you can imagine, there are some folks out there who do not like change and want us to add more knobs so they can manually tune their queries.

Why? Well, they say that they know their workload better than Snowflake and therefore can optimize it better than Snowflake software. Others say that they’ve already done all the work to optimize another platform for their workload and don’t want to “waste” all that time and effort.

Since I come from a traditional data warehouse background, and know many expert data warehouse architects who are world class at tuning specific databases to do specific types of queries,  I definitely understand and can sympathize with this point of view.

But the world of data has changed! The way we build, deploy, and use data (of all kinds) is continually evolving and we need to evolve our approach to to be successful.

Because of that, there are a few things to consider with regard to tuning:

  1. Your workloads will change.
  2. The sheer number of different workloads is going to grow faster than your bandwidth to manually tune (so someone has to prioritize who wins).
  3. What happens if your expert query tuner decides to sell real estate instead?

Relying on yourself, and other expert DBA tuning staff, is ultimately going to result in a performance tuning bottleneck, which will in turn lead to increased contention for resources in the data warehouse. No matter how good you are, in the end you will not be able to keep up. So why not use a data warehouse service to handle that work for you so you can focus on getting the business people access to all this data in a form they can use to make better business decisions? In short, people are the hardest thing to scale in any data analytics solution. Thus any solution that allows you to scale without bottlenecks on people is a winner.

And if the query is still not fast enough, with Snowflake you always have the option to resize your virtual warehouse to add additional compute resources, as we described in the post about elasticity. Without the benefit of a dynamic optimization engine, you could not change the compute configuration on the fly and see immediate improvements. In a many other environments, if you changed the number of compute nodes you would have to redistribute the data, or do some other manual tuning, in order to get the improved performance. That would not be elastic at all.

For all these reasons, our patent-pending dynamic optimization is, to me, one of the most important features when it comes to making the Snowflake Elastic Data Warehouse a true SaaS offering and enabling agile, high-speed analytics at any scale. As part of our vision, we developed an offering that helps customers concentrate on understanding and deriving value from their data rather than spending time refining, managing , and optimizing the environment. Business users and data scientists can easily analyze their data by connecting natively to Snowflake through various tools, and not have to worry about tuning and query optimization.

As I said before, as an architect and data warehouse designer, I really like this model!

End of this trail…more to come

Well, that is the last of my Top 10 list of cool features in Snowflake.  I want to thank Saqib Mustafa for all his help collaborating on this series of posts.

There are of course many other cool features in Snowflake, and more to come in the future. So keep your eyes on this blog for future posts with more details on the Snowflake Elastic Data Warehouse, the cloud-native data warehouse as a service.

As always, keep an eye on our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for continuing updates on all the action and activities here at Snowflake Computing.

 

Support for Multiple Workloads

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#3 Support for Multiple Workloads

With our unique Multi-cluster, shared data architecture, Snowflake can easily support multiple and disparate workloads.  This is a common issue in traditional data warehouses so it makes total sense to be able to keep disparate workloads separate, to truly avoid resource contention, rather than just saying we support “mixed” workloads.

In legacy data warehouse environments, we often found ourselves constrained by what we could run and when we could run it for fear of resource contention, especially with the CPUs. In many cases it was impossible to refresh the data during the day because the highly parallelized, batch ETL process, while tuned for maximum throughput, usually hogged all the CPUs while it ran. That meant virtually no reporting queries could get resources so they would just hang. Likewise a complex report with calculations and massive aggregations would cause normally fast, simple reports to languish. And there was no way you could let any business users in the system to do exploratory queries as those might also cause everything else to hang.

Because of the separation of compute and storage native to Snowflake’s architecture, as previously highlighted, you can easily spin up a set of compute nodes (we call them Virtual Warehouses) to run your ELT processes, and another set to support your BI report users, and a third set to support data scientists and data miners. In fact you can spin up (or down!) as many virtual warehouses as you need to execute all the workloads you have.

Virtual Warehouse

So not only does each virtual warehouse share the same data (insuring consistent results), they are able to do so without being affected by operations being launched in other virtual warehouses because they are using completely separate resources. Hence there is no more resource contention!

With the Snowflake Elastic Data Warehouse, there is no more need to run the data loads at night just to avoid slowing down the reports. No more worry that one runaway query will impact the loads or other users. You can now run loads (e.g., real time, micro-batch, etc) at any time and thus provide your analysts and users current data on a more frequent basis.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake’s Data Warehouse as a Service (DWaaS) is built by design.

Nice!

Thanks to Saqib Mustafa for his help and suggestions on this post.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

ANSI SQL with Analytic Functions

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#5 ANSI compliant SQL with Analytic Functions

At Snowflake, we believe that it should be easy to access, query, and derive insights from your data. To support that, we provide our users with the ability to query all their data using ANSI compliant SQL . (Hard to call yourself a relational database otherwise, right?).

However, Snowflake goes beyond  basic SQL, delivering sophisticated analytic and windowing functions as part of our data warehouse service. Functions like:

  • CUME_DIST
  • DENSE_RANK
  • FIRST_VALUE
  • LAG
  • LAST_VALUE
  • LEAD
  • NTILE
  • PERCENT_RANK
  • RANK
  • ROW_NUMBER

select Nation, Customer, Total
from (select n.n_name Nation,
             c.c_name Customer,             
             sum(o.o_totalprice) Total,
             rank() over (partition by n.n_name
      order by sum(o.o_totalprice) desc)
     customer_rank
     from orders o,
     customer c,
     nation n
     where o.o_custkey = c.c_custkey
     and c.c_nationkey = n.n_nationkey
     group by 1, 2)
where customer_rank <= 3
order by 1, customer_rank;

As you see in the example, we support not only analytic windowing functions, but all the other features you would expect in SQL. This includes but is not limited to general aggregation functions (e.g. sum), nested virtual tables, sub queries, order by, and group by.

In additional to general aggregation functions, we also have:

  • Bitwise aggregation functions
  • Linear regressions functions and
  • Cardinality estimation functions (i.e., HyperLogLog)

So, if your existing queries are written with standard SQL, they will run in Snowflake. And, as we noted in the previous blog on JSON, you can apply all these functions to your semi-structured data natively using Snowflake.

Another reason to love the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site (or better – sign up for the RSS feed), and our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Snowflake Fast Clone

Hopefully you had a chance to read our previous posts: Query Results Sets available in history,   Ability to connect with JDBC, and  Undrop a table, schema or a database. As promised in the original Top 10 post, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#7 Fast Clone

Even cooler than UNDROP is our fast clone feature.

Have you ever had to wait hours, days, maybe weeks to spin up a copy of your production data warehouse so you could have a test or a development environment? And have to pay extra for the test or development environment to able to hold all the replicated data? Or maybe you have some new data scientists that just want to play around with the data but they really need their own copy?

With the simple Snowflake CLONE command, our customers can create multiple copies of the data tables, schemas, and databases, without replicating the data itself. This gives our customers the ability to almost instantly make the data available to use for multiple user groups, without the additional cost (or time) of actually replicating the data.

Magic??

Almost –

Magic with Meta Data.

Based on our unique solution, Snowflake users are able to clone their tables, schema or databases without creating additional copies. Snowflake stores data in files that are immutable, and encrypted, as part of our architecture. Our cloud services layer, with the metadata repository, records the information regarding the files being stored, the file locations, and a reference to a certain version of the data. This is also kept encrypted.  In addition, when any data changes, the Metadata repository is automatically updated to provide a pointer to the changed data.  All of this is performed in the background by the software without any involvement from the user. The metadata repository still retains the record for all versions of the data set.

How Cloning works

Because of the data in the metadata store, the user can quickly create a clone of the table. All the user has to do is to submit the clone command. This command can be submitted at any time.

CREATE OR REPLACE TABLE MyTable_V2 CLONE MyTable

As a result of the CLONE command, the system simply creates a new entry in the metadata store to keep track of the new clone.

Time Traveling Clones! Oh My!

In addition to simple cloning of objects, you can blend cloning with Snowflake time travel to clone tables, schemas, or even databases at a point in time in the past AT or BEFORE a specific timestamp. Here is an example:

CREATE SCHEMA mytestschema_clone_restore CLONE testschema BEFORE (TIMESTAMP => TO_TIMESTAMP(40*365*86400));

This command once executed will create a clone of the entire schema (tables, views, etc.) as it existed before the specified timestamp.

Hmm…time traveling clones…sounds like an episode of the X-Files (also cool).

Because Snowflake maintains the history of queries performed and identifies them by unique ID, we can also submit a request to create a clone using the unique ID of the query, BEFORE or AT a certain time stamp. This would allow you to perhaps run a revised set of scripts against an older data set then compare the results to the current data set.

In addition, once cloned, cloned objects are independent of each other. Despite being independent, there are no additional storage requirements and thus no additional charge (unless you add or modify records), since these clones share files. Thus Snowflake allows its customers to clone at multiple levels: table, schema (file format, views, sequences, stages) and databases and over time. And because they are independent, updates to one are not visible in the others.

Very helpful for experimentation and data exploration!

Cloning Example

As an example of what it looks like in the Snowflake UI, here is a snapshot of one of my demo databases with Twitter data. It is about 2 TB of data.

Twitter db

Now here is me launching the Clone via our Web UI at 3:02:49 PM:

Start Clone

And here is the result showing the cloned db created at 3:03:55 PM. Barely a minute to create a clone of a 2TB database with 10 tables!

Clone Completed

And I am not a DBA let alone a “cloud” DBA. This is all part of Snowflake being a Data Warehouse as a Service (DWaaS). Another reason why I think it is way cool!

Why is this hard to do in traditional systems?

In traditional RDBMS systems, if you wanted to clone an existing data warehouse and give a different set of users access to the environment, you typically have to create a whole new, separate, deployment of the environment. In Snowflake, as I have just shown, you can do this with ease with just a few commands (one of the many benefits resulting from having written Snowflake from scratch for the cloud). Moreover, while most traditional data warehouse systems allow you to create snapshots of the data, this generally results in consumption of more storage (which of course costs more $$).

All these reasons are why Fast Cloning made my Top 10 list of really cool features in the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano),  (@cloudsommelier), and #ElasticDW for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Special thanks to Ashish Motivala from Snowflake Engineering for helping us with some of the more technical details on how this works so well.

Kent Graziano and Saqib Mustafa

Undrop a Table, Database or Schema

Hopefully you had a chance to read our previous posts #10, Query Results Sets available in history and #9 Ability to connect with JDBC, As promised in the original top 10 post, we continue the series with a deeper dive into another of the Top 10 cool features from Snowflake:

#8 UNDROP

Have you ever accidentally dropped a table? Maybe even a table with hundreds of millions of rows? Ouch. Or maybe someone ran the wrong script with a DROP in Production. And then you have to rush to the backup only to discover that the backup is 24 hours old! And then when you restore, it takes 3 times as long as it did to do the original backup, and even then the restored copy is 24 hours old.

UNDROP is a really cool feature in Snowflake that makes this so much easier to retrieve your data, because we store encrypted versions of the data and objects for 24 hours by default (with an option to purchase even longer retention periods if you need it).

With the UNDROP feature in Snowflake you can recover that lost data instantaneously with a single command:

UNDROP TABLE <tablename>

No need to reload last night’s backup to do the restore. No need to wait while all that data is pulled back in. No need to find the DBA who knows how to do it and has done it before.

It just happens!

In addition to UNDROP TABLE, we also have UNDROP SCHEMA and UNDROP DATABASE for when someone makes an even bigger goof (it really does happens – “really I thought I was logged into the dev database!”).

Again these are simple commands:

UNDROP SCHEMA <schema name>

UNDROP DATABASE <db name>

Both of these instantaneously restore all the objects that were previously dropped.  UNDROP SCHEMA recovers all the tables, views, and sequences that were in that schema.  While UNDROP DATABASE recovers all the schemas, tables, views, stages areas,  and sequences in that database.

For example: here is a sample  database history display (the result of running SHOW DATABASES HISTORY) that shows database SALES_DEV was dropped on 2/23/2016 at 8:49 AM. Oops.

Drop Database

Here is the history again after an we execute the command UNDROP DATABASE SALES_DEV using the Snowflake Web UI:

Undrop Database

 

In the Worksheet log you can see that the UNDROP was executed at 10:58 AM and took all of 591 milliseconds to execute. In the database history readout you can see that the “dropped_on” column now shows NULL for SALES_DEV indicating it is available for use again.

Very few databases allow you to UNDROP a table, schema or a database, this easily and quickly, using a simple SQL command prompt.

Now that is a huge time (and life) saver. Ask anyone who has dropped a production database or table by accident. And that is why it made the Top 10 list of really cool features in the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier), and our LinkedIn page for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Kent Graziano and Saqib Mustafa