Financial Services: Welcome to Virtual Private Snowflake

Correct, consistent data is the lifeblood of the financial services industry. If your data is correct and consistent, it’s valuable. If it’s wrong or inconsistent, it’s useless and may be dangerous to your organization.

I saw this firsthand during the financial meltdown of 2007/08. At that time, I had been working in the industry for nearly 20 years as a platform architect. Financial services companies needed that “single source of truth” more than ever. To remain viable, we needed to consolidate siloed data sets before we could calculate risk exposure. Most financial services companies were on the brink of collapse. Those that survived did so because they had access to the right data.

At my employer, we looked for a way to achieve this single source with in-house resources, but my team and I quickly realized it would be an extraordinary challenge. Multiple data marts were sprawled across the entire enterprise, and multiple sets of the same data existed in different places, so the numbers didn’t add up. In a global financial services company, even a one percent difference can represent billions of dollars and major risk.

We ultimately built an analytics platform powered by a data warehouse. It was a huge a success. It was so successful that everybody wanted to use it for wide-ranging production use cases. However, it couldn’t keep up with demand, and no amount of additional investment would solve that problem.

That’s when I began my quest to find a platform that could provide universal access, true data consistency and unlimited concurrency. And for financial services, it had to be more secure than anything enterprises were already using. I knew the cloud could address most of these needs. However, even with the right leap forward in technical innovation, would the industry accept it as secure? Then I found Snowflake. But my story doesn’t end there.

I knew Snowflake, the company, was onto something. So, I left financial services to join Snowflake and lead its product team. Snowflake represents a cloud-first approach to data warehousing, with a level of security and unlimited concurrency that financial services companies demand.

We’ve since taken that a step further with Virtual Private Snowflake (VPS) – our most secure version of Snowflake. VPS gives each customer a dedicated and managed instance of Snowflake within a separate Amazon Web Services (AWS) Virtual Private Cloud (VPC). In addition, customers get our existing, best-in-class Snowflake security features including end-to-end encryption, at rest and in-transit. VPS also includes Tri-Secret Secure, which combines a customer-provided encryption key, a Snowflake-provided encryption key and user credentials. Together, these features thwart an attempted data decryption attack by instantly rendering data unreadable. Tri-Secret Secure also includes user credentials to authenticate approved users.

VPS is more secure than any on-premises solution and provides unlimited access to a single source of data without degrading performance. This means financial services companies don’t have to look at the cloud as a compromise between security and performance.

To find out more, read our VPS white paper and solution brief: Snowflake for Sensitive Data.

 

New Snowflake features released in Q2’17

It has been an incredible few months at Snowflake. Along with the introduction of self-service and numerous other features added in the last quarter, we have witnessed:

  • Our customer base has grown exponentially with large numbers of applications in full production.
  • Billions of analytical jobs successfully executed this year alone, with petabytes of data stored in Snowflake today, and without a single failed deployment to-date.
  • A strong interest in pushing the boundaries for data warehousing even further by allowing everyone in organizations to share, access and analyze data.

Continuing to engage closely with our customers during this rapid growth period, we rolled out key new product capabilities throughout the second quarter.

Instantly sharing data without limits – Introducing the Data Sharehouse  

Giving users the ability to easily and securely share data in the right format without the need for cumbersome data integration pipelines.

One of our highlights was the general availability of Snowflake’s Data Sharing feature. It allows multiple Snowflake customers to instantly share data with each other.

With data sharing, there is no need to use the expensive, insecure, and often complicated and error-prone procedure of transferring large numbers of single files from one location to another. Instead, customers can now simply share data via SQL statements and secure views fully integrated with our role-based access control model.

To learn more about how Snowflake customers are already leveraging this capability to build new products and features to drive their business, we encourage you to read:

Improving out-of-the box performance & SQL programmability  

Our ongoing mission is to build the fastest database for data warehousing with the SQL you love and no concurrency limits.

  • We continued to make end-to-end query execution faster with more efficient pruning for sub-columns in VARIANT types (JSON), general JOIN performance improvements, and faster global median calculations.
  • We addressed popular customer requests for improved programmability via SQL by:
  • Customers who are using our multi-cluster warehouses auto-scale feature for spiking workloads can now specify up to 10 compute clusters. This allows running hundreds of queries without any query queuing.

Staying ahead with enterprise-ready security and compliance

From day one, security has always been core to Snowflake’s design.

  • One of our exciting new capabilities this quarter is the general availability of customer-managed keys which added an additional layer of security for customers with sensitive data. This feature is the primary component of Tri-Secret Secure, a Snowflake Enterprise Edition for Sensitive Data (ESD) feature. You can find more details in our engineering blog about customer-managed keys in Snowflake.  
  • We also improved the ability for our users to monitor and filter the query history Information Schema table function for more specific SQL command types.
  • After its preview in Q1, secure views reached general availability in Q2.
  • In terms of certification, Snowflake received PCI DSS compliance – a common requirement for customers in banking, financial services, retail, services, and more. Customers and prospects who have PCI requirements will need to subscribe to the Snowflake Enterprise Edition for Sensitive Data (ESD).  

Improving our ecosystem and data loading  

Enabling developers and builders to create applications with their favorite tools, drivers, and languages remains a top priority.

  • For data warehouse automation, Wherescape added support for Snowflake.
  • We enhanced our parallel data loading & unloading via the COPY command; developers can now:
  • Expanding our driver support, we announced a preview version of our open-source Go driver, available in Snowflake’s Github repo.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Snowflake users can now set up MFA via the UI; they no longer need to reach out to our support team to enable the feature.
  • Building on the general availability of Query Profile in Q1, we added a number of additional usability enhancements that can be leveraged to better understand where time is spent during query execution.   
  • The AWS Key ID is now displayed in DESC STAGE output and interface.
  • We added support for leveraging file extensions used for data unloading operations, and changed the default behavior when loading files containing byte order marks, i.e. we now detect and skip the marks instead of throwing an error because the data could not be converted to proper data types.
  • To allow Snowflake users to better control consumption of compute resources, we also enhanced resource monitors (currently in preview). Users can now explicitly assign them to virtual warehouses and specify certain actions if a credit threshold is met or exceeded. Please stay tuned for a separate blog on this important capability.  

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Given our rapid growth since the beginning of this year, we continued working on product scale, reliability, and availability to ensure our readiness for the next phase of hyper-growth.
  • We’re already seeing our efforts to automate all operations, particularly Snowflake deployments, pay off. We were able to quickly roll out a new region (US East – Virginia in Q2) while working on additional deployments around the globe at the same time.

Acknowledgements and conclusion

As always, we want to first thank our customers for their continuous feedback. Additionally, we want to recognize that moving rapidly while also scaling would not be possible without our mighty Engineering Team, which has proven once again that it’s possible to ship high-quality features while serving our existing users’ needs at the same time.

Now, onwards to Q3 and the next set of exciting capabilities you will hear about very soon!

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Virtual Warehouse Billing Made Smarter

One of our most important commitments to our users is reducing/eliminating the management and tuning tasks imposed by other systems. We are always on the lookout for new and innovative ways of making our service easier to use.

Most recently, we looked at how our users manage their Snowflake virtual warehouses and realized we could be smarter about how we charge for the compute resources used in these warehouses.

First, Some Background Information…

To begin, here are some details about what happens when a virtual warehouse (or simply “warehouse”) is resumed and suspended in Snowflake:

  1. Each time the warehouse is resumed:
    • Snowflake provisions servers from AWS EC2 for a minimum of one hour. This is based on how AWS charges for EC2 resources.
    • The number of Snowflake credits charged depends on the number of servers provisioned, which is determined by the size of the warehouse (XS, S, M, L, XL, etc.) and the number of clusters in the warehouse (if multi-cluster warehouses are enabled).
  2. When the warehouse is suspended, the servers are shut down and credits are no longer charged.
  3. When the warehouse is resumed again, the servers are re-provisioned and the charges start over, regardless of how much time has passed since the warehouse was last charged. As a result, if the same warehouse is resumed multiple times within the same hour, credits are charged each time. As stated earlier, this follows the AWS EC2 model.

What We Changed

We noticed that some of our users were spending time and effort managing their virtual warehouses to ensure that credits were not consumed unnecessarily. We decided to eliminate this extra work by introducing Warehouse Billing Continuation (WBC).

With WBC, we now track the last time each individual server in a warehouse was charged and, if the warehouse is suspended and resumed within 60 minutes of the last charge, we don’t charge again for the server. The charge is continued from the last time as if the warehouse had never been suspended. This eliminates any extra charges, thereby reducing the need for strictly monitoring and controlling when warehouses are suspended and resumed.

How Does WBC Work?

The simple answer is it just works, regardless of how often you resume and suspend your virtual warehouses. If this answer satisfies your curiosity, you can skip now to the end of this post. Otherwise, read on for the gory details…

The best way to explain WBC is with examples. Say you have a Small warehouse (2 servers) that’s been suspended for longer than an hour. Now, imagine this warehouse goes through the following status changes:

Resumed Suspended Credits Charged before WBC Credits Charged with WBC
09:15 09:25 2 2
09:40 09:50 2
10:05 10:10 2
10:30 10:50 2 2
11:20 11:40 2 2 (at 11:30)

Before WBC, every time the warehouse was resumed, we charged 2 credits (1 credit per server in the warehouse). For example, in the scenario described above, between 09:20 and 11:40, the warehouse incurred 5 charges for a total of 10 credits.

With WBC, the behavior is different. The warehouse only incurs 3 charges for a total of 6 credits. The following diagram provides a more detailed explanation of what actually happens:

Example of charges for resuming and suspending a Small virtual warehouse
Figure 1: Example of charges for resuming and suspending a Small virtual warehouse
  1. At 09:15, 2 credits are charged for the servers in the warehouse because the warehouse has been suspended for longer than an hour and there’s no previous charge to continue.
  2. At 9:40 and 10:05, we recognize that the warehouse was charged within the last hour so no additional charges are incurred. In other words, the current charge doesn’t expire until 60 minutes after it was first incurred, so the earliest the warehouse will be charged again is 10:15, regardless of how many times the warehouse is suspended and resumed during this period.
  3. At 10:15, the warehouse isn’t charged because it’s not running at that time.
  4. At 10:30, the 2 servers are charged again because more than 60 minutes have elapsed since the initial charge. More importantly, this new time is now used to track all subsequent charges, i.e. the earliest time for the next charge is 11:30.
  5. At 11:20, no charge is incurred because 60 minutes haven’t elapsed since the last charge.
  6. At 11:30, the warehouse is charged again because 60 minutes have now elapsed since the last charge and the warehouse is running.

This example covers a relatively simple case for a Small warehouse. Each successively larger warehouse has more servers, so the scenarios are slightly more involved, especially if the warehouse is resized (or multi-cluster warehouses are being used); however, the mechanics and calculations are all the same. The most important thing to remember is that every server in a warehouse is charged independently.

WBC Example with Virtual Warehouse Resizing

Consider the same example from earlier, but starting with a Medium warehouse (4 servers) and the following resize events:

  • Warehouse resized down to Small (2 servers) at 09:30, 10:00, and 11:00.
  • Warehouse resized back to Medium (4 servers) at 09:45, 10:45, and 11:15.
Resumed Credits Charged for Resume Resized Credits Charged for Resize
09:15 4 09:30 (Small)
09:40 09:45 (Medium)
10:00 (Small)
10:05
10:30 2 10:45 (Medium) 2
11:00 (Small)
11:15 (Medium)
11:20 2 (at 11:30) 2 (at 11:45)

The total credits charged would be (4+2+2) + (2+2) = 12. The following diagram shows how the charges are incurred (remember that I warned you earlier about the gory details):

Example of charges for resuming and suspending a Medium virtual warehouse with resizing
Figure 2: Example of charges for resuming and suspending a Medium virtual warehouse with resizing
  1. At 09:15, the initial charge for the Medium warehouse is 4 credits.
  2. At 10:30, the new charge is 2 credits (reflecting the Small size at the time the warehouse is resumed). These 2 servers will be charged next in 60 minutes at 11:30.
  3. At 10:45, a new, separate charge for 2 additional servers is incurred (due to the resize to Medium). These 2 additional servers will be charged next at 11:45, independently of the other 2 servers.
  4. All the other resizing events increase or decrease the number of servers running at that time, but incur no additional charges.

Note that this diagram illustrates how we remove servers from a warehouse, i.e. we always start with the most recently-provisioned servers (i.e. LIFO) and we add them back in the same order. This is important because servers are each charged according to their individual start times. This is also important because there’s no benefit to reducing the size of a warehouse within each hour that it runs because the servers have already been charged for the hour.

So, What Next?

There’s no need to make any changes to your virtual warehouses. We’ve enabled Warehouse Billing Continuation by default for all accounts and warehouses, including any existing warehouses. In fact, we implemented WBC near the end of March so you probably have already noticed reduced charges for some of your warehouses, particularly the ones that you resume and suspend frequently.

However, to take full advantage of this change, you might want to revisit your warehouse procedures and settings. For example, you can now set auto-suspend for a warehouse to a shorter value (e.g. 5 or 10 minutes) without worrying about being charged additional credits each time the warehouse resumes within the hour.

Interested in learning more? Check out the Snowflake documentation for more in-depth information about warehouses:

Also, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Tech Tip: Quick Start for Getting Your First Data into Snowflake

If you are like me and fairly new to this whole cloud thing, then one of your main questions is likely:

“How do I get data from my desktop (or server) into Snowflake so I can query it?”

Which, in reality, translates to:

“How do I load data in the cloud?”

Well, there are several methods, including a variety of integration and ETL tools. In this tech tip I will show you the simplest way to accomplish this via the Snowflake Web UI.

Our brilliant engineers built a nice little wizard to make it simple for cloud novices (like me) to easily upload a data file from your desktop and load it into a table in Snowflake.

Getting Started

First off, you need a table in the database that matches the structure of the data file you are going to load (you modeled that table right?).  You can build that by running a create table DDL in a worksheet once you have logged in.

So something like this:

CREATE TABLE Region_t1


  (

    Name               VARCHAR2 (200) NOT NULL ,

    Code               VARCHAR2 (2) NOT NULL ,

    Abbreviation       VARCHAR2 (2) NOT NULL ,

    Sort_Order         NUMBER (2) NOT NULL ,

    External_Reference VARCHAR2 (1000) NOT NULL ,

    Record_Source      VARCHAR2 (100) NOT NULL ,

    Record_Owner       NUMBER (1) NOT NULL ,

    Comments           VARCHAR2 (2000)

  ) ;

ALTER TABLE Region_t1 ADD CONSTRAINT PK_Region_t1 PRIMARY KEY ( Code ) ;

ALTER TABLE Region_t1 ADD CONSTRAINT UK_Region_t1 UNIQUE ( Name ) ;

Invoke the Load Wizard

To begin the load process, navigate to the Database tab in the UI, drill into your database and then click on the table name. Once you do that, the Load Table widget is displayed (fourth from the left on the toolbar).  Click on the widget, and the Load Data wizard opens and asks for a warehouse. Select a virtual warehouse to use from the drop-down list. (Note: The list displays any warehouse you have the USAGE privilege on.)

select-wh

Select the File to Load

In the next step, you select the source file to load. You have options to get a file from your file system, or from an AWS S3 bucket if you have already moved some files to the cloud.

file-selected

Note that you can also load multiple files! The select dialog box supports multi-select. But keep in mind that the files must all be the same type with the same structure because they’re all going into the same table.

In the next step, you pick a file format from the drop-down list. In this case, I have selected a format that specifies a CSV type file with comma delimiters.

pick-file-format-for-load

Creating a File Format (if necessary)

If you do not have an appropriate file format already, simply click the plus sign (+) next to the drop-down arrow in the UI to define a new one. For a standard CSV file that uses commas as the delimiter, you really just need to give the file format a name and click Finish. However, if the first row of the file includes column headings, be sure to set Header lines to skip to 1.

create-file-format

Setting Load Options

The next step of the wizard allows you to choose how to handle load errors. There are four options. I chose to have the load continue by skipping any problem records (which I will deal with after the load).

load-options

At this point, you are set to go. If you want to see the code for the load (so you can learn something new, right?), just click the Show SQL link before you click Load.

load-sql

Running the Load

Once you click Load, the process starts and does the following:

  1. Uses PUT to copy the file to a Snowflake internal staging area, automatically encrypting the data during the process.
  2. Uses COPY to load the data into the target table and deletes the file from the staging area upon completion.
  3. Reports the results.

load-success

And voila, you have now loaded data from your desktop into the cloud-native Snowflake Elastic Data Warehouse. Congratulations.

For a much deeper dive into loading data into Snowflake check out the Data Loading topic in our user guide.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more tips and tricks, along with updates on all the action and activities here at Snowflake Computing.

Power BI Enables Connectivity to Snowflake

Guest post by Senior Program Manager for Power BI at Microsoft. This post can also be found on the Power BI Blog.

As part of the Power BI Desktop August Update we are very excited to announce a preview of a new data connector for Snowflake.

Snowflake is a cloud-native elastic data warehouse service that makes it easy to bring together data from disparate sources and make it available to all users and systems that need to analyze it. With the new Power BI Desktop connector, users can import data from Snowflake, combine it with data from other sources, and build rich data visuals as part of their Power BI reports.

In order to use this new connector, users need to first enable the Preview feature option within the Options dialog (File -> Options and Settings -> Options).

clip_image001

As a pre-requirement, the Snowflake connector requires users to install the Snowflake ODBC driver in their machines, matching the architecture of the Power BI Desktop installation (i.e. 32-bit vs. 64-bit). The Snowflake ODBC driver can be downloaded from this location: http://go.microsoft.com/fwlink/?LinkID=823762

After the feature has been enabled, users can find the Snowflake connector within the “Get Data” dialog, under the Database category.

clip_image002

In the connector dialog, users can specify the Snowflake Server and Warehouse to connect to.

clip_image004

Once the connection details have been specified, Power BI Desktop will attempt to connect to Snowflake, which will require users to specify credentials. In Power BI Desktop, credentials are only asked from the user on first-time connection, after which they are stored as part of the user profile and can be managed from the “Data Source Settings” dialog (under File -> Options and Settings). Subsequent connections to Snowflake will reuse the stored credentials for the current user.

clip_image006

After the connection has been successfully established and the user has been authenticated, the Navigator dialog will allow users to browse the list of databases available within the specified warehouse. In this dialog, users can preview and select one or multiple tables to use in their Power BI report.

clip_image008
Once one or more tables have been selected in the Navigator dialog, users can either:

  • Load: This will load all data rows from each of the tables selected into the local Power BI Desktop data model, from where users can build their reports.
  • Edit: This option will bring users into the Query Editor dialog, from where they can perform data transformations and filters on a sample subset of the data, without the need to download the entire data set. Operations applied within the Query Editor dialog will run on the Snowflake backend, as long as the operations are supported in the source (i.e. filters, etc.)

We’re very glad to enable connectivity to Snowflake from Power BI and are looking forward to customers’ feedback. We encourage you to give it a try and share with us any feedback or issues that you encounter via the “Send a Frown” feature in Power BI Desktop. In subsequent updates, we will also enable DirectQuery capabilities for this connector, as well as support for refreshing datasets based on Snowflake data in the Power BI service via the Power BI Data Gateway…

Stay tuned for improvements to the Snowflake connector and for many other connectors in upcoming releases. If yours isn’t supported yet, just let us know.

Miguel Llopis

Saving Time & Space: Simplifying DevOps with Fast Cloning

As I have written before, I think our fast cloning feature is pretty cool. Not only is it very fast (as in it took me barely a minute to clone a 2TB database!), it allows you to have multiple copies of your data without the additional cost of storage usually associated with replicating data.  

That is all well and good, but out in the real world, folks want to see lots of examples on how to make good use of this feature. A common question from customers is how to utilize the Snowflake fast clone feature to support DevOps. Specifically, can it be used to support a promotion process from development to integration/testing to production?

The answer is yes! Following are a few examples of how that might work.

Getting Started: Cloning Your Production Database

Starting out, let’s assume you have one database in Snowflake (i.e. production by default) and it has a schema named PUBLIC. In this database you have two tables (A and B) and they are being populated daily by a load process.

Cloning Figure 1

Now you find you need to make changes and additions and realize you should not do those in production so now you need a development database. How do you do that in Snowflake?

You could do it the old-fashioned way and just create a new database then copy the existing tables and data over from production (using a “CREATE TABLE AS SELECT …” or CTAS operation). If they are large tables that would take some time and cost extra as it also requires more space. However, using Snowflake’s fast clone feature you can do it way faster and not incur the cost for extra space.

Here is how to do it:

CREATE DATABASE Dev CLONE Prod;

That’s it, one command! It creates the new database, schema, and tables and gives you a logical copy of the production data to work with almost instantly. 

While you are at it, you can also  create the integration testing database (INT database) too:

CREATE DATABASE Int CLONE Prod;

Now you are set to continue your data warehouse development efforts with a proper environment in place.

Scenario 2: Promoting New Development

Based on what we did above, your current Development database has tables A and B, cloned from Prod. As you begin your development iteration, you can add more tables. In this example, we added two new tables C and D, which are under development.

Initial State:

Prod database: Tables A, B

Int database: Tables A, B (previously cloned from Prod)

Dev database:

  • Tables A, B (previously cloned from Prod)
  • Tables C, D (created as part of new development)
Cloning Figure2

Mini Release:

Now suppose you want to do a mini-release, wherein you:

  • Promote only Table C to Int for testing
  • Leave Table D as is in Dev (work-in-progress, not ready to deploy)

To do this mini-release you need to execute the following steps in a script, using the command-line client, or in the Snowflake worksheet:

  1. Deploy Table C to Int by simple clone:
        • If C already already contains production data (copy data):
          USE Int.public; 
          CREATE TABLE C CLONE Dev.public.C;  // Yes, that simple!
          • Load incremental new data into Table C
          • Validate results
        • If Table C has only test data and you do not want to copy the data:
    USE Int.public;
    CREATE TABLE C LIKE Dev.public.C;  // copy the table structure only
          • Load new data in Table C
          • Validate results
  2. If the validation of Table C succeeds in Int, then deploy it to Prod:
USE Prod.public;
CREATE TABLE  C LIKE Int.public.C;
      • Load production data into Table C

Now you have the new Table C live in your Prod database.

Cloning Figure3

But you are not quite done yet. You need to complete the cycle.

Refresh Dev:

We now want to refresh Dev to be in sync with Prod. To do that we need to create a production clone of Table C and update Tables A and B with refreshed data. This will also serve to reduce your storage by replacing the original version of Table C in Dev with a cloned table. Likewise, you should sync Int as well.

If there are just a few tables in Prod (as in our example) then:

USE Dev.public;
CREATE OR REPLACE TABLE A CLONE Prod.public.A;
CREATE OR REPLACE TABLE B CLONE Prod.public.B;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE or REPLACE DATABASE Int CLONE Prod;
Cloning Figure4

Note: CREATE OR REPLACE TABLE will drop and rebuild the table. This will release any space used in Dev (or Int) by those tables subsequent to the last clone operation.

Hint: If there are many tables in Prod, and only a few new tables in Dev, then this approach will be more efficient:

CREATE DATABASE Dev2 CLONE Prod;
USE Dev2.public;
CREATE TABLE D CLONE Dev.public.D; // preserving the WIP table in Dev
DROP DATABASE Dev;
ALTER DATABASE Dev2 RENAME To Dev;

This process removes all previous clones and allows you to preserve Table D as it is currently in development. When your are done with development on Table D, you follow the same process as we did for Table C to promote Table D to Int for testing then up to Prod for go live. After the promotion be sure to replace Table D in Dev and Int with a fresh clone from Prod in order to release the space.

Scenario 3: Promoting a Change to an Existing Table

Now let’s suppose you have three tables already in production and you need to alter one of those tables (i.e., add a new column) and then promote the revised table to Production. This is an important scenario, specially when rows are being added constantly to the data in production table. In this scenario you do not want to reload the modified table but rather change it then continue with incremental loads.

Initial State:

Prod database: Tables A, B, C

Int database: Tables A, B, C (cloned from Prod)

Dev database: Tables A, B, C (cloned from Prod)

Cloning Figure5

Mini Release:

In this scenario, you want to do a mini-release, wherein you will:

  • Alter Table C by adding one new column
  • Adjust the load process to include that column
  • Promote Table C to Int for testing

To do this mini-release you need to execute the following steps in a script or by hand in the Snowflake worksheet:  

    1. Change the table and the code (in Dev), then refresh the table
USE Dev.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Modify load to include NEW_COL1
      • Write a one-time update script to update existing rows to populate NEW_COL1
        • Pull data from Production source if possible
        • Review and validate results
      • Execute modified load routine to load incremental new rows
        • Review and validate results

2. Deploy Table C to Int by simple clone:

USE Int.public;
CREATE OR REPLACE TABLE C CLONE Dev.public.C;
      • Promote revised load process to Int environment
      • Promote Update script to Int environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results

3. If promotion to Int succeeds, then deploy to Prod:

USE Prod.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Promote revised load process to Prod environment
      • Promote Update script to Prod environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results
      • Start writing queries and reports that use the new column

Refresh Dev:

Once the promotion to Production is complete, you can refresh Dev and Int again using the same process as mentioned above.

USE Dev.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;

Now you have a production clone of Table C in Dev & Int with the new column added and updated with fresh production data.

Want to see cloning in action? Check out this video:

Conclusion

These are just a few examples of how you might use Snowflake’s fast cloning to support an agile, light-weight DevOps process that does not require a lot of DBA management and does not take up extra space! Hopefully this gives you some ideas how you can use fast cloning in your development environment.

Thanks to Saqib Mustafa (@cloudsommelier) for his help on this post!

As always, keep an eye on this blog site, our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Automatic Query Optimization. No Tuning!

Hopefully you had a chance to read all our previous top 10 posts. We now conclude the series with a final deep dive into my #1 Cool Feature from Snowflake:

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about query performance at all. It is all handled “auto-magically” via a dynamic optimization engine in our cloud services layer. I just model, load, and query the data.

Cloud Services

Snowflake is an MPP, columnar store thus designed for high speed analytic queries by definition. Our cloud services layer does all the query planning and optimization based on data profiles that are collected automatically as the data is loaded. It automatically collects and maintains the required statistics to determine how to most effectively distribute the data and queries across the available compute nodes.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for even distribution, and no need to remember to update statistics.

Another key point to remember is that not only does this dynamic optimization auto-tune queries that work against traditional structured data, it also works against semi-structured data (like JSON, Avro, or XML). When semi-structured data is loaded into our VARIANT data type, as mentioned in an earlier blog post, the data is profiled and columnarized behind the scenes. With this approach, querying the data (using our JSON or XML extensions) is optimized just like for structured relational data, without having to spend the time and effort to split it out into relational tables and columns. Because of this unique approach to handling semi-structured data, Snowflake not only supports joining structured and semi-structured data with SQL, it optimizes those joins too.

Let go Luke – Trust the Force!

As you can imagine, there are some folks out there who do not like change and want us to add more knobs so they can manually tune their queries.

Why? Well, they say that they know their workload better than Snowflake and therefore can optimize it better than Snowflake software. Others say that they’ve already done all the work to optimize another platform for their workload and don’t want to “waste” all that time and effort.

Since I come from a traditional data warehouse background, and know many expert data warehouse architects who are world class at tuning specific databases to do specific types of queries,  I definitely understand and can sympathize with this point of view.

But the world of data has changed! The way we build, deploy, and use data (of all kinds) is continually evolving and we need to evolve our approach to to be successful.

Because of that, there are a few things to consider with regard to tuning:

  1. Your workloads will change.
  2. The sheer number of different workloads is going to grow faster than your bandwidth to manually tune (so someone has to prioritize who wins).
  3. What happens if your expert query tuner decides to sell real estate instead?

Relying on yourself, and other expert DBA tuning staff, is ultimately going to result in a performance tuning bottleneck, which will in turn lead to increased contention for resources in the data warehouse. No matter how good you are, in the end you will not be able to keep up. So why not use a data warehouse service to handle that work for you so you can focus on getting the business people access to all this data in a form they can use to make better business decisions? In short, people are the hardest thing to scale in any data analytics solution. Thus any solution that allows you to scale without bottlenecks on people is a winner.

And if the query is still not fast enough, with Snowflake you always have the option to resize your virtual warehouse to add additional compute resources, as we described in the post about elasticity. Without the benefit of a dynamic optimization engine, you could not change the compute configuration on the fly and see immediate improvements. In a many other environments, if you changed the number of compute nodes you would have to redistribute the data, or do some other manual tuning, in order to get the improved performance. That would not be elastic at all.

For all these reasons, our patent-pending dynamic optimization is, to me, one of the most important features when it comes to making the Snowflake Elastic Data Warehouse a true SaaS offering and enabling agile, high-speed analytics at any scale. As part of our vision, we developed an offering that helps customers concentrate on understanding and deriving value from their data rather than spending time refining, managing , and optimizing the environment. Business users and data scientists can easily analyze their data by connecting natively to Snowflake through various tools, and not have to worry about tuning and query optimization.

As I said before, as an architect and data warehouse designer, I really like this model!

End of this trail…more to come

Well, that is the last of my Top 10 list of cool features in Snowflake.  I want to thank Saqib Mustafa for all his help collaborating on this series of posts.

There are of course many other cool features in Snowflake, and more to come in the future. So keep your eyes on this blog for future posts with more details on the Snowflake Elastic Data Warehouse, the cloud-native data warehouse as a service.

As always, keep an eye on our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for continuing updates on all the action and activities here at Snowflake Computing.

 

Support for Multiple Workloads

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#3 Support for Multiple Workloads

With our unique Multi-cluster, shared data architecture, Snowflake can easily support multiple and disparate workloads.  This is a common issue in traditional data warehouses so it makes total sense to be able to keep disparate workloads separate, to truly avoid resource contention, rather than just saying we support “mixed” workloads.

In legacy data warehouse environments, we often found ourselves constrained by what we could run and when we could run it for fear of resource contention, especially with the CPUs. In many cases it was impossible to refresh the data during the day because the highly parallelized, batch ETL process, while tuned for maximum throughput, usually hogged all the CPUs while it ran. That meant virtually no reporting queries could get resources so they would just hang. Likewise a complex report with calculations and massive aggregations would cause normally fast, simple reports to languish. And there was no way you could let any business users in the system to do exploratory queries as those might also cause everything else to hang.

Because of the separation of compute and storage native to Snowflake’s architecture, as previously highlighted, you can easily spin up a set of compute nodes (we call them Virtual Warehouses) to run your ELT processes, and another set to support your BI report users, and a third set to support data scientists and data miners. In fact you can spin up (or down!) as many virtual warehouses as you need to execute all the workloads you have.

Virtual Warehouse

So not only does each virtual warehouse share the same data (insuring consistent results), they are able to do so without being affected by operations being launched in other virtual warehouses because they are using completely separate resources. Hence there is no more resource contention!

With the Snowflake Elastic Data Warehouse, there is no more need to run the data loads at night just to avoid slowing down the reports. No more worry that one runaway query will impact the loads or other users. You can now run loads (e.g., real time, micro-batch, etc) at any time and thus provide your analysts and users current data on a more frequent basis.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake’s Data Warehouse as a Service (DWaaS) is built by design.

Nice!

For a quick look at how this works, check out this video.

Thanks to Saqib Mustafa for his help and suggestions on this post.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

ANSI SQL with Analytic Functions

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#5 ANSI compliant SQL with Analytic Functions

At Snowflake, we believe that it should be easy to access, query, and derive insights from your data. To support that, we provide our users with the ability to query all their data using ANSI compliant SQL . (Hard to call yourself a relational database otherwise, right?).

However, Snowflake goes beyond  basic SQL, delivering sophisticated analytic and windowing functions as part of our data warehouse service. Functions like:

  • CUME_DIST
  • DENSE_RANK
  • FIRST_VALUE
  • LAG
  • LAST_VALUE
  • LEAD
  • NTILE
  • PERCENT_RANK
  • RANK
  • ROW_NUMBER

select Nation, Customer, Total
from (select n.n_name Nation,
             c.c_name Customer,             
             sum(o.o_totalprice) Total,
             rank() over (partition by n.n_name
      order by sum(o.o_totalprice) desc)
     customer_rank
     from orders o,
     customer c,
     nation n
     where o.o_custkey = c.c_custkey
     and c.c_nationkey = n.n_nationkey
     group by 1, 2)
where customer_rank <= 3
order by 1, customer_rank;

As you see in the example, we support not only analytic windowing functions, but all the other features you would expect in SQL. This includes but is not limited to general aggregation functions (e.g. sum), nested virtual tables, sub queries, order by, and group by.

In additional to general aggregation functions, we also have:

  • Bitwise aggregation functions
  • Linear regressions functions and
  • Cardinality estimation functions (i.e., HyperLogLog)

So, if your existing queries are written with standard SQL, they will run in Snowflake. And, as we noted in the previous blog on JSON, you can apply all these functions to your semi-structured data natively using Snowflake.

Another reason to love the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site (or better – sign up for the RSS feed), and our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.