Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

New Snowflake Features Released in H2’17

Looking back at the first half of this year, the main theme at Snowflake has been centered around scaling – the team, the service and our customer base. We introduced and improved the self-service onboarding experience, grew our ecosystem by collaborating with major technology partners and SIs and invested in making our service faster, more secure and easier to use.

In the summer, we announced the general availability of Snowflake’s Instant Data Sharing. As we continue to grow, this will provide the foundation for data-driven organizations of all sizes to instantly exchange data, create new insights and discover new revenue streams.

In the second half of this year, we started a journey to evolve the service in substantial ways and this will continue throughout 2018. 

Instant elasticity, per-second billing and 4XL virtual warehouses
Imagine you have instant access to performance and scalability – and truly pay only for what you use.

Many cloud DW solutions make the claim in the headline above. Nonetheless, taking a closer look exposes fine differences with the degree of scalability and concurrency supported, overall user experience and the types of workloads a user can run at scale. One of our key investments in this area resulted in our instant elasticity feature:

  • The introduction of per-second pricing represents one crucial building block. Now, our customers are billed to the second instead of a full hour when running workloads in Snowflake.
  • We also improved our provisioning algorithms, using past usage data to better serve compute capacity for future use.
  • We added support for our new 4XL virtual warehouses, doubling the maximal compute configuration and allowing our customers to tackle their most challenging workloads and performance SLAs.

Based on customer feedback, the decrease in wait time from minutes to seconds is a huge benefit. You can now ensure the right performance for your executive dashboards and not worry anymore about a full-hour charge. You can also scale your load instantly and have fast access to the most current data.

Improving performance and SQL programmability
Our ongoing mission is to build the fastest data warehouse with the SQL you love.

Throughout the last quarter, we released a wide range of SQL capabilities that address database migration challenges and increase both compatibility with SQL ANSI and overall SQL programmability:

  • Support for JavaScript Table UDFs (in public preview).
  • Support for Double-Dot notation for specifying database objects to ease migration from on-premise database systems.
  • Support for Table Literals.
  • INSERT OVERWRITE: Allowing to rebuild tables to ease migration from Hadoop systems.
  • LEAD/LAG function enhancement: Ignoring NULL values.
  • SAMPLE/TABLESAMPLE – announced general availability.
  • We introduced functions to estimate frequent values in near-linear time.
  • More flexibility and support to configure start of the week and week of the year policy.

In addition, we have been working on substantial performance improvements that address ease-of-use and automation. Please stay tuned for major announcements in early 2018.

Staying ahead with enterprise-ready security and compliance
From day one, security has always been core to Snowflake’s design.

  • Programmatic SSO support – Now, a Snowflake user can leverage browser-assisted single-sign-on programmatically through our Snowflake ODBC, JDBC, and Python drivers and our command line tool, SnowSQL. In contrast to some of our competitors, you don’t need to write code to connect to your SAML 2.0-compliant IDP. (Programmatic SSO support is available in our Enterprise Edition).
  • PrivateLink support for Snowflake – We worked with AWS to integrate PrivateLink with Snowflake. With PrivateLink, Snowflake users can now connect to Snowflake by bypassing the public internet. No cumbersome proxies need to be set up between Snowflake and their network. Users have full control over egress traffic.

Improving our ecosystem and data loading
Enabling developers and builders to onboard new workloads and create applications with their favorite tools, drivers and languages remains a top priority.

Snowpipe
We recently announced Snowpipe, which represents an important milestone in many ways. First, users can now efficiently address continuous loading use cases for streaming data. Secondly, with Snowpipe, we introduced the first service in addition to the core data warehousing service. Snowpipe is serverless so there’s no need to manage a virtual warehouse for data loading into Snowflake. Finally, we continue to deeply integrate with the entire AWS stack: Snowpipe allows customers to take advantage of AWS S3 event notifications to automatically trigger Snowflake data loads into target tables. You can read more about Snowpipe here: Overview and First Steps.

Broader Ecosystem
While we continue adding functionality and enriching Snowflake with additional services, we strongly believe in freedom of choice. Instead of preaching a unified platform and a one-solution-fits-all philosophy, our key objectives are to make integrations easier and with improved performance.

  • For enterprise-class ETL, data integration and replication:
    • Added PowerCenter 10.2 support for Snowflake.
    • Added  support for the ETL use case with AWS Glue – build modern and performing ETL jobs via our Spark connector, pushing compute into Snowflake.  
  • For data analytics and data science:
    • Added a native Snowflake connector for Alteryx supporting in-DB components.
    • Upgraded our Snowflake dplyR library.  
  • For  business intelligence (BI):  
    • Added a native Snowflake connector for ChartIO ( in preview).
  • For parallel data loading & unloading via the COPY command, developers can now:
    • Load non-UTF-8 character-encoded data.
    • Unload to the Parquet file format.

Increasing transparency and usability
These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the service itself.

  • Resource monitors can now be created through the Snowflake UI. As a user creates resource monitors, they can also set up notifications. Notifications for “suspend” and “suspend immediately” actions are sent out when quota thresholds are reached for a particular resource monitor. In addition, we added  support for up to 5 warnings that can be defined for each resource monitor.
  • New MONITOR USAGE privilege to grant access to review billing and usage information via SQL or through the Snowflake UI. This provides non-account administrators with the ability to review information about data stored in databases and stages, as well as provide insight into warehouse usage.
  • For improved usability and convenience, we added a new option (COPY GRANTS) which allows users to preserve or copy grants as part of executing the CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, CREATE TABLE LIKE, and CREATE TABLE CLONE variations.  
  • We also completely overhauled our worksheet and improved the overall experience. We are going to roll these changes out in stages in the coming weeks, with much more to come in 2018.

Scaling and investing in service robustness
These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • Our global expansion continues with a new Snowflake region in Sydney, Australia.
  • We introduced a new product edition: Virtual Private Snowflake (VPS). With VPS, customers get a dedicated and managed instance of Snowflake within a separate dedicated AWS VPC, with all the performance, simplicity and concurrency inherent to Snowflake. This also includes completely dedicated metadata services isolated from the metadata activity of other Snowflake customers.
  • We continued our investments in system stability, improving and strengthening various components of our cloud services layer.

Conclusion

We are well underway to reach 1000 customers by the end of 2017, with tens of compressed PBs stored in Snowflake and millions of data processing jobs successfully executed daily. We grew to over 300 Snowflakes and expanded our global presence. 2017 has been truly a defining year in Snowflake’s young history: Self-service, instant Data Sharing, broadening our ecosystem, instant elasticity, Snowpipe, VPS and PrivateLink for Snowflake.  We look forward to an exciting 2018 as we continue to help our customers solve their data challenges at scale.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

5 Must-Have Features for High Concurrency Environments

Whether you’re new to cloud data warehousing or comparing multiple cloud data warehouse technologies, it’s critical to assess whether your data warehouse environment will need to support concurrent processing of any sort. Unless you’re a lone database shop, in all likelihood, the answer is yes you will. Concurrency, or concurrent data processing, is simultaneous access and/or manipulation of the same data. This is not to be confused with parallel processing, which is multiple operations happening at the same time, but not against the same data.

Concurrency can take the form of multiple users interactively exploring and manipulating a particular data set, concurrent applications querying and visualizing the same data set, transactional updates to the data, or even concurrent loading of new data or change data into the data set. And if you thought to yourself, “What? I can concurrently load new data, while supporting queries on the same data set?,” then keep reading.

If you require concurrency, at any capacity — you will want these five cloud data warehouse features:

  • High relational performance across a broad range of data types: Of course, you want high performance–that’s a given. However, the more challenging aspect to plan for is fast queries on a broad range of data types, including semi-structured/JSON data. You don’t want your relational data warehouse to bog down and hold up corporate users because now it must handle non-traditional data from groups like the web team or product engineering. JSON is the new normal.  
  • Automatic and instant warehouse scaling: Let’s say your warehouse can handle a high number of concurrent accesses, but it bogs down during a period of high demand. Now what? This is a very important question to have answered as you compare technologies. Will you have to kick-off users? Will you have to schedule after-hour jobs? Will you have to add nodes? Will this require you to redistribute data? If so, redistributing data takes time and it’s a double-whammy because the existing data in the warehouse has to be unloaded. This is a huge disruption to your users. Instead, you want the ability to load balance across new virtual warehouses (compute engines) and have your cloud data warehouse continue to execute at fast speeds–including loading new data–all against the same data set. The more automatic this load balancing happens, the better.
  • ACID compliance: Right up there with poor performance is inaccurate results or reporting due to inconsistent data or dirty reads. With more people, workgroups, or applications accessing the same data simultaneously, the more pressure there is to maintain data consistency. A data warehouse with ACID compliance ensures consistency and data integrity are validated without having to write scripts or manually managing data integrity and consistency yourself.
  • Multi-statement transaction support: Tied to ACID compliance is multi-statement transaction support. If you have users that nest multiple transactions within a single query, you want a warehouse solution that you can trust will completely execute the transactions with integrity or will automatically rollback transactions should another transaction in the line fail.  
  • Data sharing: Data integrity and freshness should not stop just because you have to share data with an external stakeholder. Traditional approaches require engaging in an ETL process or spending time manually deconstructing, securing and transferring data. Data consumers on the receiving end must also spend time to reverse your steps and reconstruct data. This process is slow and manual, and does not ensure your data consumers are always working with live and up to date data. Data sharing allows you to eliminate all of this effort and ensure access to live data.

The business value for proactively planning for concurrency is that you want to ensure your cloud data warehouse can support your environment, regardless of what it throws at you. Especially during times of sudden, unpredictable, heavy query loads against a common data set.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Let History Halve Your Next Data Analytics Purchase

If you’re lucky, you’ll spend just six to 12 months considering and buying your next enterprise software solution. If it’s a seven-figure purchase, plan for an additional six to 12 months to confirm your organization has made the best investment possible.

During that process, dozens of your IT and business leaders will engage your shortlist of on-premises and software-as-a-service (SaaS) vendors to compare these competing technologies based on architecture, features, performance, business benefits and cost of ownership. Buying a data warehouse is no different.

But what if you could shorten that process? What if you had fact-based information arranged in a non-traditional but highly effective method to help you confidently narrow your search and quicken your time to market (TTM) with your chosen solution? What would that be worth to your organization and your peace of mind?

All enterprise software decisions include alternatives that span the decades. Meaning, your organization is likely to consider upgrading an existing technology you already own, and consider solutions that represent today’s latest SaaS offerings.

How so? One of your alternatives might be to upgrade your existing, on-premises data warehouse your organization purchased 10 to 15 years ago. A solution that wasn’t much different when it first emerged in the 1990s, and hasn’t advanced much since first deployed in your data center.

You may also own or are considering an on-premises NoSQL solution such as Hadoop. This technology emerged just over a decade ago, challenging the very existence of the legacy data warehouse in order to accommodate the exponential increase in the volume, variety and velocity of existing and new data types.

Since the advent of Hadoop, many traditional data warehouse vendors now offer a cloud version of their on-premises solution. With all this said, there are now many more solution categories, and many new vendors, that you must consider for your next data warehouse purchase.

Herein lies the rub. Nearly all data warehouse purchase decisions, and all enterprise software decisions for that matter, take the form of a side-by-side-by-side, laundry-list comparison. That’s a significant amount of ground to cover, especially for data warehousing, which is four decades old. Your review of competing alternatives becomes even more protracted when the architectures and features of traditional and more modern products don’t align, which they never do.

Instead, think linear. Think like a historian. Many of the solutions you’ll consider are a response to the drawbacks, and benefits, of preceding, competing technologies. For example, at Snowflake, many of our customers had previously used a legacy data warehouse for years and then added a Hadoop solution to expand their data analytics platform. Unfortunately, that combination of technologies did not meet their ever-increasing requirements.

So, before you kick off a necessary, side-by-side comparison, consider your initial group of alternative technologies as building blocks, from bottom to top, from the oldest to the most recent. Then eliminate from contention those that do not add anything new or innovative. This approach enables you to focus on more recent technologies that truly deliver better value, and will enable you to continue to innovate well into the future. In the end, you’ll get to a shorter shortlist, and quickly, thus speeding your decision-making process and TTM by up 50 percent.

At Snowflake, we’ve done the hard work for you. We invite you to read this short but revealing ebook that details the benefits and drawbacks of each succeeding data analytics technology – from the birth of the legacy data warehouse, all the way to today’s modern, built-for-the-cloud data warehouse. We’re confident it will provide the insight you need to quicken your next data warehouse purchase by rapidly reviewing every technology that got us to where we are today.

The Virtual Private Snowflake Story

Snowflake was built as a secure, multi-tenant SaaS data warehouse. We’ve been offering a multi-tenant product designed for high-security needs for years now. But we knew from previous discussions with Capital One and other financial services companies that a dedicated solution would be required to meet their regulatory requirements.

The input we received from the financial services industry gave rise to our idea for a new product: Virtual Private Snowflake (VPS). Those deep discussions with industry customers helped shape and define the fundamental requirements of VPS:

  • Certifiably Secure with PCI support, validated by the customer’s security team.
  • The customer is in complete Control of the data with comprehensive Auditability. Customer-managed keys are a must, as is a complete record of all operations performed by the data warehouse.
  • Resilience to failures with a roadmap to cross-region business continuity. The long-term goal is a 15-minute recovery time from a total regional failure.
  • Isolation that delivers a dedicated instance of Snowflake, running in a separate Virtual Private Cloud.

Snowflake and our financial services customers have worked together since that time to further define the product requirements. Today, I’m proud to announce that our journey to building the most secure cloud data warehouse has reached its first milestone. Virtual Private Snowflake is now commercially available, with Capital One as our first customer.

VPS delivers the full power of the Snowflake service in the form of a fully managed, dedicated pod running in Amazon Web Services (AWS). Financial Services, healthcare and companies from many other industries that handle sensitive data get the best of all worlds with VPS. A dedicated instance of the best cloud data warehouse.

There are many milestones that are still ahead for the most secure, flexible and powerful cloud data warehouse available. But helping customers succeed with their data projects is what we do at Snowflake. That commitment is true for all of our customers.

We thank Capital One and our other financial services customers for their support of VPS. And we look forward to working with all of our customers to help them solve their toughest data challenges.

To learn more about VPS and about Snowflake’s approach to securing sensitive data, click here.

How to analyze JSON with SQL

One of the key reasons I joined  Snowflake is its built-in support to load and query semi-structured data such as JSON. In most conventional data warehouse and Big Data environments, you have to first load this type of data to a Hadoop or NoSQL platform. Then, you must parse it with a product such as MapReduce so you can load the data into tables in a relational database. Then, and only then, you’re ready to analyze the data with SQL queries or a BI/Analytics tool. But why take extra steps when there is a faster, easier way to get the job done?

With Snowflake, you can load your semi-structured data direct into a relational table, query the data with a SQL statement and then join it to other structured data – all while not fretting about future changes to the schema of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or shredding required.

This means you can leverage your existing knowledge and skills in SQL to jump into the world of big data. Even with this feature, there is still a tiny bit to learn. However, it’s easy with Snowflake’s extensions to SQL. To get you started, we have produced a handy ebook that takes you step-by-step through loading some JSON into Snowflake, then querying that data with SQL. The book is called How to analyze JSON with SQL: Schema-on-read made easy.

Get your free ebook now and start your journey to analyzing big data in the cloud!

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) for all the news and happenings here at Snowflake.

P.S. If you don’t already have a Snowflake account, you can sign up for a self-service account here and get a jumpstart with $400 in free credits!

 

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Virtual Warehouse Billing Made Smarter

One of our most important commitments to our users is reducing/eliminating the management and tuning tasks imposed by other systems. We are always on the lookout for new and innovative ways of making our service easier to use.

Most recently, we looked at how our users manage their Snowflake virtual warehouses and realized we could be smarter about how we charge for the compute resources used in these warehouses.

First, Some Background Information…

To begin, here are some details about what happens when a virtual warehouse (or simply “warehouse”) is resumed and suspended in Snowflake:

  1. Each time the warehouse is resumed:
    • Snowflake provisions servers from AWS EC2 for a minimum of one hour. This is based on how AWS charges for EC2 resources.
    • The number of Snowflake credits charged depends on the number of servers provisioned, which is determined by the size of the warehouse (XS, S, M, L, XL, etc.) and the number of clusters in the warehouse (if multi-cluster warehouses are enabled).
  2. When the warehouse is suspended, the servers are shut down and credits are no longer charged.
  3. When the warehouse is resumed again, the servers are re-provisioned and the charges start over, regardless of how much time has passed since the warehouse was last charged. As a result, if the same warehouse is resumed multiple times within the same hour, credits are charged each time. As stated earlier, this follows the AWS EC2 model.

What We Changed

We noticed that some of our users were spending time and effort managing their virtual warehouses to ensure that credits were not consumed unnecessarily. We decided to eliminate this extra work by introducing Warehouse Billing Continuation (WBC).

With WBC, we now track the last time each individual server in a warehouse was charged and, if the warehouse is suspended and resumed within 60 minutes of the last charge, we don’t charge again for the server. The charge is continued from the last time as if the warehouse had never been suspended. This eliminates any extra charges, thereby reducing the need for strictly monitoring and controlling when warehouses are suspended and resumed.

How Does WBC Work?

The simple answer is it just works, regardless of how often you resume and suspend your virtual warehouses. If this answer satisfies your curiosity, you can skip now to the end of this post. Otherwise, read on for the gory details…

The best way to explain WBC is with examples. Say you have a Small warehouse (2 servers) that’s been suspended for longer than an hour. Now, imagine this warehouse goes through the following status changes:

Resumed Suspended Credits Charged before WBC Credits Charged with WBC
09:15 09:25 2 2
09:40 09:50 2
10:05 10:10 2
10:30 10:50 2 2
11:20 11:40 2 2 (at 11:30)

Before WBC, every time the warehouse was resumed, we charged 2 credits (1 credit per server in the warehouse). For example, in the scenario described above, between 09:20 and 11:40, the warehouse incurred 5 charges for a total of 10 credits.

With WBC, the behavior is different. The warehouse only incurs 3 charges for a total of 6 credits. The following diagram provides a more detailed explanation of what actually happens:

Example of charges for resuming and suspending a Small virtual warehouse
Figure 1: Example of charges for resuming and suspending a Small virtual warehouse
  1. At 09:15, 2 credits are charged for the servers in the warehouse because the warehouse has been suspended for longer than an hour and there’s no previous charge to continue.
  2. At 9:40 and 10:05, we recognize that the warehouse was charged within the last hour so no additional charges are incurred. In other words, the current charge doesn’t expire until 60 minutes after it was first incurred, so the earliest the warehouse will be charged again is 10:15, regardless of how many times the warehouse is suspended and resumed during this period.
  3. At 10:15, the warehouse isn’t charged because it’s not running at that time.
  4. At 10:30, the 2 servers are charged again because more than 60 minutes have elapsed since the initial charge. More importantly, this new time is now used to track all subsequent charges, i.e. the earliest time for the next charge is 11:30.
  5. At 11:20, no charge is incurred because 60 minutes haven’t elapsed since the last charge.
  6. At 11:30, the warehouse is charged again because 60 minutes have now elapsed since the last charge and the warehouse is running.

This example covers a relatively simple case for a Small warehouse. Each successively larger warehouse has more servers, so the scenarios are slightly more involved, especially if the warehouse is resized (or multi-cluster warehouses are being used); however, the mechanics and calculations are all the same. The most important thing to remember is that every server in a warehouse is charged independently.

WBC Example with Virtual Warehouse Resizing

Consider the same example from earlier, but starting with a Medium warehouse (4 servers) and the following resize events:

  • Warehouse resized down to Small (2 servers) at 09:30, 10:00, and 11:00.
  • Warehouse resized back to Medium (4 servers) at 09:45, 10:45, and 11:15.
Resumed Credits Charged for Resume Resized Credits Charged for Resize
09:15 4 09:30 (Small)
09:40 09:45 (Medium)
10:00 (Small)
10:05
10:30 2 10:45 (Medium) 2
11:00 (Small)
11:15 (Medium)
11:20 2 (at 11:30) 2 (at 11:45)

The total credits charged would be (4+2+2) + (2+2) = 12. The following diagram shows how the charges are incurred (remember that I warned you earlier about the gory details):

Example of charges for resuming and suspending a Medium virtual warehouse with resizing
Figure 2: Example of charges for resuming and suspending a Medium virtual warehouse with resizing
  1. At 09:15, the initial charge for the Medium warehouse is 4 credits.
  2. At 10:30, the new charge is 2 credits (reflecting the Small size at the time the warehouse is resumed). These 2 servers will be charged next in 60 minutes at 11:30.
  3. At 10:45, a new, separate charge for 2 additional servers is incurred (due to the resize to Medium). These 2 additional servers will be charged next at 11:45, independently of the other 2 servers.
  4. All the other resizing events increase or decrease the number of servers running at that time, but incur no additional charges.

Note that this diagram illustrates how we remove servers from a warehouse, i.e. we always start with the most recently-provisioned servers (i.e. LIFO) and we add them back in the same order. This is important because servers are each charged according to their individual start times. This is also important because there’s no benefit to reducing the size of a warehouse within each hour that it runs because the servers have already been charged for the hour.

So, What Next?

There’s no need to make any changes to your virtual warehouses. We’ve enabled Warehouse Billing Continuation by default for all accounts and warehouses, including any existing warehouses. In fact, we implemented WBC near the end of March so you probably have already noticed reduced charges for some of your warehouses, particularly the ones that you resume and suspend frequently.

However, to take full advantage of this change, you might want to revisit your warehouse procedures and settings. For example, you can now set auto-suspend for a warehouse to a shorter value (e.g. 5 or 10 minutes) without worrying about being charged additional credits each time the warehouse resumes within the hour.

Interested in learning more? Check out the Snowflake documentation for more in-depth information about warehouses:

Also, keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse

Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.

Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.

Which then leads to this common question:

Is there any reason to maintain a permanently disabled FK in the data model?  If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK.  Instead, the relationship can be described in the comment of the child column.

So, why would I want RI constraints in my data warehouse?

Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.

#1 – Design Metadata

RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.

RI in a diagram
A picture is worth a thousand words

#2 – BI Metadata

If you want to use any sort of reporting or BI tool against the database (it is a data warehouse,  after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.

Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.

#3 – QA your ETL/ELT code

I know you think your ETL code is perfect.

But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?

If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.

Defining RI Constraints in Snowflake

You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.

The syntax is standard SQL. You can define the constraints both inline and out-of-line.

Here is a simple example of inline constraints:

Create or replace TABLE SAT_REGIONS (
HUB_REGION_KEY NUMBER(38,0) NOT NULL,
SAT_LOAD_DTS DATE NOT NULL,
REGION_COMMENT VARCHAR(152),
HASH_DIFF VARCHAR(32) NOT NULL,
SAT_REC_SRC VARCHAR(50) NOT NULL,
constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS),
constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY)
references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY)
);

Example of an out-of-line constraint:

ALTER TABLE SAT_REGIONS ADD constraint 
SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);

It’s that easy. So why don’t you have constraints in your data warehouse?

 

Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.