Tech Tip: Quick Start for Getting Your First Data into Snowflake

If you are like me and fairly new to this whole cloud thing, then one of your main questions is likely:

“How do I get data from my desktop (or server) into Snowflake so I can query it?”

Which, in reality, translates to:

“How do I load data in the cloud?”

Well, there are several methods, including a variety of integration and ETL tools. In this tech tip I will show you the simplest way to accomplish this via the Snowflake Web UI.

Our brilliant engineers built a nice little wizard to make it simple for cloud novices (like me) to easily upload a data file from your desktop and load it into a table in Snowflake.

Getting Started

First off, you need a table in the database that matches the structure of the data file you are going to load (you modeled that table right?).  You can build that by running a create table DDL in a worksheet once you have logged in.

So something like this:

CREATE TABLE Region_t1


  (

    Name               VARCHAR2 (200) NOT NULL ,

    Code               VARCHAR2 (2) NOT NULL ,

    Abbreviation       VARCHAR2 (2) NOT NULL ,

    Sort_Order         NUMBER (2) NOT NULL ,

    External_Reference VARCHAR2 (1000) NOT NULL ,

    Record_Source      VARCHAR2 (100) NOT NULL ,

    Record_Owner       NUMBER (1) NOT NULL ,

    Comments           VARCHAR2 (2000)

  ) ;

ALTER TABLE Region_t1 ADD CONSTRAINT PK_Region_t1 PRIMARY KEY ( Code ) ;

ALTER TABLE Region_t1 ADD CONSTRAINT UK_Region_t1 UNIQUE ( Name ) ;

Invoke the Load Wizard

To begin the load process, navigate to the Database tab in the UI, drill into your database and then click on the table name. Once you do that, the Load Table widget is displayed (fourth from the left on the toolbar).  Click on the widget, and the Load Data wizard opens and asks for a warehouse. Select a virtual warehouse to use from the drop-down list. (Note: The list displays any warehouse you have the USAGE privilege on.)

select-wh

Select the File to Load

In the next step, you select the source file to load. You have options to get a file from your file system, or from an AWS S3 bucket if you have already moved some files to the cloud.

file-selected

Note that you can also load multiple files! The select dialog box supports multi-select. But keep in mind that the files must all be the same type with the same structure because they’re all going into the same table.

In the next step, you pick a file format from the drop-down list. In this case, I have selected a format that specifies a CSV type file with comma delimiters.

pick-file-format-for-load

Creating a File Format (if necessary)

If you do not have an appropriate file format already, simply click the plus sign (+) next to the drop-down arrow in the UI to define a new one. For a standard CSV file that uses commas as the delimiter, you really just need to give the file format a name and click Finish. However, if the first row of the file includes column headings, be sure to set Header lines to skip to 1.

create-file-format

Setting Load Options

The next step of the wizard allows you to choose how to handle load errors. There are four options. I chose to have the load continue by skipping any problem records (which I will deal with after the load).

load-options

At this point, you are set to go. If you want to see the code for the load (so you can learn something new, right?), just click the Show SQL link before you click Load.

load-sql

Running the Load

Once you click Load, the process starts and does the following:

  1. Uses PUT to copy the file to a Snowflake internal staging area, automatically encrypting the data during the process.
  2. Uses COPY to load the data into the target table and deletes the file from the staging area upon completion.
  3. Reports the results.

load-success

And voila, you have now loaded data from your desktop into the cloud-native Snowflake Elastic Data Warehouse. Congratulations.

For a much deeper dive into loading data into Snowflake check out the Data Loading topic in our user guide.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more tips and tricks, along with updates on all the action and activities here at Snowflake Computing.

Cloud Analytics in a City Near You: More Learnings and a City Tour Sneak Peak

Last week we discussed a theme that we saw in the Cloud Analytics Symposiums that we had earlier this year – that the use of cloud for data analytics is actually changing the way people work, making not only greater collaboration between people and groups possible, but also enabling more experimentation and creativity in the work of data analysis.

As we kick off our fall Cloud Analytics City Tour today in New York, we wanted to discuss another lesson learned from these past symposiums and to provide a sneak peak of what to expect at our upcoming City Tour events.

This spring we heard thought leaders discuss ways in which organizations are looking to self-service analytics, streaming data, and multi-source data exploration to connect not only formerly siloed information within an organization, but also to include diverse external data that had previously been difficult to integrate and leverage. Beyond just bringing together data, these companies are hoping to create a business mindset where every decision possible becomes a data-driven decision.

However, the goal of using data-driven decision making across the organization has all too often been stymied by critical obstacles. Tamara Dull, a speaker in our spring Symposium named #13 on the Big Data 2015: Top 100 Influencers in Big Data list, pointed out that the lack of people resources and organizational assumptions about big data solutions are often what hinders the implementation of successful big data infrastructure and processes.

While self-service infrastructure can address part of this conundrum, there has been a painful shortage of qualified people who can implement, manage, and maintain the existing “big data” solutions as well as a shortage of skilled data scientists to support the decision-making analytics needed by decision-makers.

Analytics tools including Tableau and Looker help data analysts work with data more easily, but when plugged into a modern, agile data warehousing solution these services are super-charged – able to provide more insight with more data, faster, and at any scale required. Dean Abbott, Co-founder at SmarterHQ and a data scientist with over 21 years of experience, noted that Snowflake’s unique support for semi-structured data can make it possible for analysts to access and join a greater diversity of data faster than ever before, in minutes rather than hours.

And what’s more, with a data warehouse built for the cloud, the entrenched impression that an organization needs a new breed and more IT staff to handle care and feeding of the data platform also no longer holds true. In fact, it doesn’t take a huge IT organization to support big data. Snowflake client PlayStudios made that point clear at our symposiums.

And the benefits are not limited to technology companies. The University of Notre Dame spoke about how they improved both performance and analyst productivity by migrating from an in-house data warehouse appliance to Snowflake.

Building on what we learned at our symposiums this spring, we approach our fall Cloud Analytics City Tour to learn new insights in how people and organizations are using data. Today in New York, we’ll hear David Linthicum, CTO & CEO at Cloud Technology Partners and the #1 Cloud Industry Analyst Influencer as rated by Apollo Research, share insight into how cloud is being used for data management and analysis across different industries. We will also hear from Accordant Media regarding how the use of cloud technology in their data handling has super-charged their products and businesses.

If you haven’t signed up already, you can do so at http://cloudanalyticscitytour.com/new-york . We’ll also be bringing our City Tour to other cities including Denver, Dallas, Seattle, San Francisco, Chicago and Los Angeles where we’ll have the chance to hear from thought leaders including Dave Menninger of Ventana Research, Claudia Imhoff from Intelligent Solutions, William McKnight from the McKnight Consulting Group, Tony Cosentino, Noel Yuhanna of Forrester, and more. You can see the full schedule and register for a city near you at http://cloudanalyticscitytour.com.

Cloud Analytics: Sharing Information, Insights and Innovations

While it’s been a busy summer here at Snowflake, we are now picking up even more momentum as we head into the fall. In addition to showcasing our cloud data warehouse at industry conferences such as the upcoming Strata + Hadoop World conference in New York, our team will be hitting the road for our fall Cloud Analytics City Tour, kicking off next week in New York and Boston. (You can see the full fall schedule and register for a city near you here.)

To provide some background, this past spring Snowflake invited data professionals to participate in cloud analytic symposiums held in Chicago and Los Angeles. Attendees included a broad array of people interested in innovation in analytics. Our goal was to create a forum where a diverse set of cloud and data analytics professionals could discuss how and why the cloud is playing an increasingly prominent role in analytics and share experiences and recommendations for managing and using data in the cloud.

The value-added knowledge that we saw exchanged at the event and the feedback we received from attendees demonstrated the value of a forum where data professionals could learn and share information and ideas. Speakers shared insights about how the cloud is not simply about efficiencies that save time and budget (one of the first things people think of when it comes to cloud solutions in general), but about how cloud is creating new opportunities for data users to experiment, innovate and to make exponential progress in their work. We saw speakers talk about how cloud is causing a sea change in the status quo, creating new opportunities for a broader array of data users to experiment, innovate and make exponential progress in putting data to work.

The Creative Destruction Cycle in Analytics

Dean Abbott, co-founder and Chief Data Scientist of SmarterHQ and a speaker at our symposium in Los Angeles, described this change as making possible a rapid and continuous cycle of creation, destruction and re-creation that is enabling him and his team of data scientists to test and iterate on the fly. In on-premises data analytics environments, that team would have been required to do careful planning well in advance because of the upfront investment required to make sure that the appropriate resources were purchased and deployed to support projects. In the cloud, environments can be created, populated, used, and destroyed on the fly, making it easy to experiment and iterate rapidly. This different way of working is giving his team’s work added relevancy as a result of being able to test more hypotheses within the same budget and time frame. Moreover, because of the simpler, more agile environment that the cloud enables, the team has access to fresher data, making it possible for new data from the field and other sources to be quickly normalized, integrated and examined against historical data in a way that had not been possible prior to the use of cloud services for data handling and report generation.

This continuous, iterative cycle of data experimentation is vastly different from the traditional lock-step and labor intensive framework in which data users labored and debated incessantly over the hypotheses that they would be testing before undertaking the report generation phase. That latter phase itself had before taken days, even weeks to complete. And even before that, IT had to choose the technology and systems to run the reports before the data had even been examined. Before the advent of the cloud data warehouse and cloud analytics tools, ambiguities, inconsistencies and incongruities were common and difficult to test. The work flow process of the data analysts contained a substantial amount of guesswork, with gaps and delays developing between what the scientists already knew, what data still needed to be tested, and any new data coming in.

You Have All That Data, Now What Do You Do With It?

Tamara Dull, Director of Emerging Technologies at SAS and #13 on the Big Data 2015: Top 100 Influencers in Big Data list, pointed out that cloud has made utilizing the benefits of a data warehouse more accessible to a wider diversity of types and sizes of organizations than in the past. This new accessibility is not only improving data management and enhancing security but also, similar to the experience shared by Dean Abbott, creating new opportunities in data discovery and providing a platform for advanced analytics.

In the past, when data was more homogenous and there were fewer data sources, new and old data could be integrated via complex data integration pipelines, carefully planned data warehouses and sometimes some very large Excel spreadsheets. But with the advent of new data sources and formats such as web application data, mobile user data, and now IoT data streams, traditional systems can no longer keep up. The result is that these gaps have been getting bigger, potentially at great cost to accuracy and effectiveness for those still using these old systems.

It was clear from the presentations and discussions that a variety of organizations from revolutionary start-ups to reinvented Fortune 500’s are building and rebuilding their data-driven operations in the cloud to ensure that their data management infrastructures are as flexible as the incoming data. The outcome of this new paradigm is that while approaches and methodologies for using data can differ vastly between organizations, storing and using data in the cloud opens up exciting new possibilities for data analytics. These changes aren’t just better for business, but have become a requirement for thriving in an increasingly data-driven world.

Coming Up Next

We’re looking forward to more discussions and insights in this fall’s City Tour.  We hope you’ll join us to share in the discussion and add your own insights.

sf-header-4-logo-2x

SnowSQL Part II: Interactivity

In our previous blog, we introduced you to SnowSQL – Snowflake’s new modern command line tool that we built from scratch. Today we are going to talk about how SnowSQL lets you interactively query your data.

Making SnowSQL Yours

While SnowSQL simply works after its installation, users can make customizations to take full advantage of SnowSQL.

For example, by defining connections in the config file, I can preset my environment (account, warehouse, database, etc…) and don’t have to worry about connection strings or exposed passwords.

In order to keep track of which environment I am in, I customized my prompt to show only the information I need. (At this stage I am not concerned with the warehouse.)

Default prompt:

[user]#[warehouse]@[database].[schema]>

image00

My prompt:

image01
By the same token, I set output_format to fancy_grid, making it much easier to read the output on my screen.
PSQL FANCY_GRID

image02

Although you could do without a customizable prompt (as most other SQL clients do), it is preferable to be able to make customizations as you want. This, combined with progress bars for the PUT and GET commands, and a few other customization options, leads to a slick user experience, unlike other paleolithic SQL command line tools.

Using SnowSQL

SnowSQL also includes very powerful line editing commands to simplify the manipulation of queries. All the usual shortcuts are there: CTRL-R to search the history and Arrow keys to navigate up/down in the history. By using the !edit command, I can modify large queries in my favorite editor (sorry emacs users, its vim for me), rather than on the command line.

image03

Since some queries can take a long time to run, I run them concurrently to avoid disrupting my workflow. To run a query in the background, just add > after the semicolon.

I can then check all the queries that are running or finished by using the !QUERIES command. With variable_substitution turned on, the !queries command will bind a variable to each query ID.

I can abort any running query by writing !abort &. To get the result of a query, I use !result &. If the query is running, it will wait until the result comes back. If the query finished, it will just fetch the result, so I can always go back to previous queries.

image06

We want SnowSQL to be your tool of choice. If you come across any features that you like or miss, please let us know. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions. As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.

In the meantime, stay tuned for blog #3, where we will demonstrate SnowSQL’s scripting capabilities.

Mobile customer experience matters

Customers are not only interacting with your app, they are also using multiple social networks from their mobile devices to interact with your organization — making mobile experience essential. Customer behavior has multiple sources; understanding that behavior requires a data platform that can easily incorporate and process all these sources of data along with the web traffic being generated by your app. It can be difficult to integrate these disparate sources given key requirements such as:

  • Ingesting different types of data
  • Accommodating data changes
  • Scaling in synch with the user base scales to ensure supply for analytics matches demand

Regardless, organizations with a successful mobile strategy have been able to find a way. They use data analytics solutions to differentiate themselves and personalize their customers’ mobile experience. They use these solutions to better understand customer interaction with their organization using a mobile platform along with getting a stronger grip on customer experience and fixing unforeseen problems.

One such organization is Chime. Chime is revolutionizing banking for the mobile generation by designing their mobile presence around the customer. Their app is aimed at helping people lead healthier financial lives and automating savings. “Chime is designed for the millennial generation who expect services to be personalized and mobile-first.” says Ethan Erchinger, Chime’s Director of Technical Operations.

See how Chime is able to easily ingest and analyze data from 14 different sources including Facebook, Google and applications that emit JSON data in order to effectively analyze customer experience and feedback within their mobile platform. For example, Chime is able to collect and analyze feedback from app users to personalize their experience based on their geographic location, so that they can get helpful hints about saving money in their area. Read Chime’s story here.

SnowSQL Part I: Introducing SnowSQL – a modern command line tool built for the cloud

What & Why

Starting this month, we have introduced a new modern command line tool for an interactive query experience with the Snowflake Elastic Data Warehouse cloud service. Yes indeed – a new command line tool!

Now you might ask – why did Snowflake decide to invest in building a new SQL command line tool in the 21st century when there are so many different ways of accessing and developing against the Snowflake service? For example, you could use our current user interface, or various drivers, or other SQL editor tools available today.

The answer is simple:

  • First of all, developers still care very deeply about light-weight mechanisms to quickly ask SQL questions via a text-based terminal with simple text input and output.
  • Secondly, we recognized the need for a more modern command line tool designed for the cloud that is easy to use, built on high security standards, and is tightly integrated with the actual core Snowflake elastic cloud data service.
  • Finally, we wanted to build a command line tool which offers users more powerful scripting capabilities overall.   

The result of these efforts is SnowSQL – our new SQL command line tool that is entirely built in Python and leverages Snowflake’s Python connector underneath.

Looking at existing more generic command line tools, such as SQLLine or HenPlus, it became obvious that these tools lack the ease-of-use and did not really offer sufficient scripting features. Also, Snowflake’s approach was to consider the command line tool as part of the overall cloud service. That is, SnowSQL can be seen as an extension of the Snowflake Elastic DW service. This cloud-first philosophy has important implications on the entire life cycle of Snowflake’s command line tool including the agility of delivering new SnowSQL features as part of the frequent Snowflake service updates (please see below the auto upgrade capabilities)

Getting Started

To get started, you can download SnowSQL from the Snowflake UI after logging into your Snowflake account. SnowSQL is currently supported on all three major platforms including:

  • Linux 64-bit,
  • Mac OS X 10.6+, and
  • Windows 64-bit

We provide a native installer for each platform with easy-to-follow installation steps. Once you have installed SnowSQL, open a new terminal and type:

$ SNOWSQL_PWD=<password> snowsql -a <account_name> -u <user_name>

Upon first use, SnowSQL users will notice a progress bar which indicates the download of the initial version of SnowSQL. This is a one time operation.  Any future downloads will happen in the background and remain entirely transparent for the user. The ability to automatically and fully transparent upgrade the command line tool is one of the key benefits of building a command line tool as cloud extension rather than a stand-alone software component interacting with a remote service (please see the auto-upgrade section below).

Using SnowSQL

There are a few capabilities we would like to highlight in this blog.  

SnowSQL Commands

First, SnowSQL offers a wide range of commands a user can make use of. As  a general rule, all SnowSQL commands start with a bang character ‘!’. For a complete list of currently supported commands, please see our documentation here.

Auto-Complete and Syntax Highlighting

Secondly, with our context-sensitive auto-complete feature, SnowSQL users are released of cumbersome and error-prone typing of long object names Instead they can complete SQL keywords and functions once typing the first three letters. By leveraging auto-complete, SnowSQL users can become increasingly more productive and quickly explore data in Snowflake. Furthermore, SQL statements are highlighted in different colors resulting in a better readability for SnowSQL users when interacting with a terminal

 

Auto-Upgrade

Thinking as a service, the auto-update framework enables users to always stay up-to-date with both –  Snowflake’s and SnowSQL’s latest features to streamline end user experience. No more additional downloads or tedious re-installation are needed. The upgrade is transparent for the end user and takes place  as a background process when you start SnowSQL. Next time a user runs SnowSQL, the new version will be automatically picked up while their workflows remain unaffected and will not be interrupted during the upgrade.

Secure Connection and Encryption

Finally, security is core in  SnowSQL’s design. SnowSQL secures connections to Snowflake using TLS (Transport Layer Security) with OCSP (Online Certificate Status Protocol – OCSP) checks. The auto-upgrade binaries are always validated by using RSA signature. In addition to the secured connection, SnowSQL provides end-to-end security of data moving in and out of Snowflake by using AES (Advanced Encryption Standard)  for Snowflake’s PUT and GET commands.

In the next few weeks, we will provide a deeper dive into some of the unique capabilities of SnowSQL. Please stay tuned as we gather and incorporate feedback from our customers. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions.

Finally, all of this would not have been possible without the active Python community who inspired us in many ways and offered us tools and packages to build SnowSQL. Thank you.   

As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.

Empowering Data Driven Businesses via Scalable Delivery

The accessibility of data has changed the way that businesses are run. Data has become essential to decision makers at all levels, and getting access to relevant data is now a top priority within many organizations. This change has created huge demands on those who manage the data itself. Data infrastructures often cannot handle the demands of the large numbers of business users, and the result is often widespread frustration.

The concept of an Enterprise Data Warehouse offers a framework to provide a cleansed and organized view of all critical business data within an organization. However, this concept requires a mechanism to distribute the data efficiently to a wide variety of audiences. Executives most often require a highly summarized version of the data for quick consumption. Analysts want the ability to ‘slice and dice’ the data to find interesting insights. Managers need a filtered view that allows them to track their team’s data in the context of the organization as a whole. Strategists and economists need highly specialized data that can require highly complex algorithms and tools to generate. If one system is to serve the data for all of these audiences, along with numerous others, it must be able to scale to provide access to all of them.

Traditional Data Warehouse architecture has not been able to adequately respond to this demand for data. Many organizations have been forced to move to hybrid architectures in response. OLAP tools and datamarts became commonplace as mechanisms to distribute data to business users. Statisticians and later data scientists often had to create and maintain entirely separate systems to handle data volumes and workloads that would overwhelm the data warehouse. Data is copied repeatedly into a variety of systems and formats in order to make it available to wider audiences. A vice president at a large retail company openly referred to their data warehouse as a data pump that was entirely consumed by the loading of data on one side and the copying of that data out to all sorts of systems on the other side. This also leads to challenges in ensuring there is one vision of the data inside the organization and synchronizing all the siloed data.
data_to_analysts

During my time working in a business intelligence role for a number of the largest organizations in the world, I encountered all of these problems. While supporting the Mobile Shopping Team at a major online retailer, I had access to a wide variety of powerful tools for delivering data to my coworkers that relied on that data for making key decisions about the business. However coordinating all of these systems so that they produced consistent results across the board was a huge headache. On top of that, many of the resources were shared with other internal teams, which meant that we were regularly competing for access to a limited pool of resources. There were a number of situations where I could not get data that was requested by my VP in a timely manner because other teams had consumed all available resources in the warehouse environment.
The architects behind Snowflake were all too familiar with these problems, and they came to the conclusion that the situation demanded an entirely new architecture. The rise of cloud computing provided the raw materials: vast, scalable storage completely abstracted from the idea of server hardware or storage arrays and elastic compute. More importantly, this storage is self-replicating, thus automatically creating additional copies of items in storage if they become ‘hot’. This new storage, combined with the availability to leverage nearly unlimited computing power, is at the heart of the multi-cluster shared data architecture that is central to Snowflake’s Elastic Data Warehouse.

data_warehouse (2)

At Snowflake, data is stored once, but can be accessed by as many different compute engines as are required to provide responses to any number of requests. Data workers no longer need to focus on building out frameworks to deal with concurrency issues. They can keep all of their data in one system and provide access without having to consider how different groups of users might impact one another. The focus can then be placed where it belongs, on deriving value from the data itself. Thus you can load data and at the same time have multiple groups, each with their own computing resources, query the data. No more waiting for some other groups to free up resources. Your business can now deliver analytics to the various users: executives, analysts, data scientists, and managers, without the environment getting in the way of performance.

As always, keep an eye on our Snowflake-related Twitter feeds (@SnowflakeDB and (@ToddBeauchene) for continuing updates on all the action and activities here at Snowflake Computing.

Automatic Encryption of Data

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#2 Automatic Encryption of Data

One of the biggest worries people have about moving to the cloud is security. One key piece of providing enterprise class security is the ability to encrypt the data in your data warehouse environment. With Snowflake, your data is automatically encrypted by default.

No setup, no configuration, no add-on costs for high security features. Data is encrypted during its entire lifecycle. From loading data to storing data at rest, we apply end-to-end encryption, such that only the customer can read the data, and no one else. It is just part of the Snowflake service! That is a huge win for anyone who has ever tried to set up database security of any kind. In addition, this gives Snowflake a significant advantage compared to environments like Hadoop, where encryption and security is almost almost entirely left up to the customer to implement and maintain.

So what level of encryption?

For all data within Snowflake, we use strong AES 256-bit keys. Your data is encrypted as you load it. That is the default, and you cannot turn it off. In addition, our Snowflake security framework includes additional security best practices such as the use of a hierarchical key model and regular key rotation. All of this is automatic and transparent to our customers. In this way we provide our customers best-in-class data security as a service. For even more details on our approach to end-to-end encryption and how we secure your data, check out these blogs; end to end encryption and encryption key management. 

Remember the #10 Top feature – persistent result sets? Well those query results are also encrypted with 256-bit encryption keys (all the data, all the time).

So do you have your data at rest encrypted in your data warehouse today? Are your loading tools, and staging environments also encrypted?

If not, then put your data into the Snowflake Elastic Data Warehouse and rest easy knowing your data is safe and secure. Snowflake brings enterprise grade security to data warehousing in the cloud, with end-to-end encryption as a major part of our offering.

As a co-writer of this series, I would like to thank Kent Graziano, who has put in a lot of effort into bringing the thoughts behind this series to the audience. Without his persistence and vision, this would not have been possible. As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Kent Graziano and Saqib Mustafa

Support for Multiple Workloads

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#3 Support for Multiple Workloads

With our unique Multi-cluster, shared data architecture, Snowflake can easily support multiple and disparate workloads.  This is a common issue in traditional data warehouses so it makes total sense to be able to keep disparate workloads separate, to truly avoid resource contention, rather than just saying we support “mixed” workloads.

In legacy data warehouse environments, we often found ourselves constrained by what we could run and when we could run it for fear of resource contention, especially with the CPUs. In many cases it was impossible to refresh the data during the day because the highly parallelized, batch ETL process, while tuned for maximum throughput, usually hogged all the CPUs while it ran. That meant virtually no reporting queries could get resources so they would just hang. Likewise a complex report with calculations and massive aggregations would cause normally fast, simple reports to languish. And there was no way you could let any business users in the system to do exploratory queries as those might also cause everything else to hang.

Because of the separation of compute and storage native to Snowflake’s architecture, as previously highlighted, you can easily spin up a set of compute nodes (we call them Virtual Warehouses) to run your ELT processes, and another set to support your BI report users, and a third set to support data scientists and data miners. In fact you can spin up (or down!) as many virtual warehouses as you need to execute all the workloads you have.

Virtual Warehouse

So not only does each virtual warehouse share the same data (insuring consistent results), they are able to do so without being affected by operations being launched in other virtual warehouses because they are using completely separate resources. Hence there is no more resource contention!

With the Snowflake Elastic Data Warehouse, there is no more need to run the data loads at night just to avoid slowing down the reports. No more worry that one runaway query will impact the loads or other users. You can now run loads (e.g., real time, micro-batch, etc) at any time and thus provide your analysts and users current data on a more frequent basis.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake’s Data Warehouse as a Service (DWaaS) is built by design.

Nice!

For a quick look at how this works, check out this video.

Thanks to Saqib Mustafa for his help and suggestions on this post.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Automatic Concurrency Scaling in Snowflake – Another Way the Cloud Changes the Game

The challenge with concurrency

Today we take a major step forward by extending our elastic architecture to solve another major pain point in existing on-premises and cloud data warehousing solutions: how to run massively concurrent workloads at scale in a single system.

Have you had the following experiences when building mission-critical applications that incorporate data analytics:

  • My application can only support a certain level of user concurrency due to the underlying data warehouse, which only allows 32-50 concurrent user queries.
  • To build my application, I need to acquire multiple data warehouse instances in order to isolate numerous workloads and users from each other. This adds to costs and complexity.
  • We have built our own scheduling policies around the data warehouse. We use query queues to control and prioritize incoming queries issued by our numerous users.
  • During peak times, users are getting frustrated because their requests are getting queued or fail entirely.

At Snowflake, we separate compute from storage by introducing the unique concept of virtual data warehouses. That concept makes it possible to instantly resize virtual warehouses or pause them entirely. In addition, because of that concept Snowflake is the only cloud data warehousing solution that allows concurrent workloads to run without impacting each other.

However, we saw the need to go a step further to offer a service that adapts to changing workloads and addresses concurrency at the same time:

  • Imagine you didn’t have any concurrency limitations on your mission-critical business application.
  • Imagine you didn’t need users to adjust their workloads to accommodate data warehouse bottlenecks.
  • Imagine the data warehouse itself could detect increasing workloads and add additional compute resources as needed or shut-down/pause compute resources when workload activities subside again.
  • Imagine your application could scale out-of-the-box with one single (virtual) data warehouse without the need to provision additional data warehouses.
  • Imagine a world without any scheduling scripts and queued queries – a world in which you can leverage a smart data warehousing service that ensures all your users get their questions answered within the application’s SLA.

With Snowflake, we allow you to do that all of this for real, not just in your imagination, with our new multi-cluster data warehouse feature.

Multi-cluster data warehouses

A virtual warehouse represents a number of physical nodes a user can provision to perform data warehousing tasks, e.g. running analytical queries. While a user can instantly resize a warehouse by choosing a different size (e.g. from small to 3X large), until now a virtual data warehouse in Snowflake always consisted of one physical cluster.

With the recent introduction of multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, more resources for a warehouse by specifying additional clusters for the warehouse.

QueryScheduler-1b

Figure 1: How automatic scaling works

 

The figure above shows a multi-cluster DW that consists of three compute clusters. All compute clusters in the warehouse are of the same size. The user can choose from two different modes for the warehouse:

  • Maximized:When the warehouse is started, Snowflake always starts all the clusters to ensure maximum resources are available while the warehouse is running.
  • Auto-scale:Snowflake starts and stops clusters as needed to dynamically manage the workload on the warehouse.

As always, in Snowflake a user can either leverage the user interface or use SQL to specify the minimum/maximum number of clusters per multi-cluster DW:

Create Warehouse UI wizard

MultiCluster

Figure 2: Create Warehouse in UI Wizard

 

Create Warehouse SQL script

Multi_cluster_sql

Figure 3: Create Warehouse via SQL

 

Similar to regular virtual warehouses, a user can resize all additional clusters of a multi-cluster warehouse instantly by choosing a different size (e.g. XS, S, M, L, …) either through the UI or programmatically via corresponding SQL DDL statements. In auto-scale mode, Snowflake automatically adds or resumes additional clusters (up to the maximum number defined by user) as soon as the workload increases. If the load subsides again, Snowflake shuts down or pauses the additional clusters. No user interaction is required – this all takes place transparently to the end user.

For these decisions, internally, the query scheduler takes into account multiple factors. There are two main factors considered in this context:

  1. The memory capacity of the cluster, i.e. whether clusters have reached their maximum memory capacity
  2. The degree of concurrency in a particular cluster, i.e. whether there are many queries executing concurrently on the cluster

As we learn more from our customers’ use cases, we will extend this feature further and share interesting use cases where multi-cluster data warehouses make a difference. Please stay tuned as we continue reinventing modern data warehousing and analytics by leveraging the core principles of cloud computing.

We would like to thank our co-authors, Florian Funke and Benoit Dageville, for their significant contributions as main engineer and architect, respectively, to making multi-cluster warehouses a reality. As always, keep an eye on the blog and our Snowflake Twitter feed (@SnowflakeDB) for updates on Snowflake Computing.