Why You Need a Cloud Data Warehouse

Are you new to the concepts of data warehousing? Do you want to know how your enterprise can benefit from using a data warehouse to streamline your business, better serve your customers and create new market opportunities? If so, this blog is for you. Let’s cover the basics.

It begins with production data

Day-to-day, nearly all enterprise-class companies process data as part of core business operations. Banks process debit and credit transactions for account holders. Brick-and-mortar and online retailers process in-store and website purchases, respectively. Insurance companies maintain and update customer profile and insurance policy information for policyholders.

The nature of these production systems is transactional and require databases that can capture, write, update or delete information at the pace of business operations. The systems behind these transactions are online transaction processing (OLTP) databases. For example, OLTP databases for investment firms must operate at lightning speed to keep up with high-volume stock and bond trading activity that occur in fractions of a second.

The need for a data warehouse solution

In addition to capturing transactions, another aspect of business operations is to understand what’s happening, or what has happened, based on the information captured with OLTP databases. By this, I mean companies must not only know how much revenue is coming in, they must know where revenue is coming from, the profile of customers making the purchases, business trends (up or down), the products and services being purchased and when those transactions are taking place. And, certainly businesses need to know what it will take for customers to remain loyal and buy more. Answers and insights to these questions are necessary to develop strategic business plans and develop new products that will keep businesses growing.

Why transactional (OLTP) systems are not optimized for data warehousing

Acquiring these insights requires accumulating, synthesizing and analyzing the influx of data from OLTP databases. The aggregation of all this data results in very large data sets for analytics. In contrast, when OLTP systems capture and update data, the amount of data transacted upon is actually very small. However, OLTP systems will execute thousands upon thousands of  small transactions at a time. This is what OLTP systems are optimized to do; however, OLTP systems are not optimized for the analysis of large to extremely large data sets.  

This is why data warehousing solutions emerged. Data warehouse solutions will hold a copy of data stored in OLTP databases. In addition, data warehouses also hold exponentially larger amounts of data accessed by enterprises, thanks to the enormous amount of Internet and cloud-born data. Ideally, data warehouses should be optimized to handle analytics on data sets of any size.  A typical data warehouse will have two primary components: One, a database (or a collection of databases) to store all of the data copied from the production system; and two, a query engine, which will enable a user, a program or an application to ask questions of the data and present an answer.

Benefits of deploying a data warehouse

As previously stated, with a data warehouse, you ask and find answers to questions such as:

  • What’s the revenue?
  • Who’s buying?
  • What’s the profile of customers?
  • What pages did they visit on our website?
  • What caught their attention?
  • Which customers are buying which products?

With native language processing and other deep learning capabilities gaining popularity, you can even develop insights about the sentiment of prospects and potential customers as they journey towards your enterprise.

Benefits of data warehousing… in the cloud

Many data warehouses deployed today were developed during the 1980s and were built for on-premises data centers typical of the time. These solutions still exist, including availability of “cloud-washed” versions. Both options typically involve upfront licensing charges to buy and to maintain these legacy data warehouses. Yet, neither legacy data warehouses (0r current generation data lakes based on Hadoop) can elastically scale up, down, or suspend as needed to meet the continuously varying demands of today’s enterprises.

 

As result, these types of solutions require a lot attention on low-level infrastructure tasks that divert IT and data science teams from truly strategic analytics projects that advance the business.

With modern, cloud-built data warehouse technology now available, such as Snowflake, you can gather even more data from a multitude of data sources and instantly and elastically scale to support virtually unlimited users and workloads.

All of this is accomplished while ensuring the integrity and consistency of a single source of truth without a fight for computing resources. This includes a mix of data varieties, such as structured data and semi-structured data. As a modern cloud service, you can have any number of users query data easily, in a fully relational manner using familiar tools, all with better security, performance, data protection and ease-of-use that are built-in.

For these reasons, you can expect enterprises to turn to companies like Snowflake to help propel insights from your data in new directions and at new speeds, regardless the size of the business or industry in which you compete.

subscribe to the snowflake blog

The Dream Data Warehouse Development Environment

Earlier this month, Snowflake’s Customer Enablement Team was assigned an email from one of our customers. The customer stated that he was not happy about the idea of cloning full copies of production databases for development purposes. “Do we really want to develop and maintain a system to copy a fraction of the production DB to dev?”, citing the reason for his message that, by just copying the entire production database, the dev team would have access to too much data. Being a veteran of Snowflake, I initially dismissed his concern because of Snowflake’s zero-copy clone capability, as outlined in this article. From my perspective, the zero-copy clone would not incur any additional cost for development purposes, so why not give the dev team all of the data?

The answer of course, as the customer pointed out, has to do with making the development effort more efficient. The assumption of zero-copy clone equating to zero-cost development is, of course, incorrect. There’s the cost of querying the data (requiring virtual warehouse credits) and the cost behind each development hour. After all, longer query times lead to longer development iterations and longer testing cycles. To create a “blissful” development environment in Snowflake, we need a more refined approach towards building the development data set.

The approach outlined by the customer was rooted in complimenting Snowflake’s zero-copy clone with the additional technique of using Block Sampling. In other words, they proposed creating a view containing a sample set from the original table. This approach enables an administrator to quickly set up a dev environment with minimal data. Controlling the amount of sample data is a good thing for many development situations because developers seldom require access to the full dataset.

Ok, let’s take a look at SAMPLE / TABLESAMPLE and see how we can do this. The syntax is quite simple:

 
 SELECT ...
 FROM ...
   { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( <probability> ) [ { REPEATABLE | SEED } ( <seed> ) ]
 [ ... ]
 
 -- Where:
    samplingMethod :: = { { BERNOULLI | ROW } | { SYSTEM | BLOCK } }


Note that there are some interchangeable terms we should pay attention to when writing the query. These terms are synonyms and only differ in syntax:

SAMPLE | TABLESAMPLE

BERNOULLI | ROW

SYSTEM | BLOCK

REPEATABLE | SEED

The two main methods of sampling are ROW (or BERNOULLI) and BLOCK (or SYSTEM) sampling. Let’s take a closer look at each one. 

Row Sampling

This approach uses the Bernoulli principle to select data by applying a probability of p/100 to each row. In the Snowflake documentation, we mention that this is similar to “flipping a weighted coin” on each row. The number of sampled (selected) rows should be equal to (p/100) * n, where n is the total number of rows in the table and p is the sample probability value set by the user.

This method implies that we will iterate through each row and calculate a probability at each row to match a row candidate. This iteration is going to impact query performance. However, the advantage here is that we will end up with a formal and concise distribution of samples from our data. Regarding creating a dev environment, using the Bernoulli sampling method to create a view does not result in better performance. However, you could certainly create a new data table based on this sampling method and still scan fewer data. For example:

create table dev_Sales_LineItem_Bernoulli_20 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample bernoulli (20);

This statement will create a development table from the LINEITEM table in our SALES database with 20% of the rows from the original table.

I won’t focus too much on this method in this post. You can feel free to experiment on your own to see if this sampling method would suit your needs. Instead, I will talk more about Block Sampling.

Block Sampling

In this approach, we apply a probability of p/100 to each block of rows. For those who are familiar with Snowflake’s micro partitions (details below), block sampling chooses individual partitions based on a specific probability. If you simply want quick access to production data and to run queries against a small percentage of the rows in a table, leveraging Block Sampling is a good way to go.

Let’s look at this with a specific example using the Block Sampling method on the LINEITEM table in the SALES database. Here are the table metadata:

In this example, I have a developer who is working on an aggregation query against the LINEITEM table. He created a simple query like this:

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

L_linestatus;

Result:

 

This query reports the amount of business that was billed, shipped and returned. If the developer has decided to issue a query against this table, we would see the following outcome in our query plan. (Note: the below result was performed on an XS warehouse):


It took about 19 seconds on an XS warehouse, which is not bad by any means. However, we can see that the query still performed a full table scan and performed a pretty good amount of aggregation.

Now, let’s see how we can improve the experience for the developer by creating a sample set of the LINEITEM table. We’ll create a special view (against our cloned dev database) for this developer and call it V_SALES_LINEITEM_SAMPLE_10. In this view, we’ll use the Block Sampling technique and only scan 10% of the partitions required by this query.

First create the view:

create view v_Sales_LineItem_Sample_10 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample block (10);

Next, let’s update the query to use the view instead.

select

       l_returnflag,

       l_linestatus,

       sum(l_quantity) as sum_qty,

       sum(l_extendedprice) as sum_base_price,

       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

       avg(l_quantity) as avg_qty,

       avg(l_extendedprice) as avg_price,

       avg(l_discount) as avg_disc,

       count(*) as count_order

  from

       v_Sales_LineItem_Sample_10

where

       l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

       l_returnflag,

       l_linestatus

 order by

       l_returnflag,

       l_linestatus;

Result:

 

Let’s take a look at the query profile for this query:

 

This query ran about 2.5 seconds on an XS warehouse, and we scanned about 10% of the total table partition. This result is much better! Snowflake is sampling 10% of the partition to pull based on the WHERE filter. The result should still be accurate enough to let the developer know this query is working as expected.

In conclusion, leveraging Snowflake’s SAMPLE clause considerably reduces the amount of test dataset for a developer without losing data integrity. Even if a specific id or a timestamp biases the native data, chances are the developer is going to accept the nuance and continue with the development instead of spending additional compute credit and hours waiting for the results to return. The alternative is to use the LIMIT function perhaps to reduce the number of rows being returned. However, this involves modifying the original query or creating a view that still needs to be maintained. Using the SAMPLE clause, you can reliably subset a large table with acceptable result mix and performance. I hope you find this helpful. Feel free to leave comments in our community and let me know your thoughts.

 

Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Deliveroo Delivers with Real-time Data

In a field of struggling food delivery startups, one notable success story has emerged from the fray. Termed “the European unicorn” by TechCrunch, Deliveroo is a British startup that offers fast and reliable food delivery service from a premium network of restaurants.

Deliveroo recently raised a $385 million funding round, boasts an estimated $2 billion valuation and is credited with transforming the way people think about food delivery. What is this unicorn doing differently? How has it found success where so many others have failed?

“Data is baked into every aspect of the organization,” Deliveroo’s head of business intelligence, Henry Crawford said. “Having instant access to data reveals which geographic areas are experiencing a shortage of restaurants and a shortage of particular cuisines so we can create these hubs right at the consumer’s doorstep.”

Deliveroo analyzes customer behavior, gains insights into market trends and responds with swift decisions and rapid execution by using data-driven insights. Snowflake makes all of this possible.

“With data coming from a variety of sources, including web traffic, transactions and customer behavior, having a data warehouse built for the cloud provides one repository for a single source of truth,” Henry explains.“The shift to Snowflake’s cloud data warehouse has enabled us to make good on our promise that got Deliveroo started: To connect consumers with great food from great restaurants, wherever you are, and whatever it takes.“

Snowflake also accommodates Deliveroo’s 650% growth in 2016. Such rapid momentum prompted Deliveroo to expand its business intelligence team from two employees to 14. Additional team members triggered the need for more access to the same data but without impacting performance.

Since Snowflake is built for the cloud, an unlimited number of users can access all of an organization’s data from a single repository, which is critical to Deliveroo’s success. There’s no replicating data, shifting queries and other workloads to non-business hours, or queueing users to preserve performance. Instead, Snowflake’s true cloud elasticity means Deliveroo can automatically scale up, down and out (concurrency) to load and analyze data without disruption.

“None of these future plans would be possible without real-time, concurrent access to massive volumes of data,” Henry said.

What’s next for Deliveroo? Using real-time logistics algorithms to increase the number and the speed of deliveries. Deliveroo’s expansion plans also include an “Editions” program—delivery-only kitchens so partner restaurants can expand their footprint without opening brick-and-mortar locations.

Learn more about how Snowflake can accelerate your data storage and analytics initiatives.

Snowflake’s Seattle Office Welcomes New Engineering Talent

In early October, Snowflake announced the expansion of its Seattle office and welcomed two new engineers, Polita and Murali, to the rapidly growing team. We sat down with Polita and Murali to get some insights into their individual career paths and find out what ultimately drew them to Snowflake.

Snowflake Senior Engineer, Polita, wanted to be part of Snowflake’s customer-first culture.

In February 2017, Polita was happily working at a major software company in the Seattle area. It had been her home for more than 15 years, and she wasn’t looking to go anywhere else.

But then she came upon an exciting opportunity: come work at a startup that is fundamentally changing the way organizations work with data.

“I knew that Snowflake is a company with a great future,” Polita said. “The leadership and employees are so passionate about this important product they’re building. This was such a compelling opportunity to join a company that cares so much about the customer that I couldn’t turn it down.”

Today, Polita is a senior engineer working on security and identity in Snowflake’s new Seattle office. She says the way the company responds to customer needs is unique in the industry.

“I’m working on a feature specifically because a customer asked for it, and I know why they need it,” she says. “It’s not always the case that you can be this connected to the customer and their ‘why,’ but at Snowflake, I am. And that shows in the product that we ultimately build.”

 

Snowflake Senior Engineer, Murali, returns to databases to find a whole new world.

After beginning his career in databases, Murali explored other areas in tech. He worked at several high-profile companies before deciding that he wanted to get back to his database roots. After all, it’s an exciting time to be working with data.

“Organizations are producing so much more data than ever before, and that data is key to their success. But how can they process all of it?” Murali said. “These realities have changed the way we think about databases, and I was really impressed with the way Snowflake is applying this thinking to the cloud.”

Murali had worked with a few Snowflake employees in previous jobs, and had a lot of respect for their smarts. After he dove into the product, he was equally impressed. He joined the company as a software engineer in September 2017 and went to work on improving data ingestion in Snowflake.

“Snowflake, because of the way it’s architected in the cloud from the ground up, is ideal for helping customers get insight into their data,” he said. “The product has been around for about three years, and it is phenomenal how extensible and reliable it is. It really does make data easier to organize and manage for all kinds of customers.”

As Snowflake continues to grow in Seattle, we are looking for brilliant and dedicated engineers and product managers to join our team. Check the Snowflake Careers page for current openings.

Rethink What You Know About Creating a Data Lake for JSON

Over the last 10 years, the notion has been that to quickly and cost-effectively gain insights from a variety of data sources, you need a Hadoop platform. Sources of data could be weblogs, clickstreams, events, IoT and other machine-born JSON or semi-structured data. The proposition with Hadoop-based data processing is having a single repository (a data lake) with the flexibility, capacity and performance to store and analyze an array of data types.

It shouldn’t be complicated

In reality, analyzing data with an Hadoop-based platform is not simple. Hadoop platforms start you with an HDFS file system, or equivalent. You then must piece together about a half-dozen software packages (minimum) just to provide basic enterprise-level functionality. Functionality such as provisioning, security, system management, data protection, database management and the necessary interface to explore and query data.

Despite the efforts of open-source communities to provide tools to improve the capabilities of Hadoop platforms to operate at the highest enterprise-class level, there is the constant need for highly skilled resources. Skilled resources to continually support Hadoop to keep it up and running, while enabling users to do more than just explore data. This all adds up to unnecessary complexity.

A much simpler proposition

Snowflake, which is built for the cloud and delivered as a service, provides you with a different option for handling JSON and semi-structured data. Just point your data pipelines to Snowflake, land the data in our elastic storage repository and you have instant access to a bottomless data lake. You also have access to a full-fledged data warehouse. With Snowflake, you can easily load JSON and query the data with relational, robust SQL. You can mix JSON with traditional structured data and data from other sources, all from within the same database. Moreover, you can also support endless concurrent analytic workloads and work groups against the JSON data. Whether it is one workload or 1,000 workloads, Snowflake can handle it all with ease.

As a combined data lake and data warehouse platform, Snowflake allows you do much more. Read more about it with our new eBook, Beyond Hadoop: Modern Cloud Data Warehousing.

Try Snowflake for free. Sign up and receive $400 dollars of free usage credits. You can create a sandbox or launch a production implementation from the same Snowflake account.

Rethink what you’ve been told

In order to gain insights from JSON or other machine data, Hadoop is not a prerequisite

When you need to store, warehouse and analyze JSON and other machine data, rethink what you’ve been told. Snowflake, easily, allows you to develop insights or uncover relationships that can drive business forward. You can support all of your structured and semi-structured data warehousing and analytic workload needs with a single tool. A single tool that is built for the cloud and is ACID-compliant. Unlike the special skills often needed to operate an Hadoop platform, Snowflake is a fully relational SQL environment that utilizes the familiar semantics and commands that are known to millions of SQL users and programmers, and thousands of SQL tools.

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @miclnixon1) for all the news and happenings here at Snowflake.

 

 

 

 

What makes Snowflake a data warehouse?

One of the most common questions I get when speaking to people about Snowflake is: “Why do you call it a data warehouse and not a database?” This is a very reasonable question given some of the characteristics of Snowflake.

At Snowflake, in part, we say we are a full relational database management system (RDBMS) built for the cloud. We are ACID compliant and we support standard SQL. Sounds like a database to me, too. Let’s take a closer look just to be sure.

What is a database?

A database is a collection of information organized to be easily accessed, managed and updated. While there are many types of databases available today, the most common is an RDBMS. But when most folks say “database”, they usually mean a traditional RDBMS that handles Online Transaction Processing (OLTP).

So, what are some of the defining characteristics of an OLTP database?

  • Designed for rapid storage and retrieval of small sets of current data records in support of transactions and interactions within an enterprise.
  • Data is organized in tables and columns, allowing users access via structured query language (SQL).
  • Handles quick, real-time activity such as entering a customer name, recording a sale and recording all accounting activity of that sale.
  • Works well for basic operational reporting of a limited number of records. Analytic reporting is relegated to simple, static reports often driven by IT.

What is a data warehouse?

Some of the defining characteristics of a data warehouse are:

  • A database designed to store and process large volumes of current and historical data collected from multiple sources inside and outside the enterprise for deep analysis.
  • Organizes data into tables and columns, and allows users access via SQL.
  • Optimized for loading, integrating and analyzing very large amounts of data.
  • Designed to support descriptive, diagnostic, predictive and prescriptive analytic workloads.

Snowflake definitely includes the overlapping characteristics of both a database and a data warehouse-ACID compliant, support for standard SQL, etc. But Snowflake also embodies all of the defining characteristics of a data warehouse.

One of the key differentiators of Snowflake, from other solutions, is that it’s specifically designed for data warehousing and high speed analytic processing. Rather than a generalized SQL database that has been “tuned” or even adapted to handle these type of workloads, Snowflake was built from the ground up for the cloud to optimize loading, processing and query performance for very large volumes of data. Therefore, hands down, Snowflake is a data warehouse.

So, why do we still need a specialized data warehouse engine?

As OLTP databases have been able to scale higher and innovations like in-memory databases have emerged, some organizations have questioned whether they still need a separate technology or specialized system for reporting and analytics. The answer, again, requires us to look at the basics: What benefits emerge from storing and analyzing data in a separate system?

  1. It eases the burden of reporting from transactional systems by removing the contention for limited and expensive resources.
  2. It produces more business-friendly data results by allowing the data to be restructured to a more suitable format.
  3. It provides access to a wider array of reports more quickly because all the resources in the data warehouse are dedicated to reporting and analysis.
  4. It integrates valuable data from across the enterprise for richer insight. Something that can’t (and shouldn’t) be done in an OLTP system.

For more information on how you can up your data warehousing game with a modern, built-for-the-cloud approach, check out some of our free resources such as our ebook The Data Warehouse: The Engine That Drives Analytics. We would love to help you on your journey to the cloud so keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) to keep up with all the news and happenings here at Snowflake.