Architecture

The Snowflake Architectural Difference

Snowflake is a fully relational SQL data warehouse. It is built for the cloud on Amazon AWS and is all new. Snowflake provides complete relational database support for both structured and semi-structured data (JSON, AVRO, XML) and implements comprehensive support for the  SQL language. It requires no administration and is delivered as a turn-key cloud service. Snowflake provides broad support for ETL and BI tools and enables developers to build modern data applications. It is secure by design.

Beyond these attributes, what makes Snowflake different from a traditional legacy data warehouse, Hadoop system, or other cloud database?

In one word, architecture.

Snowflake has introduced a patent-pending, multi-cluster, shared data architecture which was born and built for the cloud to revolutionize data analysis.

Read about our Architecture

Snowflake Architecture - June 2015

Other data systems are built using a shared-disk or shared-nothing architecture. These systems tightly connect data processing onto a single cluster. The Snowflake multi-cluster, shared data architecture separates data analysis into three distinct layers

Storage

Compute

Service

Storage

Built on cloud-native S3 storage, Snowflake utilizes micro-partitions to securely and efficiently store customer data. When loaded into Snowflake, data is automatically split into modest-sized micro-partitions and metadata is extracted to enable efficient query processing. The micro-partitions are then columnar compressed and fully encrypted using a secure key hierarchy.

Compute

All data processing within Snowflake is performed by virtual warehouses. A virtual warehouse is one or more clusters of compute nodes. When performing a query, the virtual warehouse retrieves the minimum data required from the storage micro-partitions to satisfy the query. As data is retrieved, it is cached locally to improve the performance of future queries. The Compute layer is designed to process enormous quantities of data with maximum speed and efficiency.

Completely unique to Snowflake, multiple virtual warehouses can simultaneously operate on the same data at the same time while fully enforcing global system-wide transactional integrity. Read operations (select) always see a consistent view of the data and write operations never block readers. Transactional integrity across virtual warehouses is achieved by maintaining all transaction states within the services layer.

The ability to simultaneously operate on the same data across multiple virtual warehouses enables Snowflake to achieve effectively unlimited scale and concurrency.

Service

In addition to fully separating Storage and Compute, Snowflake utilizes a Service layer to authenticate user sessions, provide management and security functions, perform query compilation and optimization, and coordinate all transactions. The Service layer consists of a set of stateless nodes running across multiple AWS availability zones and utilizes a highly available, distributed metadata store for global state management.

If the Compute layer is the brawn of Snowflake, then the Service layer is the brain. It provides all security and encryption key management and enables all DDL functions. Queries are compiled within the Service layer and metadata is used to determine the micro-partitions columns that need to be scanned. All operational state is maintained within the services layer, which performs transaction coordination across all virtual warehouses.

Snowflake Architecture Benefits

Transactional SQL Data Warehouse

  • Separation of Services from Storage and Compute allows multiple virtual warehouses to simultaneously operate on the same data. Concurrency is unlimited and, with a multi-cluster warehouse, can be automatically scaled.
  • Activity in one virtual warehouse has zero impact on all other virtual warehouses.  For example, data loading in a virtual warehouse has no performance impact on queries running in other virtual warehouses, even when they are accessing the same data.
  • Full ACID transactional integrity is maintained across separate virtual warehouses. Queries always see a consistent view of data and transaction commits are immediately visible to new queries running in all data warehouses.
  • Zero-copy clones enable databases or tables to be duplicated in a matter of seconds. A clone is a full replica of the original object with an independent lifecycle.
  • Time travel enables any select statement or zero-copy clone to view the database in a consistent, “as of” state from the past over a user-determined retention period.

Performance and Throughput

  • Snowflake outperforms all other solutions. Compute resources scale linearly and efficient query optimization delivers answers in a fraction of the time of legacy systems.
  • Performance issues can be addressed in seconds. Virtual warehouse size is specified based on the service level and performance required – and can be changed at any time, even while a warehouse is running.
  • Compute charges only accrue when work is performed. Virtual warehouses can be paused at night and during other downtime, releasing all resources and eliminating compute charges.
  • Multi-cluster warehouses delivers a consistent SLA to an unlimited number of concurrent users. As the concurrent workload increases, Snowflake automatically adds clusters to the virtual warehouse and distributes queries across those clusters. When the workload decreases, the clusters are paused. Charges only accrue for active clusters so you pay for performance and throughput only when you need it.

Storage and Support for All Data

  • Storage is inexpensive and scales indefinitely. Snowflake is the optimal platform for a data lake, delivering cost-effective and highly performant support for multi-petabyte databases.  All storage costs are based on actual usage for data after compression and are measured in TB stored per month.
  • Both structured and machine-generated, semi-structured data (i.e. JSON, AVRO, XML)  can be queried using relational SQL operators with similar performance characteristics. Loading semi-structured data is painless; the schema is dynamic and is automatically discovered during load. This support for dynamic schemas enables efficient query execution using natural extensions to SQL.
  • With Snowflake, there is no need to implement separate systems to process structured and semi-structured data. The complex Hadoop + data warehouse pipeline can be eliminated. Snowflake can perform both roles much more efficiently with better business results and lower cost.

Availability and Security

  • High availability is achieved using a scale-out architecture that is fully distributed across multiple Amazon availability zones. Snowflake can continue operations and withstand the loss of an Amazon availability zone. The system is designed to tolerate failures with minimal impact to the customer.
  • Snowflake is secure by design. All data is encrypted over the Internet and on disk. Two-factor and federation authentication with single sign-on is supported. Authorization is role-based. Policies can be applied to limit access to predefined client addresses.
  • Snowflake is SOC 2 Type 2 certified and support for PHI data for HIPAA customers is available with a Business Associate Agreement. Additional levels of security, such as encryption across all network communications and virtual private or dedicated isolation, are also available.

(Snowflake) just works, getting us answers an order of magnitude faster, without manual tuning and management. As a result we can do 100 times more queries per day, helping us give our clients richer analysis.

BALAJI RAO

VP. TECHNOLOGY, ACCORDANT MEDIA

Request a Trial