Mar 19, 2018
Machine Learning (ML) and predictive analytics are quickly becoming irreplaceable tools for small startups and large enterprises. The questions that ML can answer are boundless. For example, you might want to ask, “What job would appeal to someone based on their interests or the interests of jobseekers like them?” Or, “Is this attempt to access the network an indication of an intruder?” Or, “What type of credit card usage indicates fraud?”
Setting up a machine learning environment, in particular for on-premise infrastructure, has its challenges. An infrastructure team must request physical and/or virtual machines and then build and integrate those resources. This approach is both time-consuming and error prone due to the number of manual steps involved. It may work in a small environment, but the task becomes exponentially more complicated and impractical at scale.
There are many different ML systems to choose from, including TensorFlow, XGBoost, Spark ML and MXNet, to name a few. They all come with their own installation guides, system requirements and dependencies. What’s more, implementation is just the first step. The next challenge is figuring out how to make the output from the machine learning step (e.g. a model) available for consumption. Then, all of the components for building a model within the machine learning tier and the access to the model in the API tier need to scale to provide predictions in real-time. Last but not least, the team needs to figure out where to store all the data needed to build the model.
Managing this whole process from end-to-end becomes significantly easier when using cloud-based technologies. The ability to provision infrastructure on demand (IaaS) solves the problem of manually requesting virtual machines. It also provides immediate access to compute resources whenever they are needed. But that still leaves the administrative overhead of managing the ML software and the platform to store and manage the data.
At last year’s AWS developer conference, AWS announced Sagemaker, a “Fully managed end-to-end machine learning service that enables data scientists, developers and machine learning experts to quickly build, train and host machine learning models at scale.”
Sagemaker can access data from many different sources (specifically the underlying kernels like Python, PySpark, Spark and R), and access data provided by Snowflake. Storing data in Snowflake also has significant advantages.
Single source of truth
If data is stored in multiple locations, inevitably those locations will get out of sync. Even if data is supposed to be immutable, often one location is modified to fix a problem for one system while other locations are not. In contrast, if data is stored in one central, enterprise-grade, scalable repository, it serves as a “single source of truth” because keeping data in sync is made easy. Different tools are not required for structured and semi-structured data. Data can be modified transactionally, which immediately reduces the risk of problems.
Shorten the data preparation cycle
According to a study published in Forbes, data preparation accounts for about 80% of the work performed by data scientists. Shortening the data preparation cycle therefore has a major impact on the overall efficiency of data scientists.
Snowflake is uniquely positioned to shorten the data preparation cycle due to its excellent support for both structured and semi-structured data into language, SQL. This means that semi-structured data and structured data can be seamlessly parsed, combined, joined and modified through SQL statements in set-based operations. This enables data scientists to use the power of a full SQL engine for rapid data cleansing and preparation.
Scale as you go
Another problem that ML implementers frequently encounter is what we at Snowflake call “works on my machine” syndrome. Small datasets easily work on a local machine but when migrated to production dataset size, reading all the data into a single machine doesn’t scale, or it may behave unexpectedly. Even if it does finish the job, it can take hours to load a terabyte-sized dataset. In Snowflake there is no infrastructure that needs to be provisioned, and Snowflake’s elasticity feature allows you to scale horizontally as well as vertically, all with the push of a button.
Connecting Sagemaker and Snowflake
Sagemaker and Snowflake both utilize cloud infrastructure as a service offerings by AWS, which enables us to build the Infrastructure when we need it, where we need it (geographically) and at any scale required.
Since building the services becomes simplified with Sagemaker and Snowflake, the question becomes how to connect the two services. And that’s exactly the subject of the following parts of this post. How do you get started? What additional configuration do you need in AWS for security and networking? How do you store credentials?
In part two of this four-part blog, I’ll explain how to build a Sagemaker ML environment in AWS from scratch. In the third post, I will put it all together and show you how to connect a Jupyter Notebook to Snowflake via the Snowflake Python connector. With the Python connector, you can import data from Snowflake into a Jupyter Notebook. Once connected, you can begin to explore data, run statistical analysis, visualize the data and call the Sagemaker ML interfaces.
However, to perform any analysis at scale, you really don’t want to use a single server setup like Jupyter running a python kernel. Jupyter running a PySpark kernel against a Spark cluster on EMR is a much better solution for that use case. So, in part four of this series I’ll connect a Jupyter Notebook to a local Spark instance and an EMR cluster using the Snowflake Spark connector.
The Snowflake difference
Snowflake is the only data warehouse built for the cloud. Snowflake delivers the performance, concurrency and simplicity needed to store and analyze all data available to an organization in one location. Snowflake’s technology combines the power of data warehousing, the flexibility of big data platforms, the elasticity of the cloud, and live data sharing at a fraction of the cost of traditional solutions. Snowflake: Your data, no limits.