Happy Tweet, Sad Tweet – Building a Naive Bayes Classifier in SQL

We love to talk about how Snowflake is built and what makes Snowflake stand out from the data warehousing crowd, but we also love to show how you can put Snowflake to use for interesting big data challenges.  In this blog post we show how to build a machine learning classifier on top of Snowflake using SQL and Snowflake’s JSON extensions.  The classifier will learn to separate happy emotions from sad emotions in short text snippets.  As a data source we use the publicly available Twitter data stream that can be downloaded as raw JSON.

Twitter is a never-ending source of news, links, comments, and personal chatter. Some of this communication is highly emotional: it’s happy, it’s sad, it’s neither, or both at the same time. By loading Twitter data into Snowflake, we can build a machine learning classifier that detects such emotions automatically. The machine learning classifier can then be used to detect emotions in any tweet-sized text snippet, such as “The sun is shining, the weather is sweet again” (happy), “I must’ve called a thousand times” (sad), and “Make America Great Again” (happy).

Or, in SQL:

select * from result order by id;
----+------------------------------------------------+-------+
 ID |                      TEXT                      | LABEL |
----+------------------------------------------------+-------+
 1  | The sun is shining, the weather is sweet again | happy |
 2  | I must've called a thousand times              | sad   |
 3  | Make America Great Again                       | happy |
----+------------------------------------------------+-------+

In this blog post we will build a Naive Bayes machine learning classifier using SQL. We do this in three steps:

  1. We create a training data set by labeling tweets as happy or sad using emojis.
  2. We compute a machine learning model using a Naive Bayes classifier.
  3. We validate our model using a test data set and detect emotions in other text snippets.

Let’s quickly dive into what Twitter data looks like, what it takes to build a machine learning model, and what a Naive Bayes classifier is anyway.

Twitter Data

Twitter allows you to download a one percent sample of all public tweets for free. The format of this data is JSON. Each tweet is a separate JSON object that looks like this:

{
  "created_at": "Tue Feb 02 21:09:01 +0000 2016",
  "entities": {
    "hashtags": [ ... ],
    "media": [],
    "symbols": [],
    "trends": [],
    "urls": [],
    "user_mentions": [ ... ]
  },
  "id": 694751783172739072,
  "lang": "en",
  "retweeted_status": { ... },
  "text": "We are delighted to inform you that your submission 900: 
           The Snowflake Elastic Data Warehouse has been accepted. 
           #SIGMOD @SnowflakeDB",
  "user": { "screen_name": "hemasail" ... }
  ...
}

This tweet was sent by user @hemasail—which is me, coincidentally—on February 2, 2016. Every JSON object representing a tweet has mandatory fields and optional fields. Mandatory fields include the tweet’s id, the user that sent the tweet, and the language the tweet was written in (which is apparently detected by Twitter’s language classifier). Optional fields include “retweeted_status,” containing information about the retweeted tweet (if any); and “entities,” containing information about hashtags, media, URLs, etc. that were found in the tweet. There is much more information in a tweet’s JSON that is not shown in the example above. It is always interesting to see the full amount of refined information from “140 characters or less.”

Twitter data formatted in JSON is a great fit for Snowflake. Snowflake supports JSON natively: loading, querying, and unloading of JSON data is supported without conversion. Snowflake also supports user-defined functions in JavaScript that can operate on JSON data directly.

Machine Learning

Machine Learning is a method in computer science to automatically learn characteristics of existing data and apply the findings to new data. In this blog we use supervised machine learning. Supervised machine learning is a branch of machine learning where the existing data set needs to be labeled. That is, each item in the data set—each tweet for example—is assigned a label: this one is a happy tweet, this one is a sad tweet. There is also unsupervised machine learning where the data set does not need to be labeled, and the machine learning algorithm tries to find labels itself. However, we will use a labeled data set.

There are many ways to generate labels. Usually, labels are generated by humans. In our case, that would mean that we sit down and label lots of tweets as happy or sad. This would take a lot of time. We could also outsource label generation to services such as Amazon Mechanical Turk, but this would take a lot of money. So we won’t do either. Instead, we will assign labels to tweets using emojis. If a tweet contains a happy emoji, we will assign a label “happy” to this tweet. If a tweet contains a sad emoji, we will assign a label “sad” to this tweet. If there are no emojis or both types of emojis in a tweet, we will assign labels “none” or “both,” respectively.

Given a labeled data set, we will split it into a training data set and a test data set. The training data set will be used to train the machine learning classifier. The test data set will be used to test how good our classifier is. We will split our data set into 80% training data and 20% test data.

Naive Bayes Classifier

A Naive Bayes classifier is a simple type of machine learning model based on probabilities. In our case, a Naive Bayes classier uses word probabilities to classify a tweet as happy or sad. Simply said, a Naive Bayes classifier counts all words and computes probabilities of how often they appear in one category or another. For example, the word “great” appears more often in happy tweets than in sad tweets. The word “must’ve” appears more often in sad tweets than in happy tweets. By averaging all word probabilities together, we get an overall probability that a text belongs to one category or another. For example, by averaging all word probabilities in “Make America Great Again” together, we see that it is more happy than sad. At least, our training data from Twitter tells us so.

A much better explanation of the Naive Bayes classifier can be found in the slides from Stanford here.

Building a Naive Bayes Classifier in SQL

We create a training data set by first labeling tweets as happy or sad. Labels will be assigned using emojis. If a tweet has a happy emoji, we assign the label “happy.” If a tweet has a sad emoji, we assign the label “sad.” If the tweet does not contain an emoji, or if it contains both types of emojis, we assign the labels “none” or “both” respectively.

To compute labels, we will use the following user-defined function (UDF) in JavaScript. The function takes a string as input and outputs a label based on the type of emojis found. For this task, it splits the string into words and compares each word with a happy emoji or sad emoji. We call this function “happyLabel”:

create or replace function happyLabel(TEXT string)
returns string
language javascript
as ‘
 var happyRegex = /:-?\\)/;
 var sadRegex = /:-?\\(/;
 var happyEmoji = happyRegex.test(TEXT);
 var sadEmoji = sadRegex.test(TEXT);
 if (happyEmoji && sadEmoji) return “both”;
 if (happyEmoji) return “happy”;
 if (sadEmoji) return “sad”;
 return “none”;
‘;

We will use another JavaScript UDF that cleans a tweet and splits it into words. This function takes a string as input and outputs a variant. A variant is a Snowflake-specific data type that can contain semi-structured data like JSON objects and JSON arrays, as well as native data such as strings and numbers. This function outputs a variant containing a JSON array. Before splitting a tweet into words, the function removes any HTML entities such as & and <, any mention of other Twitter names, and any punctuation. It outputs a list of lower-case words.

create or replace function splitText(TEXT string)
returns variant
language javascript
as ‘
 var words = TEXT
   .replace(/&\\w+;/g, ” “)         // remove HTML entities
   .replace(/@[^\\s]+/g, “”)        // remove mentions
   .replace(/[^#\’\\w\\s]|_/g, ” “) // remove punctuation
   .toLowerCase()
   .trim()
   .split(/\\s+/g);                 // split on whitespace
 return words;
‘;

Now we will write SQL to generate the labeled data set. To generate the labels, we will call the JavaScript UDFs defined above. We will balance the labeled data set such that there are the same number of happy tweets and sad tweets. From this balanced data set, we will use 80 percent of the tweets for training the classifier. We will compute all probabilities and counts necessary for the Naive Bayes classifier. For a complete summary of all formulas needed to implement the classifier, see the slides from Stanford University. For any questions regarding JSON processing in Snowflake, refer to the Snowflake documentation.

-- label tweets and only select happy and sad tweets
create or replace table labeledTweets as
select tweet,
      happyLabel(tweet:text) as label,
      splitText(tweet:text) as words,
      uniform(0::float, 1::float, random()) as rand
from twitter.public.tweets_1p
where created_at > ‘2016-05-01’::date       — tweets from May 2016
 and tweet:lang = ‘en’                     — english language only
 and array_size(tweet:entities:urls) = 0   — no tweets with links
 and tweet:entities:media is null          — no tweets with media (images etc.)
 and tweet:retweeted_status is null        — no retweets, only originals
 and (label = ‘happy’ or label = ‘sad’);

-- balance tweets to have same number of happy and sad tweets
create or replace table balancedLabeledTweets as
select *
from (select *,
       rank() over (partition by label order by tweet:id) as rnk
     from labeledTweets) x
where rnk <= (select min(cnt)
             from (select label,count(*) as cnt
                   from labeledTweets
                   group by label));
-- training set, pick random 80 percent of tweets
create or replace table training as
select * from balancedLabeledTweets where rand < 0.8;

-- training helper table: tweet id, word, label
create or replace table training_helper as
select tweet:id as id, value as word, label
from training,
    lateral flatten(words);
 
-- number of total docs
create or replace table docs as
select count(*) as docs from training;

-- number of docs per class j
create or replace table docsj as
select label,count(*) as docsj
from training
group by label;

-- number of distinct words = |Vocabulary|
create or replace table voc as
select count(distinct word) as voc
from training_helper;

-- number of words per class j
create or replace table wordsj as
select label,count(*) as wordsj
from training_helper
group by label;

-- count per word n and class j
create or replace table wordsnj as
select a.label,a.word,ifnull(wordsnj,0) as wordsnj
from
(select label,word
from (select distinct label from training_helper) c
 join
    (select distinct word from training_helper) w) a
left outer join
 (select label,word,count(*) as wordsnj
 from training_helper
 group by label,word) r
on a.label = r.label and a.word = r.word;

To test our Naive Bayes classifier, we will use the remaining 20 percent of the labeled tweets. For every tweet in the test set, we will compute the actual label that the classifier assigns to the tweet. The classifier does not look at the emojis, of course. That would be cheating. We will then compare the actual labels with the expected labels. This will give us a percentage of correctly classified tweets.

-- test set
create or replace table test as
select * from balancedLabeledTweets where rand >= 0.8;

-- test helper table: tweet id, word, label
create or replace table test_helper as
select tweet:id as id,value as word,label
from test,
    lateral flatten(words);

-- classification probabilities
create or replace table probs as
select id,label,max(pc)+sum(pw) as p,expected
from (
select id,t.word,n.label,wordsnj,wordsj,docsj,docs,voc,
 log(10,docsj::real/docs) as pc,
 log(10,(wordsnj::real+0.1)/(wordsj+0.1*voc)) as pw,
 t.label as expected
from test_helper t
 inner join wordsnj n on t.word = n.word
 inner join wordsj j on n.label = j.label
 inner join docsj on docsj.label = n.label
 cross join docs
 cross join voc) x
group by id,label,expected;

-- classification result
create or replace table testResult as
select p1.id,t.tweet:text::string as text,p1.expected,p1.label as actual
from probs p1
 inner join (
   select id,max(p) as maxp
   from probs
   group by id) p2
 on p1.id = p2.id and p1.p = p2.maxp
 inner join test t on p1.id = t.tweet:id;

-- correctly classified tweets: “win probability”
select sum(win),count(*),sum(win)::real/count(*) as winprob
from (
select id,expected,actual,
 iff(expected = actual,1,0) as win
from testResult);

The output of the last query is the following:
---------+-----------+-----------------+
SUM(WIN) | COUNT(*)  |     WINPROB     |
---------+-----------+-----------------+
43926    | 56298     |   0.7802408611  |
---------+-----------+-----------------+

That means our Naive Bayes classifier classified 78% of all test tweets correctly. This is not too bad given that we did not do much data cleansing, spam detection, and word stemming. The baseline winning probability is 50% because the number of happy tweets is the same as the number of sad tweets in our data set. Thus, our classifier gives us a significant boost.

We can now use the trained classifier to label any text snippets. For example, we can label text snippets such as “The sun is shining, the weather is sweet again,” “I must’ve called a thousand times,” and “Make America Great Again.” To do this, we create a new table with all text snippets that we want to classify. To compute the classification, we split each text into words and use the word probabilities from our training set to ultimately assign a label to each text snippet. The results are stored in a table called “result”.

-- create new table with any text snippets
create or replace table query(id int, text varchar(500));
insert into query values (1, ‘We are the champions’);
insert into query values (2, ‘I must\’ve called a thousand times’);
insert into query values (3, ‘Make America Great Again’);
 
-- split texts into words
create or replace table query_helper as
select id,value as word
from query,
    lateral flatten(splitText(text));

-- compute probabilities using data from training set
create or replace table probs as
select id,label,max(pc)+sum(pw) as p
from (
select id,t.word,n.label,wordsnj,wordsj,docsj,docs,voc,
 log(10,docsj::real/docs) as pc,
 log(10,(wordsnj::real+0.1)/(wordsj+0.1*voc)) as pw
from query_helper t
 inner join wordsnj n on t.word = n.word
 inner join wordsj j on n.label = j.label
 inner join docsj on docsj.label = n.label
 cross join docs
 cross join voc) x
group by id,label;

-- assign labels to text snippets
create or replace table result as
select p1.id as id, text, p1.label as label
from probs p1
 inner join (
   select id,max(p) as maxp
   from probs
   group by id) p2
 on p1.id = p2.id and p1.p = p2.maxp
 inner join query q on p1.id = q.id;

And, ta-dah, here are the results of our classification:

select * from result order by id;
---+------------------------------------------------+-------+
ID |                      TEXT                      | LABEL |
---+------------------------------------------------+-------+
1  | The sun is shining, the weather is sweet again | happy |
2  | I must’ve called a thousand times              | sad   |
3  | Make America Great Again                       | happy |
---+------------------------------------------------+-------+

Summary

In this blog post, we built a complete Naive Bayes classifier using SQL. The classifier learned to distinguish happy tweets from sad tweets. Our classifier has a success rate of 78%, which is a significant improvement over the baseline of 50%. We used the classifier on other text snippets to show its applicability beyond tweets. Of course, this is only a demo and further data cleansing, spam detection, word stemming, and other natural language processing tricks are necessary to increase the success rate of the classifier. 

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), and my personal Twitter feed (@hemasail) for updates on all the action and activities here at Snowflake Computing.