Kentucky Derby 2015 - Twitter Analysis

Tweet

Collection

I initially was going to collect tweets for just the Kentucky Derby, but to test the water I also collected tweets for the Kentucky Oaks. I used the streamR R package, created by Pablo Barberá. I attempted to collect tweets from 7pm (UK time, 2pm EST) to 1am (8pm EST), this allowed a few hours before the Oaks and Derby races, and a hour or so after the race, unfortunately, on Derby day collected was ended at midnight (7pm EST), I believe due to the number of tweets collected. A couple of popular hashtags were collected, as well as the possible runners in the Oaks and in the Derby. The table below shows the search terms for each race, when collection started, ended, the number of tweets collected, and a link .RData workspaces containing the tweets and search terms:

Race Start End Searched For # Tweets Data
Oaks 7pm (2pm EST) 1am (8pm EST) KYOaks, KentuckyOaks, Forever Unbridled, Shook Up, Include Betty, Eskenformoney, Condo Commando, Angela Renee, Lovely Maria, I’m A Chatterbox, Money’soncharlotte, Oceanwave, Sarah Sis, Stellar Wind, Birdatthewire, Puca, Peace And War 7973 data
Derby 7pm (2pm EST) ~12am (7pm EST) KYDerby, Kentucky Derby, Ocho Ocho Ocho, Carpe Diem, Materiality, Tencendur, Danzig Moon, Mubtaahij, El Kabeir, Dortmund, Bolo, Firing Line, Stanford, International Star, Itsaknockout, Keen Ice, Frosted, War Story, Mr Z, American Pharoah, Upstart, Far Right, Frammento, Tale Of Verve 148725 data

There are a couple of issues worth addressing now, collection is indiscriminant, for example, tweets that mention Dortmund a runner in the Derby could be talking about the German City, or more likely the football club after which the horse was named. Ocho Ocho Ocho, another runner in the derby, will have led tweets to be collected if they mentioned ocho just once. Once tweets are cleaned these examples of mistaken collection need to be reduced somewhat. The number of tweets collected per race are very different, part of this is likely due to the Derby being a race that reaches a wider audience (once a year racing fans), while Ocho Ocho Ocho, Dortmund, Stanford, Frosted and Carpe Diem are more likely to be used in tweets not about racing than any of the Oaks runners.

Cleaning

To clean tweets, remove unnecessary characters, links, etc, I use a function I wrote specifically when I collected tweets for the Cheltenham Festival, the tweet_cleaner function can be seen here. An example of cleaning can be seen below on 6 tweets from Derby day, followed by tweets after they’ve been cleaned (the second tweet mentions Stanford, but is talking about the college, not the horse):

## [1] "RT @SportsCenter: So this is happening today:\n\n<U+0095> Day 3, NFL Draft\n<U+0095> NHL Playoffs\n<U+0095> Yankees at Red Sox\n<U+0095> Kentucky Derby   \n<U+0095> Spurs vs Clippe<U+0085>"
## [2] "RT @SBNationCFB: Most picks through four:\n\nFlorida State: 7\nMiami: 6\nAlabama, Florida, Louisville, Stanford: 5"                                
## [3] "Are you a fan of the Kentucky Derby? Try these cocktails out for something different.  http://t.co/sdlBJrHa7t"                                     
## [4] "RT @SportsCenter: So this is happening today:\n\n<U+0095> Day 3, NFL Draft\n<U+0095> NHL Playoffs\n<U+0095> Yankees at Red Sox\n<U+0095> Kentucky Derby   \n<U+0095> Spurs vs Clippe<U+0085>"
## [5] "Goal is to one day go to the Kentucky Derby and wear a big hat <ed><U+00A0><U+00BD><ed><U+00B1><U+0097><ed><U+00A0><U+00BD><ed><U+00B1><U+0092>"   
## [6] "RT @ESPNNFL: The Super Bowl champs are at the Kentucky Derby. (via Tom Brady/Facebook) http://t.co/ckxBtiBQY8"
tweet_cleaner(tweets = example_tweets)
## [1] "rt @sportscenter: so this is happening today: day , nfl draft nhl playoffs yankees at red sox kentucky derby spurs vs clippe"
## [2] "rt @sbnationcfb: most picks through four: florida state: miami: alabama, florida, louisville, stanford:"                     
## [3] "are you a fan of the kentucky derby? try these cocktails out for something different."                                       
## [4] "rt @sportscenter: so this is happening today: day , nfl draft nhl playoffs yankees at red sox kentucky derby spurs vs clippe"
## [5] "goal is to one day go to the kentucky derby and wear a big hat"                                                              
## [6] "rt @espnnfl: the super bowl champs are at the kentucky derby. (via tom brady/facebook)"

Cleaning tweets also makes it easier to identify tweets that belong to each race, for any number of terms the function can search them out and concatenate them, so Ocho Ocho Ocho will become ochoochoocho, any tweets that mention ocho just once can then be removed. This is required because the streamR::filterStream function will collect tweets that mention a term, but if there are multiple words (such as “Kentucky Derby”) then those two words can appear anywhere in a tweet, not necessarily in the order desired. In the above tweets, we want to concatenate ‘Kentucky Derby’.

tweet_cleaner(tweets = example_tweets, concat_terms = "Kentucky Derby")
## [1] "rt @sportscenter: so this is happening today: day , nfl draft nhl playoffs yankees at red sox kentuckyderby spurs vs clippe"
## [2] "rt @sbnationcfb: most picks through four: florida state: miami: alabama, florida, louisville, stanford:"                    
## [3] "are you a fan of the kentuckyderby ? try these cocktails out for something different."                                      
## [4] "rt @sportscenter: so this is happening today: day , nfl draft nhl playoffs yankees at red sox kentuckyderby spurs vs clippe"
## [5] "goal is to one day go to the kentuckyderby and wear a big hat"                                                              
## [6] "rt @espnnfl: the super bowl champs are at the kentuckyderby . (via tom brady/facebook)"
# we can also remove punctuation
tweet_cleaner(tweets = example_tweets, concat_terms = "Kentucky Derby", rm_punct = TRUE)
## [1] "rt sportscenter so this is happening today day nfl draft nhl playoffs yankees at red sox kentuckyderby spurs vs clippe"
## [2] "rt sbnationcfb most picks through four florida state miami alabama florida louisville stanford"                        
## [3] "are you a fan of the kentuckyderby try these cocktails out for something different"                                    
## [4] "rt sportscenter so this is happening today day nfl draft nhl playoffs yankees at red sox kentuckyderby spurs vs clippe"
## [5] "goal is to one day go to the kentuckyderby and wear a big hat"                                                         
## [6] "rt espnnfl the super bowl champs are at the kentuckyderby via tom brady facebook"

The time tweets were sent need to be converted ahead of plotting, the tweets were collected from the UK, so I need to convert to a US timezone. I’ll admit that this confused me somewhat, what with GMT, BST, UTC, EST, etc, in the end I believe the correct code is EST5EDT, but would welcome input

oaks$time <- strptime(oaks$created_at, "%a %b %d %H:%M:%S %z %Y", tz = "EST5EDT")
derby$time <- strptime(derby$created_at, "%a %b %d %H:%M:%S %z %Y", tz = "EST5EDT")

Classifying

Once tweets have been cleaned, it is easier to remove mistakenly collected tweets, in the above 6 tweets we want to identify if they contain a runner or phrase we want. The findtweets function (seen here) looks through cleaned tweets for any mention of a collection of terms, and returns a logical vector. All the tweets above mention either a runner (Stanford mentions will now be removed due to the horse being a non-runner) or a phrase we tracked via streamR::filterStream, so as an example of findtweets we’ll just look for “Kentucky Derby”:

findtweets(tweets = example_tweets, searchfor = "Kentucky Derby")
## [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

For the oaks there were 763 mistakenly collected tweets, ie. a runner, perhaps Oaks runner Sarah Sis, appears in a tweet, but the tweet is “my sister Sarah …”. For the derby there were 17439. Ideally removing other tweets, such as those that mention the football team Dortmund rather than the horse, would be carried out, but this is a very tricky task, any outside input would be welcomed!!

Sentiment

Finally, I score the sentiment of a tweet based on the words in the positive and negative lexicons found in the data folder and the senti_score function (found here ).

positive <- scan("data/positive-lexicon.txt", what = "character")
negative <- scan("data/negative-lexicon.txt", what = "character")

oaks$senti_score <- senti_score(tweets = oaks$tweet, pos_words = positive, neg_words = negative)
derby$senti_score <- senti_score(tweets = derby$tweet, pos_words = positive, neg_words = negative)