cleaning

Due to the number of tweets this rmarkdown document will only serve as an example of the cleaning and preparation carried out. The actual work has been performed once, resulting in a workspace of clean twitter data, the R script found here carries out the cleaning work, it includes comments, the clean workspace is in the data folder in the gh-pages branch, called clean_tweets.RData.

tweets

To get an idea of the dirty tweets, a few examples can be seen below:

## [1] "Think Faugheen has it personally <ed><U+00A0><U+00BD><ed><U+00B1><U+008F>"                                                                                              
## [2] "No point backing Douvan in the opener when you get your money back if ya bet loses with billy hills!"                                                                   
## [3] "RT @horseandhound: Denman and Kauto Star among former stars to parade at Cheltenham Festival today http://t.co/lP7qIrPmoV http://t.co/yyoJS…"                           
## [4] "20p e/w Yankee for fun\n1:30 Shaneshill\n2:05 Clarcam\n3:20 Hurricane Fly \n4:40 Corgy\n£4.40 &gt; £8413 \n#LooooooongShot\n#CheltenhamFestival"                        
## [5] "RT @BoyleSports: ONE DAY LEFT! RT &amp; answer this #CheltFest question to #win a #free €/£500 bet! Enter here &gt; http://t.co/BpaX6UTPVk http://…"                    
## [6] "<ed><U+00A0><U+00BC><ed><U+00BF><U+0087>RUBY WALSH WAGER<ed><U+00A0><U+00BC><ed><U+00BF><U+0087> Have the HUGE 5/1(From 1/20) RUBY WALSH To Win ANY Race At Cheltenham<U+27A1> http://t.co/4pFaUYuRQ8 http://t.co/4ebIBcSIor"

We can see that the tweets require some cleaning, thankfully I can use a function I wrote when looking at tweets from last year. This function, called tweet_cleaner removes emoticons/unrecognised characters, removes control characters, removes links, removes digits, removes spaces from start/end and removes double spaces. It also provides functionality to perform some tasks specific for these tweets. It has the following params:

args(tweet_cleaner)

## function (tweets, concat_terms = NULL, rename_odds = FALSE, rm_punct = FALSE) 
## NULL

The concat_terms parameter will look for specific mentions of phrases in the tweets and concatenate them, so “hello birdy” would look for that phrase in tweets using the regex “hello ?birdy”, and convert it to “hellobirdy”. This is to help identify tweets that belong to the Festival, because Twitters streaming API would return tweets that mentioned “hello” and “birdy”, but not necessarily in the order desired, ie. tweets could be collected that said “hello little birdy”. And because we collected tweets mentioning horses, the order of the phrases is important if we are to analyse these tweets. This means a number of the collected tweets are unlikely to be related to the Festival. The rename_odds parameter will convert odds such as 3/1, 20/1, 2-1, 3.75, to the word “odds”. Finally, the rm_punct parameter removes punctuation.

tweet_cleaner(tweets = eg_tweets, rename_odds = TRUE, rm_punct = TRUE)

## [1] "think faugheen has it personally"                                                                                         
## [2] "no point backing douvan in the opener when you get your money back if ya bet loses with billy hills"                      
## [3] "rt horseandhound denman and kauto star among former stars to parade at cheltenham festival today"                         
## [4] "p e w yankee for fun odds shaneshill odds clarcam odds hurricane fly odds corgy odds gt looooooongshot cheltenhamfestival"
## [5] "rt boylesports one day left rt amp answer this cheltfest question to win a free bet enter here gt"                        
## [6] "ruby walsh wager have the huge odds from odds ruby walsh to win any race at cheltenham"

Look at tweet number 4, it mentions a horse we wanted collecting, “hurricane fly”, by supplying the term “hurricane fly” as the concat_terms parameter, tweet_cleaner will locate the term/horse and concatenate. Helping identify a tweet that mentions the horse, compared to a tweet that happened to include the words “hurricane” and “fly”.

(eg_tweets <- tweet_cleaner(tweets = eg_tweets, concat_terms = "hurricane fly", rename_odds = TRUE, rm_punct = TRUE))

## [1] "think faugheen has it personally"                                                                                        
## [2] "no point backing douvan in the opener when you get your money back if ya bet loses with billy hills"                     
## [3] "rt horseandhound denman and kauto star among former stars to parade at cheltenham festival today"                        
## [4] "p e w yankee for fun odds shaneshill odds clarcam odds hurricanefly odds corgy odds gt looooooongshot cheltenhamfestival"
## [5] "rt boylesports one day left rt amp answer this cheltfest question to win a free bet enter here gt"                       
## [6] "ruby walsh wager have the huge odds from odds ruby walsh to win any race at cheltenham"

concat_terms can take a vector of terms, so these will be the ~500 horses that ran in any of the races over the four festival days.

Cheltenham Festival 2015 - Twitter Analysis

cleaning

tweets

times