Data rarely comes ready for analysis, and whatever you call it, cleaning, tidying, munging, it is a very important part of the process. There are a number of helper functions in the RcappeR
package that can help get data into a cleaner, more usable, format, some of these include:
This vignette will run through a few of them using a dummy dataset included in the package, this dataset contains fictional data to highlight the use of the functions, but hopefully the scope of their use will become apparent. There are likely many other cleaning tasks that are perhaps not present in the current version of RcappeR
, if you have a demand for any, don’t hesitate to ask - either look me up on twitter or add an issue on github.
Already, the dataset being used has been cleaned a little, ie. it’s in a nice neat dataframe. However some of the variables are in need of further polishing. The dataset can be seen below:
data(example_race)
(df <- example_race)
## date time pos horse wintime going code btn_l dist
## 1 01/01/01 2:20 1 aardvark 1m39.99 good flat 0.00 8
## 2 01/01/01 2:20 2 badger 1m39.99 good flat 0.25 8
## 3 01/01/01 2:20 3 cat 1m39.99 good flat 0.75 8
## 4 01/01/01 2:20 4 dog 1m39.99 good flat 1.00 8
## 5 01/01/01 2:20 5 eagle 1m39.99 good flat 1.75 8
## 6 01/01/01 2:20 6 frog 1m39.99 good flat 2.50 8
## 7 01/01/01 2:20 7 giraffe 1m39.99 good flat 4.00 8
## 8 01/01/01 2:20 8 hamster 1m39.99 good flat 6.00 8
## 9 02/02/02 3:30 1 andrew 1:12.78 good flat 0.00 6
## 10 02/02/02 3:30 2 barry 1:12.78 good flat 1.00 6
## 11 02/02/02 3:30 3 charlie 1:12.78 good flat 1.50 6
## 12 02/02/02 3:30 4 david 1:12.78 good flat 2.50 6
## 13 02/02/02 3:30 5 eddie 1:12.78 good flat 5.00 6
The dataset contains two unique races, which can be identified by the date
and time
variables, but it is often a good idea to create a race_id
variable to help identify the unique races and their runners. Simply concatenating the date and races’ start time should be sufficient:
df$race_id <- paste(df$date, df$time, sep = "_")
unique(df$race_id)
## [1] "01/01/01_2:20" "02/02/02_3:30"
The winners’ time is available in the wintime
variable, but are in character format, 1m39.99
and 1:12.78
, what is desired is a time in numeric format, in seconds. The conv_times
function will convert these times (or similar), by splitting times up according to a regular expression, "[[:punct:]]\\s?|m\\s?|\\s+"
, this should cover a healthy number of formats that races come in, the table below lists a few.
df$new_time <- conv_times(df$wintime)
## the new times look like
df$new_time
## [1] 99.99 99.99 99.99 99.99 99.99 99.99 99.99 99.99 72.78 72.78 72.78
## [12] 72.78 72.78
The table below contains various formats of times that will be converted, but if times to be converted are not covered then the regex
parameter in the conv_times
function allows users to create unique regular expression.
old time | regex used | splits old time | new time |
---|---|---|---|
“1m39.99” | default | “1”, “39”, “99” | 99.99 |
“1 39.99” | default | “1”, “39”, “99” | 99.99 |
“1-39.99” | default | “1”, “39”, “99” | 99.99 |
“1:39.99” | default | “1”, “39”, “99” | 99.99 |
“1min 39.99” | “min |[[:punct:]]” | “1”, “39”, “99” | 99.99 |
The margins between horses are typically recorded in lengths, the conv_margins
function can help convert margins into seconds, and calculate finishing times for each runner. The usage of the conv_margins
function is shown below. In its simplest invocation it uses a scale of 5 lengths per second, the US convention, to calculate the difference in seconds between runners.
A quick reminder of the lengths between our runners
## [1] 0.00 0.25 0.75 1.00 1.75 2.50 4.00 6.00 0.00 1.00 1.50 2.50 5.00
conv_margins(btn_l = df$btn_l)
## [1] 0.00 0.05 0.15 0.20 0.35 0.50 0.80 1.20 0.00 0.20 0.30 0.50 1.00
Entering the winners time into the win_time
parameter, returned by the conv_times
function above, it will convert the margins into individual times for runners.
conv_margins(btn_l = df$btn_l, win_time = df$new_time)
## [1] 99.99 100.04 100.14 100.19 100.34 100.49 100.79 101.19 72.78 72.98
## [11] 73.08 73.28 73.78
The BHA changes the lengths per second scale according to the conditions of the race, conditions being the ground and the type of race (national hunt or flat). The conditions
parameter allows an abbreviated version of these conditions to be entered, returning the scale employed by the BHA given the conditions. See ?bha_ls
and ?bhascale
for more details about the abbreviated conditions.
## our races are flat races, and going is good or quicker, so conditions = "f-gq"
conv_margins(btn_l = df$btn_l, win_time = df$new_time, conditions = "f-gq")
## [1] 99.99 100.03 100.11 100.16 100.28 100.41 100.66 100.99 72.78 72.95
## [11] 73.03 73.20 73.61
Finally, if margins between horses aren’t cumulative, but are the lengths between a runner and the horse in front, as opposed to the winner, then entering FALSE
to the cum_l
parameter will make the adjustment. However, to do this the races need to be split up, to prevent the losing margin from a runner in one race being added to the winner of another race. Here it is best to use the package dplyr
, grouping by our race_id
, and then calculating the beaten margins.
library(dplyr)
df <- df %>%
group_by(race_id) %>%
mutate(ind_times = conv_margins(btn_l = btn_l,
cum_l = FALSE,
win_time = new_time))
select(df, race_id, pos, wintime, new_time, btn_l, ind_times)
## Source: local data frame [13 x 6]
## Groups: race_id [2]
##
## race_id pos wintime new_time btn_l ind_times
## (chr) (int) (chr) (dbl) (dbl) (dbl)
## 1 01/01/01_2:20 1 1m39.99 99.99 0.00 99.99
## 2 01/01/01_2:20 2 1m39.99 99.99 0.25 100.04
## 3 01/01/01_2:20 3 1m39.99 99.99 0.75 100.19
## 4 01/01/01_2:20 4 1m39.99 99.99 1.00 100.39
## 5 01/01/01_2:20 5 1m39.99 99.99 1.75 100.74
## 6 01/01/01_2:20 6 1m39.99 99.99 2.50 101.24
## 7 01/01/01_2:20 7 1m39.99 99.99 4.00 102.04
## 8 01/01/01_2:20 8 1m39.99 99.99 6.00 103.24
## 9 02/02/02_3:30 1 1:12.78 72.78 0.00 72.78
## 10 02/02/02_3:30 2 1:12.78 72.78 1.00 72.98
## 11 02/02/02_3:30 3 1:12.78 72.78 1.50 73.28
## 12 02/02/02_3:30 4 1:12.78 72.78 2.50 73.78
## 13 02/02/02_3:30 5 1:12.78 72.78 5.00 74.78
If times for individual runners already exist in a dataset, but margins in seconds are needed, the btn_sec
function will do the work for you. It is a very simple function, which substracts the fastest time (hopefully the winner) from the other times. If applying to a dataset of a number of races, the race_id
must be used again:
df %>%
group_by(race_id) %>%
mutate(btn_sec = btn_sec(times = ind_times))
Using the dplyr
for just the last task neednt be the case, instead of writing lines and lines of code, using the $
to access variables that need cleaning, the cleaning can be chained together:
example_race %>%
group_by(date, time) %>%
mutate(race_id = paste(date, time, sep = "_"),
new_time = conv_times(times = wintime),
new_margins = conv_margins(btn_l = btn_l,
cum_l = FALSE,
win_time = new_time,
conditions = "f-gq")) %>%
select(race_id, pos, horse, wintime, new_time, btn_l, new_margins)
## Source: local data frame [13 x 9]
## Groups: date, time [2]
##
## date time race_id pos horse wintime new_time btn_l
## (chr) (chr) (chr) (int) (chr) (chr) (dbl) (dbl)
## 1 01/01/01 2:20 01/01/01_2:20 1 aardvark 1m39.99 99.99 0.00
## 2 01/01/01 2:20 01/01/01_2:20 2 badger 1m39.99 99.99 0.25
## 3 01/01/01 2:20 01/01/01_2:20 3 cat 1m39.99 99.99 0.75
## 4 01/01/01 2:20 01/01/01_2:20 4 dog 1m39.99 99.99 1.00
## 5 01/01/01 2:20 01/01/01_2:20 5 eagle 1m39.99 99.99 1.75
## 6 01/01/01 2:20 01/01/01_2:20 6 frog 1m39.99 99.99 2.50
## 7 01/01/01 2:20 01/01/01_2:20 7 giraffe 1m39.99 99.99 4.00
## 8 01/01/01 2:20 01/01/01_2:20 8 hamster 1m39.99 99.99 6.00
## 9 02/02/02 3:30 02/02/02_3:30 1 andrew 1:12.78 72.78 0.00
## 10 02/02/02 3:30 02/02/02_3:30 2 barry 1:12.78 72.78 1.00
## 11 02/02/02 3:30 02/02/02_3:30 3 charlie 1:12.78 72.78 1.50
## 12 02/02/02 3:30 02/02/02_3:30 4 david 1:12.78 72.78 2.50
## 13 02/02/02 3:30 02/02/02_3:30 5 eddie 1:12.78 72.78 5.00
## Variables not shown: new_margins (dbl)