Data Processing – Performance Visualisation in R
Up until this point, we have been working with base R. Now we are going to import some packages. Packages (sometimes referred to as libraries) allow additional functionality in your R code.
We are going to be using the Tidyverse (link) and StatsBombR (link) packages to process data from the UEFA Women’s Euro 2022. Data needs to be processed in order to create effective visualisations.
We are going to look at data from the UEFA Women’s Euro 2022. This data is provided by StatsBomb (link).
In this lesson, we will be creating a script that will process a large data file. To get started, create a new R project and script file, save it within a folder. In the source pane (top left), we need to load the StatsBombR and tidyverse packages.
#---------- Load packages
# An if statment is used to evaluate if the required package is
# already installed. If the package is not installed the code will
# install the package. The package will be loaded in the library() function.
if (require("tidyverse") == FALSE) {
install.packages("tidyverse")
}
library(tidyverse)
#--
if (require("devtools") == FALSE) {
install.packages("devtools")
}
library(devtools)
#--
if (require("StatsBombR") == FALSE) {
devtools::install_github("statsbomb/SDMTools") &&
devtools::install_github("statsbomb/StatsBombR")
}
library(StatsBombR)
From the StatsBombR library, we can use the FreeCompetitions() function to pull a list of competitions in which data is freely available. Let’s save this to a variable called competitions_available.
# Saves list of all available competitions to a variable
competitions_available <- FreeCompetitions()
Run the code and the competitions_available will appear in the environment pane (top right).
# The below code will open the variable in a tabular format.
view(competitions_available)
We are wanting to use data from the UEFA Women’s Euro 2022, we need to inspect the data to ascertain the competition_id and season_name for our competition of interest.
The competition_id and season_name can be used to filter out competitions we are not analysing. The code below pulls the list of all the competitions, filters the data set to just contain the UEFA Women’s Euro 2022 and stores the filtered data in a new variable called comps. The comps variable is then used inside of the FreeMatches() function to pull the list of matches, which is saved to a new variable called matches. The matches variable is then used inside the free_allevents() function to pull the data set we will be filtering.
Note the %>% syntax, this is from the tidyverse package and allows us to pipe our code. In the code below the FreeCompetitions() function acquires a list of competitions which is then piped into the filter() function allowing up to filter the specific competition we are using.
# Filters UEFA Women's Euro 2022
comps <- FreeCompetitions() %>%
filter(competition_id == 53, season_name == "2022")
# Gets the matches from the UEFA Women's Euro 2022
matches <- FreeMatches(comps)
# Pulls available data for all matches and clean
data <- free_allevents(MatchesDF = matches, Parallel = T) %>%
allclean(
To get an idea of how huge this data set you view it and print the names of the columns.
# The below code will open the variable in a tabular format.
view(data)
# The below code will print into the console all the column names
colnames(data)
This dataset has a total of 179 columns with 105157 rows containing every event for every match in the competition. This is long format data which isn’t easily interpreted.
When handling large data sets, it’s crucial to know what you want the end result to be. Our end result is to create a visualisation to assess performance. We are going to assess performance by calculating the following variables: shots per 90, goals per 90, and non-penalty expected goals per 90. We will then visualise these variables in a scatterplot. Additionally, we are going to create a shot map. For the shot map, we need to know the player’s location when they took the shot and the outcome of the shot.
Note: This lesson talks you through what to do with this dataset to give you an introduction to using R. If you go on to independently process large data sets, you will discover a lot of time is spent looking at the data. You need to assess what data is available, and the format it is in. You’ll also need to plan, what data you can use to get your desired output and what steps you need to do in order to achieve that.
Let’s start with calculating the per 90 variables. For this, we need to know how many minutes each player has played, StatsBombR has a function which will return the number of minutes each player played per match. The code below uses the get.minutesplayed() function on the data variable and saves it into the mins variable.
mins <- get.minutesplayed(data)
If you view the mins variable, you will see it contains: “player.id”, “match_id”, “team.id”, “TimeOn”, “TimeOff”, “GameEnd”, “MinutesPlayed”. The player.id column refers to plays by a number, rather than their name. In order for the data to be interpretable, we need to rectify this. The data variable (the huge one), contains both the player.id column and a column called player.name which as you imagine, contains the player’s name. We can use the code below the process the data in a way that will result in us having a variable that has the player’s name, team and information on how minutes they’ve played.
players <- data %>% # 1)
distinct(player.id, player.name, team.name) %>% # 2)
left_join(mins, by = "player.id") %>% # 3)
na.omit() # 4)
# 1) Save the following to the players variable. First take the data variable.
# 2) Use the distinct() functions with the player.id, player.name,
# 2) and team.name arguments (arguments are what we call inputs to functions).
# 2) The distinct() functions will return every unique row of player.id,
# 2) player.name and team.name.
# 2) The reason we do this is because the data variable has rows which repeat
# 2) player names since the data variable contains information by event.
# 2) Right now we need a list of player names, player ids and team names
# 2) in order to apply the to mins variable.
# 3) We then can add the mins variable to the list of player names
# 3) with the left_join() function using the player.id column to match up
# 3) the coorsponding mins data to the approriate player.
# 3) Remember that the mins variable contains data on minutes players
# 3) for every player in every match.
# 4) Finally, we us the na.omit() function which removes any empty data entires
# 4) since they're not relevant to us.
# view the players variable created above
view(players)
In the players variable, we have how many minutes each player played for each match. Next, we need to calculate the total minutes each player has played so we can then acquire the nineties.
mins_stats <- players %>% # 1)
group_by(player.name, team.name) %>% # 2)
summarise(total_mins = round(sum(MinutesPlayed), digits = 2)) %>% # 3)
mutate(nineties = round(total_mins / 90, digits = 2)) %>% # 4)
arrange(-nineties) %>% # 5)
distinct() %>% # 6)
ungroup() # 7)
# 1) Save the following to the mins_stats variable.
# 1) First take the players variable.
# 2) Group the data by player.name (team.name is included as an argument
# 2) to preserve that information). Grouping the data allows us to
# 2) calculate nineties for each player.
# 3) This may look a bit messy with all the nesting. What is happening
# 3) is that the summarise() function creates a new data frame,
# 3) with a column called total_mins.
# 3) The total minutes are calculated by summing the MinutesPlayed,
# 3) which are summed per player because we used group_by() function.
# 3) The values are rounded to two decimal places.
# 4) The mutate() function creates a column called nineties.
# 4) The nineties are calculated by dividing the total minutes played
# 4) by 90 and rounding to two decimal places.
# 5) The arange() function orders the rows by values in the nineties column,
# 5) starting with high values.
# 6) Like we've done previously, we are using the distinct() function
# 6) to returns unique values, since we don't need any repeated values.
# 7) The ungroup() function reverses the group_by() function,
# 7) we do this to prevent the group_by() function interfering with
# 7) any future manipulations we apply to the dataset.
# We can get the players with the highest and lowest nineties values
# using the head() and tail() functions.
# The head() function returns rows from the start of a dataframe.
# Since the mins_stats variable is ordered by the nineties column,
# head() will return the players with the highest nineties.
# The head() function will print to console unless you specify otherwise.
print("Players with highest nineties:")
head(mins_stats)
# The tail() function returns that last rows in a dataframe.
# Therefore will return the players wit the lowest nineties.
# The tail() function will printed to console unless you specify otherwise.
print("Players with lowest nineties:")
tail(mins_stats)
We will use the nineties values which are stored in the mins_stats variable to calculate shots, goals and non-penalty expected per 90. If you haven’t done so already, consider the difference between comparing players based on goals per 90 data and comparing players on total goals. Looking at goals is an absolute value, whereas goals per 90 is a relative value. If a player has less time on the pitch, they have less opportunity to score goals. Using the number of goals doesn’t have any control over the variance in time a player is on the pitch. Whereas, although not perfect itself, goals per 90 allows for better comparison between players, since you can compare how effective a player is with their time on the pitch.
In order to calculate goals, shots and expected goals per 90. We need to know these values in their absolute form, then divide that value by the nineties we’ve just calculated.
summary <- data %>% # 1)
mutate( # 2)
is_goal = ifelse(shot.outcome.name == "Goal", 1, 0), # 2)
is_shot = ifelse(type.name == "Shot", 1, 0) # 2)
) %>%
filter(is_shot == 1) %>% # 3)
group_by(player.name) %>% # 4)
summarise( # 5)
shots = sum(is_shot), # 5)
goals = sum(is_goal), # 5)
npxg = sum(shot.statsbomb_xg) # 5)
) %>%
left_join(mins_stats, by = "player.name") %>% # 6)
ungroup() %>% # 7)
mutate( # 8)
shots_p90 = shots / nineties, # 8)
goals_p90 = goals / nineties, # 8)
npxg_p90 = npxg / nineties # 8)
) %>%
arrange(-npxg_p90) # 9)
write.csv(summary, "summary.csv", row.names = FALSE) # 10)
# 1) Save the following to the summary variable.
# 1) First take the data variable, this is the HUGE data set.
# 2) Using the mutate() function we are create two columns
# 2) called is_goal and is_shot. In order obtain the total
# 2) amounts of shots and goals for each player we need
# 2) numerical data. The is_goal and is_shot columns are
# 2) computed using ifelse statements, like we did
# 2) in the previous lesson.
# 3) The filter() function filters the data, specfiying that
# 3) the only rows of data that will remain are those relating to
# 3) when a shot has occured.
# 4) We use group_by() in the same fashion we have previously to calculate
# 4) values per player.
# 5) The summarise() function creates a new dataframe in which
# 5) we can calculate the absoulte value of goals, shots and non-penalty
# 5) expected goals. For goals and shots we use the variables is_goal
# 5) and is_shot we created. For non-penalty expected goals we use
# 5) StatsBomb's calculation of expected goal.
# You can read more about how StatsBomb calculate expected goals
# by visiting the following link:
# https://statsbomb.com/soccer-metrics/expected-goals-xg-explained/
# 6) To the dataframe containing the absoulte values we add
# 6) the mins_stats variable and match the data by player.name.
# 7) ungroup() is applied to prevent interference from the
# 7) group_by() function.
# 8) mutate() adds three new columns to the dataframe. All calculated
# 8) in the same fashion. For example, the total number of goals
# 8) a player has scored is divided by nineties to produce
# 8) the goals_p90 variable which contains goals per nineties.
# 9) The dataset is arranged with players with higher non-penalty
# 9) expected goals per 90 values at the top.
# 10) Finally, we save the summary variable to a .csv file.
# The file is saved to the same folder as your R Project.
# 10) This data file will be used to create our
# 10) scatterplot visualisations.
# 10) The summary data file as the name suggests is a summary
# 10) in data science terminology, it would be described
# 10) as data in wide format. Wide format data is more
# 10) easily interpreted.
At this stage, we have our data sorted for making the scatterplot visualisations – great! Next, we need to wrangle the data for shot maps.
shot_data <- data %>% # 1)
filter(player.name %in% summary$player.name, # 2)
type.name == "Shot") %>% # 2)
mutate(is_goal = ifelse(shot.outcome.name == "Goal", 1, 0)) %>% # 3)
select( # 4)
player.name, # 4)
location.x, # 4)
location.y, # 4)
is_goal, # 4)
shot.outcome.name, # 4)
shot.statsbomb_xg, # 4)
team.name # 4)
)
write.csv(shot_data, "shot_data.csv", row.names = FALSE) # 5)
# 1) Save the following to the shot_data variable.
# 1) First take the data variable, this is the HUGE data set.
# 2) Filter the data variable to keep player names that are in
# 2) the summary variable's player.name column, and keep all
# 2) data relating to shots.
# Note 1. %in% is the in operator. In this application the
# in operator is checking to see if player.names in data
# variable appear in the summary variable.
# Note 2. The dollar $ sign is used to specify a column
# in a dataframe. For example, summary$player.name is
# assessing the player.name from the summary variable.
# 3) As done previously, we are converting the shot.outcome.name
# 3) into numerical values.
# 4) The select() function is specifying what column names are
# 4) going to be saved from the huge data variable into
# 4) the shot_data variable
# 5)Finally, we save the shot_data variable to a .csv file.
# The file is saved to the same folder as your R Project.
# 5) This data file will be used to create shot maps.
# 5) The shot_data data file would be described
# 5) as data in long format. Long format data is
# 5) harder to interpret in tabular form.
# 5) One of the main reason for producing visualisations
# 5) is to make data easy to interpret. When this
# 5) shot_data is presented in a shot maps it will be easier to interpret.
In the same folder as your R project, you should now have a summary.csv file and a shot_data.csv file. We will be using these in the next lesson to create visualisations!
Want the code from this lesson in one chunk? ⬇⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
You can download the whole data processing script from the course files (opens in new tab).
Alternatively, you can copy and paste the below:
#----
#---------- Load packages
# An if statment is used to evaluate if the required package is
# already installed. If the package is not installed the code will
# install the package. The package will be loaded in the library() function.
if (require("tidyverse") == FALSE) {
install.packages("tidyverse")
}
library(tidyverse)
#--
if (require("devtools") == FALSE) {
install.packages("devtools")
}
library(devtools)
#--
if (require("StatsBombR") == FALSE) {
devtools::install_github("statsbomb/SDMTools") &&
devtools::install_github("statsbomb/StatsBombR")
}
library(StatsBombR)
##---------------------------
#----
#---------- Saves list of all available competitions to a variable
competitions_available <- FreeCompetitions()
##---------------------------
#----
#---------- View the competitions_available variable in a tabular format
view(competitions_available)
##---------------------------
#----
#---------- Filter competitions of interest, below is UEFA Women's Euro 2022
comps <- FreeCompetitions() %>%
filter(competition_id == 53, season_name == "2022")
##---------------------------
#----
#------ Gets the matches from the filtered competitions
matches <- FreeMatches(comps)
##---------------------------
#----
#------- Pulls available data for all matches and clean
data <- free_allevents(MatchesDF = matches, Parallel = T) %>%
allclean()
##---------------------------
#----
#------ We can use "View(data)" to open up the data file in a tabular format
view(data)
#----
#------ We can print into the console all the column names
colnames(data)
##---------------------------
#----
#------- In order to calculate per 90 variables we need to know how many minutes each player has played.
#------- The StatsBombR library has a function specifically for this
mins <- get.minutesplayed(data) # Get how many mins each player has played in each match
print("Got the mins played") # useful for first creating code so know where we are up to when running the code
view(mins)
#----
#------- Extracting player names and team names from the data variable
#------- and combing with the mins variable to create the player variable.
players <- data %>% # 1)
distinct(player.id, player.name, team.name) %>% # 2)
left_join(mins, by = "player.id") %>% # 3)
na.omit() # 4)
# 1) Save the following to the players variable. First take the data variable.
# 2) Use the distinct() functions with the player.id, player.name,
# 2) and team.name arguments (arguments are what we call inputs to functions).
# 2) The distinct() functions will return every unique row of player.id,
# 2) player.name and team.name.
# 2) The reason we do this is because the data variable has rows which repeat
# 2) player names since the data variable contains information by event.
# 2) Right now we need a list of player names, player ids and team names
# 2) in order to apply the to mins variable.
# 3) We then can add the mins variable to the list of player names
# 3) with the left_join() function using the player.id column to match up
# 3) the coorsponding mins data to the approriate player.
# 3) Remember that the mins variable contains data on minutes players
# 3) for every player in every match.
# 4) Finally, we us the na.omit() function which removes any empty data entires
# 4) since they're not relevant to us.
# view the players variable created above
view(players)
##---------------------------
#----
#------- Sum player minutes and calculate 90s.
mins_stats <- players %>% # 1)
group_by(player.name, team.name) %>% # 2)
summarise(total_mins = round(sum(MinutesPlayed), digits = 2)) %>% # 3)
mutate(nineties = round(total_mins / 90, digits = 2)) %>% # 4)
arrange(-nineties) %>% # 5)
distinct() %>% # 6)
ungroup() # 7)
# 1) Save the following to the mins_stats variable.
# 1) First take the players variable.
# 2) Group the data by player.name (team.name is included as an argument
# 2) to preserve that information). Grouping the data allows us to
# 2) calculate nineties for each player.
# 3) This may look a bit messy with all the nesting. What is happening
# 3) is that the summarise() function creates a new data frame,
# 3) with a column called total_mins.
# 3) The total minutes are calculated by summing the MinutesPlayed,
# 3) which are summed per player because we used group_by() function.
# 3) The values are rounded to two decimal places.
# 4) The mutate() function creates a column called nineties.
# 4) The nineties are calculated by dividing the total minutes played
# 4) by 90 and rounding to two decimal places.
# 5) The arange() function orders the rows by values in the nineties column,
# 5) starting with high values.
# 6) Like we've done previously, we are using the distinct() function
# 6) to returns unique values, since we don't need any repeated values.
# 7) The ungroup() function reverses the group_by() function,
# 7) we do this to prevent the group_by() function interfering with
# 7) any future manipulations we apply to the dataset.
#----
# We can get the players with the highest and lowest nineties values
# using the head() and tail() functions.
# The head() function returns rows from the start of a dataframe.
# Since the mins_stats variable is ordered by the nineties column,
# head() will return the players with the highest nineties.
# The head() function will print to console unless you specify otherwise.
print("Players with highest nineties:")
head(mins_stats)
# The tail() function returns that last rows in a dataframe.
# Therefore will return the players wit the lowest nineties.
# The tail() function will printed to console unless you specify otherwise.
print("Players with lowest nineties:")
tail(mins_stats)
##---------------------------
#----
#------- Calculate goals, shots, and expected goals, per 90.
summary <- data %>% # 1)
mutate( # 2)
is_goal = ifelse(shot.outcome.name == "Goal", 1, 0), # 2)
is_shot = ifelse(type.name == "Shot", 1, 0) # 2)
) %>%
filter(is_shot == 1) %>% # 3)
group_by(player.name) %>% # 4)
summarise( # 5)
shots = sum(is_shot), # 5)
goals = sum(is_goal), # 5)
npxg = sum(shot.statsbomb_xg) # 5)
) %>%
left_join(mins_stats, by = "player.name") %>% # 6)
ungroup() %>% # 7)
mutate( # 8)
shots_p90 = shots / nineties, # 8)
goals_p90 = goals / nineties, # 8)
npxg_p90 = npxg / nineties # 8)
) %>%
arrange(-npxg_p90) # 9)
write.csv(summary, "summary.csv", row.names = FALSE) # 10)
# 1) Save the following to the summary variable.
# 1) First take the data variable, this is the HUGE data set.
# 2) Using the mutate() function we are create two columns
# 2) called is_goal and is_shot. In order obtain the total
# 2) amounts of shots and goals for each player we need
# 2) numerical data. The is_goal and is_shot columns are
# 2) computed using ifelse statements, like we did
# 2) in the previous lesson.
# 3) The filter() function filters the data, specfiying that
# 3) the only rows of data that will remain are those relating to
# 3) when a shot has occured.
# 4) We use group_by() in the same fashion we have previously to calculate
# 4) values per player.
# 5) The summarise() function creates a new dataframe in which
# 5) we can calculate the absoulte value of goals, shots and non-penalty
# 5) expected goals. For goals and shots we use the variables is_goal
# 5) and is_shot we created. For non-penalty expected goals we use
# 5) StatsBomb's calculation of expected goal.
# You can read more about how StatsBomb calculate expected goals
# by visiting the following link:
# https://statsbomb.com/soccer-metrics/expected-goals-xg-explained/
# 6) To the dataframe containing the absoulte values we add
# 6) the mins_stats variable and match the data by player.name.
# 7) ungroup() is applied to prevent interference from the
# 7) group_by() function.
# 8) mutate() adds three new columns to the dataframe. All calculated
# 8) in the same fashion. For example, the total number of goals
# 8) a player has scored is divided by nineties to produce
# 8) the goals_p90 variable which contains goals per nineties.
# 9) The dataset is arranged with players with higher non-penalty
# 9) expected goals per 90 values at the top.
# 10) Finally, we save the summary variable to a .csv file.
# The file is saved to the same folder as your R Project.
# 10) This data file will be used to create our
# 10) scatterplot visualisations.
# 10) The summary data file as the name suggests is a summary
# 10) in data science terminology, it would be described
# 10) as data in wide format. Wide format data is more
# 10) easily interpreted.
##---------------------------
#----
#------- Prepare the data required for creating shot maps.
shot_data <- data %>% # 1)
filter(player.name %in% summary$player.name, # 2)
type.name == "Shot") %>% # 2)
mutate(is_goal = ifelse(shot.outcome.name == "Goal", 1, 0)) %>% # 3)
select( # 4)
player.name, # 4)
location.x, # 4)
location.y, # 4)
is_goal, # 4)
shot.outcome.name, # 4)
shot.statsbomb_xg, # 4)
team.name # 4)
)
write.csv(shot_data, "shot_data.csv", row.names = FALSE) # 5)
# 1) Save the following to the shot_data variable.
# 1) First take the data variable, this is the HUGE data set.
# 2) Filter the data variable to keep player names that are in
# 2) the summary variable's player.name column, and keep all
# 2) data relating to shots.
# Note 1. %in% is the in operator. In this application the
# in operator is checking to see if player.names in data
# variable appear in the summary variable.
# Note 2. The dollar $ sign is used to specify a column
# in a dataframe. For example, summary$player.name is
# assessing the player.name from the summary variable.
# 3) As done previously, we are converting the shot.outcome.name
# 3) into numerical values.
# 4) The select() function is specifying what column names are
# 4) going to be saved from the huge data variable into
# 4) the shot_data variable
# 5)Finally, we save the shot_data variable to a .csv file.
# The file is saved to the same folder as your R Project.
# 5) This data file will be used to create shot maps.
# 5) The shot_data data file would be described
# 5) as data in long format. Long format data is
# 5) harder to interpret in tabular form.
# 5) One of the main reason for producing visualisations
# 5) is to make data easy to interpret. When this
# 5) shot_data is presented in a shot maps it will be easier to interpret.
##---------------------------
print("Tada! Data is processed")
Further resources:
If you wish to pursue R independently, I recommend the following online resources to aid you on your learning journey.