Tiderverse Descriptive Statistics and Data Wrangling

Filtering and Selecting Data

In the following coding examples, you will see three important functions, a pipe, select(), and filter(). First, a pipe is written as a %>% or |>. Pipes are a powerful tool that help us clearly expressing a sequence of functions. Pipes tell R that you want to use a designated set for a chain of functions that build off of each other. A way to think about is in math terms.

\[ F(G(H(X))), X=Data\]

In this series of functions, the first step is you solve \(H(X)\), then using that solution you solve \(G()\), then after you solve \(F()\). The final output is the solution of those chain of events with the information of \(X\). This is how pipes work! Pipes use a set of information, typically a data frame, and then apply it to a series of functions that build off each other.

In the following example, I use the select() and filter() to simplify the data. The select() allows me to select certain variables from the data and eliminate the rest. filter() does something similar but instead of variables, it allows me to simplify the data based on values based on a set of programmed logical operators. In my first example, I select for the circuit, year, constructor, surname of the driver, and the number of points awarded during that race. I use the unique() to remove duplicates, and then I filter for the race information for the constructor McLaren during the 2023 season.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RandomData)


race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(constructor == "McLaren" & year == 2023)

# A tibble: 40 × 5
   circuit                         year constructor surname points
   <chr>                          <dbl> <chr>       <chr>    <dbl>
 1 Bahrain International Circuit   2023 McLaren     Norris       0
 2 Jeddah Corniche Circuit         2023 McLaren     Norris       0
 3 Jeddah Corniche Circuit         2023 McLaren     Piastri      0
 4 Albert Park Grand Prix Circuit  2023 McLaren     Norris       8
 5 Albert Park Grand Prix Circuit  2023 McLaren     Piastri      4
 6 Baku City Circuit               2023 McLaren     Norris       2
 7 Baku City Circuit               2023 McLaren     Piastri      0
 8 Miami International Autodrome   2023 McLaren     Norris       0
 9 Miami International Autodrome   2023 McLaren     Piastri      0
10 Circuit de Monaco               2023 McLaren     Norris       2
# ℹ 30 more rows

What if we wanted two different constructors? Then we want to use the logical operator for | and also differentiate it from the next instruction for filter by using a , or (). In the example bellow, I filter for the race information for the constructor Mercades and Red Bull during the 2021 season.

race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
    unique() |>
    filter(constructor == "Mercades" |  constructor == "Red Bull", year == 2021) |>
    print()

# A tibble: 40 × 5
   circuit                             year constructor surname    points
   <chr>                              <dbl> <chr>       <chr>       <dbl>
 1 Losail International Circuit        2021 Red Bull    Pérez          12
 2 Losail International Circuit        2021 Red Bull    Verstappen     19
 3 Bahrain International Circuit       2021 Red Bull    Pérez          10
 4 Bahrain International Circuit       2021 Red Bull    Verstappen     18
 5 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Pérez           0
 6 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Verstappen     25
 7 Autódromo Internacional do Algarve  2021 Red Bull    Pérez          12
 8 Autódromo Internacional do Algarve  2021 Red Bull    Verstappen     18
 9 Circuit de Barcelona-Catalunya      2021 Red Bull    Pérez          10
10 Circuit de Barcelona-Catalunya      2021 Red Bull    Verstappen     19
# ℹ 30 more rows

However, if we wanted to use the modified data, we could not because it is not saved. If we want to use this data we need to save it as a new object!

season_2021 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
    unique() |>
    filter(constructor == "Mercades" |  constructor == "Red Bull", year == 2021) |>
    print()

# A tibble: 40 × 5
   circuit                             year constructor surname    points
   <chr>                              <dbl> <chr>       <chr>       <dbl>
 1 Losail International Circuit        2021 Red Bull    Pérez          12
 2 Losail International Circuit        2021 Red Bull    Verstappen     19
 3 Bahrain International Circuit       2021 Red Bull    Pérez          10
 4 Bahrain International Circuit       2021 Red Bull    Verstappen     18
 5 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Pérez           0
 6 Autodromo Enzo e Dino Ferrari       2021 Red Bull    Verstappen     25
 7 Autódromo Internacional do Algarve  2021 Red Bull    Pérez          12
 8 Autódromo Internacional do Algarve  2021 Red Bull    Verstappen     18
 9 Circuit de Barcelona-Catalunya      2021 Red Bull    Pérez          10
10 Circuit de Barcelona-Catalunya      2021 Red Bull    Verstappen     19
# ℹ 30 more rows

Now, we can use this information!

Creating New Variables

We can also add to our data sets! If we want to make a new variable, we can use the mutate(), which uses existing data to create new variables. The correct format of the function is the following, first is the new variable name and then an equal sign =, followed by how the new variable will be formed and what existing variables are used to make it.

mutate(new_column_name = function(old_variable_1))

In the example below, we calculate the average lap time by taking the total time for the driver to finish the race divided by the number of laps.

season_2023 <- race_stats |>
  select(circuit, year, constructor, surname, time, laps) |>
  filter(constructor == "McLaren" & year == 2023) |>
  mutate(avg_laptime = sum(time)/laps)

season_2023 |> 
   select(circuit, year, constructor, surname, avg_laptime) |>
   # remove duplicates
   unique() |>
   print()

# A tibble: 40 × 5
   circuit                         year constructor surname avg_laptime
   <chr>                          <dbl> <chr>       <chr>         <dbl>
 1 Bahrain International Circuit   2023 McLaren     Norris      646286.
 2 Jeddah Corniche Circuit         2023 McLaren     Norris      710915.
 3 Jeddah Corniche Circuit         2023 McLaren     Piastri     710915.
 4 Albert Park Grand Prix Circuit  2023 McLaren     Norris      612858.
 5 Albert Park Grand Prix Circuit  2023 McLaren     Piastri     612858.
 6 Baku City Circuit               2023 McLaren     Norris      696975.
 7 Baku City Circuit               2023 McLaren     Piastri     696975.
 8 Miami International Autodrome   2023 McLaren     Norris      623609.
 9 Miami International Autodrome   2023 McLaren     Piastri     634745.
10 Circuit de Monaco               2023 McLaren     Norris      461633.
# ℹ 30 more rows

If we wanted to make a new character variable we would use the case_when() in the mutate() function. In this example, I make a new variable based on the final position of the race for the two McLaren drivers, Piastri and Norris, in the 2023 season. I use the existing variables the circuit and surname to make this new variable.

McLarenStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname) |>
  # remove duplicates
  unique() |>
  filter(constructor == "McLaren" & year == 2023) |>
  mutate(
    final_position = case_when(
      #PIASTRI
      circuit == "Bahrain International Circuit" & surname == "Piastri" ~ "DNF",
      circuit == "Jeddah Corniche Circuit" & surname == "Piastri" ~ "15",
      circuit == "Albert Park Grand Prix Circuit" & surname == "Piastri" ~ "8", 
      circuit ==  "Baku City Circuit" & surname == "Piastri" ~ "11",
      circuit ==  "Miami International Autodrome" & surname == "Piastri" ~ "19",
      circuit == "Circuit de Monaco" & surname == "Piastri" ~ "10",
      circuit == "Circuit de Barcelona-Catalunya" & surname == "Piastri" ~ "13",
      circuit == "Circuit Gilles Villeneuve" & surname == "Piastri" ~ "11", 
      circuit == "Red Bull Ring" & surname == "Piastri" ~ "16",
      circuit == "Silverstone Circuit" & surname == "Piastri" ~ "4",
      circuit == "Hungaroring" & surname == "Piastri" ~ "5",
      circuit == "Circuit de Spa-Francorchamps" & surname == "Piastri" ~ "DNF",
      circuit == "Circuit Park Zandvoort" & surname == "Piastri" ~ "9",
      circuit == "Autodromo Nazionale di Monza" & surname == "Piastri" ~ "12",
      circuit == "Marina Bay Street Circuit" & surname == "Piastri" ~ "7",
      circuit == "Suzuka Circuit" & surname == "Piastri" ~ "3",
      circuit == "Losail International Circuit" & surname == "Piastri" ~ "2",
      circuit == "Circuit of the Americas" & surname == "Piastri" ~ "DNF",
      circuit == "Autódromo Hermanos Rodríguez" & surname == "Piastri" ~ "8",
      circuit == "Autódromo José Carlos Pace" ~ "14",
      circuit == "Las Vegas Strip Street Circuit" & surname == "Piastri" ~ "10",
      circuit == "Yas Marina Circuit" & surname == "Piastri" ~ "6",
        
        # NORRIS
        circuit == "Bahrain International Circuit" & surname == "Norris" ~ "17",
        circuit == "Jeddah Corniche Circuit" & surname == "Norris" ~ "17",
        circuit == "Albert Park Grand Prix Circuit" & surname == "Norris" ~ "6",
        circuit ==  "Baku City Circuit" & surname == "Norris" ~ "9",
        circuit ==  "Miami International Autodrome" & surname == "Norris" ~ "17", 
        circuit == "Circuit de Monaco" & surname == "Norris" ~ "9", 
        circuit == "Circuit de Barcelona-Catalunya" & surname == "Norris" ~ "17", 
        circuit == "Circuit Gilles Villeneuve" & surname == "Norris" ~ "13",
        circuit == "Red Bull Ring" & surname == "Norris" ~ "4", 
        circuit == "Silverstone Circuit" & surname == "Norris" ~ "2",
        circuit == "Hungaroring" & surname == "Norris" ~ "2",
        circuit == "Circuit de Spa-Francorchamps" & surname == "Norris" ~ "7",
        circuit == "Circuit Park Zandvoort" & surname == "Norris" ~ "9",
        circuit == "Autodromo Nazionale di Monza" & surname == "Norris" ~"8",
        circuit == "Marina Bay Street Circuit" & surname == "Norris" ~ "2",
        circuit == "Suzuka Circuit" & surname == "Norris" ~ "2",
        circuit == "Losail International Circuit" & surname == "Norris" ~ "3",
        circuit == "Circuit of the Americas" & surname == "Norris" ~ "3",
        circuit == "Autódromo Hermanos Rodríguez" & surname == "Norris" ~ "5",
        circuit == "Autódromo José Carlos Pace" ~ "2",
        circuit == "Las Vegas Strip Street Circuit" & surname == "Norris" ~ "DNF",
        circuit == "Yas Marina Circuit" & surname == "Norris" ~ "5"
        )
  ) 

print(McLarenStandings_2023)

# A tibble: 40 × 5
   circuit                         year constructor surname final_position
   <chr>                          <dbl> <chr>       <chr>   <chr>         
 1 Bahrain International Circuit   2023 McLaren     Norris  17            
 2 Jeddah Corniche Circuit         2023 McLaren     Norris  17            
 3 Jeddah Corniche Circuit         2023 McLaren     Piastri 15            
 4 Albert Park Grand Prix Circuit  2023 McLaren     Norris  6             
 5 Albert Park Grand Prix Circuit  2023 McLaren     Piastri 8             
 6 Baku City Circuit               2023 McLaren     Norris  9             
 7 Baku City Circuit               2023 McLaren     Piastri 11            
 8 Miami International Autodrome   2023 McLaren     Norris  17            
 9 Miami International Autodrome   2023 McLaren     Piastri 19            
10 Circuit de Monaco               2023 McLaren     Norris  9             
# ℹ 30 more rows

Descriptive Statistics Using Tidyverse

Now that we know some basic ways to manipulate the data frame, lets look at different way to do basic descriptive statistics! In this section we will be using the function, summarize(). This function is similar to the mutate function, except instead of adding a variable, it makes a new data frame based on existing variables. You will also see the function group_by(). This function allows us to organize the data by telling it to group things by a variable(s). Essentially, the functions splits things into groups.

For this example we are going to find the total points for each team in the 2023 season!

TeamStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(year==2023) |>
  group_by(constructor) |>
  summarize(
    total_points = sum(points)
  )

print(TeamStandings_2023)

# A tibble: 10 × 2
   constructor    total_points
   <chr>                 <dbl>
 1 Alfa Romeo               16
 2 AlphaTauri               22
 3 Alpine F1 Team          110
 4 Aston Martin            266
 5 Ferrari                 363
 6 Haas F1 Team              9
 7 McLaren                 266
 8 Mercedes                374
 9 Red Bull                790
10 Williams                 26

TeamStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(year==2023) |>
  group_by(constructor) |>
  summarize(
    total_points = sum(points)
  ) |>
  arrange(desc(total_points))

print(TeamStandings_2023)

# A tibble: 10 × 2
   constructor    total_points
   <chr>                 <dbl>
 1 Red Bull                790
 2 Mercedes                374
 3 Ferrari                 363
 4 Aston Martin            266
 5 McLaren                 266
 6 Alpine F1 Team          110
 7 Williams                 26
 8 AlphaTauri               22
 9 Alfa Romeo               16
10 Haas F1 Team              9

What if we wanted to know the percentage of points each driver contributed to the teams total?

TeamStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(year==2023) |>
  group_by(constructor) |>
  mutate(total_points = sum(points, na.rm = TRUE)) |>
  ungroup() |>  # Ungroup to avoid issues with the next group_by
  group_by(surname) |>
  summarize(
  perc_points = sum(points, na.rm = TRUE) / unique(total_points) * 100)  |>
  arrange(desc(perc_points))

print(TeamStandings_2023)

# A tibble: 22 × 2
   surname    perc_points
   <chr>            <dbl>
 1 Albon             96.2
 2 Alonso            74.4
 3 Norris            69.2
 4 Verstappen        67.1
 5 Hülkenberg        66.7
 6 Tsunoda           63.6
 7 Bottas            62.5
 8 Hamilton          58.0
 9 Leclerc           51.0
10 Ocon              50.9
# ℹ 12 more rows