Descriptive Statistics Using Tidyverse

Now that we know some basic ways to manipulate the data frame, lets look at different way to do basic descriptive statistics! In this section we will be using the function, summarize(). This function is similar to the mutate function, except instead of adding a variable, it makes a new data frame based on existing variables. You will also see the function group_by(). This function allows us to organize the data by telling it to group things by a variable(s). Essentially, the functions splits things into groups.

For this example we are going to find the total points for each team in the 2023 season!

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RandomData)

TeamStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(year==2023) |>
  group_by(constructor) |>
  summarize(
    total_points = sum(points)
  )

print(TeamStandings_2023)

# A tibble: 10 × 2
   constructor    total_points
   <chr>                 <dbl>
 1 Alfa Romeo               16
 2 AlphaTauri               22
 3 Alpine F1 Team          110
 4 Aston Martin            266
 5 Ferrari                 363
 6 Haas F1 Team              9
 7 McLaren                 266
 8 Mercedes                374
 9 Red Bull                790
10 Williams                 26

TeamStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(year==2023) |>
  group_by(constructor) |>
  summarize(
    total_points = sum(points)
  ) |>
  arrange(desc(total_points))

print(TeamStandings_2023)

# A tibble: 10 × 2
   constructor    total_points
   <chr>                 <dbl>
 1 Red Bull                790
 2 Mercedes                374
 3 Ferrari                 363
 4 Aston Martin            266
 5 McLaren                 266
 6 Alpine F1 Team          110
 7 Williams                 26
 8 AlphaTauri               22
 9 Alfa Romeo               16
10 Haas F1 Team              9

What if we wanted to know the percentage of points each driver contributed to the teams total?

TeamStandings_2023 <- race_stats |>
  select(circuit, year, constructor, surname, points) |>
  # remove duplicates
  unique() |>
  filter(year==2023) |>
  group_by(constructor) |>
  mutate(total_points = sum(points, na.rm = TRUE)) |>
  ungroup() |>  # Ungroup to avoid issues with the next group_by
  group_by(surname) |>
  summarize(
  perc_points = sum(points, na.rm = TRUE) / unique(total_points) * 100)  |>
  arrange(desc(perc_points))

print(TeamStandings_2023)

# A tibble: 22 × 2
   surname    perc_points
   <chr>            <dbl>
 1 Albon             96.2
 2 Alonso            74.4
 3 Norris            69.2
 4 Verstappen        67.1
 5 Hülkenberg        66.7
 6 Tsunoda           63.6
 7 Bottas            62.5
 8 Hamilton          58.0
 9 Leclerc           51.0
10 Ocon              50.9
# ℹ 12 more rows