Correlations

Correlations

Correlations are used to understand a relationship between two variables.

It is important to understand that correlations DO NOT say anything about how much Y changes when X changes. Rather it only describes the ‘relationship’ between two variables.

Correlations can describe two things: direction and strength of a relationship.

Direction (+/-)

  • A positive correlation means that when one variable increases another variable increases as well.

  • A negative correlation means that when one variable goes up another variable decreases.

Strength

  • Correlations range from -1.0 to +1.0.

  • The closer it is to +1 or -1, the more closely the two variables are related. (Strong)

  • If the correlation is close to 0, it means there is no relationship.

Furthermore, to describe correlations we look at the direction (positive or negative) and strength (the number)

  • Strong: the relationship is very consistent

  • Weak: the relationship exists, but is not very strong or consistent

  • Positive: when one variable increases, the other tends to increase

  • Negative: when one variable increases, the other tends to decrease

REMINDER: THE NUMERICAL VALUE DOES NOT DESCRIBE HOW MUCH CHANGE JUST THE STRENGTH OF THE RELATIONSHIP.

Example 1:

In the first example, we are looking a randomly generated data to see how we can identify correlations visually based on scatter plots. I have provided 5 examples.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(123)

x = rnorm(100, mean = 1, sd = .5)
y = rnorm(100, mean = 1, sd = 0.5)

d1 <- data.frame(x = x,
                 y = y + x*.5)
d2 <- data.frame(x = x,
                 y = y + x*-.5)
d3 <- data.frame(x = x,
                 y = y + x*1)
d4 <- data.frame(x = x,
                 y = y + x*-1)
d5 <- data.frame(x = x,
                 y = y)

# visualize the correlations through a scatter plot

# weak positive correlation
ggplot(d1, aes(x, y)) +
geom_point()  +
 labs(title="Weak Positive Correlation Between X and Y")

# weak negative correlation
ggplot(d2, aes(x, y)) +
geom_point()  +
 labs(title= "Weak Negative Correlation Between X and Y")

# strong positive correlation
ggplot(d3, aes(x, y)) +
geom_point()  +
 labs(title="Strong PositiveCorrelation Between X and Y")

# strong negative correlation
ggplot(d4, aes(x, y)) +
geom_point()  +
 labs(title="Strong Negative Correlation Between X and Y")

# no correlation
ggplot(d5, aes(x, y)) +
geom_point() +
 labs(title="No Correlation Between X and Y")

Example 2:

Now in example 2, I have randomly generated data on a fake data set to understand how to use the the cor() function and how to visually plot correlations to illustrate it.

age <- sample(18:100, 100, replace = TRUE)
gender <- sample(c(1, 0), 100, replace = TRUE)  # 1 = Female, 0 = Male
kids <- sample(0:5, 100, replace = TRUE)

# Now build the data.frame with dependent vars defined inline
data <- data.frame(
  age = age,
  gender = gender,
  height = rnorm(100, mean = 5.5, sd = 0.75) + gender * -0.25,
  kids = kids,
  income = rnorm(100, mean = 45000, sd = 10000) + age * 1000 + gender * (-2000) + kids * (-3000),
  years_edu = sample(10:23, 100, replace = TRUE) + age * 1 + gender * 2
)

head(data)
  age gender   height kids    income years_edu
1  40      0 5.480122    3  73470.91        53
2  84      0 5.324767    1 135526.80       100
3  26      1 4.814127    5  51245.85        46
4  78      0 5.015744    5  95671.94        99
5  56      0 5.242842    2  97375.04        68
6  48      0 4.243953    1 100512.59        70

The cor() function allows us to clalculate the correlation between two or more variables. My first example is looking at only variables and then visualizing it through a scatter plot.

# correlation for only two variables
cor(data$age, data$income)
[1] 0.9260033
# visualize the correlations through a scatter plot
ggplot(data, aes(x = age, y = income)) +
geom_point()

Since the correlation score is 0.89, we can determine there is strong positive relationship between age and income, stating that as age increases so does income. Which is seen in our scatter plot as well.

Next, we will examine our full data set. Doing this only works if you have only numerical values, it does not work for non numerical values.

# correlations for all variables in the dataset
cor(data)
                  age      gender      height        kids      income
age        1.00000000  0.01613562  0.12926437 -0.04294040  0.92600327
gender     0.01613562  1.00000000 -0.24154026  0.02198794 -0.02963598
height     0.12926437 -0.24154026  1.00000000  0.01487277  0.09756631
kids      -0.04294040  0.02198794  0.01487277  1.00000000 -0.26536753
income     0.92600327 -0.02963598  0.09756631 -0.26536753  1.00000000
years_edu  0.98540459  0.08326801  0.11777500 -0.05497819  0.91578590
            years_edu
age        0.98540459
gender     0.08326801
height     0.11777500
kids      -0.05497819
income     0.91578590
years_edu  1.00000000

We can visualize this using the ggcorrplot() function.

# install.packages("ggcorrplot")
library(ggcorrplot) # for ggcorrplot()

# save correlation values as a variable
cor_data <- cor(data)

# plot the correlation values
ggcorrplot(cor_data)