Simple Linear Regression

Welcome

In the previous lab we learned about correlation. We visualised the relationship of different types of variables. We also computed one correlation measure for two numeric variables, Pearson’s correlation. We used Pearson’s correlation to summarise the strength and direction of a linear association between two numeric variables. However, it presents some limitations. It can only be used for numeric variables, it allows only one pair of variables at a time, and it is only appropriate to describe a linear relationship.

Linear regression can overcome some of these limitations and it can be extended to achieve further purposes, such as using multiple variables or to make predictions for new scenarios. This technique is in fact common and one of the most popular in quantitative research in social sciences. Therefore, getting familiar with it will be important for you not only to perform your own analyses, but also to interpret and critically read the academic literature.

Linear regression is appropriate only when the dependent variable is numeric (interval/ratio).

However, independent variables can be categorical, ordinal, or numeric. You can include more than one independent variable to evaluate how these relate to the dependent variable. In this lab we will start using only one explanatory variable. This is known as simple linear regression.

TipDon’t Panic

We will be covering linear regression over three weeks. Do not worry if everything this week doesn’t immediately make sense. We will revisit and build upon key ideas over labs 9 and 10 as well.

Please also do let me (Alasdair) know where anything is unclear or is confusing. We have been updating the lab content this year based on student feedback from previous years and value any feedback on the newest version so we can continue to make improvements.

As they cover some key concepts relevant for the summative assignment, I will aim to make further updates this year to Labs 8 - 10 based on any initial feedback.

Setup

R Markdown

For this lab, we will continue using the 2012 NILT survey to introduce linear models.

As usual, we will need to setup an R Markdown file for today’s lab. Remember if you are unsure about any of the steps below you can take a look back at Lab 6 which has more detailed explanation of each of the steps.

To set up a new R Markdown file for this lab, please use the following steps:

  1. Please go to your ‘Lab Group ##’ in Posit Cloud (log in if necessary);
  2. Open your own copy of the ‘NILT’ project from the ‘Lab Group ##’;
  3. Within the Files tab (bottom-right pane) click ‘New File’, then ‘R Markdown’ from the drop-down list of options;
  4. Within the ‘Create a New File in Current Directory’ dialogue, name it ‘Lab-8-Simple-Linear-Regression.Rmd’ and click OK.
  5. Feel free to make any adjustments to the YAML header.
  6. Create a new code chunk, with ```{r setup, include=FALSE} and in the chunk:
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
  1. Create another code chunk, named preamble and again with include=FALSE. Then add the following in the chunk:
# Load the packages
library(tidyverse)
library(haven)

# Read NILT
nilt <- readRDS("data/nilt_r_object.rds")

Please note, we are loading the haven package again this week as we will be coercing more variables.

  1. Run the preamble code chunk and if no errors, then everything is setup for this session.

Dataframe Setup

Let’s revisit the example from last week of the relationship between a respondent’s age and their spouse or partner’s age.

Firstly, let’s coerce the variables for the age of the respondent’s spouse/partner:

# Age of Respondent's Spouse / Partner
nilt$spage <- as.numeric(nilt$spage)

Bonus Activity: We are coercing a variable again each week as a reminder for how to do it. However, we could remove the need to coerce variables we are re-using across each R Markdown file. In Lab 3 we created the 01_prep_nilt_2012.R script in a sub-folder named R. Reviewing it, can you see where you could add a line of code to coerce spage? If so, modify your script, rerun it in full, then re-run the preamble code chunk in your R Markdown file for this week. After doing that, you’ll know you have successfully modified the script if you can run class(nilt$spage) in the Console and receive "numeric" as the output.

OK, we are going to use scatterplots to help visualise concepts related to linear regression. However, the NILT has hundreds of respondents, making very noisy plots. So, next let’s create a minimal sample of the dataset to only include 40 random observations:

# set a seed for randomisation
set.seed(3)

# Filter where partner's age is not NA and take a random sample of 40
nilt_sample <- nilt |>
  filter(!is.na(spage)) |>
  sample_n(40)

# Select only respondent's age and spouse/partner's age
nilt_sample <- nilt_sample |>
  select(rage, spage)

Please note, we are creating a nilt_sample dataframe with a random selection of 40 observations - i.e. respondents - from the nilt data solely to help make it easier to visualise in the examples below.

To break down some of the code here:

  • The set.seed() function lets us set a seed to fix the random number generator. Results remain random, but using the same seed means when running code in the same order makes the results reproducible.
  • We then filter(!is.na(spage)). ! is the “NOT” operator, and is.na() checks whether a value is NA (i.e. missing data). So, combined it is equivalent to saying “filter to remove all observations where the value for spage is NA”.
  • sample_n() is then used to randomly select 40 of the observations that remain.

Formal specification

Now you are ready for the formal specification of the linear model.

After the introduction above, it is easy to take the pieces apart. In essence, the simple linear model is telling us that the dependent value \(y\) is defined by a line that intersects the vertical axis at \(\beta_0\), plus the multiplication of the slope \(\beta_1\) by the value of \(x_1\), plus some residual \(\epsilon\). The second part of the equation is just telling us that the residuals are normally distributed (that is when a histogram of the residuals follows a bell-shaped curve).

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]

You do not need to memorize this equation, but being familiar to this type of notation will help you to expand your skills in quantitative research.

Fitting linear regression in R

R syntax

We already fitted our first linear model using the lm() above, which of course stands for linear model, but let’s go back through it in relation to the specifications.

The first argument takes the dependent variable, then, we use tilde ~ to express that this is followed by the independent variable. Then, we specify the data set separated by a comma, as shown below:

Example Code - do not run
lm(dependent_variable ~ independent_variable, data)

Running a simple linear regression

Now, let’s try with the full nilt data set.

First, we will fit a linear model using the same variables as the example above but including all the observations - remember that before we were working only with a small random sample of 40 observations.

The first argument is the respondent’s spouse/partner age spage, followed by the respondent’s age rage. Let’s assign the model to an object called m2 (for ‘model 2’) and print it after.

m2 <- lm(spage ~ rage, data = nilt)
m2

Call:
lm(formula = spage ~ rage, data = nilt)

Coefficients:
(Intercept)         rage  
     3.7039       0.9287  

The output first displays the formula employed to estimate the model - under Call:. Then, it shows us the estimated values for the intercept and the slope for the age of the respondent, these estimated values are called coefficients.

In this model, the coefficients differ from the ones in the example above (m1). This is because as we are now using the full nilt dataframe we have more information to fit the line. Despite the difference, we see that the relationship is still in the same direction (positive). In fact, the value of the slope \(\beta_1\) did not change much if you look at it carefully (0.87 vs 0.92).

Interpreting results

But what are the slope and intercept telling us? The first thing to note is that the slope is positive, which means that the relationship between the respondent age and their spouse/partner age are in the same direction. When the age of the respondent goes up, the age of their spouse/partner is expected to go up as well.

In our model, the intercept does not have a meaningful interpretation on its own. What we mean by saying it is not meaningful is that it would not be logical to say that when the respondent’s age is 0 their partner would be expected to be 3.70 years old. Instead, the interpretations should be made only within the range of the values used to fit the model (the youngest age in this data set is 18).

The slope can be interpreted as follows: For every year older the respondent is, the partner’s age is expected to change by 0.92. In other words, the respondents are expected to have a partner who is 0.92 times years older than themselves. Or, to put it in more simple terms, for each 1-year increase in the respondent’s age, the expected spouse/partner age increases by 0.92 years. On average then, partners are slightly younger than respondents.

In this case, the units of the dependent variable are the same as the independent (age vs age) which make it easy to interpret. In reality, there might be more factors that potentially affect how people select their partners, such as gender, education, race/ethnicity, etc. For now, we are keeping things ‘simple’ by using only these two variables.

Let’s practice with another example. What if we are interested in income?

We can use annual personal income persinc2 as the dependent variable and the number of hours worked a week rhourswk as the independent variable. We’ll assign the result to an object called m3 (for ‘model 3’).

m3 <- lm(persinc2 ~ rhourswk, data = nilt)
m3

Call:
lm(formula = persinc2 ~ rhourswk, data = nilt)

Coefficients:
(Intercept)     rhourswk  
     5170.4        463.2  

This time, the coefficients look quite different. The results are telling us that for every additional hour worked a week a respondent is expected to earn £463.2 more a year than other respondent.

Note that the input units are not the same in the interpretation. The dependent variable is in pounds earned a year and the independent is in hours worked a week. Does this mean that someone who works 0 hours a week is expected to earn £5170.4 a year (given the fitted model \(\hat{persinc2} = 5170.4 + 0*463.2\)) or that someone who works 110 hours a week (impossible!) is expected to earn £56K a year? We should only make reasonable interpretations of the coefficients within the range analysed.

Finally, an important thing to know when interpreting linear models using cross-sectional data (data collected in a single time point only) is: correlation does not imply causation. Variables are often correlated with other non-observed variables which may cause the effects observed on the independent variable in reality. This is why we cannot always be sure that X is causing the observed changes on Y.

Model evaluation

Model summary

So far, we have talked about the basics of the linear model. If we print a summary of the model, using the handy summary() function, we can obtain more information to evaluate it.


Call:
lm(formula = spage ~ rage, data = nilt)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.211  -2.494   0.075   2.505  17.578 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.70389    0.66566   5.564    4e-08 ***
rage         0.92869    0.01283  72.386   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.898 on 589 degrees of freedom
  (613 observations deleted due to missingness)
Multiple R-squared:  0.8989,    Adjusted R-squared:  0.8988 
F-statistic:  5240 on 1 and 589 DF,  p-value: < 2.2e-16

The summary output has several parts. This can look intimidating at first, but we can break it down bit-by-bit. Let’s focus on:

  • first the call
  • second the residuals
  • third the coefficients
  • fourth the goodness-of-fit measures

These are highlighted in the screenshot below:

Model summary

The first part, Call, is simply printing the variables that were used to produce the models.

The second part, the Residuals, provides a small summary of the residuals by quartile, including minimum and maximum residual values. Remember the second part of the formal specification of the model? Well, this is where we can see in practice the distribution of the residuals. The quartiles show typical prediction errors. For example, half of the residuals lie between the 1st and 3rd quartiles. So by looking at the values for the first and third quartile, we can see that half of the predicted values by the model are \(\pm\) 2.5 years old away from the observed partner’s age… not that bad.

The third part is about the coefficients. This time, the summary provides more information about the intercept and the slope than just the estimated coefficients as before. Usually, we focus on the independent variables in this section, rage in our case. The second column, standard error, tell us how large the error of the slope is, which is small for our variable. The third and fourth columns provide measures about the statistical significance of the relationship between the dependent and independent variable.

The most commonly reported is the p-value (the fourth column). R uses scientific notation for very small p-values. In our result the p-value is <2e-16, which means a number less than 0.0000000000000002, you can learn about scientific notation here).

As a general agreed principle, if the p-value is equal or less than 0.05 (\(p\) ≤ 0.05), we can reject the null hypothesis, and say that the relationship between the dependent variable and the independent is significant. In our case, the p-value is far lower than 0.05. So, we can say that the relationship between the respondent’s spouse/partner age and the respondent’s age is significant. In fact, R includes codes for each coefficient to help us to identify the significance. In our result, we got three stars/asterisks. This is because the value is less than 0.001. R assigns two stars if the p-value is less than 0.01, and one star if it is lower than 0.05. (P‑values, however, are just one part of the story; always read them alongside effect size, confidence intervals, and substantive meaning.)

In the fourth area, we have various goodness-of-fit measures. Overall, these measures tell us how well the model represents our data. For now, let’s focus on the adjusted R-squared.

R-squared is the proportion of variation in the outcome explained by the model. It ranges from 0 to 1: 0 means none of the variance is explained; 1 means all variance in the sample is explained by the model. Adjusted R-squared penalises adding variables that do not improve the model. In the social sciences, modest R-squared values are common.

Adjusted R-squared then can be understood as the proportion (multiply it by 100 to report a percentage) of variance explained by the model, adjusted for the number of independent variables used. In our age-age example, the model explains 89.88% percent of the total variance. This is a good fit, but this is because we are modelling a quite “obvious” relationship. The R-squared describes the model file, it does not judge the quality of our analysis.

You may wonder why we do not report the sum of squared residuals (SSR), as we covered in our initial example. This is because the scale of the SSR is in the units used to fit the model, which would make it difficult to compare our model with other models that use different units or different variables. Conversely, the adjusted R-squared is expressed in relative terms which makes it a more suitable measure to compare with other models. An important thing to note is that the output is telling us that this model excluded 613 observations which were missing in our dataset. This is because we do not have the age of the respondent’s partner in 613 of the rows. Missing spouse/partner ages can occur for serveral reasons. For example, the respondent may not have a parter, or choose not to answer.

In academic papers in the social sciences, usually the measures reported are the coefficients, p-values, the adjusted R-squared, and the size of the sample used to fit the model (number of observations).

Further checks

The linear model, as many other techniques in statistics, relies on assumptions. These refer to characteristics of the data that are taken for granted to generate the estimates. It is the task of the modeller/analyst to make sure the data used follows these assumptions. One simple yet important check is to examine the distribution of the residuals, which ought to follow a normal distribution.

Do the residuals follow a normal distribution in our model, m2? We can go further and plot a histogram to graphically evaluate it. Use the function residuals() to extract the residuals from our object m2, and then plot a histogram with the base R function hist().

hist(residuals(m2), breaks = 20)

Overall, it seems that the residuals follow the normal distribution reasonably well, with the exception of the negative value to the left of the plot. Very often when there is a strange distribution or different to the normal, it is because one of the assumptions is violated. If that is the case we cannot trust the coefficient estimates. In the next lab we will talk more about the assumptions of the linear model. For now, we will leave it here and get some practice.

Lab activities

In your R Markdown file, use the nilt dataset to:

  1. Plot a scatter plot using ggplot. In the aesthetics, locate rhourswk in the X axis, and persinc2 in the Y axis. In the geom_point() jitter the points by specifying the position = 'jitter'. Also, include the best fit line using the geom_smooth() function, and specify the method = 'lm' inside.
  2. Print the summary of m3 using the summary() function.
  3. Is the relationship of hours worked a week statistically significant?
  4. What is the adjusted R-squared? How would you interpret it?
  5. What is the sample size to fit the model?
  6. What is the expected income in pounds a year for a respondent who works 30 hours a week according to coefficients of this model?
  7. Plot a histogram of the residuals of m3 using the residuals() function inside hist(). Do the residuals look normally distributed (as in a bell-shaped curve)?
  8. Discuss your answers with your neighbour or tutor.