Week 7 figures - Lectures 12 and 13

linear models
Author
Affiliation
Lecture date

May 12, 2025

Modified

May 8, 2025

0. Set up

Code
# cleaning
library(tidyverse)

# visualization
theme_set(theme_classic() +
            theme(panel.grid = element_blank(),
                  axis.text = element_text(size = 18),
                  axis.title = element_text(size = 18),
                  text = element_text(family = "Lato")))
library(patchwork)
library(ggeffects)
library(flextable)
library(modelsummary)
library(gtsummary)
library(readxl)
library(here)
library(janitor)

# analysis
library(car)
library(performance)
library(broom)

1. Math

a. Residuals

Residuals are the difference between the actual observed value (\(y_i\)) and the model prediction (\(\hat{y}\)) at some value of \(x\).

\[ residual = y_i - \hat{y} \]

Ordinary least squares minimizes the sum of squares of the residuals.

b. Model equations

OLS gives you an equation for a line.

$$ \[\begin{align} y &= b + mx \\ y &= \beta_0 + \beta_1x + \epsilon \\ \end{align}\] $$ \(\beta\) terms (“betas”) are often referred to as “model coefficients”.

c. Mathematical hypothesis

Statistically, the hypotheses are:

H_0_: the predictor variable does not predict the response
H_A_: the predictor variable does predict the response

Mathematically, you might express that as:

\[ H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0 \]

d. R2

\[ \begin{align} R^2 &= 1 - \frac{\sum_{i = 1}^{n}(y_i - \hat{y})^2}{\sum_{i = 1}^{n}(y_i - \bar{y})^2} \\ &= 1 - \frac{SS_{residuals}}{SS_{total}} \end{align} \]

e. Pearson’s correlation

formula for Pearson’s correlation

\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i - \bar{y})^2}} \]

test statistic for pearson correlation

\[ \begin{align} t &= \frac{r\sqrt{n - 2}}{\sqrt{1-r^2}} \\ df &= n -2 \end{align} \]

2. R2

Code
df <- tibble(
  x = seq(from = 1, to = 20, by = 1),
  r2_1 = 3*x + 1,
  r2_between = runif(n = 20, min = 1, max = 5)*x + runif(n = 20, min = 1, max = 5),
  r2_0 = runif(n = 20, min = 1, max = 20)
)

lm(r2_1 ~ x, data = df) |> 
  summary()

Call:
lm(formula = r2_1 ~ x, data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.676e-15 -1.312e-15 -5.240e-17  8.432e-16  7.220e-15 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 1.000e+00  1.142e-15 8.757e+14   <2e-16 ***
x           3.000e+00  9.532e-17 3.147e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.458e-15 on 18 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 9.905e+32 on 1 and 18 DF,  p-value: < 2.2e-16
Code
ggplot(df,
       aes(x = x,
           y = r2_1)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm")

Code
lm(r2_between ~ x, data = df) |> 
  summary()

Call:
lm(formula = r2_between ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.344  -8.201   0.994   9.517  32.971 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0684     6.6511   0.762  0.45591    
x             2.5942     0.5552   4.672  0.00019 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.32 on 18 degrees of freedom
Multiple R-squared:  0.5481,    Adjusted R-squared:  0.523 
F-statistic: 21.83 on 1 and 18 DF,  p-value: 0.0001896
Code
ggplot(df,
       aes(x = x,
           y = r2_between)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm",
              se = FALSE)

Code
lm(r2_0 ~ x, data = df) |> 
  summary()

Call:
lm(formula = r2_0 ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2426 -1.7958  0.6555  1.7988  6.0232 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0045     1.6667   3.003 0.007641 ** 
x             0.6143     0.1391   4.415 0.000334 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.588 on 18 degrees of freedom
Multiple R-squared:  0.5199,    Adjusted R-squared:  0.4932 
F-statistic: 19.49 on 1 and 18 DF,  p-value: 0.0003342
Code
ggplot(df,
       aes(x = x,
           y = r2_0)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm",
              se = FALSE)

3. Red abalone example

This example is inspired in by Hamilton et al. 2022.

We work with the linear model using the real data in class, but it’s hard to see the difference in diagnostic plots between a “good” vs “bad” linear model with that data set.

This is the fake data generated to compare different diagnostic plots.

a. generating the data

Code
set.seed(666)
abalone <- tibble(
  temperature = seq(from = 10, to = 18, by = 1),
  growth1 = runif(length(temperature), min = -0.7, max = -0.63)*temperature + runif(length(temperature), min = 15, max = 17),
  growth2 = runif(length(temperature), min = -0.7, max = -0.63)*temperature + runif(length(temperature), min = 15, max = 17),
  growth3 = runif(length(temperature), min = -0.7, max = -0.63)*temperature + runif(length(temperature), min = 15, max = 17)
) |> 
  pivot_longer(cols = growth1:growth3,
               names_to = "rep",
               values_to = "growth") |> 
  mutate(growth = round(growth, digits = 1)) |> 
  select(temperature, growth) |> 
  rename(x = temperature,
         y = growth)

# look at your data:
head(abalone, 10)
# A tibble: 10 × 2
       x     y
   <dbl> <dbl>
 1    10   9.1
 2    10   9.6
 3    10   9.7
 4    11   9  
 5    11   8.8
 6    11   8.8
 7    12   7.5
 8    12   9.1
 9    12   8.5
10    13   6.3
Code
ggplot(data = abalone,
       aes(x = x,
           y = y)) +
  geom_point() 

Seems like there is a linear relationship between temperature and abalone growth. As temperature increases, abalone growth decreases.

b. fitting a model

Code
abalone_model <- lm(
  y ~ x,
  data = abalone
)

# just checking DHARMa residuals just in case
DHARMa::simulateResiduals(abalone_model, plot = TRUE)

Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
 
Scaled residual values: 0.328 0.604 0.656 0.62 0.548 0.472 0.228 0.976 0.716 0.052 0.496 0.528 0.128 0.132 0.92 0.352 0.684 0.152 0.98 0.764 ...
Code
# base R residuals
par(mfrow = c(2, 2))
plot(abalone_model)

c. looking at model coefficients

Code
summary(abalone_model)

Call:
lm(formula = y ~ x, data = abalone)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.15370 -0.50093  0.06019  0.34491  1.24630 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.40926    0.69633   23.57  < 2e-16 ***
x           -0.69722    0.04891  -14.25 1.65e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6562 on 25 degrees of freedom
Multiple R-squared:  0.8904,    Adjusted R-squared:  0.8861 
F-statistic: 203.2 on 1 and 25 DF,  p-value: 1.65e-13
Code
# common way of representing model summaries
# from flextable package
flextable::as_flextable(abalone_model)

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

16.409

0.696

23.565

0.0000

***

x

-0.697

0.049

-14.254

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.6562 on 25 degrees of freedom

Multiple R-squared: 0.8904, Adjusted R-squared: 0.8861

F-statistic: 203.2 on 25 and 1 DF, p-value: 0.0000

Code
# better table
flextable::as_flextable(abalone_model) |> 
  set_formatter(values = list("p.value" = function(x){ # special function to represent p < 0.001
    z <- scales::label_pvalue()(x)
    z[!is.finite(x)] <- ""
    z
  }))

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

16.409

0.696

23.565

<0.001

***

x

-0.697

0.049

-14.254

<0.001

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.6562 on 25 degrees of freedom

Multiple R-squared: 0.8904, Adjusted R-squared: 0.8861

F-statistic: 203.2 on 25 and 1 DF, p-value: 0.0000

Code
# somewhat more customizable way
# from modelsummary package
modelsummary(abalone_model)
(1)
(Intercept) 16.409
(0.696)
x -0.697
(0.049)
Num.Obs. 27
R2 0.890
R2 Adj. 0.886
AIC 57.8
BIC 61.7
Log.Lik. -25.899
F 203.189
RMSE 0.63
Code
# better table
modelsummary(list("Abalone model" = abalone_model), # naming the model
             fmt = 2, # rounding digits to 2 decimal places
             estimate = "{estimate} [{conf.low}, {conf.high}] ({p.value})", # customizing appearance
             statistic = NULL, # not displaying standard error
             gof_omit = 'DF|AIC|BIC|Log.Lik.|RMSE') # taking out some extraneous info
Abalone model
(Intercept) 16.41 [14.98, 17.84] (<0.01)
x -0.70 [-0.80, -0.60] (<0.01)
Num.Obs. 27
R2 0.890
R2 Adj. 0.886
F 203.189
Code
# using gtsummary package
tbl_regression(abalone_model,
               intercept = TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 16 15, 18 <0.001
x -0.70 -0.80, -0.60 <0.001
Abbreviation: CI = Confidence Interval
Code
# more customizing
tbl_regression(abalone_model, # model object
               intercept = TRUE) |> # show the intercept
  as_flex_table() # turn it into a flextable (easier to save)

Characteristic

Beta

95% CI

p-value

(Intercept)

16

15, 18

<0.001

x

-0.70

-0.80, -0.60

<0.001

Abbreviation: CI = Confidence Interval

Code
anova(abalone_model)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value   Pr(>F)    
x          1 87.501  87.501  203.19 1.65e-13 ***
Residuals 25 10.766   0.431                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

d. visualizing the model

Code
model_preds <- ggpredict(
  abalone_model,
  terms = "x"
)

# look at the output:
model_preds
# Predicted values of y

 x | Predicted |     95% CI
---------------------------
10 |      9.44 | 8.96, 9.92
11 |      8.74 | 8.34, 9.14
12 |      8.04 | 7.71, 8.37
13 |      7.35 | 7.07, 7.62
14 |      6.65 | 6.39, 6.91
15 |      5.95 | 5.67, 6.23
16 |      5.25 | 4.92, 5.58
18 |      3.86 | 3.38, 4.34
Code
ggpredict(abalone_model, 
          terms = "x[18]")
# Predicted values of y

 x | Predicted |     95% CI
---------------------------
18 |      3.86 | 3.38, 4.34
Code
# plotting without 95% CI
ggplot(abalone, # using the actual data
       aes(x = x, # x-axis
           y = y)) + # y-axis
  geom_point(color = "cornflowerblue", # each point is an individual abalone
             size = 3) +
  
  # model prediction: actual model line
  geom_line(data = model_preds, # model prediction table
            aes(x = x, # x-axis
                y = predicted), # y-axis
            linewidth = 1)  # line width

Code
# plotting
ggplot(abalone, # using the actual data
       aes(x = x, # x-axis
           y = y)) + # y-axis
  
  # plot the data first
  # each point is an individual abalone
  geom_point(color = "cornflowerblue",
             size = 3) + 
  
  # model prediction: 95% CI
  geom_ribbon(data = model_preds, # model prediction table
              aes(x = x, # x-axis
                  y = predicted, # y-axis
                  ymin = conf.low, # lower bound of 95% CI
                  ymax = conf.high), # upper bound of 95% CI
              alpha = 0.2) + # transparency) 
  
  # model prediction: actual model line
  geom_line(data = model_preds, # model prediction table
            aes(x = x, # x-axis
                y = predicted), # y-axis
            linewidth = 1) # line width

Code
# compare with:
ggplot(abalone,
       aes(x = x,
           y = y)) +
  geom_point() +
  geom_smooth(method = "lm")

e. outliers

Code
abalone |> 
  mutate(outlier = ifelse(row_number() %in% c(19, 21, 27), "yes", "no")) |> 
  ggplot(aes(x = x,
             y = y)) +
  geom_point(aes(color = outlier), 
             size = 3) + 
  geom_line(data = model_preds,
            aes(x = x,
                y = predicted),
            linewidth = 1) +
  scale_color_manual(values = c("yes" = "red", "no" = "cornflowerblue")) +
  theme(legend.position = "none")

Code
# new abalone data frame
abalone2 <- abalone |> 
  slice(-c(19, 21, 27))

abalone_model2 <- lm(y ~ x,
                    data = abalone2)

summary(abalone_model2)

Call:
lm(formula = y ~ x, data = abalone2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01641 -0.44359  0.08359  0.34163  1.05272 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.81772    0.60270   27.90  < 2e-16 ***
x           -0.73087    0.04336  -16.85 4.61e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.537 on 22 degrees of freedom
Multiple R-squared:  0.9281,    Adjusted R-squared:  0.9249 
F-statistic: 284.1 on 1 and 22 DF,  p-value: 4.606e-14
Code
model_preds2 <- ggpredict(abalone_model2, terms = "x")

ggplot(data = abalone2,
       aes(x = x,
           y = y)) +
  geom_point(color = "cornflowerblue",
             size = 3) +
  geom_line(data = model_preds2,
            aes(x = x,
                y = predicted), 
            linewidth = 1)

f. actual data

Data from: Hamilton et al. 2022. Aquaculture. “Integrated multi-trophic aquaculture mitigates the effects of ocean acidification: Seaweeds raise system pH and improve growth of juvenile abalone.”

Thank you to Scott for being willing to share this data!

Code
abalone <- read_xlsx(here("data", "Abalone IMTA_growth and pH.xlsx")) |> 
  clean_names()

# creating clean dataset
abalone_clean <- abalone |> # start with abalone object
  # select columns of interest
  select(mean_p_h, change_in_area_mm2_d_1_25, recirculation_treatment) |> 
  # rename columns
  rename(ph = mean_p_h, 
         change_in_area = change_in_area_mm2_d_1_25,
         recirculation = recirculation_treatment) |> 
  # recod recirculation
  mutate(recirculation = case_when(
    recirculation == 0 ~ "0% recirculation",
    recirculation == 0.3 ~ "30% recirculation",
    recirculation == 0.65 ~ "65% recirculation"
  ))

# base layer: ggplot
ggplot(data = abalone_clean,
       aes(x = ph,
           y = change_in_area)) +
  # first layer: points representing abalones
  geom_point(size = 4,
             stroke = 1,
             fill = "firebrick3",
             shape = 21)

Code
abalone_clean |> 
  select(ph, change_in_area) |> 
  slice_sample(n = 5)
# A tibble: 5 × 2
     ph change_in_area
  <dbl>          <dbl>
1  7.95           5.40
2  8.23           5.74
3  8.15           5.48
4  8.14           6.02
5  7.95           5.47
Code
abalone_model <- lm(
  change_in_area ~ ph,  # formula: change in area as a function of pH
  data = abalone_clean  # data frame: abalone_clean
)

DHARMa::simulateResiduals(abalone_model) |> 
  plot()

Code
par(mfrow = c(2, 2))   # creating a 2x2 grid
plot(abalone_model)    # plot diagnostic plots

Code
# more information about the model
summary(abalone_model)

Call:
lm(formula = change_in_area ~ ph, data = abalone_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42714 -0.29282  0.08769  0.21258  0.43965 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -18.5010     6.1601  -3.003  0.01488 * 
ph            2.9827     0.7596   3.926  0.00348 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3063 on 9 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.6314,    Adjusted R-squared:  0.5905 
F-statistic: 15.42 on 1 and 9 DF,  p-value: 0.003477
Code
# creating a new object called abalone_preds
abalone_preds <- ggpredict(
  abalone_model,      # model object
  terms = "ph"        # predictor (in quotation marks)
)

# display the predictions
abalone_preds
# Predicted values of change_in_area

  ph | Predicted |     95% CI
-----------------------------
7.94 |      5.19 | 4.83, 5.54
7.95 |      5.21 | 4.86, 5.55
8.06 |      5.53 | 5.30, 5.76
8.10 |      5.67 | 5.46, 5.88
8.10 |      5.67 | 5.46, 5.88
8.14 |      5.77 | 5.55, 5.98
8.15 |      5.81 | 5.59, 6.04
8.33 |      6.36 | 5.92, 6.80
Code
# find predicted growth at ph = 7.9
ggpredict(
  abalone_model,
  terms = "ph[7.9]"
)
# Predicted values of change_in_area

  ph | Predicted |     95% CI
-----------------------------
7.90 |      5.06 | 4.65, 5.48
Code
# find predicted growth at ph = 7.9
ggpredict(
  abalone_model,
  terms = "ph[8.23]"
)
# Predicted values of change_in_area

  ph | Predicted |     95% CI
-----------------------------
8.23 |      6.05 | 5.75, 6.34
Code
# find predicted growth at ph = 8
ggpredict(
  abalone_model,
  terms = "ph[8]"
)
# Predicted values of change_in_area

ph | Predicted |     95% CI
---------------------------
 8 |      5.36 | 5.08, 5.64
Code
# model summary in a neat table
as_flextable(abalone_model)

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

-18.501

6.160

-3.003

0.0149

*

ph

2.983

0.760

3.926

0.0035

**

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.3063 on 9 degrees of freedom

Multiple R-squared: 0.6314, Adjusted R-squared: 0.5905

F-statistic: 15.42 on 9 and 1 DF, p-value: 0.0035

Code
# model summary in a neat table
tbl_regression(abalone_model,
               intercept = TRUE,
               label = list(`(Intercept)` = "Intercept", 
                            `ph` = "pH")) |> 
  as_flex_table() 

Characteristic

Beta

95% CI

p-value

Intercept

-19

-32, -4.6

0.015

pH

3.0

1.3, 4.7

0.003

Abbreviation: CI = Confidence Interval

Code
# look at the column names
colnames(abalone_preds)
[1] "x"         "predicted" "std.error" "conf.low"  "conf.high" "group"    
Code
# look at the "class" (i.e. "type") of object
class(abalone_preds)
[1] "ggeffects"  "data.frame"
Code
# base layer: ggplot
# using clean data frame
ggplot(data = abalone_clean,
       aes(x = ph,
           y = change_in_area)) +
  # first layer: points representing abalones
  geom_point(size = 4,
             stroke = 1,
             fill = "firebrick3",
             shape = 21) +
  # second layer: ribbon representing confidence interval
  # using predictions data frame
  geom_ribbon(data = abalone_preds,
              aes(x = x,
                  y = predicted,
                  ymin = conf.low,
                  ymax = conf.high),
              alpha = 0.1) +
  # third layer: line representing model predictions
  # using predictions data frame
  geom_line(data = abalone_preds,
            aes(x = x,
                y = predicted),
            linewidth = 1) +
  # axis labels
  labs(x = "pH", 
       y = expression("Change in shell area ("*mm^{2}~d^-1*")"))

Code
ggplot(data = abalone_clean,
       aes(x = ph,
           y = change_in_area)) +
  # first layer: points representing abalones
  geom_point(size = 4,
             stroke = 1,
             fill = "firebrick3",
             shape = 21) +
  geom_smooth(method = "lm")

4. Exponential growth example

To compare with abalone example

Code
df_ex <- tibble(
  x = seq(from = 10, to = 18, length = 27),
  y = c(15, 15, 15, 
        15, 14.3, 14.2, 
        14.1, 14, 13.9,
        13.9, 13.8, 13.7,
        13.2, 13.1, 13,
        12.5, 11.1, 10,
        9.9, 7, 5,
        3, 1.7, 1,
        0.5, 0.3, 0.1)
) 

lm_ex <- lm(y ~ x, data = df_ex)

summary(lm_ex)

Call:
lm(formula = y ~ x, data = df_ex)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2540 -2.0605 -0.4009  2.3533  3.6288 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.5833     2.6772   14.41 1.29e-13 ***
x            -2.0329     0.1885  -10.79 6.82e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.347 on 25 degrees of freedom
Multiple R-squared:  0.8231,    Adjusted R-squared:  0.816 
F-statistic: 116.3 on 1 and 25 DF,  p-value: 6.823e-11
Code
ggplot(data = df_ex,
       aes(x = x,
           y = y)) +
  geom_point()

Code
lm_pred <- ggpredict(lm_ex, terms = ~x)

ex_plot_noline <- ggplot(df_ex, aes(x= x, y = y)) +
  geom_point(size = 3, color = "orange") +
  theme_classic() +
  theme(text = element_text(size = 14))

ex_plot_noline

Code
ex_plot <- ggplot(df_ex, aes(x= x, y = y)) +
  geom_point(size = 3, color = "orange") +
  geom_line(data = lm_pred, aes(x = x, y = predicted), linewidth = 1) +
  theme(text = element_text(size = 14)) 

ex_plot

Code
par(mfrow = c(2, 2))
plot(lm_ex)

Code
DHARMa::simulateResiduals(lm_ex) |> plot()

Code
# if using quarto, don't label chunk with a table... so weird
anova_tbl <- broom::tidy(anova(model1)) |> 
  mutate(across(where(is.numeric), ~ round(.x, digits = 2))) |> 
  mutate(p.value = case_when(
    p.value < 0.001 ~ "< 0.001"
  )) 

flextable(anova_tbl) |> 
  set_header_labels(term = "Term", 
                    df = "Degrees of freedom", 
                    sumsq = "Sum of squares", 
                    meansq = "Mean squares", 
                    statistic = "F-statistic", 
                    p.value = "p-value") |> 
  set_table_properties(layout = "autofit", width = 0.8)

5. Temperature/elevation example

Code
# data
sonadora <- read_csv(here::here("data", "knb-lter-luq.183.877108", "Temp_SonadoraGradient_Daily.csv"))

# cleaning
sonadora_clean <- sonadora |> 
  janitor::clean_names() |> 
  pivot_longer(cols = plot_250:plot_1000,
               names_to = "plot_name",
               values_to = "temp_c") |> 
  separate_wider_delim(cols = plot_name,
                       delim = "_",
                       names = c("plot", "elevation_m"),
                       cols_remove = FALSE) |> 
  select(-plot) |> 
  mutate(elevation_m = as.numeric(elevation_m))

# summarizing
sonadora_sum <- sonadora_clean |> 
  group_by(plot_name, elevation_m) |> 
  reframe(mean_temp_c = mean(temp_c, na.rm = TRUE)) |> 
  arrange(elevation_m)

# model
model <- lm(mean_temp_c ~ elevation_m, data = sonadora_sum)

# visualization
ggplot(data = sonadora_sum, 
       aes(x = elevation_m,
           y = mean_temp_c)) +
  geom_point(size = 3) +
  labs(x = "Elevation (m)",
       y = "Temperature (°C)")

Code
# diagnostics from DHARMa
DHARMa::simulateResiduals(model, plot = TRUE)

Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
 
Scaled residual values: 0.484 0.476 0.264 0.764 0.472 0.36 0.336 0.396 0.564 0.44 0.548 0.792 0.856 0.948 0.604 0
Code
# summary
summary(model)

Call:
lm(formula = mean_temp_c ~ elevation_m, data = sonadora_sum)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.67288 -0.07577  0.00022  0.09393  0.37042 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 24.7807991  0.1719438  144.12  < 2e-16 ***
elevation_m -0.0055446  0.0002581  -21.48 4.08e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.238 on 14 degrees of freedom
Multiple R-squared:  0.9706,    Adjusted R-squared:  0.9684 
F-statistic: 461.4 on 1 and 14 DF,  p-value: 4.075e-12
Code
tidy(model)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 24.8      0.172        144.  1.32e-23
2 elevation_m -0.00554  0.000258     -21.5 4.08e-12
Code
# model predictions
ggpredict(model) |> plot(show_data = TRUE)

Code
# diagnostics
par(mfrow = c(2, 2))
plot(model)

6. correlation

a. abalone correlation

Code
cor.test(abalone_clean$ph, abalone_clean$change_in_area,
         method = "pearson")

    Pearson's product-moment correlation

data:  abalone_clean$ph and abalone_clean$change_in_area
t = 3.9265, df = 9, p-value = 0.003477
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3721101 0.9443471
sample estimates:
      cor 
0.7946121 

b. correlation but different slopes

Code
corr_df <- tibble(
  x = seq(from = 1, to = 10, by = 0.5),
  y1 = 0.25*x,
  y2 = 1*x
) 

corr_0.25 <- ggplot(data = corr_df,
                    aes(x = x,
                        y = y1)) +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(0, 10))

corr_1 <- ggplot(data = corr_df,
                 aes(x = x,
                     y = y2)) +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(0, 10))

corr_0.25 + corr_1

c. no correlation but clear relationship

Code
x_lm <- seq(from = 1, to = 30, length.out = 50)
# y = a( x – h) 2 + k
df_para <- cbind(
  x = x_lm,
  y = 0.1*(x_lm - 15)^2 + 12
) |> 
  as_tibble()

ggplot(data = df_para, 
       aes(x = x, 
           y = y)) +
  geom_point(size = 3) 

Code
cor.test(df_para$x, df_para$y, method = "pearson")

    Pearson's product-moment correlation

data:  df_para$x and df_para$y
t = 0.90749, df = 48, p-value = 0.3687
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1540408  0.3939806
sample estimates:
      cor 
0.1298756 

exponential growth example

Code
x_ex <- seq(from = 5, to = 9, length = 30)

y_ex <- exp(x_ex)

df_ex <- cbind(
  x = x_ex,
  y = -exp(x_ex)
) |> 
  as_tibble()

lm_ex <- lm(y ~ x, data = df_ex)

lm_ex

Call:
lm(formula = y ~ x, data = df_ex)

Coefficients:
(Intercept)            x  
       9431        -1642  

model summary

Code
summary(lm_ex)

Call:
lm(formula = y ~ x, data = df_ex)

Residuals:
    Min      1Q  Median      3Q     Max 
-2756.1  -660.5   226.3   843.2  1081.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9431.3     1111.3   8.486 3.16e-09 ***
x            -1642.0      156.5 -10.492 3.30e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1023 on 28 degrees of freedom
Multiple R-squared:  0.7972,    Adjusted R-squared:   0.79 
F-statistic: 110.1 on 1 and 28 DF,  p-value: 3.298e-11

model plots

Code
lm_pred <- ggpredict(lm_ex, terms = ~x)

ex_plot_noline <- ggplot(df_ex, aes(x= x, y = y)) +
  geom_point(shape = 17, size = 3, color = "orange") +
  theme_classic() +
  theme(text = element_text(size = 14))

ex_plot <- ggplot(df_ex, aes(x= x, y = y)) +
  geom_point(shape = 17, size = 3, color = "orange") +
  geom_line(data = lm_pred, aes(x = x, y = predicted), linewidth = 1) +
  theme_classic() +
  theme(text = element_text(size = 14))

diagnostic plots

Code
par(mfrow = c(2, 4))
plot(model1, which = c(1), col = "cornflowerblue", pch = 19)
plot(lm_ex, which = c(1), col = "orange", pch = 17)
plot(model1, which = c(2), col = "cornflowerblue", pch = 19)
plot(lm_ex, which = c(2), col = "orange", pch = 17)
plot(model1, which = c(3), col = "cornflowerblue", pch = 19)
plot(lm_ex, which = c(3), col = "orange", pch = 17)
plot(model1, which = c(5), col = "cornflowerblue", pch = 19)
plot(lm_ex, which = c(5), col = "orange", pch = 17)
dev.off()

Citation

BibTeX citation:
@online{bui2025,
  author = {Bui, An},
  title = {Week 7 Figures - {Lectures} 12 and 13},
  date = {2025-05-12},
  url = {https://spring-2025.envs-193ds.com/lecture/lecture_week-07.html},
  langid = {en}
}
For attribution, please cite this work as:
Bui, An. 2025. “Week 7 Figures - Lectures 12 and 13.” May 12, 2025. https://spring-2025.envs-193ds.com/lecture/lecture_week-07.html.