Coding workshop: Week 2

using Quarto, data wrangling, visualizing uncertainty

tidyverse
janitor
read_csv
pipe operators
|>
clean_names
mutate
select
pivot_longer
group_by
summarize
sd
qt
deframe
ggplot
geom_histogram
geom_point
geom_errorbar
geom_pointrange
geom_line
Author
Affiliation
Modified

April 9, 2025

Workshop dates: April 10 (Thursday), April 11 (Friday)

1. Summary

Packages

  • tidyverse
  • janitor

Operations

  • read in data using read_csv()
  • chain functions together using |>
  • clean column names using clean_names()
  • create new columns using mutate()
  • select columns using select()
  • make data frame longer using pivot_longer()
  • group data using group_by()
  • summarize data using summarize()
  • calculate standard deviation using sd()
  • calculate t-values using qt()
  • expand data frames using deframe()
  • visualize data using ggplot()
  • create histograms using geom_histogram()
  • visualize means and raw data using geom_point()
  • visualize standard deviation, standard error, and confidence intervals using geom_errorbar() and geom_pointrange()
  • visualize trends through time using geom_point() and geom_line()

Data source

This week, we’ll work with data on seafood production types (aquaculture or capture). This workshop’s data comes from Tidy Tuesday 2021-10-12, which was from OurWorldinData.org.

2. Code

1. Set up

This section of code includes reading in the packages you’ll need: tidyverse and janitor.

You’ll also read in the data using read_csv() and store the data in an object called production.

# load in packages
library(tidyverse)
library(janitor)
# read in data
production <- read_csv("captured_vs_farmed.csv")
Note

Remember to look at the data before working with it! You can use View(production) in the console, or click on the production object in the Environment tab in the top right.

Before you start: think about what the differences might be between aquaculture and capture production in the context of fisheries. Which production type do you think would produce more seafood?

insert your best guess here

2. Cleaning up

The data comes in what’s called “wide format”, meaning that each row represents multiple observations. For example, the first row contains the production from Afghanistan (country code AFG) in 1969 for aquaculture and capture production.

We want to convert the data into “long format” so that it’s easier to work with. A dataset is in long format if each row represents an observation.

In this chunk of code, we’ll:

  1. clean the column names using clean_names()
  2. filter to only include the “entity” we want using filter()
  3. select the columns of interest using select()
  4. make the data frame longer using pivot_longer()
  5. manipulate the type column to change the long names (e.g. aquaculture_production_metric_tons) to short names (e.g. aquaculture) using mutate() and case_when()
  6. use mutate() to create a new column called metric_tons_mil
production_clean <- production |> # use the production data frame
  clean_names() |> # clean up column names
  filter(entity == "United States") |> # filter to only include observations from the US
  select(year, aquaculture_production_metric_tons, capture_fisheries_production_metric_tons) |> # select columns of interest 
  pivot_longer(cols = aquaculture_production_metric_tons:capture_fisheries_production_metric_tons, # choose columns to pivot
               names_to = "type", # name the column name with fishery type "type"
               values_to = "catch_metric_tons") |> # name the column name with the catch amount "catch_metric_tons"
  mutate(type = case_when( # mutate the existing type column
    type == "aquaculture_production_metric_tons" ~ "aquaculture", # when "aquaculture_production_metric_tons" appears in the "type" column, fill in "aquaculture"
    type == "capture_fisheries_production_metric_tons" ~ "capture" # when "capture_fisheries_production_metric_tons" appears in the "type" column, fill in "capture"
  )) |> 
  mutate(metric_tons_mil = catch_metric_tons/1000000) # convert catch in metric tons to millions

3. Making a boxplot/jitter plot

Last week in workshop, we made a boxplot. The boxplot shows useful summary statistics (displays the central tendency and spread), while the jitter plot shows the actual observations (in this case, each point is the catch for a fishery in a given year). One way to display the underlying data is to combine a boxplot with a jitter plot.

ggplot(data = production_clean, # start with the production_clean data frame
       aes(x = type, # x-axis should be type of production
           y = metric_tons_mil, # y-axis should be metric tons of production (in millions)
           color = type)) + # coloring by production type
  geom_boxplot() + # first layer should be a boxplot
  geom_jitter(width = 0.2, # making the points jitter horizontally
              height = 0) + # making sure points don't jitter vertically
  labs(x = "Type", # labelling the x-axis
       y = "Metric tons of production (in millions)") # labelling the y-axis

On average, a) which type of production produces more fish, and b) what components of the plot are you using to come up with your answer?

a) Capture production, because b) the median of the boxplot is way higher than the median for aquaculture

4. Making a histogram

This chunk of code creates a histogram. Note that for a histogram, you only need to fill in the aes() argument for the x-axis (x), not the y-axis. This is because ggplot() counts the number of observations in each bin for you.

Within the geom_histogram() call, you’ll need to tell R what number of bins you want using the bins argument. In this case (using the Rice Rule to determine the appropriate number of bins), we’ll use 10 bins.

ggplot(data = production_clean,
       aes(x = metric_tons_mil,
           fill = type)) + # fill the histogram based on the fishery type
  geom_histogram(bins = 10, # set the number of bins
                 color = "black") # make the border of the columns black

Can you tell from looking at the histogram which production type tends to produce more fish? Why or why not?

Yes. There are no observations for aquaculture at high catch (in millions); most of the observations for aquaculture are lower than 1 million tons, while capture production ranges up to 6 million tons.

5. Visualizing spread, variance, and confidence

a. Calculations

# calculate the confidence interval "by hand"
production_summary <- production_clean |> # start with the production_clean data frame
  group_by(type) |> # group by production type
  summarize(mean = mean(metric_tons_mil), # calculate the mean
            n = length(metric_tons_mil), # count the number of observations
            df = n - 1, # calculate the degrees of freedom
            sd = sd(metric_tons_mil), # calculate the standard deviation
            se = sd/sqrt(n), # calculate the standard error
            tval = qt(p = 0.05/2, df = df, lower.tail = FALSE), # find the t value
            margin = tval*se, # calculate the margin of error
            ci_lower = mean - tval*se, # calculate the lower bound of the confidence interval
            ci_higher = mean + tval*se # calculate the upper bound of the confidence interval
          ) 

production_summary
# A tibble: 2 × 10
  type         mean     n    df    sd     se  tval margin ci_lower ci_higher
  <chr>       <dbl> <int> <dbl> <dbl>  <dbl> <dbl>  <dbl>    <dbl>     <dbl>
1 aquaculture 0.324    59    58 0.149 0.0194  2.00 0.0389    0.285     0.363
2 capture     4.38     59    58 1.20  0.156   2.00 0.313     4.07      4.69 
# use a function to calculate the confidence interval
production_ci <- production_clean |>  
  group_by(type) |>  
  summarize(ci = mean_cl_normal(metric_tons_mil)) |>  # calculate the CI using a function
  deframe() # expand the data frame

production_ci
                   y     ymin     ymax
aquaculture 0.323932 0.285032 0.362832
capture     4.381083 4.068038 4.694128

When you compare the 95% CI from production_summary and production_ci, they should be about the same.

b. Visualizations

When visualizing the central tendency (in this case, mean) and spread (standard deviation) or variance (standard error) or confidence (confidence intervals), you can stack geoms on top of each other.

To visualize the mean, we’ll use geom_point(). Remember that geom_point() can be used for any plot you want to make that involves a point.

To visualize the spread/variance/confidence, we’ll use geom_errorbar(). This is the geom that creates two lines that can extend away from a point.

Standard deviation

First, we’ll visualize standard deviation.

ggplot(data = production_summary, # use the summary data frame
       aes(x = type, # x-axis should be production type
           y = mean, # y-axis should show the mean production
           color = type)) + # color the points by fishery type
  geom_point(size = 2) + # plot the mean
  geom_errorbar(aes(ymin = mean - sd, # plot the standard deviation
                    ymax = mean + sd),
                width = 0.1) + # make the bars narrower
  labs(title = "Standard deviation",
       x = "Type",
       y = "Mean and SD million metric tons production")

Standard error

Then, we want to visualize standard error.

ggplot(data = production_summary, # use the summary data frame
       aes(x = type, 
           y = mean, 
           color = type)) + # color the points by fishery type
  geom_point(size = 2) + # plot the mean
  geom_errorbar(aes(ymin = mean - se, # plot the standard error
                    ymax = mean + se),
                width = 0.1) +
  labs(title = "Standard error",
       x = "Type",
       y = "Mean and SE million metric tons production")

95% confidence interval

Then, we want to visualize the 95% confidence interval.

ggplot(data = production_summary, # use the summary data frame
       aes(x = type, 
           y = mean, 
           color = type)) + # color the points by fishery type
  geom_point(size = 2) + # plot the mean
  geom_errorbar(aes(ymin = mean - margin, # plot the margin of error
                    ymax = mean + margin),
                width = 0.1) +
  labs(title = "Confidence interval",
       x = "Type",
       y = "Mean and 95% CI million metric tons production")

c. geom_pointrange()

We can also visualize means and spread/variance/confidence intervals using the geom_pointrange() function.

ggplot(data = production_summary, # use the summary data frame
       aes(x = type, 
           y = mean, 
           color = type)) + # color the points by fishery type
  geom_pointrange(aes(ymin = mean - margin, 
                      ymax = mean + margin)) +
  labs(title = "Confidence interval (using geom_pointrange)",
       x = "Type",
       y = "Mean and 95% CI million metric tons production")

d. Visualizing with the underlying data

Lastly, we want to visualize the 95% confidence interval with the underlying data.

# base layer: ggplot
ggplot(data = production_clean,
       aes(x = type, 
           y = metric_tons_mil, 
           color = type)) +
  # first layer: adding data (each point shows an observation)
  geom_jitter(width = 0.1,
              height = 0,
              alpha = 0.4,
              shape = 21) +
  # second layer: means and 95% CI
  geom_pointrange(data = production_summary,
                  aes(x = type, 
                      y = mean, 
                      ymin = mean - margin, 
                      ymax = mean + margin)) +
  # changing appearance: colors, labels, and theme
  scale_color_manual(values = c("aquaculture" = "deeppink3",
                                "capture" = "slateblue4")) +
  labs(x = "Production type",
       y = "Million metric tons of production",
       color = "Production type",
       title = "Capture produces more than aquaculture") +
  theme_light()

There are lots of themes in ggplot to play around with. These are nice to use to get rid of the grey background that is the default, and generally make your plot look cleaner.

A list of built-in themes and their theme_() function calls is here.

e. Visualizing through time

Then, if we want to visualize production through time:

ggplot(data = production_clean,
       aes(x = year,
           y = metric_tons_mil,
           color = type,
           shape = type)) +
  geom_point() +
  geom_line() +
  scale_color_manual(values = c("aquaculture" = "deeppink3",
                                "capture" = "slateblue4")) +
  labs(x = "Year",
       y = "Million metric tons of production",
       color = "Production type",
       shape = "Production type") +
  theme_minimal()

END OF WORKSHOP 2