Due on Friday April 25 (Week 4) at 11:59 PM
Read the instructions carefully and double check that you have everything on the checklist.
Part 1. Tasks
Remember that you are not expected to turn anything in for tasks, but you should complete all the material.
Task 1. Set up your folders and Rproject.
a. Create a new folder for this homework assignment within your ENVS-193DS
folder.
Within your ENVS-193DS
folder, create a new folder for Homework 2. Name it whatever you want (a logical name could be homework-02
).
b. Download the files from Canvas
Download the homework files from Canvas into your homework folder. This includes:
- the homework template
sbpm.csv
c. Create an Rproject for this homework assignment
Create an Rproj file within your homework-02
folder. If you need help with this, watch the “Creating an Rproject” video on Canvas.
d. In the template, write code to load in the packages you need.
You will probably need to use the tidyverse
, and potentially janitor
.
e. Read the Canvas data into R.
Store this as an object called sbpm
.
Task 2. Read about the Environmental Protection Agency’s Outdoor Air Quality data.
a. Learn about PM 2.5
Read the EPA’s description of particulate matter to understand the definition of PM 2.5.
b. Read about the EPA’s data collection
Read the EPA’s description of their Air Data to understand the source of the data you’ll be working with in Problem 2.
Task 3. Enter your data for your personal data project.
a. Create a spreadsheet to enter your data.
Two good options include Google Sheets or Microsoft Excel. This will be the spreadsheet that you continually update with data for new observations, so store it in a logical place so that you can find it later.
b. Create the columns of your spreadsheet.
If you have organized your data sheet in “long” format, in which each row is an observation, then your spreadsheet columns will be the same as your data sheet columns. If not, that’s ok; just make sure you’re entering your data in a way that makes sense based on how you collected it.
c. Enter your data.
Double check your values!
d. Save your spreadsheet as a .csv file in your homework-02
folder.
e. Read your data into R.
Include the code to to this at the top of the template. You will want to do this the same way you read any data into R: by creating a new object (you could call this my_data
) and using the left arrow operator to store and read in the data using read_csv()
.
You are now ready to start your homework!
Part 2. Problems
Problem 1. Burrowing owl abundance (12 points)
Managers at Dangermond Reserve are interested in the return of burrowing owls to the reserve. You conduct weekly surveys for burrowing owls from October to February and collect the following data on the number of burrowing owls you see each week:
\[ 0, 2, 0, 3, 1, 4, 1, 2 \]
- What kind of data did you collect, and why? Explain in 1-2 sentences. (2 points)
- What is a better description of the variability in burrowing owl count, standard deviation or standard error? Explain why in 1 sentence. Calculate the metric of your choice, showing your work. Round your final answer to 1 decimal point, and show the correct units. (5 points)
- What is a better description of the uncertainty in burrowing owl count, standard deviation or standard error? Explain why in 1 sentence. Calculate the metric of your choice, showing your work. Round your final answer to 1 decimal point, and show the correct units. (5 points)
Problem 2. Fire and particulate matter (96 points)
In 2017-2018, the Thomas Fire burned 281,893 acres in Santa Barbara County and Ventura County. At the time, it was the largest wildfire in California history. The fire started on December 4th and was fully contained on January 12, 2018. You’ll be working with EPA data on particulate matter during the fire.
In this problem, you will visualize the data two different ways. You will then answer the question: from the start of the Thomas Fire on December 4th until it was contained on January 12th, was there a difference in PM2.5 between Goleta and Santa Barbara?
Create a line graph with date on the x-axis and PM 2.5 on the y-axis. Color each line by the location of the sensor (hint: there should be 5 locations). Label the x-axis, y-axis, and legend colors. (18 points)
Create a new object called
gol_sb
to filter thesbpm
data frame to only include observations from Goleta and Santa Barbara. Only show the code; do not display the data frame. (2 points)In one sentence, write your hypotheses to answer the question: from the start of the fire until when it was contained, was there a difference in PM2.5 between Goleta and Santa Barbara? in biological terms (not statistical or mathematical terms). Make sure you have a null and alternative hypothesis. (2 points)
Using the
gol_sb
object, create a boxplot with jittered points, with location on the x-axis and PM 2.5 on the y-axis. Make sure the jittered points do not move along the y-axis. Label the x and y-axes. Additionally:- color by site and change the colors from the default
- make sure each site has a different shape
- use a ggplot theme that is not the default
- make sure the legend is not showing
(26 points)
- color by site and change the colors from the default
Using the
gol_sb
object, make a QQ plot. Make sure there are two panels for each location. You do not need to label the x and y-axes. (10 points)In one sentence, describe whether the variable (PM 2.5) is normally distributed or not. Use visual components (e.g. shape, distribution) of the QQ plot you made to justify your characterization of the variable. (4 points)
Check your variances using
var.test()
. In one sentence, describe whether the groups have equal variances or not. (4 points)Do a t-test using
t.test()
. (4 points)
Double check your arguments to make sure you’re running the right test.
In one sentence each, describe:
- Why a t-test would have been appropriate for testing your hypothesis in part a
- How you evaluated normality and homogeneity of variance
- If the variable was not normally distributed, how you justified using a t-test
(6 points)
- Why a t-test would have been appropriate for testing your hypothesis in part a
Describe the results in 1-2 full sentences in your own words. Make sure to include the:
- Test you ran
- Number of observations for each location
- Significance level
- Degrees of freedom
- Test statistic
- p-value
(20 points)
- Test you ran
Round any numbers with decimals to two decimal points.
You will be graded on how you synthesize information. See lecture notes for an example of how to summarize the results of a statistical test in parentheses.
Problem 3. Personal data (24 points)
By now, you have some observations on your data sheet for your personal data. Even though it’s early on in your data collection, it’s a good idea to practice good data management. For this problem, you’ll enter your data, read it into R, and create a visualization. If you get stuck at any step, you’ll know there’s something you need to fix.
- Create a visualization with a categorical predictor variable on the x-axis and your response variable on the y-axis. Label the x- and y-axes. (8 points)
- Create a visualization with a continuous predictor variable on the x-axis and your response variable on the y-axis. Label the x- and y-axes. (8 points)
- In 2-5 sentences, describe what insights (if any) you can gain about your data from visualizations like these. Once you collect more data and update these figures, would these insights change? (4 points)
- In 2-5 sentences, describe the process of getting your data from your spreadsheet into R. Did you encounter any challenges? If so, why do you think those challenges arose, and how did you fix them? If not, why do you think your system for collecting your data worked? (4 points)
If you found that entering your data from your spreadsheet and getting it into the right format to be used in R was challenging, that’s ok! This happens a lot with data collection. Feel free to change your data sheet so that you’re collecting data in a way that makes your life easier as you’re reading your data into R and using it.
Problem 4. Statistical critique (24 points)
Check the Google sheet and choose a paper to use for your critique based on An’s/Thuy-Tien’s/Grace’s recommendations. Answer the following questions about the paper in 1-2 sentences each:
- Why were you interested in this paper? (4 points)
- What questions/hypotheses are the authors addressing? (4 points)
- Which statistical test (from Homework 1) is included in this paper? What is the response variable? What is the predictor variable? (4 points)
- How does this test address the main question(s) presented by the authors? For example, how would the authors interpret a “significant” result? (4 points)
- Find the figure(s) and/or table(s) in the paper that are associated with the statistical tests from question (c). If there are multiple figures and tables relating to the statistical test, find the best one that demonstrates the relationship betwee the predictor and the response. Take a screenshot, and insert it into your document. 4 points
If your screenshot for part e is not visible, you will not receive points for part f.
- If you have a figure: describe the x- and y-axes, and what the figure is supposed to show (i.e. what is the main message of the figure)? If you have a table, what are the rows and columns, and what is in each cell of the table? (4 points)
Double check your assignment!
Your assignment should:
Your responses should include:
Lastly, check out the rubric on Canvas to see the point breakdown in more detail.
171 points total
General formatting points
You will only receive full points for annotations if you have:
- comments on each line of visualization code and/or ggplot geom/theme call (not needed for each argument, though good to have)
- comments for each argument of a test call (e.g.
var.test()
,t.test()
)
You will only receive full points for readability if:
- all messages/warnings are hidden
- all code is contained in code chunks (double check line breaks in comments once you render your document)
- all text is where it’s supposed to be (headers and main text show up correctly)