Wednesday, August 28, 2019

Statistics: R-Practice: 101 Years of Regional High Temperatures: Brunswick. Georiga

Statistics: R-Practice: 101 Years of Regional High Temperatures - Part 1
8/27/2019



From https://cran.cnr.berkeley.edu/ I downloaded the latest R version.

I downloaded RStudio from https://www.rstudio.com/products/rstudio/download/#download.

I also used OpenOffice from https://www.openoffice.org/.

R is the program. RStudio is a program to make the program easier to use. OpenOffice Calc is a spreadsheet program and was used to format large amounts of data that I could then import into RStudio.

Within RStudio, I installed the dplyr package as an option for testing and summarizing data, whether or not I reported such tests in this post. I installed the ggplot2 package for graph work. I installed gvlma as an easier solution to test assumptions of a regression. To activate the packages just installed, I clicked the check box beside the package name from the package list.
From https://w2.weather.gov/climate/ I clicked the Southeast Georgia, Northeast Florida region; the green dot near Jacksonville, Florida.

After Clicking the region:


I then selected the NOWData Tab.
From NOWData I selected an area (example: Jacksonville Area), then ticked the radio button for Monthly Summarized Data, entered the year range: 1918-2018, selected from the drop down menus: Variable Max temp and Summary Daily Maximum.

Explanation of this data. The information provided is the highest temperature recorded for the month. This is a made up example: so if the temperature was 50F every day of January for the year 1950, but was 90 on the 15th of January in 1950, then the data will report only the highest for the month.


I then download data for the region, 10 different locations:
Brunswick, GA
Douglas, GA
Fargo, GA
Fernandina Beach, FL
Homerville, GA
Jacksonville, FL
Lake City, FL
Nahunta, GA
Sapelo Island, GA
Waycross, GA

To get to Sapelo Island, GA data, I went back and clicked in the area of the green dot for Savannah, Georgia.


I copied, pasted, and saved the data into a spreed sheet with OpenOffice Calc. For each location, I obtained the highest temperature recorded for the month, for each year, where the temperature was available.




From 1918-2018 is 101 total years. A temperature for each month allows for 1212 temperature readings for each location (101*12=1212). Note: although Annual data was provided, I often omitted it.

For data not available, or missing, the data has a M designation.


In OpenOffice, I used formulas to count the amount of missing, "M" data, for each location data.

Count Formula Used
Missing 492 “=COUNTIF(B2:M102;"M")”
Present 720 “=COUNT(B2:M102)”
Total 1212 “=SUM(P1:P2)”

I calculated the amount of missing data for each location.

Missing Data % Missing
Brunswick, GA 24 1.9801980198
Douglas, GA 292 24.0924092409
Fargo, GA 796 65.6765676568
Fernandina Beach, FL 53 4.3729372937
Homerville, GA 492 40.5940594059
Jacksonville, FL 0 0
Lake City, FL 7 0.5775577558
Nahunta, GA 514 42.4092409241
Sapelo Island, GA 514 42.4092409241
Waycross, GA 95 7.8382838284
Sum 2787 22.995049505





Why is some data missing? That is unknown. Maybe the weather station was not put in place, reading were missing, natural disasters, etc. Some stations are missing less than others. I mention it here to evaluate the amount of data available, and that results will not include missing data.

To help analyze the data, I needed to "clean-it-up" or "reformat" it. Such a change was moving Month columns to a single column Month, yet keeping each entry for that year. This is easier said than done, and there may exists better methods, but I:
Selected the data in OpenOffice Calc.
Then went to Data > Pivot Table.
I moved the Year to the Row Fields.
I moved the Months to the Data Fields column.
The pivot table formatted the Year, Month, and Temperature more how I wanted.

I further cleaned up the table by filling out the year in the blank cells and setting the Months.


For me, RStudio had problems processing missing values. I deleted the blank missing value place holding cells.

With the data formatted how I wanted it, I then saved the OpenOffice document as a .csv file.

I Imported the .csv file with the data into RStudio. Once in RStudio, I can now run some queries and generate some statistics.



What is the hottest temperature ever recorded for each of the ten locations?
Note my data is named p6.
I enter this into RStudio
> p6 %>% 
+   group_by(City) %>% summarise(Temp = max(Temp))
It returned:
# A tibble: 10 x 2
   City           Temp
   <fct>         <int>
 1 Brunswick       106
 2 Douglas         106
 3 Fargo           105
 4 Fernandina      104
 5 Homerville      104
 6 Jacksonville    103
 7 Lake City       106
 8 Nahunta         104
 9 Sapelo Island   105
10 Waycross        108

Starting from scratch:
Import the data, in this example, labeled p6.
I then checked the tick boxes beside the dplyr, ggplot2, and gvlma packages.
These tasks can also be performed entering the code as shown below.


The data, p6, looks like this:



#Ill start using this blue color to show code#

t10 <- p6 %>%
  group_by(City,Year) %>% summarise(Temp = max(Temp)) %>%
  filter(City == "Brunswick")

gg<-ggplot(t10, aes(Year, Temp, colour = Temp)) +
  geom_point(aes(col="red", size=))+
  #geom_smooth(method="loess", se=F)+
  geom_smooth(method="lm", se=F)+
  ylim(c(90,110)) + xlim(c(1918,2018))+
  labs(subtitle="Brunswick 101 Year High Temperature",
       y="High Temperature",
       x="Year from 1918 to 2018",
       title="Southeast Regional High Temperature",
       caption="Source:https://w2.weather.gov/climate/")
plot(gg)

I'll try to translate.For the first part, I am creating data called "t10" from data p6, that is being adjusted %>% by grouping City and Year adjusted %>% using the summarise function on Tempareture searching for the maximum temperature, adjusted by %>% a filter for information only on the City of Brunswick.

The second part creates data of gg by using the ggplot function on t10 for Year and Temperature making the temperature a color. geom_point function orders a graph to be a point graph, with red points, of the default size of the points. The #geom_smooth portion is dullified due to the #, it is not used. I left it here to show that. Next geom_smooth is utilized, using the method "lm" and that line created the blue 'best fit line' that in the graph looks like the high temperatures are decreasing. ylim and xlim define the y and x axis start and stop measurements. The labs puts the headings and subheadings around the graph. plot(gg) tells the program to show the created graph.


I then entered:
######################################
#Adds regression line statistics
fit1 <- lm(Temp ~ Year, data = t10)
summary(fit1)
######################################

The lines with a # do nothing but show text. fit1 is data created from the lm function for Temperature and Year, taken from the t10 data. The summary function displays the information generated for fit1:
Call:
lm(formula = Temp ~ Year, data = t10)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6279 -1.5277 -0.5232  1.3220  6.3493 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 108.695842  14.385123   7.556 2.12e-11 ***
Year         -0.004554   0.007309  -0.623    0.535    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.141 on 99 degrees of freedom
Multiple R-squared:  0.003907, Adjusted R-squared:  -0.006154 
F-statistic: 0.3883 on 1 and 99 DF,  p-value: 0.5346
This code provided the same result:
#This runs the Regression test.
#I believe the results provide the equation for the line.
linearMod <- lm(formula = Temp ~ Year, data = t10)
summary(linearMod)


How did I know to do all this? You have more questions of what this all means?
Here is the citation: http://r-statistics.co/Linear-Regression.html For an explanation.
It is a good read. It further mentions assumptions. Assumptions must be met, if the statistics
shown are to mean anything useful.
Another citation: http://r-statistics.co/Assumptions-of-Linear-Regression.html

#1.Regression model but be linear in parameters.
#I guess look at it, does it look like a line? The blue line does to me.

#2.The mean residuals is zero. To test this:
#This Tests for Assumption #2 of a Regression.
#2 is that residuals are zero.")
mean(fit1$residuals)
#If this number is super small, its close to zero

> #2.The mean residuals is zero. To test this:
> #This Tests for Assumption #2 of a Regression.
> #2 is that residuals are zero.")
> mean(fit1$residuals)
[1] -9.510063e-17
> #If this number is super small, its close to zero
The number is very small, it is close to zero. This assumption is good to go.

######################################
#This tests for assumption #3:
#Homoscedasticity of residuals or equal variance.
par(mfrow=c(2,2))  # set 2 rows and 2 column plot layout
mod_1 <- lm(Temp ~ Year, data = t10)  # linear model
plot(mod_1)
#4 Plots should appear.
#If the top left and bottom right plots are flat lines,
#Then it appears homoscedasticity is accepted.

######################################
#Assumption 4, no autocorrelation of residuals.
lmMod <- lm(Temp ~ Year, data = t10)
acf(lmMod$residuals)
#A graph will appear. If the lines from the center line
#appear to go beyond the blue dashed line (a lot), then
#Autocorrelation might be confirmed.
#So staying within the blue lines is a good thing,
#Allowing to run the test.

 ######################################
#Assumption 5
#The X variables and residuals are uncorrelated.
lmMod3 <- lm(Temp ~ Year, data = t10)
cor.test(t10$Year, lmMod3$residuals)
#The p-value is near 1, then assumption holds true.
#All is good. A small value near zero would be bad,
#as in not meeting the assumption.

######################################
#There are 5 more assumptions, but it got repetitive.

#I installed the gvlma package to run the following:
#This code checks assumptions for regression, the easy way.
#It even explains stuff.
par(mfrow=c(2,2))  # draw 4 plots in same window
mod <- lm(Temp ~ Year, data = t10)
gvlma::gvlma(mod)

So what does this double rainbow mean? I circled the p-values. They are all above 0.05 meaning nothing appears out of the ordinary. Do consult a textbook, but regressions try to make a line to predict where a point will fall... sort of. From our graph and the line, if another year was added either way or squeezed in between the others, a data point will most likely be similar to the others. Main Takeaway: The Year has no affect on the High Temperature. For this 100 year look, it appears the high temperatures have stayed the same. Once this is told, people then have asked me so global warming isn't really happening? Not, that is not what this is telling us. This is for high temperatures, for the past 101 years, taken from a weather station in Brunswick, GA. If you want to talk global warming, you need more than one location. Having more than 101 years of data can help too. This global warming discussion is not over, but for this post, this would not prove global warming, noting what we are analyzing. What this does kind of tell me is that in recent years, the hottest the temperature has gotten has been less than the average of the highest temperature for a year, for the past 101 years, but I just looked at the dots to come up with that, not the statistics. How can I use this data? I can go back to my Granny in Bruswick and tell her that a local weather station reports that since ~2010 the temperature has not gotten as how as it was in 90s, early 2000s.


But then again, what does science and statistics know. Granny tells me every year it is hotter than the previous year. She told me today that this year was hotter than ever!
Reinactment of what actually took place:


I do hope to posts on this data including information for the region. Yet to begin, I started with just Bruswick, and this post became very long.

Thank you for viewing! Have a Great day!
Questions or comments, let me know!