Sunday, February 26, 2017

Statistical Analysis of the Number of Seed Capsules from the Portulaca Measurements



Statistical Analysis of the Number of Seed Capsules from the Portulaca Measurements 1/30/2017

Seed capsules of Portulaca amilis.
After determining the method to evaluate and identify a difference in plant height between the P (potting soil) and M (50/50 potting soil and sand), I will attempt to analyze the seed pod measurements.

Here is the raw data table:
Tag
Type
# of Plants
Tallest Plant
# of pods
Weight (nearest 5 g
TP2
P
3
22.5
27
335
AP1
P
5
21.5
27
335
LP1
P
10
22
33
335
EP2
P
1
24
22
330
RP2
P
2
22.5
17
330
CP1
P
1
22
19
335
LP2
P
6
22
37
335
FP2
P
8
23
56
330
AP2
P
6
21
35
335
YP1
P
5
22
30
335
CP2
P
3
24
30
330
UP2
P
2
22.5
29
330
RP1
P
1
19
16
330
YP2
P
8
20
24
330
TP1
P
1
14.5
3
330
FP1
P
4
20.5
34
330
UP1
P
1
21
14
330
BP2
P
4
21.5
25
330
BP1
P
9
23.5
40
330
BM2
M
5
19.5
20
360
TM2
M
3
17.5
8
360
AM2
M
4
18
18
360
UM2
M
4
18.5
27
360
YM2
M
5
18.5
25
360
FM1
M
6
17.5
25
360
FM2
M
4
18.5
15
360
WM1
M
1
19
6
360
LM1
M
4
18
22
360
RM1
M
1
18.5
8
360
AM1
M
6
18
18
360
LM2
M
7
19
20
360
RM2
M
2
19.5
9
360
TM1
M
2
17.5
12
360
UM1
M
1
19.5
9
360
YM1
M
7
16.5
27
360
BM1
M
4
16.5
17
360

The main columns of interest are the Type and # of pods columns. I need to form the data into the form to enter into the R program. Instead of typing the code out, I went to the http://astatsa.com/WilcoxonTest/ website and copy and pasted my values. With the click of a button, the site formatted my numbers, and I take the needed information.
Portulaca amilis plants of the experiment. Left: plants grown in the potting soil (P) portion of the experiment. Right: plants grown in the 50/50 potting soil and sand (M) portion of the experiment.
P <- c(27, 27, 33, 22, 17, 19, 37, 56, 35, 30, 30, 29, 16, 24, 3, 34, 14, 25, 40)
M <- c(20, 8, 18, 27, 25, 25, 15, 6, 22, 8, 18, 20, 9, 12, 9, 27, 17)

Part 1. This is my work to figure out what to do. Part 1 is THE WRONG WAY.

I must check the assumptions of a parametric test. This will test for equal variances (note the test code is sensitive to capitalization):

var.test(P,M)

                F test to compare two variances
data:  P and M
F = 2.5685, num df = 18, denom df = 16,
p-value = 0.06375
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9453428 6.7817415
sample estimates:
ratio of variances
            2.5685

From this result, my data is just within equal variances. A thought came to me, the Levene’s test. This was the name of the test to check for equal variances I was instructed to use in my statistics class. The process could not be straight forward, rather I must take another detour that may perhaps change my results of the tallest plant. After investigation, I updated my tallest plant post to utilize this form of the equal variance test. Now to continue. For additional reading: https://www.r-bloggers.com/example-9-36-levenes-test-for-equal-variances/ and http://stats.stackexchange.com/questions/15722/how-to-use-levene-test-function-in-r.

To conform to the code I changed P and M to sample1 and sample2 as exampled:

sample1 <- c(27, 27, 33, 22, 17, 19, 37, 56, 35, 30, 30, 29, 16, 24, 3, 34, 14, 25, 40)
sample2 <- c(20, 8, 18, 27, 25, 25, 15, 6, 22, 8, 18, 20, 9, 12, 9, 27, 17)
y <- c(sample1, sample2)
group <- as.factor(c(rep(1, length(sample1)), rep(2, length(sample2))))
levene.test(y, group)

It returned:

                modified robust Brown-Forsythe Levene-type
                test based on the absolute deviations from
                the median
data:  y
Test Statistic = 1.3307, p-value = 0.2567

The results from the Levene’s test would indicate the samples are of equal variances. This differs from the var.test. I will use and accept the Levene’s test as the better method.

Next I check the data for normal distributions using shapiro.test(). Note sample1 is P, the data for potting soil samples; sample2 is M, the data for the 50/50 potting soil and sand samples.

shapiro.test(sample1)

                Shapiro-Wilk normality test
data:  sample1
W = 0.97128, p-value = 0.8017

shapiro.test(sample2)

                Shapiro-Wilk normality test
data:  sample2
W = 0.92503, p-value = 0.1796

Both sets of data meet the assumption of normal distribution.

The data meets both assumptions of equal variances and normal distribution so that a parametric test can be performed. I am testing two group of data that can use a parametric test, I use the two sample t test. I will mention for tallest plants, I had two groups of data that required the use of a non-parametric test, that is why I used the Mann-Whitney U test. The t test is as follows:

t.test(sample1,sample2, var.equal = TRUE)

                Two Sample t-test
data:  sample1 and sample2
t = 3.2435, df = 34, p-value = 0.002648
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  3.898681 16.980576
sample estimates:
mean of x mean of y
 27.26316  16.82353
Portulaca amilis Plants from the potting soil (P) portion of the experiment.
The t.test results have a p-value of 0.002648 indicating the groups have a statistically significant difference. Another way to determine the difference, looking at a t Table, going to the .05 column for a two-tailed test, then going down to the row of 34 for DF is a critical value around 2.02 to 2.04; my number is 3.2435 which is larger than the critical value and larger means statistically significant difference.

Some notes on the t test: https://www.cliffsnotes.com/study-guides/statistics/univariate-inferential-tests/two-sample-t-test-for-comparing-two-means. This provided me the explanation for partial degrees of freedom with the fix for equal variances “var.equal=TRUE” http://www.talkstats.com/showthread.php/49431-Fractional-degrees-of-freedom-when-running-t.test().

The above is incorrect because I have unequal sample sizes (see https://en.wikipedia.org/wiki/Student's_t-test for independent two sample t test, unequal sample sizes, equal variance for equation).

The post is gold: http://daniellakens.blogspot.com/2015/01/always-use-welchs-t-test-instead-of.html again from the master I referenced in my permits work. He explains always using the Welch’s test. He also provides the citation that should be used when such use of the test is put into action: http://www.jstor.org/stable/2684403?seq=1#page_scan_tab_contents.

This is basically what I would input before. I change the P and M to sample1 and sample2 so that I could perform the Levene’s test.

sample1 <- c(27, 27, 33, 22, 17, 19, 37, 56, 35, 30, 30, 29, 16, 24, 3, 34, 14, 25, 40)
sample2 <- c(20, 8, 18, 27, 25, 25, 15, 6, 22, 8, 18, 20, 9, 12, 9, 27, 17)
y <- c(sample1, sample2)
group <- as.factor(c(rep(1, length(sample1)), rep(2, length(sample2))))
levene.test(y, group)
shapiro.test(sample1)
shapiro.test(sample2)
t.test(sample1,sample2)

After the literature review, the Levene’s test is not necessary.

PART 2. THE RIGHT WAY

P <- c(27, 27, 33, 22, 17, 19, 37, 56, 35, 30, 30, 29, 16, 24, 3, 34, 14, 25, 40)
M <- c(20, 8, 18, 27, 25, 25, 15, 6, 22, 8, 18, 20, 9, 12, 9, 27, 17)
shapiro.test(P)
shapiro.test(M)
t.test(P,M)
Portulaca amilis plants from the 50/50 potting soil and sand soil (M) portion of the experiment.
Performing the statistics by hand.
I followed the equation listed in https://en.wikipedia.org/wiki/Welch's_t-test, but it is incorrect. It states use the sample variances. Reading http://www.sthda.com/english/wiki/welch-t-test, I found I should use standard deviation in the equation rather than variances, then my results started to match with the R program, yet it too had a flaw to calculate the degrees of freedom that had be go back to the other website citation.

For each group data I calculated the average, standard deviation, and variances. I also counted the sample size.

Potting
50/50

27
20

27
8

33
18

22
27

17
25

19
25

37
15

56
6

35
22

30
8

30
18

29
20

16
9

24
12

3
9

34
27

14
17

25


40


27.26316
16.82353
Average
19
17
Count
11.42046
7.125967
Stdev
130.4269
50.77941
Variance

Note: Excel does calculate the standard deviation “=stdev()” and variance “=var” correctly. I performed these by hand to double check. Count and averages are also calculated correctly.

This is the equation for the Welch’s Test I obtained from http://www.sthda.com/english/wiki/welch-t-test:
 
Here is the Excel machine:
Mean1
X1
27.26316
Mean2
X2
16.82353
Variance1
s1
11.42046
Variance2
s2
7.125967
samples size1
n1
19
samples size2
n2
17




X1-X2
10.43963

s21/n1
6.864574

s22/n2
2.987024

Added
9.851598

Sqrt
3.138726
Test statistic =
top/bottom
3.326072

This test statistic matches that of the R Program.

I calculated the degrees of freedom from the equation found at https://en.wikipedia.org/wiki/Welch's_t-test with help from http://www.sthda.com/english/wiki/welch-t-test. These equations are not easy and both websites had flaws in equations. Between the two of them, I was able to piece together the correct equations. I hope I write it correctly (this used the same symbols for values listed in the previous equation I hand wrote):
Again I hope I wrote it correctly. I added my calculations to the machine:
Mean1
X1
27.26316
Mean2
X2
16.82353
Variance1
s1
11.42046
Variance2
s2
7.125967
samples size1
n1
19
samples size2
n2
17




X1-X2
10.43963

s21/n1
6.864574

s22/n2
2.987024

Added
9.851598

Sqrt
3.138726
Test statistic =
top/bottom
3.326072



Degrees of Freedom
Top added
9.851598

top squared
97.05398

Bottom Left
2.61791

Bottom Right
0.557645

Added
3.175554

top/bottom
30.56285

To obtain a p-value, I tried Excel using “=T.DIST.2T()” that returned 0.0023349788208801. I went to http://www.socscistatistics.com/pvalues/tdistribution.aspx and entered the t statistic and df I calculated above. I then selected at the 0.05 significance level as well as two-tailed with the result of 0.002301. I trust the website version more than excel. My results: t = 3.326072; df = 30.56285; p = 0.002301.

After completing the work by hand, I will use the R Program:

P <- c(27, 27, 33, 22, 17, 19, 37, 56, 35, 30, 30, 29, 16, 24, 3, 34, 14, 25, 40)
M <- c(20, 8, 18, 27, 25, 25, 15, 6, 22, 8, 18, 20, 9, 12, 9, 27, 17)
shapiro.test(P)
shapiro.test(M)
t.test(P,M)

The results are:

> shapiro.test(P)
                Shapiro-Wilk normality test
data:  P
W = 0.97128, p-value = 0.8017

> shapiro.test(M)
                Shapiro-Wilk normality test
data:  M
W = 0.92503, p-value = 0.1796

> t.test(P,M)
                Welch Two Sample t-test
data:  P and M
t = 3.3261, df = 30.563, p-value =
0.002301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  4.034442 16.844815
sample estimates:
mean of x mean of y
 27.26316  16.82353

Looks good to me.
Portulaca amilis plants of the experiment. Left: plants grown in the potting soil (P) portion of the experiment. Right: plants grown in the 50/50 potting soil and sand (M) portion of the experiment.
Conclusion:
Plants grown in potting soil had more seed capsules than those grown in a 50/50 mixture of potting soil and sand (potting soil average capsules per jar 27.3 cm (n=19) and 50/50 mixture potting soil and sand mean average at 16.8 cm (n=17) Welch’s two sample test: t=3.326072, df=30.56285, p = 0.002301).
Figure. Seed capsules per jar by soil type (values are significantly different: Welch’s unpaired t test: t=3.326072, df=30.56285, p = 0.002301; error bars are ± one standard error).

How to read the results: Looking at the bars, potting soil had more seed capsules. The standard error-error bars do not touch, so there could be a significant different, which is confirmed in the text and caption with a p-value below 0.05.

Thank you for viewing. Comment if you want. Have a nice day!