文章目录
  1. 基本统计分析
    1. 1、描述性统计分析
      1. (1)方法集合
    2. 2、频数表和列联表
      1. (1)生成频数表
      2. (2)独立性检验
      3. (3)相关性的度量
    3. 3、相关
      1. (1)相关的类型
      2. (2)相关的显著性检验
    4. 4、T检验
      1. (1)独立样本的T检验
      2. (2)非独立样本的T检验
    5. 5、组间差异的非参数检验
      1. (1)两组的情况
      2. (2)多于两组的情况

基本统计分析

1、描述性统计分析

(1)方法集合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
> #summary
> new_var<-mtcars[vars]
> summary(new_var)
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
>
> #fivenum 返回图基五数 最小,下四分位数,中位数,上四分位数与最大值
> sapply(new_var, fivenum)
mpg hp wt
[1,] 10.40 52 1.5130
[2,] 15.35 96 2.5425
[3,] 19.20 123 3.3250
[4,] 22.80 180 3.6500
[5,] 33.90 335 5.4240
>
> #自编函数计算偏度与峰度
> mystats<-function(x,na.omit=FALSE){
+ if(na.omit)
+ x<x[!is.na(x)]
+ m<-mean(x)
+ n<-length(x)
+ s<-sd(x)
+ skew<-sum(((x-m)/s)^3)/n
+ kurt<-sum(((x-m)/s)^4)/n-3
+ return(c(n=n, mean=m,stdev=s,skew=skew,kurtosis=kurt))
+ }
>
> sapply(new_var, mystats)
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtosis -0.372766 -0.1355511 -0.02271075
>
> #第三方包 Hmisc pastecs psych包含很多描述性统计量函数
>
> library(Hmisc)
> #describe()函数可返回变量和观测的数量、缺失值和唯一值的数目、平均值、分位数,以及五个最大的值和五个最小的值。
> describe(new_var)
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
>
> #pastecs stat.desc()的函数,它可以计算种类繁多的描述性统计量
> library(pastecs)
> stat.desc(new_var)
mpg hp wt
nbr.val 32.0000000 32.0000000 32.0000000
nbr.null 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000
min 10.4000000 52.0000000 1.5130000
max 33.9000000 335.0000000 5.4240000
range 23.5000000 283.0000000 3.9110000
sum 642.9000000 4694.0000000 102.9520000
median 19.2000000 123.0000000 3.3250000
mean 20.0906250 146.6875000 3.2172500
SE.mean 1.0654240 12.1203173 0.1729685
CI.mean.0.95 2.1729465 24.7195501 0.3527715
var 36.3241028 4700.8669355 0.9573790
std.dev 6.0269481 68.5628685 0.9784574
coef.var 0.2999881 0.4674077 0.3041285
>
> #psych describe()的函数 计算非缺失值的数量、平均数、标准差、中位数、截尾均值、绝对中位差、最小值、最大值、值域、偏度、峰度和平均值的标准误。
> library(psych)
> #同名函数以最后载入的包为准,可以加包名Hmisc::describe(x)
> describe(new_var)
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17

(2)分组计算描述性统计变量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
> #分组计算描述性统计变量
> #aggregate can only use single function to the data
> aggregate(mtcars[vars],by=list(am=mtcars$am),mean)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
>
> #use by(data, INDICESm FUN) to use servial methods
>
> dstats<-function(x)(
+ c(mean=mean(x), sd=sd(x))
+ )
>
> by(mtcars[vars], mtcars$am, mean)
mtcars$am: 0
[1] NA
----------------------------------------------------------------------------------------------------------
mtcars$am: 1
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
参数不是数值也不是逻辑值:回覆NA
2: In mean.default(data[x, , drop = FALSE], ...) :
参数不是数值也不是逻辑值:回覆NA
>
> #使用reshape灵活的导出描述性统计量
> dstats<-function(x)(
+ c(n=length(x),mean=mean(x), sd=sd(x))
+ )
> library(reshape)
> #首先融合数据框
> dfm <- melt(mtcars, measure.vars=c("mpg","hp","wt"), id.vars = c("am","cyl"))
>
> #重铸 (variable表示融合后的数据中的度量变量,默认名称为variable)
> cast(dfm, am + cyl + variable ~., dstats)
am cyl variable n mean sd
1 0 4 mpg 3 22.900000 1.4525839
2 0 4 hp 3 84.666667 19.6553640
3 0 4 wt 3 2.935000 0.4075230
4 0 6 mpg 4 19.125000 1.6317169
5 0 6 hp 4 115.250000 9.1787799
6 0 6 wt 4 3.388750 0.1162164
7 0 8 mpg 12 15.050000 2.7743959
8 0 8 hp 12 194.166667 33.3598379
9 0 8 wt 12 4.104083 0.7683069
10 1 4 mpg 8 28.075000 4.4838599
11 1 4 hp 8 81.875000 22.6554156
12 1 4 wt 8 2.042250 0.4093485
13 1 6 mpg 3 20.566667 0.7505553
14 1 6 hp 3 131.666667 37.5277675
15 1 6 wt 3 2.755000 0.1281601
16 1 8 mpg 2 15.400000 0.5656854
17 1 8 hp 2 299.500000 50.2045815
18 1 8 wt 2 3.370000 0.2828427

2、频数表和列联表

(1)生成频数表

函数 含义
table(var1, var2, …, varN) 使用 N 个类别型变量(因子)创建一个 N 维列联表
xtabs(formula, data) 根据一个公式和一个矩阵或数据框创建一个 N 维列联表
prop.table(table, margins) 依margins定义的边际列表将表中条目表示为分数形式
margin.table(table, margins) 依margins定义的边际列表计算表中条目的和
addmargins(table, margins) 将概述边margins(默认是求和结果)放入表中
ftable(table) 创建一个紧凑的“平铺”式列联表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
> #频数表
> library(vcd)
> art=Arthritis
>
> head(art)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
>
> #一维列联表
> mytable <- with(art, table(Improved))
> mytable
Improved
None Some Marked
42 14 28
> #prop.table() 将频数转化为比例值
> prop.table(mytable)*100
Improved
None Some Marked
50.00000 16.66667 33.33333
>
>
> #二维列联表
> #table默认忽略缺失值,使用缺失值则加useNA="ifany"
> # xtabs如果公式左边有变量则为频数变量
> mytable <- xtabs(~ Treatment+Improved, data=art)
> mytable
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
> #边际频数
> margin.table(mytable,2)
Improved
None Some Marked
42 14 28
>
> #比例 下标表示按照第几个变量进行比例计算,
> #不选择则默认为所有变量进行计算
> prop.table(mytable,1)
Improved
Treatment None Some Marked
Placebo 0.6744186 0.1627907 0.1627907
Treated 0.3170732 0.1707317 0.5121951
>
> #函数addmargins()为结果增加边际和
> addmargins(mytable)
Improved
Treatment None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
>
> addmargins(prop.table(mytable))
Improved
Treatment None Some Marked Sum
Placebo 0.34523810 0.08333333 0.08333333 0.51190476
Treated 0.15476190 0.08333333 0.25000000 0.48809524
Sum 0.50000000 0.16666667 0.33333333 1.00000000
>
> #默认产生行与列的和,加参数控制只显示行或列
> addmargins(prop.table(mytable),2)
Improved
Treatment None Some Marked Sum
Placebo 0.34523810 0.08333333 0.08333333 0.51190476
Treated 0.15476190 0.08333333 0.25000000 0.48809524
> addmargins(prop.table(mytable),1)
Improved
Treatment None Some Marked
Placebo 0.34523810 0.08333333 0.08333333
Treated 0.15476190 0.08333333 0.25000000
Sum 0.50000000 0.16666667 0.33333333
> #使用CrossTabIe创建二维表
> library(gmodels)
> ss=CrossTable(art$Treatment,art$Improved)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 84
| art$Improved
art$Treatment | None | Some | Marked | Row Total |
--------------|-----------|-----------|-----------|-----------|
Placebo | 29 | 7 | 7 | 43 |
| 2.616 | 0.004 | 3.752 | |
| 0.674 | 0.163 | 0.163 | 0.512 |
| 0.690 | 0.500 | 0.250 | |
| 0.345 | 0.083 | 0.083 | |
--------------|-----------|-----------|-----------|-----------|
Treated | 13 | 7 | 21 | 41 |
| 2.744 | 0.004 | 3.935 | |
| 0.317 | 0.171 | 0.512 | 0.488 |
| 0.310 | 0.500 | 0.750 | |
| 0.155 | 0.083 | 0.250 | |
--------------|-----------|-----------|-----------|-----------|
Column Total | 42 | 14 | 28 | 84 |
| 0.500 | 0.167 | 0.333 | |
--------------|-----------|-----------|-----------|-----------|
>
>
> #多维列联表
> mytable<-xtabs(~Treatment+Sex+Improved,data=art)
> mytable
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
> ftable(mytable)
Improved None Some Marked
Treatment Sex
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5
>
> #边际频数
> margin.table(mytable,1)
Treatment
Placebo Treated
43 41
> margin.table(mytable,2)
Sex
Female Male
59 25
> margin.table(mytable,3)
Improved
None Some Marked
42 14 28
> margin.table(mytable,c(1,3))
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
> ftable(prop.table(mytable,c(1,2)))
Improved None Some Marked
Treatment Sex
Placebo Female 0.59375000 0.21875000 0.18750000
Male 0.90909091 0.00000000 0.09090909
Treated Female 0.22222222 0.18518519 0.59259259
Male 0.50000000 0.14285714 0.35714286
>
> ftable(addmargins(prop.table(mytable,c(1,2)),3))
Improved None Some Marked Sum
Treatment Sex
Placebo Female 0.59375000 0.21875000 0.18750000 1.00000000
Male 0.90909091 0.00000000 0.09090909 1.00000000
Treated Female 0.22222222 0.18518519 0.59259259 1.00000000
Male 0.50000000 0.14285714 0.35714286 1.00000000

(2)独立性检验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
> #独立性检验
>
> #卡方独立性检验
> #p值表示独立性的概率,一般以0.01为分界
> library(vcd)
> mytable<-xtabs(~Treatment+Improved,data=Arthritis)
> chisq.test(mytable)
Pearson-s Chi-squared test
data: mytable
X-squared = 13.055, df = 2, p-value = 0.001463
>
> mytable<-xtabs(~Improved+Sex,data=Arthritis)
> chisq.test(mytable)
Pearson-s Chi-squared test
data: mytable
X-squared = 4.8407, df = 2, p-value = 0.08889
Warning message:
In chisq.test(mytable) : Chi-squared近似算法有可能不准
>
> #fisher精确检验
> mytable<-xtabs(~Treatment+Improved,data=Arthritis)
> fisher.test(mytable)
Fisher-s Exact Test for Count Data
data: mytable
p-value = 0.001393
alternative hypothesis: two.sided
>
> #Cochran-Mantel—Haenszel检验
> #原假设是,两个名义变量在第三个变量的每一层中都是条件独立的
> mytable<-xtabs(~Treatment+Improved+Sex,data=Arthritis)
> mantelhaen.test(mytable)
Cochran-Mantel-Haenszel test
data: mytable
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647

(3)相关性的度量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
> #相关性的检验
> #二维列联表的phi系数、列联系数和Cramer’s V系数
> library(vcd)
> mytable<-xtabs(~Treatment+Improved, data = Arthritis)
> assocstats(mytable)
X^2 df P(> X^2)
Likelihood Ratio 13.530 2 0.0011536
Pearson 13.055 2 0.0014626
Phi-Coefficient : NA
Contingency Coeff.: 0.367
Cramer-s V : 0.394
>
>
> #列联表转化为扁平表
> table2flat <- function(mytable) {
+ #转换成扁平表包含频数
+ df <- as.data.frame(mytable)
+ #取得扁平表的行数与列数
+ rows <- dim(df)[1]
+ cols <- dim(df)[2]
+ x <- NULL
+
+ #重新构造每一行,根据频数
+ for (i in 1:rows) {
+ for (j in 1:df$Freq[i]) {
+ row <- df[i, c(1:(cols - 1))]
+ x <- rbind(x,row)
+ }
+ }
+
+ #重新对行名进行赋值
+ row.names(x) <- c(1:dim(x)[1])
+ return(x)
+ }
> table2flat(mytable)
Treatment Improved
1 Placebo None
2 Placebo None
3 Placebo None
4 Placebo None
5 Placebo None
6 Placebo None
7 Placebo None

转换列联表为扁平表的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#转换列联表为扁平表的函数:
#列联表转化为扁平表
table2flat <- function(mytable) {
#转换成扁平表包含频数
df <- as.data.frame(mytable)
#取得扁平表的行数与列数
rows <- dim(df)[1]
cols <- dim(df)[2]
x <- NULL
#重新构造每一行,根据频数
for (i in 1:rows) {
for (j in 1:df$Freq[i]) {
row <- df[i, c(1:(cols - 1))]
x <- rbind(x,row)
}
}
#重新对行名进行赋值
row.names(x) <- c(1:dim(x)[1])
return(x)
}

3、相关

(1)相关的类型

相关性系数包含包括Pearson相关系数、 Spearman相关系数、 Kendall相关系数、偏相关系数、多分格(polychoric)相关系数和多系列(polyserial)相关系数。下面让我们依次理解这些相关系数。

Pearson、 Spearman和Kendall相关

  • Pearson积差相关系数衡量了两个定量变量之间的线性相关程度。
  • Spearman等级相关系数则衡量分级定序变量之间的相关程度。
  • Kendall’s Tau相关系数也是一种非参数的等级相关度量。

cor函数可以计算三种相关性系数:cor(x, use= ,method= )cov可以计算协方差

参数 含义
x 矩阵或数据框
use 指定缺失数据的处理方式。可选的方式为all.obs(假设不存在缺失数据——遇到缺失数据时将报错)、 everything (默认参数,遇到缺失数据时,相关系数的计算结果将被设为missing)、 complete.obs(行删除)以及 pairwise.complete.obs(成对删除, pairwise deletion)
method 指定相关系数的类型。可选类型为pearson(默认)、 spearman或kendall
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
> #相关性系数
>
> state <- state.x77[,1:6]
>
> #协方差
> cov(state)
Population Income Illiteracy Life Exp Murder HS Grad
Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714 -3551.509551
Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286 3076.768980
Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776 -3.235469
Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480 6.312685
Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465 -14.549616
HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616 65.237894
>
> #相关系数
> cor(state)
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
>
> #相关系数
> cor(state, method="spearman")
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.0000000 0.1246098 0.3130496 -0.1040171 0.3457401 -0.3833649
Income 0.1246098 1.0000000 -0.3145948 0.3241050 -0.2174623 0.5104809
Illiteracy 0.3130496 -0.3145948 1.0000000 -0.5553735 0.6723592 -0.6545396
Life Exp -0.1040171 0.3241050 -0.5553735 1.0000000 -0.7802406 0.5239410
Murder 0.3457401 -0.2174623 0.6723592 -0.7802406 1.0000000 -0.4367330
HS Grad -0.3833649 0.5104809 -0.6545396 0.5239410 -0.4367330 1.0000000
>
> #默认情况显示方阵也可以非方阵进行显示
> x<-state[,c("Population","Income","Illiteracy","HS Grad")]
> y<-state[,c("Life Exp","Murder")]
>
> cor(x,y)
Life Exp Murder
Population -0.06805195 0.3436428
Income 0.34025534 -0.2300776
Illiteracy -0.58847793 0.7029752
HS Grad 0.58221620 -0.4879710

偏相关
pcor(u, S)

1
2
3
4
5
6
7
8
> #偏相关
> #pcor(u, S) u为数值向量,前两个为要计算相关的下标,其余为控制变量
>
> library(ggm)
>
> #在控制收入 文盲率 高中毕业率时,人口和谋杀率的偏相关系数
> pcor(c(1,5,2,3,6), cov(state))
[1] 0.3462724

(2)相关的显著性检验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> #相关性的显著性检验
>
> #cor.test(x,y, alternative = ,method = )
> #x,y为检验的变量,alternative标识指定双侧检验或者单侧检验,two.side(默认), less(一般为相关系数<0) or greater(~>0)
> #method用于指定计算的相关类型,pearson, kendall, spearman
>
> cor.test(state[,3],state[,5])
Pearson-s product-moment correlation
data: state[, 3] and state[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5279280 0.8207295
sample estimates:
cor
0.7029752
>
>
> #cor.test只能计算一组变量,psych包corr.test计算多组变量并且计算不同的相关系数及其显著性参数
>
> library(psych)
>
> corr.test(state, use="complete")
Call:corr.test(x = state, use = "complete")
Correlation matrix
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00 0.21 0.11 -0.07 0.34 -0.10
Income 0.21 1.00 -0.44 0.34 -0.23 0.62
Illiteracy 0.11 -0.44 1.00 -0.59 0.70 -0.66
Life Exp -0.07 0.34 -0.59 1.00 -0.78 0.58
Murder 0.34 -0.23 0.70 -0.78 1.00 -0.49
HS Grad -0.10 0.62 -0.66 0.58 -0.49 1.00
Sample Size
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.)
Population Income Illiteracy Life Exp Murder HS Grad
Population 0.00 0.59 1.00 1.0 0.10 1
Income 0.15 0.00 0.01 0.1 0.54 0
Illiteracy 0.46 0.00 0.00 0.0 0.00 0
Life Exp 0.64 0.02 0.00 0.0 0.00 0
Murder 0.01 0.11 0.00 0.0 0.00 0
HS Grad 0.50 0.00 0.00 0.0 0.00 0
To see confidence intervals of the correlations, print with the short=FALSE option
>
> #use=的取值可为"pairwise"或"complete"(分别表示对缺失值执行成对删除或行删
> #除) 。参数method=的取值可为"pearson"(默认值)、 "spearman"或"kendall"。
> #偏相关显著性检验可使用pcor.test(r, q, n) r=偏相关系数,q=控制变量的位置,n=样本大小
>
> pcor.test(pcor(c(1,5,2,3,6), cov(state)),2,50)
$tval
[1] 2.503409
$df
[1] 46
$pvalue
[1] 0.01591253

4、T检验

(1)独立样本的T检验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
> #t检验
> #使用MASS包中的UScrime数据集。它包含了1960年美国47个州的刑
> #罚制度对犯罪率影响的信息。我们感兴趣的结果变量为Prob(监禁的概率) 、 U1(14~24岁年龄
> #段城市男性失业率)和U2(35~39岁年龄段城市男性失业率)。类别型变量So(指示该州是否位
> #于南方的指示变量)将作为分组变量使用。
> library(MASS)
>
> describe(UScrime)
vars n mean sd median trimmed mad min max range skew kurtosis se
M 1 47 138.57 12.57 136.00 137.54 11.86 119.00 177.00 58.00 0.82 0.38 1.83
So 2 47 0.34 0.48 0.00 0.31 0.00 0.00 1.00 1.00 0.65 -1.61 0.07
Ed 3 47 105.64 11.19 108.00 105.90 11.86 87.00 122.00 35.00 -0.32 -1.15 1.63
Po1 4 47 85.00 29.72 78.00 82.13 28.17 45.00 166.00 121.00 0.89 0.16 4.33
Po2 5 47 80.23 27.96 73.00 77.59 28.17 41.00 157.00 116.00 0.84 0.01 4.08
LF 6 47 561.19 40.41 560.00 559.87 45.96 480.00 641.00 161.00 0.27 -0.89 5.89
M.F 7 47 983.02 29.47 977.00 980.23 19.27 934.00 1071.00 137.00 0.99 0.65 4.30
Pop 8 47 36.62 38.07 25.00 29.95 22.24 3.00 168.00 165.00 1.85 3.08 5.55
NW 9 47 101.13 102.83 76.00 85.56 77.10 2.00 423.00 421.00 1.38 1.08 15.00
U1 10 47 95.47 18.03 92.00 93.69 17.79 70.00 142.00 72.00 0.77 -0.13 2.63
U2 11 47 33.98 8.45 34.00 33.49 8.90 20.00 58.00 38.00 0.54 0.17 1.23
GDP 12 47 525.38 96.49 537.00 528.67 111.19 288.00 689.00 401.00 -0.38 -0.61 14.07
Ineq 13 47 194.00 39.90 176.00 192.79 35.58 126.00 276.00 150.00 0.37 -1.14 5.82
Prob 14 47 0.05 0.02 0.04 0.05 0.02 0.01 0.12 0.11 0.88 0.75 0.00
Time 15 47 26.60 7.09 25.80 26.35 6.37 12.20 44.00 31.80 0.37 -0.41 1.03
y 16 47 905.09 386.76 831.00 863.05 314.31 342.00 1993.00 1651.00 1.05 0.78 56.42
>
> #独立样本T检验
> #t检验的假设两组数据独立且源自正太总体,调用格式
> #t.test(y~x, data) y为数值型,x为二分型变量
> #t.test(y1,y2) 均为数值型变量,data为可选,表示包含这些变量的矩阵或数据框
>
> #t检验默认方差不相等,var.equal=TRUE以假定方差相等
> #默认备择假设是双侧(均值不相等,大小的方向不确定)
> #可添加alternative="less" 或 alternative="greater"进行有方向的检验
>
> t.test(Prob~So, data=UScrime)
Welch Two Sample t-test
data: Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1
0.03851265 0.06371269

(2)非独立样本的T检验

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> #非独立样本T检验
> #t.test(y1, y2, paired=TRUE)
>
> with(UScrime, t.test(U1,U2, paired = TRUE))
Paired t-test
data: U1 and U2
t = 32.407, df = 46, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
57.67003 65.30870
sample estimates:
mean of the differences
61.48936
>
> #多组使用方差分析,参见第七章

5、组间差异的非参数检验

(1)两组的情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
> #组间差异的非参数检验
> #数据无法满足t检验或ANOVA的参数假设,可以转而使用非参数方法。即结果变量
> #在本质上就严重偏倚或呈现有序关系,
>
> #两组的比较Wilcoxon秩和检验
> #wilcox.test(y~x, data)
> #或者 wilcox.test(y1,y2) data可选,类似t检验
>
> with(UScrime, by(Prob,So,median))
So: 0
[1] 0.038201
-------------------------------------------------------------------------------
So: 1
[1] 0.055552
>
>
> wilcox.test(Prob~So, data=UScrime)
Wilcoxon rank sum test
data: Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0
>
> #p<0.001拒绝0假设
> #Wilcoxon符号秩检验是非独立样本t检验的一种非参数替代方法。它适用于两组成对数据和
> #无法保证正态性假设的情境。
>
> #非独立的情况
> with(UScrime,wilcox.test(U1, U2, paired=TRUE))
Wilcoxon signed rank test with continuity correction
data: U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(U1, U2, paired = TRUE) :
cannot compute exact p-value with ties

(2)多于两组的情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> #多于两组的情况
> #如果各组独立,则Kruskal—Wallis检验将是一种实用的方法。如果各组不独立(如重复测量设计或随机区
> #组设计),那么Friedman检验会更合适。
> #
> #kruskal.test(y~A, data) y是一个数值型变量,A是一个拥有两个或更多水平的分组变量
> #
> #friedman.test(y~A|B, data) y是数值变量,A是分组变量,B用于认定匹配观测的区组变量
> #
> #文盲率问题
> #添加地区名称数据
> states<- as.data.frame(cbind(state.region, state.x77))
>
> kruskal.test(Illiteracy~state.region,data=states)
Kruskal-Wallis rank sum test
data: Illiteracy by state.region
Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05
> #结果表明各地区的文盲率不同(p<0.001)
文章目录
  1. 基本统计分析
    1. 1、描述性统计分析
      1. (1)方法集合
    2. 2、频数表和列联表
      1. (1)生成频数表
      2. (2)独立性检验
      3. (3)相关性的度量
    3. 3、相关
      1. (1)相关的类型
      2. (2)相关的显著性检验
    4. 4、T检验
      1. (1)独立样本的T检验
      2. (2)非独立样本的T检验
    5. 5、组间差异的非参数检验
      1. (1)两组的情况
      2. (2)多于两组的情况