重抽样与自助法

1、置换检验

> #置换方法并不是将统计量与理论分布进行比较，而是将其与置换观测数据后获得的经验分布进行比较，
> #根据统计量值的极端性判断是否有足够理由拒绝零假设。比如判断两组处理数据的内在差异性
> #将所有观测组数据混合后重新随机拆分至所有组合计算统计量如（t值），所有统计量升序排列，根据第一个统计量
> #（如t0）进行判断
> #置换检验真正发挥功用的地方是处理非正态数据（如分布偏倚很大）、存在离群点、
> #样本很小或无法做参数检验等情况。不过，如果初始样本对感兴趣的总体情况代表性很差，即使
> #是置换检验也无法提高推断效果。
> #置换检验主要用于生成检验零假设的p值，它有助于回答“效应是否存在”这样的问题。

2、使用coin包进行置换检验

#独立2样本检验和K样本检验
> library(coin)
> score <- c(40,57,45,55,58,57,64,55,62,65)
> treatment <- factor(c(rep("A",5), rep("B",5)))
> mydata <- data.frame(treatment, score)
> t.test(score~treatment, data=mydata,  var.equal = TRUE)
	Two Sample t-test
data:  score by treatment
t = -2.345, df = 8, p-value = 0.04705
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -19.0405455  -0.1594545
sample estimates:
mean in group A mean in group B 
           51.0            60.6 
> #distribution = "exact"，那么在零假设条件下，分布的计算是精确的（即依据所有
> #可能的排列组合）,当前仅可用于两样本问题
> oneway_test(score~treatment, data=mydata,  distribution='exact')
	Exact Two-Sample Fisher-Pitman Permutation Test
data:  score by treatment (A, B)
Z = -1.9147, p-value = 0.07143
alternative hypothesis: true mu is not equal to 0
> 
> #测试美国南部与非南部的监禁概率
> library(MASS)
> UScrime <- transform(UScrime, So = factor(So))
> wilcox_test(Prob ~ So, data=UScrime, distribution="exact")
	Exact Wilcoxon-Mann-Whitney Test
data:  Prob by So (0, 1)
Z = -3.7493, p-value = 8.488e-05
alternative hypothesis: true mu is not equal to 0
> #监禁在南部可能更多
> 
> #K样本问题, 五种治疗方案的差异性,参考分布得自于数据9999次的置换
> library(multcomp)
> set.seed(1234)
> oneway_test(response ~ trt, data = cholesterol, distribution=approximate(B=9999))
	Approximative K-Sample Fisher-Pitman Permutation Test
data:  response by trt (1time, 2times, 4times, drugD, drugE)
chi-squared = 36.381, p-value < 2.2e-16
> 
> 
> #列联表中的独立性
> #检验两类别型变量的独立性
> library(coin)
> library(vcd)
> #转换治疗效果变量为分类因子，有序因子coin会生产一个线性与线性的趋势检验，而不是卡方检验
> Arthritis <- transform(Arthritis, Improved=as.factor(as.numeric(Improved)))
> set.seed(1234)
> chisq_test(Treatment ~ Improved, data=Arthritis, distribution=approximate(B=9999))
	Approximative Pearson Chi-Squared Test
data:  Treatment by Improved (1, 2, 3)
chi-squared = 13.055, p-value = 0.0018
> 
> 
> #两数值型变量的独立性置换检验，文盲率与谋杀率
> #转换为数据框，x77是一个矩阵
> states <- as.data.frame(state.x77)
> set.seed(1234)
> spearman_test(Illiteracy~Murder, data=states, distribution=approximate(B=9999))
	Approximative Spearman Correlation Test
data:  Illiteracy by Murder
Z = 4.7065, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
> 
> #两样本和K样本相关性检验
> library(coin)
> library(MASS)
> wilcoxsign_test(U1~U2, data=UScrime, distribution="exact")
	Exact Wilcoxon-Pratt Signed-Rank Test
data:  y by x (pos, neg) 
	 stratified by block
Z = 5.9691, p-value = 1.421e-14
alternative hypothesis: true mu is not equal to 0

3、ImPerm包置换检验

#ImPerm包 提供线性模型的置换方法，回归和方差分析
> #简单回归
> library(lmPerm)
> fit<-lmp(weight~height, data=women, perm="Prob")
[1] "Settings:  unique SS : numeric variables centered"
> summary(fit)
Call:
lmp(formula = weight ~ height, data = women, perm = "Prob")
Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 
Coefficients:
       Estimate Iter Pr(Prob)    
height     3.45 5000   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-Squared: 0.991,	Adjusted R-squared: 0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14 
> 
> #多项式回归
> library(lmPerm)
> set.seed(1234)
> fit<-lmp(weight ~ height + I(height^2), data = women, perm = "Prob")
[1] "Settings:  unique SS : numeric variables centered"
> summary(fit)
Call:
lmp(formula = weight ~ height + I(height^2), data = women, perm = "Prob")
Residuals:
      Min        1Q    Median        3Q       Max 
-0.509405 -0.296105 -0.009405  0.286151  0.597059 
Coefficients:
            Estimate Iter Pr(Prob)    
height      -7.34832 5000   <2e-16 ***
I(height^2)  0.08306 5000   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-Squared: 0.9995,	Adjusted R-squared: 0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16 
> 
> 
> #多元回归
> library(lmPerm)
> set.seed(1234)
> states <- as.data.frame(state.x77)
> fit <- lmp(Murder ~ Population + Illiteracy + Income + Frost, data=states, perm="Prob")
[1] "Settings:  unique SS : numeric variables centered"
> summary(fit)
Call:
lmp(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = states, perm = "Prob")
Residuals:
     Min       1Q   Median       3Q      Max 
-4.79597 -1.64946 -0.08112  1.48150  7.62104 
Coefficients:
            Estimate Iter Pr(Prob)    
Population 2.237e-04   51   1.0000    
Illiteracy 4.143e+00 5000   0.0004 ***
Income     6.442e-05   51   1.0000    
Frost      5.813e-04   51   0.8627    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-Squared: 0.567,	Adjusted R-squared: 0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08 
> 
> #单因素方差分析
> library(lmPerm)
> library(multcomp)
> set.seed(1234)
> fit <- aovp(response~trt, data=cholesterol, perm="Prob")
[1] "Settings:  unique SS "
> summary(fit)
Component 1 :
            Df R Sum Sq R Mean Sq Iter  Pr(Prob)    
trt          4  1351.37    337.84 5000 < 2.2e-16 ***
Residuals   45   468.75     10.42                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> #单因素协方差分析
> library(lmPerm)
> set.seed(1234)
> fit <- aovp(weight ~ gesttime + dose, data=litter, perm="Prob")
[1] "Settings:  unique SS : numeric variables centered"
> summary(fit)
Component 1 :
            Df R Sum Sq R Mean Sq Iter Pr(Prob)    
gesttime     1   161.49   161.493 5000   0.0006 ***
dose         3   137.12    45.708 5000   0.0392 *  
Residuals   69  1151.27    16.685                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> #双因素方差分析
> library(lmPerm)
> set.seed(1234)
> fit <- aovp(len ~ supp * dose, data=ToothGrowth, perm="Prob")
[1] "Settings:  unique SS : numeric variables centered"
> summary(fit)
Component 1 :
            Df R Sum Sq R Mean Sq Iter Pr(Prob)    
supp         1   205.35    205.35 5000  < 2e-16 ***
dose         1  2224.30   2224.30 5000  < 2e-16 ***
supp:dose    1    88.92     88.92 2032  0.04724 *  
Residuals   56   933.63     16.67                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

4、自助法

#boot包中的自助法，即通过一个函数重复执行多次取得统计量的序列分布
> #对单个统计量使用自助法
> rsq <- function(formula, data, indicates) {
+     d <- data[indicates,]
+     fit <- lm(formula, data=d)
+     return(summary(fit)$r.square)
+ }
> library(boot)
> set.seed(1234)
> results <- boot(data=mtcars, statistic = rsq, R=1000, formula=mpg~wt+disp)
> 
> print(results)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ 
    wt + disp)
Bootstrap Statistics :
     original    bias    std. error
t1* 0.7809306 0.0133367  0.05068926
> plot(results)
#置信区间都不包含R平方值0，所以拒绝零假设
> boot.ci(results, type=c("perc", "bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL : 
boot.ci(boot.out = results, type = c("perc", "bca"))
Intervals : 
Level     Percentile            BCa          
95%   ( 0.6838,  0.8833 )   ( 0.6344,  0.8549 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
>

> #多个统计量自助法
> bs <- function(formula, data, indices) {
+   d <- data[indices,]
+   fit <- lm(formula, data=d)
+   return(coef(fit))
+ }
> 
> library(boot)
> set.seed(1234)
> results <- boot(data=mtcars, statistic = bs, R=1000, formula=mpg~wt+disp)
> 
> print(results)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ 
    wt + disp)
Bootstrap Statistics :
       original        bias    std. error
t1* 34.96055404  0.1378732942 2.485756673
t2* -3.35082533 -0.0539040965 1.170427539
t3* -0.01772474 -0.0001208075 0.008786803
> #查看车重的抽样结果
> plot(results, index=2)
> #查看车重的95置信区间
> boot.ci(results, type= "bca", index=2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL : 
boot.ci(boot.out = results, type = "bca", index = 2)
Intervals : 
Level       BCa          
95%   (-5.655, -1.194 )  
Calculations and Intervals on Original Scale
> #查看排量的95置信区间
> boot.ci(results, type= "bca", index=3)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL : 
boot.ci(boot.out = results, type = "bca", index = 3)
Intervals : 
Level       BCa          
95%   (-0.0331,  0.0010 )  
Calculations and Intervals on Original Scale

Peter I/O

chapter12-重抽样与自助法

重抽样与自助法

1、置换检验

2、使用coin包进行置换检验

3、ImPerm包置换检验

4、自助法