今天是各类统计方法R语言实现的第11期,我们主要介绍Logistic回归。Logistic回归属于广义线性回归,因此我们从广义线性回归讲起。
广义线性回归
线性回归模型要求因变量服从正态分布,但是当结果变量是分类型(有无,患病与否等,二分类常用Logistic回归)、计数型(某地区某年发生肿瘤患者的人数等,常用泊松回归)或者临床上经常使用的无复发生存期数据等(常用Cox回归),因变量不符合正态分布,无法直接使用线性回归。而广义线性模型扩展了线性模型的框架,可以进行非正态因变量的分析,在R语言中可以通过glm()函数实现。
glm()函数的参数
分布族 | 默认的连接函数 |
---|---|
binomial | (link = “logit”) |
gaussian | (link = “identity”) |
gamma | (link = “inverse”) |
inverse.gaussian | (link = “1/mu^2”) |
poisson | (link = “log”) |
quasi | (link = “identity”, variance = “constant”) |
quasibinomial | (link = “logit”) |
quasipoisson | (link = “log”) |
连用的函数
函数 | 描述 |
---|---|
summary() | 展示拟合模型的细节 |
coefficients(), coef() | 列出拟合模型的参数(截距项和斜率) |
confint() | 给出模型参数的置信区间(默认为95%) |
residuals() | 列出拟合模型的残差值 |
anova() | 生成两个拟合模型的方差分析表 |
plot() | 生成评价拟合模型的诊断图 |
predict() | 用拟合模型对新数据集进行预测 |
Logistic回归
二分类因变量常用Logistic回归,假设因变量Y服从二项分布,查表得(link = “logit”)
此处是一份婚外情数据,我们用性别、年龄等因素预测参与者是否发生婚外情affairs。
# 载入数据
data(Affairs, package = "AER")
summary(Affairs)
## affairs gender age yearsmarried children
## Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171
## 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430
## Median : 0.000 Median :32.00 Median : 7.000
## Mean : 1.456 Mean :32.49 Mean : 8.178
## 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000
## Max. :12.000 Max. :57.00 Max. :15.000
## religiousness education occupation rating
## Min. :1.000 Min. : 9.00 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :16.00 Median :5.000 Median :4.000
## Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932
## 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :5.000 Max. :20.00 Max. :7.000 Max. :5.000
table(Affairs$affairs)
##
## 0 1 2 3 7 12
## 451 34 17 19 42 38
# 创建二分类因变量,1表示发生婚外情,0表示不发生婚外情
Affairs$ynaffair[Affairs$affairs > 0] <- 1
Affairs$ynaffair[Affairs$affairs == 0] <- 0
Affairs$ynaffair <- factor(Affairs$ynaffair, levels = c(0, 1), labels = c("No", "Yes"))
table(Affairs$ynaffair)
##
## No Yes
## 451 150
# 拟合模型(link = “logit”)
fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness +
education + occupation + rating, data = Affairs, family = binomial())
summary(fit.full)
##
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + children +
## religiousness + education + occupation + rating, family = binomial(),
## data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5713 -0.7499 -0.5690 -0.2539 2.5191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.37726 0.88776 1.551 0.120807
## gendermale 0.28029 0.23909 1.172 0.241083
## age -0.04426 0.01825 -2.425 0.015301 *
## yearsmarried 0.09477 0.03221 2.942 0.003262 **
## childrenyes 0.39767 0.29151 1.364 0.172508
## religiousness -0.32472 0.08975 -3.618 0.000297 ***
## education 0.02105 0.05051 0.417 0.676851
## occupation 0.03092 0.07178 0.431 0.666630
## rating -0.46845 0.09091 -5.153 2.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.38 on 600 degrees of freedom
## Residual deviance: 609.51 on 592 degrees of freedom
## AIC: 627.51
##
## Number of Fisher Scoring iterations: 4
从结果中,我们发现性别gendermale、是否有孩子childrenyes、教育水平education和职业occupation对于模型贡献不显著,因此去除这些变量重新拟合模型。
# 重新拟合模型
fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, data = Affairs, family = binomial())
summary(fit.reduced)
##
## Call:
## glm(formula = ynaffair ~ age + yearsmarried + religiousness +
## rating, family = binomial(), data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6278 -0.7550 -0.5701 -0.2624 2.3998
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.93083 0.61032 3.164 0.001558 **
## age -0.03527 0.01736 -2.032 0.042127 *
## yearsmarried 0.10062 0.02921 3.445 0.000571 ***
## religiousness -0.32902 0.08945 -3.678 0.000235 ***
## rating -0.46136 0.08884 -5.193 2.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.38 on 600 degrees of freedom
## Residual deviance: 615.36 on 596 degrees of freedom
## AIC: 625.36
##
## Number of Fisher Scoring iterations: 4
此时,每一个变量对模型贡献都非常显著,我们可以使用卡方检验比较两个模型。
# 比较模型
anova(fit.reduced, fit.full, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
## Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness +
## education + occupation + rating
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 596 615.36
## 2 592 609.51 4 5.8474 0.2108
发现最终p=0.21,表明两个模型没有显著差异,因此用age + yearsmarried + religiousness + rating四个变量就能很好地预测是否发生婚外情。
# 输出回归系数,解释回归系数
coef(fit.reduced)
## (Intercept) age yearsmarried religiousness rating
## 1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144
exp(coef(fit.reduced))
## (Intercept) age yearsmarried religiousness rating
## 6.8952321 0.9653437 1.1058594 0.7196258 0.6304248
#置信区间
exp (confint(fit.reduced))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 2.1255764 23.3506030
## age 0.9323342 0.9981470
## yearsmarried 1.0448584 1.1718250
## religiousness 0.6026782 0.8562807
## rating 0.5286586 0.7493370
回归系数可用coef()获取,exp(回归系数)可以求得自变量引起因变量变化的优势比OR,对于发病率较低的慢性疾病,OR可作为相对危险度RR的估计。
OR=1,自变量X对于应变量发生与否不起作用,OR>1是一个危险因素,OR<1是一个?;ひ蛩亍#ǖ?表示发生,0表示不发生的情况)
此处结婚年龄yearsmarried是发生婚外情的危险因素,age、religiousness、rating是保护因素。
评价预测变量对结果概率的影响
我们可以假定其他因素不变,仅改变其中一个变量,从而评价这个变量对于结果概率的影响。
# 评价婚姻评分rating
testdata <- data.frame(rating = c(1, 2, 3, 4, 5), age = mean(Affairs$age), yearsmarried = mean(Affairs$yearsmarried), religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = "response")
testdata
## rating age yearsmarried religiousness prob
## 1 1 32.48752 8.177696 3.116473 0.5302296
## 2 2 32.48752 8.177696 3.116473 0.4157377
## 3 3 32.48752 8.177696 3.116473 0.3096712
## 4 4 32.48752 8.177696 3.116473 0.2204547
## 5 5 32.48752 8.177696 3.116473 0.1513079
婚姻评分从1到5,婚外情概率从0.53降到0.15
# 评价年龄age
testdata <- data.frame(rating = mean(Affairs$rating), age = seq(17, 57, 10), yearsmarried = mean(Affairs$yearsmarried), religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = "response")
testdata
## rating age yearsmarried religiousness prob
## 1 3.93178 17 8.177696 3.116473 0.3350834
## 2 3.93178 27 8.177696 3.116473 0.2615373
## 3 3.93178 37 8.177696 3.116473 0.1992953
## 4 3.93178 47 8.177696 3.116473 0.1488796
## 5 3.93178 57 8.177696 3.116473 0.1094738
年龄从17到57,婚外情概率从0.34降到0.11
过度离势
过度离势是指观测到的响应变量的方差大于期望的二项分布的方差,过度离势会导致奇异的标准误检验和不精确的显著性检验。
deviance(fit.reduced)/df.residual(fit.reduced)
## [1] 1.03248
结果非常接近1,表示没有过度离势。
fit <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs)
fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = quasibinomial(), data = Affairs)
pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F)
## [1] 0.340122
p=0.34,二者之间没有显著差异,表明没有过度离势。
如果存在过度离势,可使用类二项分布 family = quasibinomial()。
条件logistic回归
使用survival包中的clogit(),用于分析配对数据
library(survival)
## Warning: package 'survival' was built under R version 3.6.3
data(logan)
summary(logan)
## occupation focc education race
## farm : 19 farm : 92 Min. : 2.00 non-black:764
## operatives :217 operatives :235 1st Qu.:12.00 black : 74
## craftsmen :202 craftsmen :232 Median :13.00
## sales :105 sales : 82 Mean :13.58
## professional:295 professional:197 3rd Qu.:16.00
## Max. :20.00
#整理数据
resp <- levels(logan$occupation)
n <- nrow(logan)
indx <- rep(1:n, length(resp))
logan2 <- data.frame(logan[indx,],
id = indx,
tocc = factor(rep(resp, each=n)))
logan2$case <- (logan2$occupation == logan2$tocc)
# strata(id)表示配对样本的编号,其余与之前一致,不过此处分析了交互作用
model <-clogit(case ~ tocc + tocc:education + strata(id), logan2)
summary(model)
## Call:
## coxph(formula = Surv(rep(1, 4190L), case) ~ tocc + tocc:education +
## strata(id), data = logan2, method = "exact")
##
## n= 4190, number of events= 838
##
## coef exp(coef) se(coef) z Pr(>|z|)
## toccfarm -1.8964629 0.1500986 1.3807822 -1.373 0.16961
## toccoperatives 1.1667502 3.2115388 0.5656465 2.063 0.03914
## toccprofessional -8.1005492 0.0003034 0.6987244 -11.593 < 2e-16
## toccsales -5.0292297 0.0065438 0.7700862 -6.531 6.54e-11
## tocccraftsmen:education -0.3322842 0.7172835 0.0568682 -5.843 5.13e-09
## toccfarm:education -0.3702858 0.6905370 0.1164100 -3.181 0.00147
## toccoperatives:education -0.4222188 0.6555906 0.0584328 -7.226 4.98e-13
## toccprofessional:education 0.2782469 1.3208122 0.0510212 5.454 4.94e-08
## toccsales:education NA NA 0.0000000 NA NA
##
## toccfarm
## toccoperatives *
## toccprofessional ***
## toccsales ***
## tocccraftsmen:education ***
## toccfarm:education **
## toccoperatives:education ***
## toccprofessional:education ***
## toccsales:education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## toccfarm 0.1500986 6.6623 1.002e-02 2.247505
## toccoperatives 3.2115388 0.3114 1.060e+00 9.731781
## toccprofessional 0.0003034 3296.2778 7.713e-05 0.001193
## toccsales 0.0065438 152.8152 1.447e-03 0.029603
## tocccraftsmen:education 0.7172835 1.3941 6.416e-01 0.801857
## toccfarm:education 0.6905370 1.4481 5.497e-01 0.867512
## toccoperatives:education 0.6555906 1.5253 5.846e-01 0.735141
## toccprofessional:education 1.3208122 0.7571 1.195e+00 1.459723
## toccsales:education NA NA NA NA
##
## Concordance= 0.766 (se = 0.012 )
## Likelihood ratio test= 665.5 on 8 df, p=<2e-16
## Wald test = 413.5 on 8 df, p=<2e-16
## Score (logrank) test = 682.1 on 8 df, p=<2e-16
另外,有时我们还需要分析交互作用,使用逐步回归法step()等,之前推文均已讲过,此处不再赘述。
还要注意年龄等连续变量每增加一个变量对于二分类结果影响不大,经?;岱肿槲行蚨喾掷啾淞俊S行蚨喾掷啾淞堪凑崭鞲龇掷嘤胍虮淞渴欠裣咝员浠龆ㄊ欠裱票淞炕?。无序多分类自变量需要哑变量化。R中使用factor函数即可实现哑变量化。
好了,今天的R语言实现统计方法系列推文暂时告一段落,我们下次再见吧! 小伙伴们如果有什么统计上的问题,或者如果想要学习什么方面的生物信息内容,可以在微信群或者知识星球提问,没准哪天的推文就是专门解答你的问题哦!