Using PLINK for Genome-Wide Association Study
一、 基本用法
01 assoc
基本的case/control关联,质量性状数量性状均可,不能加协变量
数据过滤
plink --tfile /stor9000/apps/users/NWSUAF/2020060179/Sheep/GWAS/all_sample339.snp.filtered2.final.rename.modified --chr-set 26 --recode transpose --out all_sample339.snp.filtered2.final.rename.modified323.filter --maf 0.05 --geno 0.1 --mind 0.1 --hwe 0.001 --autosome-xy --allow-extra-chr
整理性状
cut -d " " -f 1 all_sample339.snp.filtered2.final.rename.modified323.filter.tfam > samples
for i in `cat samples`; do grep $i /stor9000/apps/users/NWSUAF/2020060179/Sheep/GWAS/trait_tail_type2 >> tail_wool;done
sed -i 's/1$/2/g' tail_wool
sed -i 's/0$/1/g' tail_wool
关联分析
# 质量性状,默认基于卡方检验,
plink --tfile ../all_sample339.snp.filtered2.final.rename.modified323.filter --pheno tail_wool --assoc --out tail_wool.assoc --chr-set 26 --allow-no-sex
Info
也可以加--fisher进行Fisher精确检验计算精确P值,否则--assoc默认根据chi square分布计算近似P值。
在大样本的情况下(总样本数>40),可以用chi square检验,若总样本数<40,单个最小频数<5,则用fisher检验较为合适。
结果文件格式:


Info
对于数量性状,基于t检验,对于数量性状来说--assoc结果和--linear以及--logistic中的ADD效应的P值一样:
#
plink --tfile ../all_sample339.snp.filtered2.final.rename.modified323.filter --pheno tail_length --assoc --out tail_length.assoc --chr-set 26 --allow-no-sex
# 结果文件格式

Warning
协变量为分类变量时,需要转换为虚拟变量/哑变量

plink --tfile all_sample339.snp.filtered2.final.rename.modified323.filter --covar ../covariate_for_plink --write-covar --dummy-coding --out ../dummy_covariate_for_plink --chr-set 26
02 logistic
**逻辑回归,用于分类变量,可以添加协变量,基于t检验
# PCA结果作为协变量
plink --tfile all_sample339.snp.filtered2.final.rename.modified323.filter --pca --chr-set 26 --maf 0.05 --out all_sample339.snp.filtered2.final.rename.modified323.filter.pca
# 关联
plink --tfile ../all_sample339.snp.filtered2.final.rename.modified323.filter --pheno tail_wool --logistic --covar ../covariate_for_plink --out tail_wool.logistic --chr-set 26 --allow-no-sex
03 linear
线性回归,用于连续性变量,可以添加协变量,基于t检验
不加协变量
plink --tfile ../all_sample339.snp.filtered2.final.rename.modified323.filter --pheno tail_length --linear --out tail_length.linear.nocov --chr-set 26 --allow-no-sex
加协变量
协变量格式:

# 关联
plink --tfile ../all_sample339.snp.filtered2.final.rename.modified323.filter --pheno tail_length --linear --covar ../covariate_for_plink --out tail_length.linear --chr-set 26 --allow-no-sex
# 可以加上--hide-covar参数,不显示协变量,只显示ADD加性效应。
plink --tfile ../all_sample339.snp.filtered2.final.rename.modified323.filter --chr-set 26 --pheno ../all_phenotypes --pheno-name Birth_weight --linear --allow-no-sex --out birth_weight.linear.cov1 --covar ../covariate_for_plink --hide-covar
二、进阶技巧
01 所有表型一起做
表型文件格式:

Warning
表型中如果用01编码二元分类性状,需要改成其它数字编码
# 关联分析
plink --tfile all_sample339.snp.filtered2.final.rename.modified323.filter --chr-set 26 --assoc --pheno all_phenotypes --all-pheno --allow-no-sex --out allpheno
# 所有表型一起做一般线性模型
plink --bfile /stor9000/apps/users/NWSUAF/2015010726/Sheep/GWAS_for_HuSheep/02.GWAS/by_GEMMA/all_samples317.filter.sort --chr-set 26 --linear hide-covar --pheno /stor9000/apps/users/NWSUAF/2015010726/Sheep/GWAS_for_HuSheep/02.GWAS/all_phenotypes.txt --all-pheno --allow-no-sex --out allpheno --covar covariate_for_plink_pca3_age_sib
02 当有P值为0时,设定最小P值
# --output-min-p 1e-99 # any p-value below 10^{-99} will be reported as 10^{-99}
plink --file /stor9000/apps/users/NWSUAF/2015010726/Sheep/GWAS_for_Black_Suffolk/01.data/plink/Black_Suffolk.snp.filtered --pheno /stor9000/apps/users/NWSUAF/2015010726/Sheep/GWAS_for_Black_Suffolk/02.GWAS/all_pheno_mod.grep --pheno-name Chest_width --linear hide-covar --covar ../../covariates_pca3 --chr-set 26 --allow-no-sex --out Chest_width --output-min-p 1e-99
03 加上多重检验的关联分析
Bonferroni校正
plink --tfile all_sample339.snp.filtered2.final.rename.modified323.filter --chr-set 26 --assoc --pheno all_phenotypes --pheno-name tail_type --allow-no-sex --out type --adjust
结果文件格式:

permutation校正
plink --tfile all_sample339.snp.filtered2.final.rename.modified323.filter --chr-set 26 --assoc --pheno all_phenotypes --pheno-name tail_type --allow-no-sex --out type --mperm 10000
# or
plink --tfile all_sample339.snp.filtered2.final.rename.modified323.filter --chr-set 26 --assoc --pheno all_phenotypes --pheno-name Tail_length --allow-no-sex --out Tail_length_msperm --mperm 10000
结果文件
sort -k4,4 -g Tail_length_msperm.qassoc.mperm | head

三、可视化与注释
01 可视化曼哈顿图
提取P值
awk '{OFS="\t"}{print $1,$3,$9}' tail_length.linear.nocov.assoc.linear | grep -v 'NA' > tail_length.linear.nocov.assoc.linear.pvalue.txt
可视化
##可视化
library(qqman)
data <- read.table("C:/Users/Administrator/Desktop/gwas.assoc.txt",header = T) # 读取数去存到data变量里面
colorset<-c("blue4", "orange3") # 创建一个颜色集合
manhattan(data, CHR = "CHR", BP = "bp", SNP = "SNP", p = "P-VALUE", col = colorset) # 画曼哈顿图
# 或者服务器上一键式
source /stor9000/apps/users/NWSUAF/2015010726/.bashrc
export R_LIBS_USER="/stor9000/apps/users/NWSUAF/2015010726/R/x86_64-conda_cos6-linux-gnu-library/3.5/"
R脚本内容
############################ /stor9000/apps/users/NWSUAF/2015010726/scripts/Manhattan_QQ_plot.R脚本内容#################################
# Object: Manhattan & QQ Plot for genome-wide association study
# Output: Single figure with PDF format
# Authors: Yingwei Guo
# Date: 8th, Jan, 2021
# Usage: Rscript commandArgs()[4] pvalue.txt sites_number traits_name
# 如果报没有qqman这个包的错误,在shell环境运行下面一行命令:
# export R_LIBS_USER="/stor9000/apps/users/NWSUAF/2015010726/R/x86_64-conda_cos6-linux-gnu-library/3.5/"
library(qqman)
args <- commandArgs(trailingOnly = TRUE)
data <- read.table(args[1],header = T) # 读取数去存到data变量里面
colorset <- c("#4C72B0", "#DD8452") # 创建一个颜色集合
pdf(paste0(args[3], ".pdf"), width = 18,height=4.5)
layout(matrix(c(1,2), 1, 2, byrow = TRUE),widths=c(3,1), heights=c(1,1))
par(mar = (c(3,4,2,2)+ 0.5), mgp=c(1.6,1,0))
par(bty="l", lwd=1.5) ## bty=l the plot is coordinate instead of box
manhattan(data, CHR = "CHR", BP = "BP", SNP="SNP", p = "P", suggestiveline = F, genomewideline = F, cex = 0.8) # 画曼哈顿图,默认是黑色和灰色,加color=colorset可自定义
sites_number <- as.numeric(args[2])
abline(h=-log10(0.05/sites_number), lty=5, col="grey40") # 自定义阈值线,h=-log10(0.05/sites_number)
qq(data$P, cex = 0.8)
dev.off()
#############################################################################################################################
用法
Rscript /stor9000/apps/users/NWSUAF/2015010726/scripts/Manhattan_QQ_plot.R P值列不包含NA的GWAS结果文件 位点数 性状名字
例如:
Rscript /stor9000/apps/users/NWSUAF/2015010726/scripts/Manhattan_QQ_plot.R Birth_weight.pvalue.txt 281613 Birth_weight.mht
02 位点过多时的可视化方式(感谢好朋友提醒)
可视化的本质是把每个点以pos为横坐标,以P值为纵坐标标注在坐标轴上。因此,当位点数目超过百万级别时,生成速度很慢,可以用下面的脚本进行可视化。原理大概是,把那些P值不显著有距离相近的位点去除一些,这样在图上显示看起来是一样的。
脚本中定义了一个名为 GAPIT.Pruning 的函数,该函数的主要目的是从大量数据点中筛选出一个子集,以便在绘图时只展示这些代表性点,从而减少绘图的复杂度和时间。
优化后的脚本
##########################################MANHATTAN_QQ.r#######################################
# Object: Manhattan & QQ Plot for genome-wide association study
# Output: Single figure with PDF format
# Authors: Zhiwu Zhang
# Modified by Zhengkui Zhou (zkzhou@126.com)
# Last update: Feb 8, 2014
`GAPIT.Pruning` <-
function(values,DPP=5000){
if(length(values)<=DPP)return(c(1:length(values)))
#values= log.P.values
values=sqrt(values) #This shift the weight a little bit to the low building.
#Handler of bias plot
rv=runif(length(values))
values=values+rv
values=values[order(values,decreasing = T)]
theMin=min(values)
theMax=max(values)
range=theMax-theMin
interval=range/DPP
ladder=round(values/interval)
ladder2=c(ladder[-1],0)
keep=ladder-ladder2
index=which(keep>0)
return(index)
}#end of GAPIT.Pruning
#=============================================================================================
#Prune most non important SNPs off the plots,
#Object: To get index of subset that evenly distribute
#`GAPIT.Pruning` <-
#function(values,DPP){
#if(length(values)<=DPP)return(c(1:length(values)))
#values=sqrt(values)
#theMin=min(values)
#theMax=max(values)
#range=theMax-theMin
#interval=range/DPP
#ladder=round(values/interval)
#ladder2=c(ladder[-1],0)
#keep=ladder-ladder2
#index=which(keep>0)
#return(index)
#}
`GAPIT.Manhattan` <-
function(GI.MP = NULL, name.of.trait = "Trait",plot.type = "Genomewise", plot.type2 = "Chromosomewise", DPP=50000, cutOff=0.01, band=5, seqQTN=NULL){
print(paste("Start to plot figure for trait: ", name.of.trait ,sep = ""))
if(is.null(GI.MP)) return
P.values <- as.numeric(GI.MP[,3])
borrowSlot=4
GI.MP[,borrowSlot]=0 #Inicial as 0
if(!is.null(seqQTN))GI.MP[seqQTN,borrowSlot]=1
index=which(GI.MP[,borrowSlot]==1 & is.na(GI.MP[,3]))
GI.MP[index,3]=1
GI.MP=matrix(as.numeric(as.matrix(GI.MP) ) ,nrow(GI.MP),ncol(GI.MP))
#Remove all SNPs that do not have a choromosome, bp position and p value(NA)
GI.MP <- GI.MP[!is.na(GI.MP[,1]),]
GI.MP <- GI.MP[!is.na(GI.MP[,2]),]
GI.MP <- GI.MP[!is.na(GI.MP[,3]),]
#Remove all SNPs that have P values between 0 and 1 (not na etc)
GI.MP <- GI.MP[GI.MP[,3]>0,]
GI.MP <- GI.MP[GI.MP[,3]<=1,]
GI.MP <- GI.MP[GI.MP[,1]!=0,]
numMarker=nrow(GI.MP)
bonferroniCutOff=-log10(cutOff/numMarker)
#Replace P the -log10 of the P-values
GI.MP[,3] <- -log10(GI.MP[,3])
y.lim <- as.integer(ceiling(max(GI.MP[,3])))
#y.lim = y.lim+1
print("The max -logP vlaue is")
print(y.lim)
chm.to.analyze <- unique(GI.MP[,1])
chm.to.analyze=chm.to.analyze[order(chm.to.analyze)]
numCHR= length(chm.to.analyze)
if(plot.type == "Genomewise")
{
nchr=max(chm.to.analyze)
ncycle=ceiling(nchr/band)
ncolor=band*ncycle
palette(rainbow(ncolor+1))
cycle1=seq(1,nchr,by= ncycle)
thecolor=cycle1
for(i in 2:ncycle){thecolor=c(thecolor,cycle1+(i-1))}
GI.MP <- GI.MP[order(GI.MP[,2]),]
GI.MP <- GI.MP[order(GI.MP[,1]),]
color.vector <- rep(c("orangered","navyblue"),numCHR)
ticks=NULL
lastbase=0
#change base position to accumulatives (ticks)
for (i in chm.to.analyze)
{
index=(GI.MP[,1]==i)
ticks <- c(ticks, lastbase+mean(GI.MP[index,2]))
GI.MP[index,2]=GI.MP[index,2]+lastbase
lastbase=max(GI.MP[index,2])
}
x0 <- as.numeric(GI.MP[,2]) ## BP
y0 <- as.numeric(GI.MP[,3]) ## P
z0 <- as.numeric(GI.MP[,1]) ## CHR
position=order(y0,decreasing = TRUE)
#print(length(position))
index0=GAPIT.Pruning(y0[position],DPP=DPP)
index=position[index0]
x=x0[index]
y=y0[index]
z=z0[index]
output=cbind(z,x,y)
#print(output)
#write.table(output, paste("gwasplot.","printy",".txt",sep=""), sep = '\t', row.names = F, col.names = T, quote = F)
#Extract QTN
QTN=GI.MP[which(GI.MP[,borrowSlot]==1),]
#Draw circles with same size and different thikness
size=1
ratio=5
base=1
themax=max(y)
themin=min(y)
wd=((y-themin+base)/(themax-themin+base))*size*ratio
s=size-wd/ratio/2
y.lim <- as.integer(ceiling(max(GI.MP[,3])))
#y.lim = 10
pdf(paste("GWAS.", name.of.trait,".Manhattan-QQ Plot.pdf" ,sep = ""), width = 18,height=4.5)
layout(matrix(c(1,2), 1, 2, byrow = TRUE),widths=c(3,1), heights=c(1,1)) ##
par(mar = (c(3,4,2,2)+ 0.5), mgp=c(1.6,1,0)) ##
par(bty="l", lwd=1.5) ## bty=l the plot is coordinate instead of box
mycols=rep(c("orangered","navyblue"),max(z)) ## doulb dolor loop by chromosome
plot(y~x,ylab=expression(-log[10](italic(p))) , ylim=c(0,y.lim), xaxs="i", yaxs="i" ,
cex.axis=0.5, cex.lab=1.0 ,col=mycols[z], axes=FALSE, type = "p",
pch=20,lwd=wd,cex=0.5, xlab="Chromosome")
if(!is.null(dim(QTN)))abline(v=QTN[,2], lty = 2, lwd=1.5, col = "grey")
abline(h=bonferroniCutOff,col="dimgray",lty=2, lwd=1)
title(xlab="Chromosome")
axis(1, at=ticks,tck=-0.01, cex.axis=1,labels=chm.to.analyze,tick=T, lwd=1.5, padj=-1) ## lwd: line width tick=T,
axis(2, tck=-0.01, cex.axis=1,lwd=1.5, padj=1) ## tck=-0.01 let the tck shor
box()
mtext(paste("GWAS on ",name.of.trait, sep=""), cex=1.5, font.main =1, side=3, outer=TRUE,line=-1.5)
print("Manhattan-Plot.Genomewise finished!")
############# QQ-plot#################
DPP=50000
if(length(P.values[P.values>0])<1) return(NULL)
DPP=round(DPP/4) #Reduce to 1/4 for QQ plot
P.values <- P.values[order(P.values)]
p_value_quantiles <- (1:length(P.values))/(length(P.values)+1)
log.P.values <- -log10(P.values)
log.Quantiles <- -log10(p_value_quantiles)
index=GAPIT.Pruning(log.P.values,DPP=DPP)
log.P.values=log.P.values[index ]
log.Quantiles=log.Quantiles[index]
qqplot(log.Quantiles, log.P.values, xlim = c(0,max(log.Quantiles)), ylim = c(0,max(log.P.values)),pch=20, mgp=c(1.6,0.5,0),
cex.axis=1, cex.lab=1, lty = 1, lwd = 1.2, col = "Blue" ,xlab =expression(Expected~~-log[10](italic(p))), tck=-0.015,
ylab = expression(Observed~~-log[10](italic(p))))
abline(a = 0, b = 1, col = "red")
dev.off()
print("QQ-Plot finished!")
print(paste("Manhattan & QQ Plot for trait: ", name.of.trait," accomplished successfully!" ,sep = ""))
}
}
用法
Rscript MANHATTAN_QQ.r gwas_results.txt "TraitName"
03 位点注释
for i in Birth_weight Ear_length Ear_width Nose_width tail_circumference Tail_length teat_number ; do sed -i 's/^27/X/' $i.top;done
for i in Birth_weight Ear_length Ear_width Nose_width tail_circumference Tail_length teat_number ; do cut -f 1-2 $i.top |bcftools view -R - /stor9000/apps/users/NWSUAF/2020055372/sheep_data/call_VCF/all_sample339.snp.filtered2.final.rename.vcf.gz -Oz -o $i.top.vcf.gz ;done
for i in Birth_weight Ear_length Ear_width Nose_width tail_circumference Tail_length teat_number ; do perl /stor9000/apps/users/NWSUAF/2015010726/software/annovar/table_annovar.pl -buildver OAR4.0 -protocol refGene -operation g -vcfinput $i.top.vcf.gz /stor9000/apps/users/NWSUAF/2015010726/Sheep/reference/Oar4.0/annovar_db/ -outfile $i.top; done
for i in Birth_weight Ear_length Ear_width Nose_width tail_circumference Tail_length teat_number ; do sed '1d' $i.top.OAR4.0_multianno.txt | cut -f 1-2,4-10 |csvtk join -t -f "1,2;1,2" -k $i.top - -T > $i.p_anno.txt;done