04.XP-EHH
原理
具体操作
计算XP-EHH
包括以下步骤:
- 分染色体提取map文件,并换算遗传距离
- 分染色体提取两个群体的vcf
- 分染色体计算xpehh,支持多线程
- 根据全局的value进行矫正,所以要一次性把所有的染色体都给到
# 分染色体提取map文件,并重命名snp name,换算遗传距离
grep -w '^1' /home/guoyingwei/project/LeiQiong_cattle2/00.data/from_raw_vcf/leiqiong.filtered.snp.maf.map |awk '{$3=$4/1000000;print $0}' |sed 's/ /\t/g' > /home/guoyingwei/project/LeiQiong_cattle2/03.selection/05.xpehh/leiqiong.filtered.snp.maf.chr1.map
# 分染色体分别提取两个群体的样本
for i in {2..29};do bcftools view -S /home/guoyingwei/project/LeiQiong_cattle2/01.diversity/01.pi/leiqiongguangdong /home/guoyingwei/project/LeiQiong_cattle2/00.data/from_raw_vcf/phased_vcf/chr${i}.leiqiong.filtered.snp.maf.phased.vcf.gz -Oz -o /home/guoyingwei/project/LeiQiong_cattle2/03.selection/05.xpehh/guangdong/chr${i}.leiqiong.filtered.snp.maf.phased.guangdong.vcf.gz;done
# 分染色体计算xpehh,支持多线程
for i in {11..20}; do selscan --xpehh --vcf guangdong/chr${i}.leiqiong.filtered.snp.maf.phased.guangdong.vcf.gz --vcf-ref hainan/chr${i}.leiqiong.filtered.snp.maf.phased.hainan.vcf.gz --map map/leiqiong.filtered.snp.maf.chr${i}.map --out guangdong_hainan.chr${i} --threads 10; done
#
norm --xpehh --files guangdong_hainan.chr*.xpehh.out --bp-win --winsize 50000 --log norm.log
矫正后的处理
文件格式
矫正后会有两种格式的文件:
xpehh.out.norm和xpehh.out.norm.50kb.windows
xpehh.out.norm.50kb.windows输出文件格式:
==> guangdong_hainan.chr10.xpehh.out.norm.50kb.windows <==
1 50001 70 0 0 5 100 1.93455 -1.22812
50001 100001 1444 0.066482 0 100 100 3.38201 -1.05993
100001 150001 1789 0.0134153 0 100 100 2.42656 -1.32729
150001 200001 748 0.092246 0 100 100 2.79572 -0.331222
200001 250001 1733 0.0536642 0 100 100 3.5527 -0.975462
250001 300001 1584 0.00568182 0 100 100 2.2763 -1.32988
300001 350001 972 0.00102881 0.00102881 100 100 2.01869 -2.07582
350001 400001 1107 0.00542005 0.00542005 100 100 2.3941 -2.41838
400001 450001 1273 0.0974077 0 100 100 3.23089 -0.94057
450001 500001 687 0 0 100 100 1.20025 -1.28854
标题行:
<win start> <win end> <scores in win> <gt the fraction of XP-EHH scores >2> <lt the fraction of XP-EHH scores < -2> <approx percentile for gt threshold wins> <approx percentile for lt threshold wins> <max score> <min score>
最开始是用这个窗口的结果,最后两列的最大和最小值画图,但感觉不科学,而且不支持overlapped滑动窗口,所以还是用所有位点的,然后网上找到的脚步自己算每个窗口的平均值。
xpehh.out.norm文件格式如下:
id pos gpos p1 ihh1 p2 ihh2 xpehh normxpehh crit
5_31647 31647 0.031647 0.902439 0.000956156 0.97 0.00101616 -0.0264322 0.310865 0
5_31701 31701 0.031701 0.231707 0.000760266 0.14 0.000923125 -0.0842947 -0.0178511 0
5_31718 31718 0.031718 0.219512 0.000739398 0.12 0.000924215 -0.0968946 -0.0894309 0
5_31802 31802 0.031802 0.256098 0.000636091 0.16 0.000840592 -0.121066 -0.226748 0
5_31805 31805 0.031805 0.292683 0.000636734 0.18 0.000833599 -0.116999 -0.203644 0
5_31821 31821 0.031821 0.585366 0.000651229 0.77 0.000985299 -0.179834 -0.560608 0
计算窗口平均值
直接把每条染色体的xpehh.out.norm文件用下面这个python脚本,生成处理后的文件。
python脚本如下:
# XPEHH_win_step.py
import pandas as pd
import click
def load_data(file):
data = pd.read_csv(file, delimiter="\t|\s+",
engine='python')
return data
def results(data, step_size, window_size):
result = []
chromosome_length = max(data['pos'])
for BIN_START in range(1, chromosome_length, step_size):
BIN_END = BIN_START - 1 + window_size
if BIN_START + window_size > chromosome_length:
break
normxpehh_vals = []
for _, row in data[(data['pos'] >= BIN_START) & (data['pos'] < BIN_END)].iterrows():
if not pd.isna(row['pos']):
normxpehh_vals.append(row['normxpehh'])
# 计算 normxpehh 的平均值并保留4位小数, 统计区间SNP数量
avg_normxpehh = 0
nvar = 0
if len(normxpehh_vals) > 0:
avg_normxpehh = round(sum(normxpehh_vals) / len(normxpehh_vals), 4)
nvar = len(normxpehh_vals)
result.append([BIN_START, BIN_END,
avg_normxpehh, nvar])
return result
@click.command()
@click.option('-f','--file', help='xpehh.out.norm文件,norm后的位点文件,不是区间文件!!', required=True)
@click.option('-c','--chromosome', help='染色体号,因为是分染色体做的', required=True)
@click.option('-w','--window', help='窗口大小', type=int, default=50000)
@click.option('-s','--step', help='步长大小', type=int, default=50000)
def main(file, chromosome, window, step):
data = load_data(file)
window_size = window
step_size = step
out = results(data, step_size, window_size)
# 创建一个 DataFrame 对象来保存结果,并使用 to_csv 方法将其写入文件中
result_df = pd.DataFrame(out, columns=["BIN_START", "BIN_END",
"avg_normxpehh", "nvar"])
result_df.loc[:, 'CHROM'] = chromosome
if chromosome == 1:
result_df[["CHROM", "BIN_START", "BIN_END",
"nvar", "avg_normxpehh"]].to_csv(f'chr{chromosome}.win.step.mean.XPEHH', sep='\t',
index=False)
else:
result_df[["CHROM", "BIN_START", "BIN_END",
"nvar", "avg_normxpehh"]].to_csv(f'chr{chromosome}.win.step.mean.XPEHH', sep='\t',
index=False, header=False)
if __name__ == '__main__':
main()
生成的文件格式如下:
12 1 50000 14 -0.0586
12 25001 75000 23 0.2398
12 50001 100000 206 0.008
12 75001 125000 536 -0.0816
12 100001 150000 604 -0.3024
12 125001 175000 410 -0.5833
12 150001 200000 298 -0.1776
12 175001 225000 301 0.348
12 200001 250000 292 0.4825
12 225001 275000 279 0.1246
把他们cat到一起就可以画图了。