04.XP-EHH

原理

具体操作

计算XP-EHH

包括以下步骤：

分染色体提取map文件，并换算遗传距离
分染色体提取两个群体的vcf
分染色体计算xpehh，支持多线程
根据全局的value进行矫正，所以要一次性把所有的染色体都给到

# 分染色体提取map文件，并重命名snp name，换算遗传距离
grep -w '^1' /home/guoyingwei/project/LeiQiong_cattle2/00.data/from_raw_vcf/leiqiong.filtered.snp.maf.map |awk '{$3=$4/1000000;print $0}' |sed 's/ /\t/g'  > /home/guoyingwei/project/LeiQiong_cattle2/03.selection/05.xpehh/leiqiong.filtered.snp.maf.chr1.map
# 分染色体分别提取两个群体的样本
for i in {2..29};do  bcftools view -S /home/guoyingwei/project/LeiQiong_cattle2/01.diversity/01.pi/leiqiongguangdong /home/guoyingwei/project/LeiQiong_cattle2/00.data/from_raw_vcf/phased_vcf/chr${i}.leiqiong.filtered.snp.maf.phased.vcf.gz -Oz -o /home/guoyingwei/project/LeiQiong_cattle2/03.selection/05.xpehh/guangdong/chr${i}.leiqiong.filtered.snp.maf.phased.guangdong.vcf.gz;done
# 分染色体计算xpehh，支持多线程
for i in {11..20}; do selscan --xpehh --vcf guangdong/chr${i}.leiqiong.filtered.snp.maf.phased.guangdong.vcf.gz --vcf-ref hainan/chr${i}.leiqiong.filtered.snp.maf.phased.hainan.vcf.gz --map map/leiqiong.filtered.snp.maf.chr${i}.map --out guangdong_hainan.chr${i} --threads 10; done
# 
norm --xpehh --files guangdong_hainan.chr*.xpehh.out --bp-win --winsize 50000 --log norm.log

矫正后的处理

文件格式

矫正后会有两种格式的文件：
xpehh.out.norm和xpehh.out.norm.50kb.windows

xpehh.out.norm.50kb.windows输出文件格式：

==> guangdong_hainan.chr10.xpehh.out.norm.50kb.windows <==
1       50001   70      0       0       5       100     1.93455 -1.22812
50001   100001  1444    0.066482        0       100     100     3.38201 -1.05993
100001  150001  1789    0.0134153       0       100     100     2.42656 -1.32729
150001  200001  748     0.092246        0       100     100     2.79572 -0.331222
200001  250001  1733    0.0536642       0       100     100     3.5527  -0.975462
250001  300001  1584    0.00568182      0       100     100     2.2763  -1.32988
300001  350001  972     0.00102881      0.00102881      100     100     2.01869 -2.07582
350001  400001  1107    0.00542005      0.00542005      100     100     2.3941  -2.41838
400001  450001  1273    0.0974077       0       100     100     3.23089 -0.94057
450001  500001  687     0       0       100     100     1.20025 -1.28854

标题行：

<win start> <win end> <scores in win> <gt the fraction of XP-EHH scores >2> <lt the fraction of XP-EHH scores < -2> <approx percentile for gt threshold wins> <approx percentile for lt threshold wins> <max score> <min score>

最开始是用这个窗口的结果，最后两列的最大和最小值画图，但感觉不科学，而且不支持overlapped滑动窗口，所以还是用所有位点的，然后网上找到的脚步自己算每个窗口的平均值。

xpehh.out.norm文件格式如下：

id      pos     gpos    p1      ihh1    p2      ihh2    xpehh   normxpehh       crit
5_31647 31647   0.031647        0.902439        0.000956156     0.97    0.00101616      -0.0264322      0.310865        0
5_31701 31701   0.031701        0.231707        0.000760266     0.14    0.000923125     -0.0842947      -0.0178511      0
5_31718 31718   0.031718        0.219512        0.000739398     0.12    0.000924215     -0.0968946      -0.0894309      0
5_31802 31802   0.031802        0.256098        0.000636091     0.16    0.000840592     -0.121066       -0.226748       0
5_31805 31805   0.031805        0.292683        0.000636734     0.18    0.000833599     -0.116999       -0.203644       0
5_31821 31821   0.031821        0.585366        0.000651229     0.77    0.000985299     -0.179834       -0.560608       0

计算窗口平均值

直接把每条染色体的xpehh.out.norm文件用下面这个python脚本，生成处理后的文件。

python脚本如下：

# XPEHH_win_step.py
import pandas as pd  
import click  
  
def load_data(file):  
    data = pd.read_csv(file, delimiter="\t|\s+",   
                       engine='python')  
    return data  
  
def results(data, step_size, window_size):  
    result = []  
    chromosome_length = max(data['pos'])  
  
    for BIN_START in range(1, chromosome_length, step_size):  
        BIN_END = BIN_START - 1 + window_size  
        if BIN_START + window_size > chromosome_length:  
            break  
  
        normxpehh_vals = []  
        for _, row in data[(data['pos'] >= BIN_START) & (data['pos'] < BIN_END)].iterrows():  
            if not pd.isna(row['pos']):  
                normxpehh_vals.append(row['normxpehh'])  
  
        # 计算 normxpehh 的平均值并保留4位小数, 统计区间SNP数量  
        avg_normxpehh = 0  
        nvar = 0  
        if len(normxpehh_vals) > 0:  
            avg_normxpehh = round(sum(normxpehh_vals) / len(normxpehh_vals), 4)  
            nvar = len(normxpehh_vals)  
        result.append([BIN_START, BIN_END,   
                       avg_normxpehh, nvar])  
    return result  
  
@click.command()  
@click.option('-f','--file', help='xpehh.out.norm文件，norm后的位点文件，不是区间文件！！', required=True)  
@click.option('-c','--chromosome', help='染色体号，因为是分染色体做的', required=True)  
@click.option('-w','--window', help='窗口大小', type=int, default=50000)  
@click.option('-s','--step', help='步长大小', type=int, default=50000)  
def main(file, chromosome, window, step):  
    data = load_data(file)  
    window_size = window  
    step_size = step  
    out = results(data, step_size, window_size)  
  
    # 创建一个 DataFrame 对象来保存结果，并使用 to_csv 方法将其写入文件中  
    result_df = pd.DataFrame(out, columns=["BIN_START", "BIN_END",   
                                           "avg_normxpehh", "nvar"])  
    result_df.loc[:, 'CHROM'] = chromosome  
  
    if chromosome == 1:  
        result_df[["CHROM", "BIN_START", "BIN_END",  
                   "nvar", "avg_normxpehh"]].to_csv(f'chr{chromosome}.win.step.mean.XPEHH', sep='\t',  
                                                    index=False)  
    else:  
        result_df[["CHROM", "BIN_START", "BIN_END",  
                   "nvar", "avg_normxpehh"]].to_csv(f'chr{chromosome}.win.step.mean.XPEHH', sep='\t',  
                                                    index=False, header=False)  
  
if __name__ == '__main__':  
    main()

生成的文件格式如下：

12      1       50000   14      -0.0586
12      25001   75000   23      0.2398
12      50001   100000  206     0.008
12      75001   125000  536     -0.0816
12      100001  150000  604     -0.3024
12      125001  175000  410     -0.5833
12      150001  200000  298     -0.1776
12      175001  225000  301     0.348
12      200001  250000  292     0.4825
12      225001  275000  279     0.1246

把他们cat到一起就可以画图了。