Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data
Pre-requisites
Make sure you install the pre-requisited before running seqlink:
#install cstatgen
conda install -c conda-forge xeus-cling
conda install -c anaconda swig
conda install -c conda-forge gsl
pip install egglib
git clone https://github.com/statgenetics/cstatgen.git
cd cstatgen
python setup.py install
#install paramlink2
R
install.packages("paramlink2")
pip install SEQLinkage
!seqlink --help
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt --included-vars testdata/test_chr1_included_vars.txt -o data/test_chp --run-linkage
!seqlink --single-marker --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_var --run-linkage
!seqlink --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --freq='AF' --blueprint testdata/test_blueprint_ext.txt -c 0.05 -o data/test_chp_na --run-linkage
!seqlink --single-marker --fam testdata/test_ped.fam --vcf testdata/test_snps.vcf.gz --anno testdata/test_chr1_anno.csv --pop testdata/test_fam_pop.txt -c 0.05 -o data/test_wg --run-linkage
--fam
, Fam file (required, format: "fid iid fathid mothid sex trait[1 control, 2 case, -9 or 0 missing]")
%%writefile testdata/test_ped.fam
1033 1033_2 0 0 2 -9
1033 1033_1 0 0 1 -9
1033 1033_99 1033_1 1033_2 2 1
1033 1033_9 1033_1 1033_2 2 1
1033 1033_3 1033_1 1033_2 2 2
1036 1036_99 1036_1 1036_2 2 2
1036 1036_6 0 0 1 2
1036 1036_1 0 0 1 -9
1036 1036_3 1036_6 1036_99 2 1
1036 1036_4 1036_6 1036_99 2 1
1036 1036_2 0 0 2 -9
1036 1036_5 1036_6 1036_99 1 1
10J_103 10J_103_10 0 0 1 -9
10J_103 10J_103_4 0 0 1 -9
10J_103 10J_103_3 0 0 2 -9
10J_103 10J_103_2 10J_103_4 10J_103_3 2 2
10J_103 10J_103_1 10J_103_10 10J_103_3 1 2
10J_109 10J_109_2 10J_109_6 10J_109_5 1 2
10J_109 10J_109_3 10J_109_6 10J_109_5 1 2
10J_109 10J_109_4 10J_109_6 10J_109_5 1 2
10J_109 10J_109_6 0 0 1 -9
10J_109 10J_109_1 10J_109_6 10J_109_5 1 2
10J_109 10J_109_5 0 0 2 2
10J_109 10J_109_7 10J_109_6 10J_109_5 1 1
10J_112 10J_112_3 0 0 1 1
10J_112 10J_112_5 10J_112_3 10J_112_2 1 2
10J_112 10J_112_1 10J_112_3 10J_112_2 2 1
10J_112 10J_112_7 10J_112_3 10J_112_2 1 1
10J_112 10J_112_2 0 0 2 2
10J_119 10J_119_2 0 0 1 1
10J_119 10J_119_5 0 0 2 1
10J_119 10J_119_4 0 0 1 1
10J_119 10J_119_6 10J_119_4 10J_119_5 1 2
10J_119 10J_119_7 10J_119_4 10J_119_5 2 2
10J_119 10J_119_1 10J_119_4 10J_119_5 2 2
10J_119 10J_119_3 10J_119_2 10J_119_1 1 1
--vcf
, VCF file (required, vcf.gz with vcf.gz.tbi)bgzip -c file.vcf > file.vcf.gz tabix -p vcf file.vcf.gz
--anno
, Annotation file fromANNOVAR
, It must contains the allele frequency for population in the file of family population information. For example in here, The annotation file must have AF_amr, AF_afr, AF_nfe columns.
anno=pd.read_csv('testdata/test_chr1_anno.csv')
anno
anno.loc[:,['AF_amr', 'AF_afr', 'AF_nfe']]
Or, create a self-defined annotation file like this:
Chr Start AF_amr AF AF_nfe AF_afr
chr1:10140:ACCCTAAC:A 1 10140 0.0006 0.0007 0.0007 0.0008
chr1:10146:AC:A 1 10146 0.638 0.6328 0.6413 0.63
chr1:10150:CT:C 1 10150 0.0357 0.0375 0.037 0.0426
chr1:10172:CCCTAA:C 1 10172 0.0086 0.0082 0.0084 0.0097
chr1:10178:CCTAA:C 1 10178 0.5 0.3333 0.2955 0.4375
chr1:10198:TAACCC:T 1 10198 0.0 0.0 0.0 0.0
chr1:10231:C:A 1 10231 0.2 0.0366 0.0 0.05
chr1:10236:AACCCT:A 1 10236 0.0 0.0 0.0 0.0
chr1:10247:TAAACCCTA:T 1 10247 0.2222 0.2089 0.1429 0.4211
The index must match with the ID in vcf file.
--pop
, The file of family population information
%%writefile testdata/test_fam_pop.txt
1033 AF_amr
1036 AF_amr
10J_103 AF_afr
10J_109 AF_nfe
10J_112 AF_nfe
10J_119 AF_nfe
--included-vars
, The file with one column of variants For example:chr1:10140:ACCCTAAC:A chr1:10172:CCCTAA:C chr1:10198:TAACCC:T chr1:10236:AACCCT:A chr1:10261:T:TA chr1:10262:AACCCT:A
--blueprint
, The blueprint file that defines regional marker (format: "chr startpos endpos name avg.distance male.distance female.distance"). The first four columns are required.
%%writefile testdata/test_blueprint_ext.txt
1 11868 14362 LOC102725121@1 9.177127474362311e-07 1.1657192989882668e-06 6.814189157634088e-07
1 11873 14409 DDX11L1 9.195320788455595e-07 1.1680302941673515e-06 6.82769803434766e-07
1 14361 29370 WASH7P 1.5299877409602128e-06 1.94345806118021e-06 1.136044574393209e-06
1 17368 17436 MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1 1.217692507120495e-06 1.5467668502473368e-06 9.041595098829462e-07
1 30365 30503 MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1 2.1295973889038703e-06 2.705108741548526e-06 1.5812659765416382e-06
1 34610 36081 FAM138A@1,FAM138C@1,FAM138F@1 2.4732411024120156e-06 3.1416201771056266e-06 1.8364278747737466e-06
1 69090 70008 OR4F5 4.866641545668504e-06 6.181823219621424e-06 3.6135725636621673e-06
1 134772 140566 LOC729737 9.633289838108921e-06 1.2236630588823159e-05 7.152898262617822e-06
1 490755 495445 LOC100132062@1,LOC100132287@1 2.2828130832833112e-05 2.8997300893994373e-05 1.6950315013571593e-05
1 450739 451678 OR4F16@1,OR4F29@1,OR4F3@1 2.575942360468604e-05 3.2720758549649544e-05 1.912685483821856e-05
1 627379 629009 LOC101928626 3.943568768003252e-05 5.009295373297854e-05 2.9281737249900675e-05
1 632614 632685 MIR12136 3.974742311959244e-05 5.048893386847169e-05 2.9513206656665908e-05
Or
%%writefile testdata/test_blueprint.txt
1 11868 14362 LOC102725121@1
1 11873 14409 DDX11L1
1 14361 29370 WASH7P
1 17368 17436 MIR6859-1@1,MIR6859-2@1,MIR6859-3@1,MIR6859-4@1
1 30365 30503 MIR1302-10@1,MIR1302-11@1,MIR1302-2@1,MIR1302-9@1
1 34610 36081 FAM138A@1,FAM138C@1,FAM138F@1
1 69090 70008 OR4F5
1 134772 140566 LOC729737
1 490755 495445 LOC100132062@1,LOC100132287@1
1 450739 451678 OR4F16@1,OR4F29@1,OR4F3@1
1 627379 629009 LOC101928626
1 632614 632685 MIR12136
- InfoFam: the number of families with the variant or the CHP marker.
LOD Score.
It is calculated from 0 to 0.5 with step 0.05 per family per gene. you can change them by --theta-inc
and --theta-max
.
- LOD0: the sum of LOD score at theta=0 among all families
- LODmax: the max of the sum of LOD score among all families between the range of thetas.
HLOD Score
- theta: the theta of best HLOD score.
- alpha: the alpha of best HLOD score.
- hlod: the max HLOD of these HLOD between the range of thetas.
result=pd.read_csv('data/test_chp/chr1result_lod_summary.csv',index_col=0)
result
result=pd.read_csv('data/test_var/chr1result_lod_summary.csv',index_col=0)
result