Genotype data preprocessing#
This mini-protocol outlines the workflow for processing genotype files, transitioning from VCF format to chromosome-specific PLINK files, running PLINK QC, preparing unrelated individuals for PCA, and conducting PCA.
Miniprotocol Timing#
This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.
Timing < X minutes
Overview#
This workflow is an application of the genotype related workflows from the xQTL project pipeline.
VCF_QC.ipynb
(step i): QC on VCF filesgenotype_formatting.ipynb
(step ii, iv): convert VCF to PLINK and merge the filesGWAS_QC.ipynb
(step iii, v-vii): QC on PLINK files, Kinship, and prepare unrelated individuals for PCAPCA.ipynb
(viii): Conduct PCAGRM.ipynb
(): Generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme
Steps#
i. QC for VCF files#
# We only do this for autosomal variants
echo ./ZOD14598_AD_GRM_WGS_2021-04-29_chr1.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr2.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr3.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr4.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr5.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr6.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr7.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr8.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr9.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr10.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr11.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr12.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr13.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr14.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr15.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr16.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr17.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr18.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr19.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr20.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr21.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr22.recalibrated_variants.vcf.gz \
| tr ' ' '\n' > vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt
sos run pipeline/VCF_QC.ipynb qc \
--genoFile vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt \
--dbsnp-variants /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/00-All.add_chr.variants.gz \
--reference-genome /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--cwd vcf_qc/ \
-J 22 --mem 120G \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest
ii. Merge separated bed files into one#
Converting VCF to PLINK keeping only the ROSMAP samples.
sos run pipeline/genotype_formatting.ipynb vcf_to_plink
--genoFile `ls vcf_qc/*.leftnorm.bcftools_qc.vcf.gz` \
--cwd genotype/ \
--keep-samples ./ROSMAP_sample_list.txt \
--mem 120G \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest
sos run xqtl-protocol/pipeline/genotype_formatting.ipynb merge_plink \
--genoFile `ls *.leftnorm.bcftools_qc.bed` \
--name ROSMAP_NIA_WGS.leftnorm.bcftools_qc \
--cwd genotype/ \
-J 5 \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest
iii. QC for PLINK files#
Using PLINK-based workflows we:
Filter out those have more than 10% missing
Set HWE cutoff as 1E-8
No minor allele filter
sos run xqtl-protocol/pipeline/GWAS_QC.ipynb qc_no_prune \
--cwd genotype \
--genoFile xqtl_association/protocol_example.genotype.chr21_22.bed \
--geno-filter 0.1 \
--mind-filter 0.1 \
--hwe-filter 1e-08 \
--mac-filter 0 \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest
iv. Genotype data partition by chromosome#
This step is necessary for TensorQTL applications.
sos run xqtl-protocol/pipeline/genotype_formatting.ipynb genotype_by_chrom \
--genoFile xqtl_association/protocol_example.genotype.chr21_22.bed \
--cwd output \
--chrom `cut -f 1 xqtl_association/protocol_example.genotype.chr21_22.bim | uniq | sed "s/chr//g"` \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest
v. Sample match with genotype#
sos run xqtl-protocol/pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
--cwd output/sample_meta \
--genoFile xqtl_association/protocol_example.genotype.chr21_22.fam \
--phenoFile xqtl_association/protocol_example.protein.csv \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
--mem 5G
vi. Kinship#
To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING
sos run xqtl-protocol/pipeline/GWAS_QC.ipynb king \
--cwd output/kinship \
--genoFile xqtl_association/protocol_example.genotype.chr21_22.bed \
--name pQTL \
--keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
--container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
--no-maximize-unrelated \
--mem 40G
viii. PCA on genotype#
sos run xqtl-protocol/pipeline/PCA.ipynb flashpca \
--cwd output/genotype_pca \
--genoFile output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed \
--container oras://ghcr.io/cumc/flashpcar_apptainer:latest \
--mem 16G
Anticipated Results#
Genotype preprocessing will produce cleaned genotype files, and genetic principal components.