Genotype data preprocessing#

This mini-protocol outlines the workflow for processing genotype files, transitioning from VCF format to chromosome-specific PLINK files, running PLINK QC, preparing unrelated individuals for PCA, and conducting PCA.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < X minutes

Overview#

This workflow is an application of the genotype related workflows from the xQTL project pipeline.

  1. VCF_QC.ipynb (step i): QC on VCF files

  2. genotype_formatting.ipynb (step ii, iv): convert VCF to PLINK and merge the files

  3. GWAS_QC.ipynb (step iii, v-vii): QC on PLINK files, Kinship, and prepare unrelated individuals for PCA

  4. PCA.ipynb (viii): Conduct PCA

  5. GRM.ipynb (): Generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme

Steps#

i. QC for VCF files#

# We only do this for autosomal variants

echo ./ZOD14598_AD_GRM_WGS_2021-04-29_chr1.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr2.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr3.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr4.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr5.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr6.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr7.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr8.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr9.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr10.recalibrated_variants.vcf.gz  ./ZOD14598_AD_GRM_WGS_2021-04-29_chr11.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr12.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr13.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr14.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr15.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr16.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr17.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr18.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr19.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr20.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr21.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr22.recalibrated_variants.vcf.gz \
    | tr ' ' '\n' > vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile vcf_qc/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt \
    --dbsnp-variants /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/00-All.add_chr.variants.gz \
    --reference-genome /mnt/vast/hpc/csg/snuc_pseudo_bulk/data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd vcf_qc/ \
    -J 22 --mem 120G \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

ii. Merge separated bed files into one#

Converting VCF to PLINK keeping only the ROSMAP samples.

sos run pipeline/genotype_formatting.ipynb vcf_to_plink
    --genoFile `ls vcf_qc/*.leftnorm.bcftools_qc.vcf.gz` \
    --cwd genotype/ \
    --keep-samples ./ROSMAP_sample_list.txt \
    --mem 120G \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest
sos run xqtl-protocol/pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls *.leftnorm.bcftools_qc.bed` \
    --name ROSMAP_NIA_WGS.leftnorm.bcftools_qc  \
    --cwd genotype/ \
    -J 5 \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

iv. Genotype data partition by chromosome#

This step is necessary for TensorQTL applications.

sos run xqtl-protocol/pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile xqtl_association/protocol_example.genotype.chr21_22.bed \
    --cwd output \
    --chrom `cut -f 1 xqtl_association/protocol_example.genotype.chr21_22.bim | uniq | sed "s/chr//g"` \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

v. Sample match with genotype#

sos run xqtl-protocol/pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
        --cwd output/sample_meta \
        --genoFile xqtl_association/protocol_example.genotype.chr21_22.fam  \
        --phenoFile xqtl_association/protocol_example.protein.csv \
        --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
        --mem 5G

vi. Kinship#

To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING

sos run xqtl-protocol/pipeline/GWAS_QC.ipynb king \
    --cwd output/kinship \
    --genoFile xqtl_association/protocol_example.genotype.chr21_22.bed \
    --name pQTL \
    --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
    --no-maximize-unrelated \
    --mem 40G

vii. Prepare unrelated individuals data for PCA#

sos run xqtl-protocol/pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile output/kinship/protocol_example.genotype.chr21_22.pQTL.unrelated.bed \
   --mac-filter 5 \
   --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
   --mem 16G

If No related individuals detected from *.kin0 occurs, there is no separate genotype data generated for unrelated individuals. In this case, we need to work from the original genotype data and must use --keep-samples to run qc to extract samples for PCA. For example:

sos run xqtl-protocol/pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile xqtl_association/protocol_example.genotype.chr21_22.bed \
   --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
   --name pQTL \
   --mac-filter 5 \
   --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
   --mem 40G

viii. PCA on genotype#

sos run xqtl-protocol/pipeline/PCA.ipynb flashpca \
   --cwd output/genotype_pca \
   --genoFile output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed \
   --container oras://ghcr.io/cumc/flashpcar_apptainer:latest \
   --mem 16G

Anticipated Results#

Genotype preprocessing will produce cleaned genotype files, and genetic principal components.