Phenotype data preprocessing

Phenotype data preprocessing#

This mini-protocol documents the shared post processing step and some utilities to handle molecular phenotype files including imputations.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < 12 minutes

Overview#

This workflow is an application of the phenotype related workflows from the xQTL project pipeline.

gene_annotation.ipynb (step i): Adds genomic coordinate annotation to gene-level molecular phenotype files and converts them to .bed format
phenotype_imputation.ipynb (step ii): Impute missing entries of molecular phenotype data
phenotype_formatting.ipynb (step iii): Splits each phenotype file by chromosome

Steps#

i. Phenotype Annotation#

This step serves as annote cooresponding chr, start, end, and gene_id to genes in the original phenotype matrix.

sos run xqtl-protocol/pipeline/gene_annotation.ipynb annotate_coord_protein \
    --cwd output/phenotype \
    --phenoFile xqtl_association/protocol_example.protein.csv \
    --annotation-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-type gene_name \
    --sample-participant-lookup output/sample_meta/protocol_example.protein.sample_overlap.txt \
    --sep "," \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest

ii. Missing Value Imputation#

This step serves as impute the missing entries for molecular phenotype data. This step is optional for eQTL analysis. But for other QTL analysis, this step is necessary. The missing entries are imputed by flashier, a Empirical Bayes Matrix Factorization model.

sos run xqtl-protocol/pipeline/phenotype_imputation.ipynb EBMF \
    --phenoFile /phenotype/protocol_example.protein.bed.gz \
    --cwd output/phenotype \
    --prior ebnm_point_laplace --varType 1 \
    --container oras://ghcr.io/cumc/factor_analysis_apptainer:latest

iii. Partition by Chromosome#

sos run xqtl-protocol/pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype_by_chrom \
    --phenoFile output/phenotype/protocol_example.protein.bed.gz \
    --chrom `for i in {21..22}; do echo chr$i; done` \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

Anticipated Results#

Phenotype preprocessing should result in a phenotype file formatted and ready for use in TensorQTL.