Covariate Data Preprocessing#

This notebook contains workflow of processing covariate files and computes PCA-derived covariates from phenotype data.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < 3 minutes

Overview#

This workflow is an application of the covariate related sections from the xQTL project pipeline.

  1. covariate_formatting.ipynb (step i): Merge covariates and genotype PCA

  2. covariate_hidden_factor.ipynb (step ii): Compute residual on merged covariates and perform hidden factors analysis

Steps#

i. Merge Covariates and Genotype PCs#

You can edit the total amount of variation you want your PCs to explain by editing the --k parameter. In this example, we chose 80%.

sos run xqtl-protocol/pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate \
    --pcaFile output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.rds \
    --covFile xqtl_association/protocol_example.samples.tsv \
    --tol-cov 0.4  \
    --k `awk '$3 < 0.8' output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1 ` \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

ii. Compute Residual on Merged Covariates and Perform Hidden Factor Analysis#

sos run xqtl-protocol/pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
   --cwd output/covariate \
   --phenoFile output/phenotype/protocol_example.protein.bed.gz  \
   --covFile output/covariate/protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz \
   --mean-impute-missing \
   --container oras://ghcr.io/cumc/pcatools_apptainer:latest

Anticipated Results#

Processed covariate data includes a file with covariates and hidden factors for use in TensorQTL.