APEX QTL association testing#

This notebook implements a workflow for using APEX to conduct analysis. APEX can use either a linear model or a linear mixed model for association testing. The linear mix model is potentially useful for analysis with related individuals.

Cautions#

  • In our pilot cis-eQTL analysis with around ~500 samples we found that APEX is less robust compared to tensorQTL for association scans and for linear regression using OLS there is no speed advantage over tensorQTL. Gene level p-value from APEX are on average smaller than that from tensorQTL, resulting in more genes discovered. However, comparison with univariate fine-mapping seem to suggest that these p-values are a bit inflated. We therefore recommend using tensorQTL for at least the analysis of samples without related individuals.

  • Notice that the command options are different from those on the APEX website documentation. The commands on the documentation page does not work (last updated September 2021). The commands below were constructed and tested by our team based on our understanding of the program, without input from APEX authors.

  • --low-mem option does not work for APEX so we do not use it (corbinq/apex#7). As a result, we have to use a large memory eg 80GB to complete the analysis.

  • the --fit-null flag listed on apex documentaion no longer exists but replaced by get-theta . --save-resid --write-gvar were replaced by --get-resid and --get-gvar. The --prefix parameter listed on the websit was replaced by --out

  • On the apex documentation, it was suggest option --exclude-snps are avaible to control for the actual variants to be analyzed. But this flag in fact are not presented in the provided binary release.

Input#

  • List of molecular phenotype files: a list of bed.gz files containing the table for the molecular phenotype. It should have a companion index file in tbi format.

  • List of genotypes in VCF format for each chromosome, previously processed through our genotype QC pipelines.

  • List of GRM containing path to GRM matrices that generated by the GRM module.

  • Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.

Output#

For each chromosome, several of summary statistics files are generated, including both nominal test statistics for each test, as well as region (gene) level association evidence.

The columns of nominal association result are as follows:

  • #chrom : Variant chromosome.

  • pos : Variant chromosomal position (basepairs).

  • ref : Variant reference allele (A, C, T, or G).

  • alt : Variant alternate allele.

  • gene : Molecular trait identifier (as specified in –bed {trait-file}).

  • beta : OLS regression slope for variant on trait.

  • se : Standard error of regression slope.

  • pval : Single-variant association nominal p-value.

  • variant_id: ID of the variant (rsid or chr:position:ref:alt)

The columns of region (gene) level association evidence are as follows:

  • #chrom : Molecular trait chromosome.

  • start : Molecular trait start position.

  • end : Molecular trait end position.

  • gene : Molecular trait identifier.

  • gene_pval : Trait-level p-value calculated across all variants in the cis region using the Cauchy combination test, comparable to beta-approximated permutation p-values.

  • n_samples : Number of samples included in analysis.

  • n_covar : Number of covariates included in analysis, including intercept.

  • resid_sd : Square root of regression mean squared error under the null model.

  • n_cis_variants : Number of variants in the cis region (which were used to calculate gene_pval).

Command interface#

sos run APEX.ipynb -h
usage: sos run APEX.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  LMM_null
  cis
  trans

Global Workflow Options:
  --phenotype-list VAL (as path, required)
                        Path to the input molecular phenotype file, per chrom,
                        in bed.gz format.
  --covariate-file VAL (as path, required)
                        Covariate file
  --genotype-list VAL (as path, required)
                        Genotype file in VCF format, per chrom
  --grm-list . (as path)
                        GRM file in plain text format, per chrom, for leave one
                        chrom out analysis
  --[no-]LMM (default to False)
                        Use LMM or not
  --[no-]rankNormal (default to False)
                        Whether or not to apply rank normalization
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --container ''
                        Container option for software to run the analysis:
                        docker or singularity
  --entrypoint {('micromamba run -a "" -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else f''}

  --name  f"{phenotype_list:bn}_{covariate_file:bn}"

                        Prefix for the analysis output
  --window 1000000 (as int)
                        Specify the scanning window for the up and downstream
                        radius to analyze around the region of interest, in
                        units of bp
  --numThreads 8 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
  --mem 80G

Sections
  LMM_null:
  cis_1:
  trans_1:
    Workflow Options:
      --gene-list . (as path)
  *s_2:
  *s_3:

Minimal Working Example#

An MWE is uploaded to google drive. The singularity image (sif) for running this MWE is uploaded to google drive

sos run pipeline/APEX.ipynb cis \
    --genotype_list vcf_list.txt \
    --phenotype_list MWE.bed.recipe \
    --grm_list grm_list.txt \
    --covariate_file ALL.covariate.pca.BiCV.cov.gz \
    --cwd . \
    --container containers/apex.sif

Global parameter settings#

The section outlined the parameters that can be set in the command interface.

[global]
# Path to the input molecular phenotype file, per chrom, in bed.gz format.
parameter: phenotype_list = path
# Covariate file
parameter: covariate_file = path
# Genotype file in VCF format, per chrom
parameter: genotype_list = path
# GRM file in plain text format, per chrom, for leave one chrom out analysis
parameter: grm_list = path()
# Use LMM or not
parameter: LMM = False
# Whether or not to apply rank normalization
parameter: rankNormal = False
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# Prefix for the analysis output
parameter: name = f"{phenotype_list:bn}_{covariate_file:bn}"
# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of bp
parameter: window = 1000000
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '5h'
parameter: mem = '80G'

import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(phenotype_list,sep = "\t")
geno_chr_inv = pd.read_csv(genotype_list,sep = "\t")
input_inv = molecular_pheno_chr_inv.merge(geno_chr_inv, on = "#id")
if LMM:
    grm_chr_inv =  pd.read_csv(grm_list,sep = "\t")
    input_inv = input_inv.merge(grm_chr_inv,on = "#id")
input_inv = input_inv.values.tolist()
chr_inv = [x[0] for x in input_inv]
file_inv = [x[1:] for x in input_inv ]

LMM Regression#

This step are done to precompute and store:

  1. LMM null models and trait residuals

  2. spline terms for LMM genotypic variances to speed up downstream analysis

[LMM_null]
input: file_inv, group_by = len(file_inv[0]) ,group_with = "chr_inv"
output: f'{cwd:a}/{name}.theta.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    apex lmm $["--rankNormal" if rankNormal else ""] --vcf $[_input[1]] \
    --bed $[_input[0]] \
    --cov $[covariate_file] \
    --out $[_output[0]:nn] \
    --grm $[_input[2]] \
    --threads $[numThreads]  \
    --get-resid \
    --get-gvar \
    --get-theta

QTL asscoiation testing#

This step generate the cis-QTL and trans-QTL summary statistics and for downstream analysis from summary statistics. The analysis is done per chromosome to reduce running time.

[cis_1]
if LMM:
    sos_run("LMM_null")
input: file_inv,group_by = len(file_inv[0]), group_with = "chr_inv"
output:f'{cwd:a}/{name}.{_chr_inv}{".LMM" if LMM else ".OLS"}.cis_long_table.txt.gz',
       f'{cwd:a}/{name}.{_chr_inv}{".LMM" if LMM else ".OLS"}.cis_gene_table.txt.gz',
       f'{cwd:a}/{name}.{_chr_inv}{".LMM" if LMM else ".OLS"}.cis_sumstats.txt.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    apex cis $["--rankNormal" if rankNormal else ""] --vcf $[_input[1]] \
    --bed $[_input[0]] \
    --cov $[covariate_file] \
    --out $[_output[0]:nnn] \
    --long $[f'--theta {cwd:a}/{name}.theta.gz --grm {_input[2]}' if LMM else ""] \
    --window $[window] \
    --threads $[numThreads]
[trans_1]
if LMM:
    sos_run("LMM_null")
parameter: gene_list = path()
if gene_list.is_file():
    genes = ",".join([x.strip() for x in open(gene_list).readlines()])
input: file_inv,group_by = len(file_inv[0]), group_with = "chr_inv"
output: f'{cwd:a}/{name}.{_chr_inv}{".LMM" if LMM else ".OLS"}.trans_long_table.txt.gz',
        f'{cwd:a}/{name}.{_chr_inv}{".LMM" if LMM else ".OLS"}.trans_gene_table.txt.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    apex trans $["--rankNormal" if rankNormal else ""] --vcf $[_input[1]] \
    --bed $[_input[0]] \
    --cov $[covariate_file] \
    --out $[_output[0]:nnn] \
    --gene "$[genes]" \
    --long $[f'--theta {cwd:a}/{name}.theta.gz --grm {_input[2]}' if LMM else ""] \
    --threads $[numThreads]
[*s_2]
output: f'{_input[0]:nn}.reformated.txt'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint = entrypoint
    library("dplyr")
    library("tibble")
    library("readr")
    library("purrr")
    data <- read_delim("$[_input[0]]",delim = "\t")
    data = data%>%mutate(variant_id = pmap_chr(list(a = `#chrom`,b = pos,c = ref, d = alt),function(a,b,c,d) paste(c(a,":",b,"_",c,"_",d),collapse = "")))%>%rename(chrom = `#chrom`)
    data %>%write_delim("$[_output]",delim = "\t")
[*s_3]
input: group_by = "all"
output: f'{_input:nn}.apex_meta_info', f'{_input:nn}.apex_column_info'
python: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint = entrypoint
    import pandas as pd 
    data_tempt = pd.DataFrame({
    "#chr" : [int(x.split(".")[-5].replace("chr","")) for x in  [$[_input:br,]]],
    "sumstat_dir" : [$[_input:r,]],\
    "column_info" : $[_output[1]:r]
    })
    column_info_df = pd.DataFrame( pd.Series( {"ID": "GENE,CHR,POS,A0,A1",
          "CHR": "chrom",
          "POS": "pos",
          "A0": "ref",
          "A1": "alt",
          "SNP": "variant_id",
          "STAT": "beta",
          "SE": "se",
          "P": "pval",
          "GENE": "gene"}), columns = ["APEX"] )
    data_tempt.to_csv("$[_output[0]]",index = False,sep = "\t" )
    column_info_df.to_csv("$[_output[1]]",index = True,sep = "\t" )