MASH analysis pipeline with data-driven prior matrices#

In this notebook, we utilize the MASH prior, referred to as the mixture_prior, from a previous step. Our objective is to conduct a multivariate analysis under the MASH model. After fitting the model, we subsequently compute the posteriors for our variables of interest.

Methods#

Multivariate adaptive shrinkage (MASH) analysis of eQTL data#

Since we published Urbut 2019, we have improved implementation of MASH algorithm and made a new R package, mashr. Major improvements compared to Urbut 2019 are:

  1. Faster computation of likelihood and posterior quantities via matrix algebra tricks and a C++ implementation.

  2. Faster computation of MASH mixture via convex optimization.

  3. New ways to estimate prior in place of the SFA approach, see mixture_prior.ipynb for details.

  4. Improve estimate of residual variance \(\hat{V}\).

At this point, the input data have already been converted from the original association summary statistics to a format convenient for analysis in MASH.

MWE Data#

Avaiable on synapse.org

Multivariate analysis with MASH model#

Using MWE with prior previously generated.

sos run pipeline/mash_fit.ipynb mash \
    --output-prefix protocol_example \
    --data output/protocol_example.mashr_input.rds \
    --vhat-data output/mashr/protocol_example.ed_mixture.EZ.V_simple.rds \
    --prior-data output/mashr/protocol_example.ed_mixture.EZ.prior.rds \
    --compute-posterior \
    --cwd output/mashr \
    --container oras://ghcr.io/cumc/stephenslab_apptainer:latest
INFO: Running mash_1: Fit MASH mixture model (time estimate: <15min for 70K by 49 matrix)
INFO: mash_1 (index=0) is ignored due to saved signature
INFO: mash_1 output:   /home/gw/Documents/xQTL/output/mashr/protocol_example.EZ.mash_model.rds
INFO: Running mash_2: Compute posterior for the "strong" set of data as in Urbut et al 2017. This is optional because most of the time we want to apply the MASH model learned on much larger data-set.
INFO: mash_2 is completed.
INFO: mash_2 output:   /home/gw/Documents/xQTL/output/mashr/protocol_example.EZ.posterior.rds
INFO: Workflow mash (ID=wdb1a3918af0fd3c3) is executed successfully with 1 completed step and 1 ignored step.
readRDS('output/mashr/protocol_example.EZ.posterior.rds')$lfsr
A matrix: 17 × 8 of type dbl
ALLAstEndExcInhMicOPCOli
1.000725e-061.206902e-010.380245012.341416e-050.0212380262.438671e-022.822754e-014.950394e-02
6.195229e-016.725207e-010.650460086.835928e-010.5822886196.287649e-016.903822e-014.929722e-01
5.502451e-086.952073e-050.428491533.848081e-090.0013491392.974207e-012.267577e-012.193946e-01
1.141256e-023.673827e-010.299842838.920542e-030.0625202811.557580e-011.438823e-013.652727e-01
4.803276e-011.793390e-010.019161761.752551e-010.3846611763.138391e-014.888775e-013.793039e-01
3.240321e-011.006760e-040.284235162.692014e-100.4646108468.359305e-048.654122e-211.358764e-09
4.744668e-013.330669e-160.357285212.052281e-050.4371276483.887724e-021.843843e-051.374751e-06
3.166308e-030.000000e+000.345738212.111164e-040.4244658241.686171e-029.992007e-162.119553e-07
4.826046e-043.121070e-020.317639163.181940e-010.0001077906.340894e-192.832298e-023.138958e-02
2.426264e-013.950080e-010.161929051.023271e-010.2842975243.167406e-041.449052e-024.348748e-01
3.999852e-098.353939e-020.473998692.161030e-060.0179151401.450814e-012.240465e-014.253355e-01
8.830382e-065.623546e-030.230466667.952619e-080.4659642749.821045e-026.871004e-022.039190e-01
4.616982e-052.434419e-020.320921682.148510e-030.0013188344.224729e-041.671158e-012.667831e-01
1.929503e-012.808606e-010.001778752.362496e-020.3272447703.820119e-013.156853e-011.851656e-01
2.744579e-025.054301e-020.159135651.581546e-040.0863971363.507312e-019.995466e-021.706981e-01
2.349326e-012.796353e-010.244664782.804030e-010.2515160323.344815e-011.402243e-023.982418e-01
1.711193e-096.569349e-030.034575804.631637e-010.4221356453.970892e-012.515601e-015.054854e-14

Global parameter settings#

[global]
parameter: cwd = path('./mashr_workflow_output')
# Input summary statistics data
parameter: data = path
parameter: prior_data = path
parameter: vhat_data = path
# Prefix of output files. If not specified, it will derive it from data.
# If it is specified, for example, `--output-prefix AnalysisResults`
# It will save output files as `{cwd}/AnalysisResults*`.
parameter: output_prefix = ''
parameter: output_suffix = 'all'
# Exchangable effect (EE) or exchangable z-scores (EZ)
parameter: effect_model = 'EZ'
parameter: container = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
data = data.absolute()
cwd = cwd.absolute()
prior_data = prior_data.absolute()
vhat_data = vhat_data.absolute()
if len(output_prefix) == 0:
    output_prefix = f"{data:bn}"

mash_model = file_target(f"{cwd:a}/{output_prefix}.{effect_model}.mash_model.rds")

def sort_uniq(seq):
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]

mashr mixture model fitting#

Main reference are our mashr vignettes this for mashr eQTL outline and this for using FLASH prior.

The outcome of this workflow should be found under ./mashr_workflow_output folder (can be configured). File names have pattern *.mash_model_*.rds. They can be used to computer posterior for input list of gene-SNP pairs (see next section).

# Fit MASH mixture model (time estimate: <15min for 70K by 49 matrix)
[mash_1]
parameter: output_level = 4
input: data, vhat_data, prior_data
output: mash_model

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container, entrypoint=entrypoint
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = readRDS(${_input[1]:r})
    U = readRDS(${_input[2]:r})$U
    mash_data = mash_set_data(dat$random.b, Shat=dat$random.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    saveRDS(list(mash_model = mash(mash_data, Ulist = U, outputlevel = ${output_level}), vhat_file=${_input[1]:r}, prior_file=${_input[2]:r}), ${_output:r})

Optional posterior computations#

Additionally provide posterior for the “strong” set in MASH input data.

# Compute posterior for the "strong" set of data as in Urbut et al 2017.
# This is optional because most of the time we want to apply the 
# MASH model learned on much larger data-set.
[mash_2]
# default to True; use --no-compute-posterior to disable this
parameter: compute_posterior = True
# input Vhat file for the batch of posterior data
skip_if(not compute_posterior)

input: data, vhat_data, mash_model
output: f"{cwd:a}/{output_prefix}.{effect_model}.posterior.rds"

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container, entrypoint=entrypoint
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = readRDS(${_input[1]:r})
    mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    mash_model = readRDS(${_input[2]:ar})$mash_model
    saveRDS(mash_compute_posterior_matrices(mash_model, mash_data), ${_output:r})