Univariate Fine-Mapping and TWAS with SuSiE

Univariate Fine-Mapping and TWAS with SuSiE#

Description#

Our pipeline is capable of performing univariate fine-mapping with SuSiE with TWAS weights. The TWAS portion makes use of TWAS weights, linkage disequilibrium data and GWAS summary statistics. Preset variants used are taken from the linkage disequilibrium data and used only for TWAS. TWAS cross validation tell us which of the four methods (enet, lasso, mrash, SuSiE) are best to use. By default, we limit to under 5000 variants for cross validation. In cross validation, the data is split into five parts. Training is done on four parts, and prediction is done on the fifth. Linear regression is used to assess the results and get r squared and pvalues.

Fine mapping with SuSiE follows the formulat y=xb+e where x has many highly correlated variables due to linkage disequilibrium. Therefore, true effects (b), are very sparse. The SuSiE wrapper looks for five independent signals in each region to increase convergence speed. However, if five signals are found, then the the upper limit is increased. SuSiE does not allow for the inclusion of covariates. Therefore, covariates are regressed in.

Input:#

--genoFile: path to a plink bed file containing genotypes.
--phenoFile: at least one tab delimited file containing chr, start, end, ID and path for the regions. For example:

#chr    start   end     ID      path
chr12   389272  389273  ENSG00000120647 $PATH/Ast.log2cpm.bed.gz
chr12   389319  389320  ENSG00000073614 $PATH/Ast.log2cpm.bed.gz

--covFile: at least one tab delimited file containing covariates in the rows and samples in the columns.
--customized-association-windows: a tab delimited file containing chr, start, end, and gene_id. For example:

#chr    start   end     gene_id
chr1    0       6480000 ENSG00000008128
chr1    0       6480000 ENSG00000008130

----phenotype-names: the names of the phenotypes if multiple are included. There should be one for each phenotype file you include.
--max-cv-variants: maximum number of variants for cross-validation.
--ld_reference_meta_file: path to file containing chrom, start, end and path for linkage disequilibrium region information. For example:

#chrom  start   end     path
chr1    101384274       104443097       $PATH/chr1_101384274_104443097.cor.xz.bim
chr1    104443097       106225286       $PATH/chr1_104443097_106225286.cor.xz.bim

--region-name: if you only wish to analyze one region, then include the gene_id of a region found in the customized-association-windows file

Minimal Working Example Steps#

i. Run the Fine-Mapping and TWAS with SuSiE#

sos run $PATH/protocol/pipeline/mnm_regression.ipynb susie_twas \
    --name ROSMAP_mega_eQTL \
    --genoFile $PATH/genofile/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed \
    --phenoFile $PATH/phenofile/Mic/analysis_ready/phenotype_preprocessing/snuc_pseudo_bulk.Mic.mega.normalized.log2cpm.region_list.txt \
                $PATH/phenofile/Ast/analysis_ready/phenotype_preprocessing/snuc_pseudo_bulk.Ast.mega.normalized.log2cpm.region_list.txt \
                $PATH/phenofile/Oli/analysis_ready/phenotype_preprocessing/snuc_pseudo_bulk.Oli.mega.normalized.log2cpm.region_list.txt \
                $PATH/phenofile/OPC/analysis_ready/phenotype_preprocessing/snuc_pseudo_bulk.OPC.mega.normalized.log2cpm.region_list.txt \
                $PATH/phenofile/Exc/analysis_ready/phenotype_preprocessing/snuc_pseudo_bulk.Exc.mega.normalized.log2cpm.region_list.txt \
                $PATH/phenofile/Inh/analysis_ready/phenotype_preprocessing/snuc_pseudo_bulk.Inh.mega.normalized.log2cpm.region_list.txt \
    --covFile $PATH/phenofile/Mic/analysis_ready/covariate_preprocessing/snuc_pseudo_bulk.Mic.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
              $PATH/phenofile/Ast/analysis_ready/covariate_preprocessing/snuc_pseudo_bulk.Ast.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
              $PATH/phenofile/Oli/analysis_ready/covariate_preprocessing/snuc_pseudo_bulk.Oli.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
              $PATH/phenofile/OPC/analysis_ready/covariate_preprocessing/snuc_pseudo_bulk.OPC.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
              $PATH/phenofile/Exc/analysis_ready/covariate_preprocessing/snuc_pseudo_bulk.Exc.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
              $PATH/phenofile/Inh/analysis_ready/covariate_preprocessing/snuc_pseudo_bulk.Inh.mega.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk_mega.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
    --customized-association-windows $PATH/windows/TADB_enhanced_cis.coding.bed \
    --phenotype-names Mic_mega_eQTL Ast_mega_eQTL Oli_mega_eQTL OPC_mega_eQTL Exc_mega_eQTL Inh_mega_eQTL \
    --max-cv-variants 5000 --ld_reference_meta_file $PATH/ldref/ld_meta_file.tsv \
    --region-name ENSG00000073921 \
    --save-data \
    --cwd $PATH/output/ -s build

Anticipated Results#

Univariate finemapping will produce a file containing results for the top hits and a file containing twas weights produced by susie.

ROSMAP_mega_eQTL.chr11_ENSG00000073921.univariate_bvrs.rds:

For each gene of interest and phenotype, this file contains:
1. susie_fitted
2. variant_names
3. analysis_script
4. other quantities - information on dropped samples, if any
5. summary statistics (beta and se) for each variant
6. sample names
7. summary statistics for top loci
8. susie results trimmed - includes pip, sets, cs_corr, sets_secondary, alpha, lbf_variable, V, niter, X_column_scale_factors, mu, mu2
9. total time elapsed
10. region info
11. preset_variants_results

ROSMAP_mega_eQTL.chr11_ENSG00000073921.univariate_data.rds (from the –save-data argument):

For each gene of interest, contains residuals for each sample and phenotype
see pecotmr code for description at load_regional_association_data function

ROSMAP_mega_eQTL.chr11_ENSG00000073921.univariate_twas_weights.rds:

For each gene of interest and phenotype, this file contains:
1. twas_weights - weights for enet, lasso, bayes r, mrash and susie methods
2. twas_predictions - twas predictions for enet, lasso, bayes r, mrash and susie methods
3. twas_cv_result
4. total_time_elapsed
5. variant_names
6. region_info