High-dimensional regression with summary statistics#
This notebook take a list of LD reference files and a list of sumstat files from various association studies, and perform:
Fine-mapping with SuSiE RSS model
TWAS / PRS weights …
Input#
I. GWAS Summary Statistics Files
Input: Vector of files for one or more GWAS studies.
Format:
Tab-delimited files.
First 4 columns:
chrom
,pos
,A1
,A2
Additional columns can be loaded using column mapping file see below
Column Mapping files (optional):
Optional YAML file for custom column mapping.
Required columns:
chrom
,pos
,A1
,A2
, eitherz
or (beta
andse
),n_sample
,n_case
,n_control
. Note: You can only providen_sample
or (n_case
&n_control
, together for case control study), fill with 0 if do not provide them. If none of these are known, filln_sample
,n_case
,n_control
with 0.Optional columns:
var_y
.
II. GWAS Summary Statistics Meta-File: this is optional and helpful when there are lots of GWAS summary statistics data to process via the same command. All studies will run at the same time.
The meta file should contain 7 columns.
Columns:
study_id
: can be any string / number.chrom
: the chromosome that summary statistics cover. Chromosome number0
indicates a genome-wide summary statistics file.file_path
: the path to the summary statistics file. The format of summary statistics file can be tsv, csv.column_mapping_file
: served to unify the column names in file_path. Map those column names to a standard column name.n_sample
: (effective) sample size of the study. If not known, fill with 0.n_case
: (effective) case number in the study if it’s case control study. If not known, fill with 0.n_control
: control number in the study if it’s case control study. If not known, fill with 0.
Note: we only need n_sample or (n_case & n_control). If is a case control study but n_case n_control not known, using n_sample also works.
eg: gwas_meta.tsv
study_id chrom file_path column_mapping_file n_sample n_case n_control
AD_1 0 data1.txt mapping1.yml 0 111326 677663
AD_2 0 data2.txt mapping2.yml 123455 0 0
AD_3 2 data3.txt mapping3.yml 0 21982 41944
If both summary stats file (I) and meta data file (II) are specified we will take the union of the two.
eg. column_mapping.yml
left: standard name (please hold on to it). Right: original column name in your summary statistics file. (do not add space before and after “:”)
chrom:chromosome
pos:base_pair_location
A1:effect_allele
A2:other_allele
beta:beta
se:standard_error
pvalue:p_value
maf:maf
z:z
n_case:n_cases
n_control:n_controls
n_sample:n
III. LD Reference Metadata File
Format: Single TSV file.
Contents:
Columns:
#chrom
,start
,end
, path to the LD matrix, genomic build.LD matrix path format: comma-separated, first entry is the LD matrix, second is the bim file.
Documentation: Refer to our LD reference preparation document for detailed information (Tosin pending update).
IV. For analyzing specific genomic regions, you can specify them using the --region-names
option in the ‘chr:start-end’ format, where multiple regions are accepted. Alternatively, you may provide a file containing a list of regions through the --region-list
option, also adhering to the ‘chr:start-end’ format. When both --region-names
and --region-list
are provided, union of these options will be used to analyze. In cases where neither option is specified, the analysis defaults to encompass all regions specified in the LD reference metadata.
Output#
Each rds file is the finemapping result of one LD block. Including 10 elements in each rds file.
single_effect_regression: Assume only one causal variant, simple regression.
noqc: Susie finemapping, the sumstat is only allele_qced.
qc_impute: Susie finemapping, the summary statistics are QCed (suspecious outliers of z scores removed) and imputed (all outliers and missing variants from reference panel).
qc_only: Susie finemapping, the summary statistics are only QCed .
conditional_regression_noqc: Bayesian conditional regression, original sumstat
conditional_regression_qc_only: Bayesian conditional regression, sumstat QCed
conditional_regression_qc_impute: Bayesian conditional regression, sumstat QCed and imputed.
sumstats_allele_qc_only: a table of sumstat after allele qc (check each variant to see if ref/alt are flipped, and correct them)
sumstats_qc_impute: a table of sumstat after qc and imputation.
sumstats_qc_impute: a table of sumstat after qc and imputation, bad quality imputation variants removed by empirical thresholds. (the actual input of all qc_impute results)
Each rds file is accompanied by 2 tsv files reporting elements 8 and 9 in tsv format for the sake of convenience.
Minimal working example#
sos run xqtl-protocol/pipeline/rss_analysis.ipynb univariate_rss \
--ld-meta-data ADSP_R4_EUR.LD.list \
--gwas-meta-data AD_sumstat_list.txt \
--impute --qc-method dentist --finemapping-method susie_rss \
--container oras://ghcr.io/cumc/pecotmr_apptainer:latest
[global]
parameter: cwd = path("output/")
parameter: gwas_meta_data = path()
parameter: ld_meta_data = path()
parameter: gwas_name = []
parameter: gwas_data = []
parameter: column_mapping = []
parameter: region_list = path()
parameter: region_name = []
parameter: container = ''
parameter: skip_regions = []
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: job_size = 5
parameter: walltime = "10h"
parameter: mem = "16G"
parameter: numThreads = 1
def group_by_region(lst, partition):
# from itertools import accumulate
# partition = [len(x) for x in partition]
# Compute the cumulative sums once
# cumsum_vector = list(accumulate(partition))
# Use slicing based on the cumulative sums
# return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
return partition
import os
if (not os.path.isfile(region_list)) and len(region_name) == 0:
region_list = ld_meta_data
[get_analysis_regions: shared = "regional_data"]
import os
import pandas as pd
from collections import OrderedDict
def file_exists(file_path, relative_path=None):
"""Check if a file exists at the given path or relative to a specified path."""
if os.path.exists(file_path) and os.path.isfile(file_path):
return True
elif relative_path:
relative_file_path = os.path.join(relative_path, file_path)
return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)
return False
def check_required_columns(df, required_columns):
"""Check if the required columns are present in the dataframe."""
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")
def parse_region(region):
"""Parse a region string in 'chr:start-end' format into a list [chr, start, end]."""
chrom, rest = region.split(':')
start, end = rest.split('-')
return [int(chrom), int(start), int(end)]
def extract_regional_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name=None, region_list=None):
"""
Extracts data from GWAS metadata files and additional GWAS data provided.
Optionally filters data based on specified regions.
Args:
- gwas_meta_data (str): File path to the GWAS metadata file.
- gwas_name (list): Vector of GWAS study names.
- gwas_data (list): Vector of GWAS data.
- column_mapping (list, optional): Vector of column mapping files.
- region_name (list, optional): List of region names in 'chr:start-end' format.
- region_list (str, optional): File path to a file containing regions.
Returns:
- GWAS Dictionary: Maps study IDs to a list containing chromosome number,
GWAS file path, and optional column mapping file path.
- Region Dictionary: Maps region names to lists [chr, start, end].
Raises:
- FileNotFoundError: If any specified file path does not exist.
- ValueError: If required columns are missing in the input files or vector lengths mismatch.
"""
# Check vector lengths
if len(gwas_name) != len(gwas_data):
raise ValueError("gwas_name and gwas_data must be of equal length")
if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):
raise ValueError("If column_mapping is provided, it must be of the same length as gwas_name and gwas_data")
# Required columns for GWAS file type
required_gwas_columns = ['study_id', 'chrom', 'file_path']
# Base directory of the metadata files
gwas_base_dir = os.path.dirname(gwas_meta_data)
# Reading the GWAS metadata file
gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
check_required_columns(gwas_df, required_gwas_columns)
gwas_dict = OrderedDict()
# Process additional GWAS data from vectors
for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):
gwas_dict[name] = {0: [data, mapping]}
for _, row in gwas_df.iterrows():
file_path = row['file_path']
mapping_file = row.get('column_mapping_file')
n_sample = row.get('n_sample')
n_case = row.get('n_case')
n_control = row.get('n_control')
# Check if the file and optional mapping file exist
if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):
raise FileNotFoundError(f"File {file_path} not found for {row['study_id']}")
# Adjust paths if necessary
file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)
if mapping_file:
mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)
# Create or update the entry for the study_id
if row['study_id'] not in gwas_dict:
gwas_dict[row['study_id']] = {}
# Expand chrom 0 to chrom 1-22 or use the specified chrom
chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]
for chrom in chrom_range:
if chrom in gwas_dict[row['study_id']]:
existing_entry = gwas_dict[row['study_id']][chrom]
raise ValueError(f"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. "
f"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}")
gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control]
# Process region_list and region_name
region_dict = dict()
if region_list and os.path.isfile(region_list):
with open(region_list, 'r') as file:
for line in file:
# Skip empty lines
if not line.strip():
continue
if line.startswith("#"):
continue
parts = line.strip().split()
if len(parts) == 1:
region = parse_region(parts[0])
elif len(parts) == 3:
region = [int(parts[0].replace("chr", "")), int(parts[1]), int(parts[2])]
elif len(parts) >= 4 and region_list != ld_meta_data : # for eQTL where chr:start:end:gene_id:gene_name, and path if LD_meta are used.
region = [int(parts[0].replace("chr", "")), int(parts[1]), int(parts[2]),parts[3]]
else:
raise ValueError("Invalid region format in region_list")
region_dict[f"{region[0]}:{region[1]}_{region[2]}"] = region
if region_name:
for region in region_name:
parsed_region = parse_region(region)
region_key = f"{parsed_region[0]}:{parsed_region[1]}_{parsed_region[2]}"
if region_key not in region_dict:
region_dict[region_key] = parsed_region
return gwas_dict, region_dict
gwas_dict, region_dict = extract_regional_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name, region_list)
regional_data = dict([("GWAS", gwas_dict), ("regions", region_dict)])
[univariate_rss]
parameter: L = 5
parameter: max_L = 10
# If available the column that indicates sample size within the sumstats
# filtering threshold for raiss imputation
parameter: rcond = 0.01
parameter: lamb = 0.01
parameter: R2_threshold = 0.6
parameter: l_step = 5
parameter: pip_cutoff = 0.025
parameter: skip_analysis_pip_cutoff = 0.025
parameter: minimum_ld = 5
parameter: coverage = [0.95, 0.7, 0.5]
# Whether to impute the sumstat for all the snp in LD but not in sumstat.
parameter: impute = False
# summary stats QC methods: rss_qc, dentist, slalom
parameter: qc_method = ""
# analysis method: single_effect, susie_rss, bayesian_conditional_regression
parameter: finemapping_method = "single_effect"
parameter: target = ""
parameter: target_column_index = "3"
# compute LD from genotype
parameter: compute_LD_from_genotype = False
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF cutoff
parameter: maf = 0.0025
depends: sos_variable("regional_data")
regions = list(regional_data['regions'].keys())
input: for_each = "regions"
output: f'{cwd:a}/{step_name}/{"%s" % qc_method.upper() if qc_method else "noQC"}{"_RAISS_imputed" if impute else ""}.chr{_regions.replace(":", "_")}.univariate{"_%s" % finemapping_method if finemapping_method else ""}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
library(pecotmr)
library(dplyr)
library(data.table)
skip_region = c(${', '.join(['"{}"'.format(item) for item in skip_regions])})
studies = c(${', '.join(['"{}"'.format(item) for item in regional_data["GWAS"].keys()])})
sumstat_paths = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][0] for x in regional_data["GWAS"].keys()]):r,})
column_file_paths = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][1] for x in regional_data["GWAS"].keys()]):r,})
n_samples = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][2] for x in regional_data["GWAS"].keys()]):r,})
n_cases = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][3] for x in regional_data["GWAS"].keys()]):r,})
n_controls = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][4] for x in regional_data["GWAS"].keys()]):r,})
if("${target}" == ""){
target = "${regional_data['regions'][_regions][3] if len(regional_data['regions'][_regions]) > 3 else None}"
if(target == "None" | "${target_column_index}" == ""){
target = ""
}
}else{target = "${target}"}
if("${target_column_index}" == ""){
target_column_index = ""
}else{target_column_index = as.numeric("${target_column_index}")}
if( ${"TRUE" if compute_LD_from_genotype else "FALSE"}){
geno_path = readr::read_delim("${ld_meta_data}","\t")%>% filter(`#chr` == "${regional_data['regions'][_regions][0]}",
start == "${regional_data['regions'][_regions][1]}",
end == "${regional_data['regions'][_regions][2]}")%>%pull(path)%>%stringr::str_replace(".bed","")
LD_data = pecotmr:::filter_X(
load_genotype_region(geno_path,
'${"chr%s:%s-%s" % (regional_data['regions'][_regions][0], regional_data['regions'][_regions][1], regional_data['regions'][_regions][2])}'),
missing_rate_thresh = ${imiss}, maf_thresh=${maf})%>%
cor
## When we compute LD from genotype adhoc, there could be snp with conventional Ref/Alt allele like 4:185875956:C:<INS:ME:ALU>, which will break the allele_qc function in the following steps.
## The additional filtering remove these unconventional allele and keep only SNP that wont break the allele_qc function, which depends on the fact that there is exactly three ":" in the snp ID.
correct_variants <- rownames(LD_data)[sapply( rownames(LD_data), function(x) sum(strsplit(x, "", fixed = TRUE)[[1]] == ":") == 3)]
LD_data = LD_data[correct_variants,correct_variants]
LD_data = list(combined_LD_matrix = LD_data, combined_LD_variants = rownames(LD_data) )
} else {
LD_data = load_LD_matrix("${ld_meta_data}", '${"chr%s:%s-%s" % (regional_data['regions'][_regions][0], regional_data['regions'][_regions][1], regional_data['regions'][_regions][2])}')
}
res = setNames(replicate(length(studies), list(), simplify = FALSE), studies)
for (r in 1:length(res)) {
tryCatch({
res[[r]] = rss_analysis_pipeline(sumstat_path = sumstat_paths[r], column_file_path = column_file_paths[r],
LD_data = LD_data, target = target, target_column_index = target_column_index, n_sample = as.numeric(n_samples[r]), n_case = as.numeric(n_cases[r]),
n_control = as.numeric(n_controls[r]), skip_region = skip_region,
qc_method = ${"NULL" if len(qc_method) == 0 else "'%s'" % qc_method},
impute = ${"TRUE" if impute else "FALSE"},
impute_opts = list(rcond = ${rcond}, R2_threshold = ${R2_threshold}, minimum_ld = ${minimum_ld}, lamb=${lamb}),
finemapping_method = ${"NULL" if len(finemapping_method) == 0 else "'%s'" % finemapping_method},
finemapping_opts = list(init_L = ${L}, max_L = ${max_L}, l_step = ${l_step}, coverage = c(${",".join([str(x) for x in coverage])}), signal_cutoff = ${pip_cutoff}),
pip_cutoff_to_skip = ${skip_analysis_pip_cutoff}
)
fwrite(res[[r]]$rss_data_analyzed, file = paste0("${_output:nn}.", target, studies[r], ".sumstats.tsv.gz"), sep = "\t", col.names = TRUE, row.names = FALSE, quote = FALSE, compress = "gzip")
if(is.null(res[[r]][[1]])){
res[[r]] = list()
}
}, error = function(e) {
res[[r]] = list()
cat(paste("Error processing file ", studies[r], ": ", conditionMessage(e), "\n"))
}
)}
region = "chr${_regions.replace(":", "_")}"
full_result = list()
full_result[[region]] = res
saveRDS(full_result, file = "${_output}")
[univariate_plot]
output: pip_plot = f"{cwd}/{_input:bn}.png"
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '20G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint = entrypoint
res = readRDS(${_input:r})
png(${_output[0]:r}, width = 14, height=6, unit='in', res=300)
par(mfrow=c(1,2))
susieR::susie_plot(res, y= "PIP", pos=list(attr='pos',start=res$pos[1],end=res$pos[length(res$pos)]), add_legend=T, xlab="position")
susieR::susie_plot(res, y= "z", pos=list(attr='pos',start=res$pos[1],end=res$pos[length(res$pos)]), add_legend=T, xlab="position", ylab="-log10(p)")
dev.off()