High-dimensional regression with summary statistics

High-dimensional regression with summary statistics#

This notebook take a list of LD reference files and a list of sumstat files from various association studies, and perform:

  1. Fine-mapping with SuSiE RSS model

  2. TWAS / PRS weights …

Input#

I. GWAS Summary Statistics Files

  • Input: Vector of files for one or more GWAS studies.

  • Format:

    • Tab-delimited files.

    • First 4 columns: chrom, pos, A1, A2

    • Additional columns can be loaded using column mapping file see below

  • Column Mapping files (optional):

    • Optional YAML file for custom column mapping.

    • Required columns: chrom, pos, A1, A2, either z or (beta and se), n_sample, n_case, n_control. Note: You can only provide n_sample or (n_case & n_control, together for case control study), fill with 0 if do not provide them. If none of these are known, fill n_sample, n_case, n_control with 0.

    • Optional columns: var_y.

II. GWAS Summary Statistics Meta-File: this is optional and helpful when there are lots of GWAS summary statistics data to process via the same command. All studies will run at the same time.

The meta file should contain 7 columns.

  • Columns:

  1. study_id: can be any string / number.

  2. chrom: the chromosome that summary statistics cover. Chromosome number 0 indicates a genome-wide summary statistics file.

  3. file_path: the path to the summary statistics file. The format of summary statistics file can be tsv, csv.

  4. column_mapping_file: served to unify the column names in file_path. Map those column names to a standard column name.

  5. n_sample: (effective) sample size of the study. If not known, fill with 0.

  6. n_case: (effective) case number in the study if it’s case control study. If not known, fill with 0.

  7. n_control: control number in the study if it’s case control study. If not known, fill with 0.

Note: we only need n_sample or (n_case & n_control). If is a case control study but n_case n_control not known, using n_sample also works.

eg: gwas_meta.tsv

study_id	chrom	file_path	column_mapping_file	n_sample	n_case	n_control
AD_1	0	data1.txt	mapping1.yml	0	111326	677663
AD_2	0	data2.txt	mapping2.yml	123455	0	0
AD_3	2	data3.txt	mapping3.yml	0	21982	41944

If both summary stats file (I) and meta data file (II) are specified we will take the union of the two.

eg. column_mapping.yml left: standard name (please hold on to it). Right: original column name in your summary statistics file. (do not add space before and after “:”)

chrom:chromosome
pos:base_pair_location
A1:effect_allele
A2:other_allele
beta:beta
se:standard_error
pvalue:p_value
maf:maf
z:z
n_case:n_cases
n_control:n_controls
n_sample:n

III. LD Reference Metadata File

  • Format: Single TSV file.

  • Contents:

    • Columns: #chrom, start, end, path to the LD matrix, genomic build.

    • LD matrix path format: comma-separated, first entry is the LD matrix, second is the bim file.

  • Documentation: Refer to our LD reference preparation document for detailed information (Tosin pending update).

IV. For analyzing specific genomic regions, you can specify them using the --region-names option in the ‘chr:start-end’ format, where multiple regions are accepted. Alternatively, you may provide a file containing a list of regions through the --region-list option, also adhering to the ‘chr:start-end’ format. When both --region-names and --region-list are provided, union of these options will be used to analyze. In cases where neither option is specified, the analysis defaults to encompass all regions specified in the LD reference metadata.

Output#

Each rds file is the finemapping result of one LD block. Including 10 elements in each rds file.

  1. single_effect_regression: Assume only one causal variant, simple regression.

  2. noqc: Susie finemapping, the sumstat is only allele_qced.

  3. qc_impute: Susie finemapping, the summary statistics are QCed (suspecious outliers of z scores removed) and imputed (all outliers and missing variants from reference panel).

  4. qc_only: Susie finemapping, the summary statistics are only QCed .

  5. conditional_regression_noqc: Bayesian conditional regression, original sumstat

  6. conditional_regression_qc_only: Bayesian conditional regression, sumstat QCed

  7. conditional_regression_qc_impute: Bayesian conditional regression, sumstat QCed and imputed.

  8. sumstats_allele_qc_only: a table of sumstat after allele qc (check each variant to see if ref/alt are flipped, and correct them)

  9. sumstats_qc_impute: a table of sumstat after qc and imputation.

  10. sumstats_qc_impute: a table of sumstat after qc and imputation, bad quality imputation variants removed by empirical thresholds. (the actual input of all qc_impute results)

Each rds file is accompanied by 2 tsv files reporting elements 8 and 9 in tsv format for the sake of convenience.

Minimal working example#

sos run xqtl-protocol/pipeline/rss_analysis.ipynb univariate_rss \
    --ld-meta-data ADSP_R4_EUR.LD.list \
    --gwas-meta-data AD_sumstat_list.txt \
    --impute --qc-method dentist --finemapping-method susie_rss \
    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest
[global]
parameter: cwd = path("output/")
parameter: gwas_meta_data = path()
parameter: ld_meta_data = path()
parameter: gwas_name = []
parameter: gwas_data = []
parameter: column_mapping = []
parameter: region_list = path()
parameter: region_name = []
parameter: container = ''
parameter: skip_regions = []
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: job_size = 5
parameter: walltime = "10h"
parameter: mem = "16G"
parameter: numThreads = 1
def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition
import os
if (not os.path.isfile(region_list)) and len(region_name) == 0:
    region_list = ld_meta_data
[get_analysis_regions: shared = "regional_data"]
import os
import pandas as pd
from collections import OrderedDict

def file_exists(file_path, relative_path=None):
    """Check if a file exists at the given path or relative to a specified path."""
    if os.path.exists(file_path) and os.path.isfile(file_path):
        return True
    elif relative_path:
        relative_file_path = os.path.join(relative_path, file_path)
        return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)
    return False

def check_required_columns(df, required_columns):
    """Check if the required columns are present in the dataframe."""
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")

def parse_region(region):
    """Parse a region string in 'chr:start-end' format into a list [chr, start, end]."""
    chrom, rest = region.split(':')
    start, end = rest.split('-')
    return [int(chrom), int(start), int(end)]

def extract_regional_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name=None, region_list=None):
    """
    Extracts data from GWAS metadata files and additional GWAS data provided. 
    Optionally filters data based on specified regions.

    Args:
    - gwas_meta_data (str): File path to the GWAS metadata file.
    - gwas_name (list): Vector of GWAS study names.
    - gwas_data (list): Vector of GWAS data.
    - column_mapping (list, optional): Vector of column mapping files.
    - region_name (list, optional): List of region names in 'chr:start-end' format.
    - region_list (str, optional): File path to a file containing regions.

    Returns:
    - GWAS Dictionary: Maps study IDs to a list containing chromosome number, 
      GWAS file path, and optional column mapping file path.
    - Region Dictionary: Maps region names to lists [chr, start, end].

    Raises:
    - FileNotFoundError: If any specified file path does not exist.
    - ValueError: If required columns are missing in the input files or vector lengths mismatch.
    """
    # Check vector lengths
    if len(gwas_name) != len(gwas_data):
        raise ValueError("gwas_name and gwas_data must be of equal length")
    
    if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):
        raise ValueError("If column_mapping is provided, it must be of the same length as gwas_name and gwas_data")

    # Required columns for GWAS file type
    required_gwas_columns = ['study_id', 'chrom', 'file_path']

    # Base directory of the metadata files
    gwas_base_dir = os.path.dirname(gwas_meta_data)
    
    # Reading the GWAS metadata file
    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    check_required_columns(gwas_df, required_gwas_columns)
    gwas_dict = OrderedDict()

    # Process additional GWAS data from vectors
    for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):
        gwas_dict[name] = {0: [data, mapping]}

    for _, row in gwas_df.iterrows():
        file_path = row['file_path']
        mapping_file = row.get('column_mapping_file')
        n_sample = row.get('n_sample')
        n_case = row.get('n_case')
        n_control = row.get('n_control')

        # Check if the file and optional mapping file exist
        if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):
            raise FileNotFoundError(f"File {file_path} not found for {row['study_id']}")
        
        # Adjust paths if necessary
        file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)
        if mapping_file:
            mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)
        
        # Create or update the entry for the study_id
        if row['study_id'] not in gwas_dict:
            gwas_dict[row['study_id']] = {}

        # Expand chrom 0 to chrom 1-22 or use the specified chrom
        chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]
        for chrom in chrom_range:
            if chrom in gwas_dict[row['study_id']]:
                existing_entry = gwas_dict[row['study_id']][chrom]
                raise ValueError(f"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. "
                                 f"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}")
            gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control]

         # Process region_list and region_name
            region_dict = dict()
            if region_list and os.path.isfile(region_list):
                with open(region_list, 'r') as file:
                    for line in file:
                        # Skip empty lines
                        if not line.strip():
                            continue
                        if line.startswith("#"):
                            continue
                        parts = line.strip().split()
                        if len(parts) == 1:
                            region = parse_region(parts[0])
                        elif len(parts) == 3:
                            region = [int(parts[0].replace("chr", "")), int(parts[1]), int(parts[2])]
                        elif len(parts) >= 4 and  region_list != ld_meta_data : # for eQTL where chr:start:end:gene_id:gene_name, and path if LD_meta are used.
                            region = [int(parts[0].replace("chr", "")), int(parts[1]), int(parts[2]),parts[3]]

        
                        else:
                            raise ValueError("Invalid region format in region_list")
                
                        region_dict[f"{region[0]}:{region[1]}_{region[2]}"] = region
                
    if region_name:
        for region in region_name:
            parsed_region = parse_region(region)
            region_key = f"{parsed_region[0]}:{parsed_region[1]}_{parsed_region[2]}"
            if region_key not in region_dict:
                region_dict[region_key] = parsed_region

    return gwas_dict, region_dict

gwas_dict, region_dict = extract_regional_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name, region_list)
regional_data = dict([("GWAS", gwas_dict), ("regions", region_dict)])
[univariate_rss]
parameter: L = 5
parameter: max_L = 10
# If available the column that indicates sample size within the sumstats
# filtering threshold for raiss imputation
parameter: rcond = 0.01
parameter: lamb = 0.01
parameter: R2_threshold = 0.6
parameter: l_step = 5
parameter: pip_cutoff = 0.025
parameter: skip_analysis_pip_cutoff = 0.025
parameter: minimum_ld = 5
parameter: coverage = [0.95, 0.7, 0.5]
# Whether to impute the sumstat for all the snp in LD but not in sumstat.
parameter: impute = False 
# summary stats QC methods: rss_qc, dentist, slalom
parameter: qc_method = ""
# analysis method: single_effect, susie_rss, bayesian_conditional_regression
parameter: finemapping_method = "single_effect"
parameter: target = ""
parameter: target_column_index = "3"
# compute LD from genotype
parameter: compute_LD_from_genotype = False
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF cutoff
parameter: maf = 0.0025

depends: sos_variable("regional_data")
regions = list(regional_data['regions'].keys())
input: for_each = "regions"
output: f'{cwd:a}/{step_name}/{"%s" % qc_method.upper() if qc_method else "noQC"}{"_RAISS_imputed" if impute else ""}.chr{_regions.replace(":", "_")}.univariate{"_%s" % finemapping_method if finemapping_method else ""}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(pecotmr)
    library(dplyr)
    library(data.table)
    skip_region = c(${', '.join(['"{}"'.format(item) for item in skip_regions])})
    studies = c(${', '.join(['"{}"'.format(item) for item in regional_data["GWAS"].keys()])})
    sumstat_paths = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][0] for x in regional_data["GWAS"].keys()]):r,})
    column_file_paths = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][1] for x in regional_data["GWAS"].keys()]):r,})
    n_samples = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][2] for x in regional_data["GWAS"].keys()]):r,})
    n_cases = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][3] for x in regional_data["GWAS"].keys()]):r,})
    n_controls = c(${paths([regional_data['GWAS'][x][regional_data['regions'][_regions][0]][4] for x in regional_data["GWAS"].keys()]):r,})
    if("${target}" == ""){
        target = "${regional_data['regions'][_regions][3] if len(regional_data['regions'][_regions]) > 3 else None}"
        if(target == "None" | "${target_column_index}" == ""){
            target = ""
        }
    }else{target = "${target}"}
    if("${target_column_index}" == ""){
      target_column_index = ""
    }else{target_column_index = as.numeric("${target_column_index}")}
    
    if( ${"TRUE" if compute_LD_from_genotype else "FALSE"}){
    geno_path = readr::read_delim("${ld_meta_data}","\t")%>% filter(`#chr` == "${regional_data['regions'][_regions][0]}", 
                                                                    start == "${regional_data['regions'][_regions][1]}", 
                                                                      end == "${regional_data['regions'][_regions][2]}")%>%pull(path)%>%stringr::str_replace(".bed","")
    LD_data = pecotmr:::filter_X(
                            load_genotype_region(geno_path, 
                                        '${"chr%s:%s-%s" % (regional_data['regions'][_regions][0], regional_data['regions'][_regions][1], regional_data['regions'][_regions][2])}'),
                            missing_rate_thresh = ${imiss}, maf_thresh=${maf})%>%
                            cor

    ## When we compute LD from genotype adhoc, there could be snp with conventional Ref/Alt allele like  4:185875956:C:<INS:ME:ALU>, which will break the allele_qc function in the following steps.
    ## The additional filtering remove these unconventional allele and keep only SNP that wont break the allele_qc function, which depends on the fact that there is exactly three ":" in the snp ID.
    correct_variants <-  rownames(LD_data)[sapply( rownames(LD_data), function(x) sum(strsplit(x, "", fixed = TRUE)[[1]] == ":") == 3)]
    LD_data = LD_data[correct_variants,correct_variants]
    LD_data = list(combined_LD_matrix  = LD_data, combined_LD_variants  = rownames(LD_data) )
     } else { 
    LD_data = load_LD_matrix("${ld_meta_data}", '${"chr%s:%s-%s" % (regional_data['regions'][_regions][0], regional_data['regions'][_regions][1], regional_data['regions'][_regions][2])}')
    }    


    res = setNames(replicate(length(studies), list(), simplify = FALSE), studies)
    for (r in 1:length(res)) {
        tryCatch({
        res[[r]] = rss_analysis_pipeline(sumstat_path = sumstat_paths[r], column_file_path = column_file_paths[r], 
                                        LD_data = LD_data, target = target,  target_column_index = target_column_index, n_sample = as.numeric(n_samples[r]), n_case = as.numeric(n_cases[r]), 
                                        n_control = as.numeric(n_controls[r]), skip_region = skip_region,
                                        qc_method = ${"NULL" if len(qc_method) == 0 else "'%s'" % qc_method},
                                        impute = ${"TRUE" if impute else "FALSE"},
                                        impute_opts = list(rcond = ${rcond}, R2_threshold = ${R2_threshold}, minimum_ld = ${minimum_ld}, lamb=${lamb}),
                                        finemapping_method = ${"NULL" if len(finemapping_method) == 0 else "'%s'" % finemapping_method}, 
                                        finemapping_opts = list(init_L = ${L}, max_L = ${max_L}, l_step = ${l_step}, coverage = c(${",".join([str(x) for x in coverage])}), signal_cutoff = ${pip_cutoff}),
                                        pip_cutoff_to_skip = ${skip_analysis_pip_cutoff} 
                                        )
        fwrite(res[[r]]$rss_data_analyzed, file = paste0("${_output:nn}.", target, studies[r], ".sumstats.tsv.gz"), sep = "\t", col.names = TRUE, row.names = FALSE, quote = FALSE, compress = "gzip")
        if(is.null(res[[r]][[1]])){
              res[[r]] = list()
        }
      }, error = function(e) {
        res[[r]] = list()
        cat(paste("Error processing file ", studies[r], ": ", conditionMessage(e), "\n"))
      }
      )}
    region = "chr${_regions.replace(":", "_")}"
    full_result = list()
    full_result[[region]] = res
    saveRDS(full_result, file = "${_output}")
[univariate_plot]
output: pip_plot = f"{cwd}/{_input:bn}.png"
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '20G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint = entrypoint
    res = readRDS(${_input:r})
    png(${_output[0]:r}, width = 14, height=6, unit='in', res=300)
    par(mfrow=c(1,2))
    susieR::susie_plot(res, y= "PIP", pos=list(attr='pos',start=res$pos[1],end=res$pos[length(res$pos)]), add_legend=T, xlab="position")
    susieR::susie_plot(res, y= "z", pos=list(attr='pos',start=res$pos[1],end=res$pos[length(res$pos)]), add_legend=T, xlab="position", ylab="-log10(p)")
    dev.off()