xQTL-GWAS pairwise enrichment and colocalization

xQTL-GWAS pairwise enrichment and colocalization#

This workflow processes fine-mapping results for xQTL, generated by susie_twas in the mnm_regression.ipynb notebook for cis xQTL, and GWAS fine-mapping results produced by susie_rss in the rss_analysis.ipynb notebook. It is designed to perform enrichment and colocalization analysis, particularly when fine-mapping results originate from different regions in the case of cis-xQTL and GWAS. The pipeline is capable to integrate and analyze data across these distinct regions. Originally tailored for cis-xQTL and GWAS integration, this pipeline can be applied to other pairwise integrations. An example of such application is in trans analysis, where the fine-mapped regions might be identical between trans-xQTL and GWAS, representing a special case of this broader implementation.

Input#

Lists of SuSiE fine-mapping output objects, in RDS format, of class(susie) in R.

For xQTL the list is meta-data of format: chr, start, end, original_data, conditions_top_loci, block_top_loci where original_data is an RDS file, conditions_top_loci is showing which contexts have top loci table (potential signals) block_top_loci is the blocks have overlapped top loci variant with xQTL data. That file could be output from fine_mapping_post_processing/overlap_qtl_gwas.
For GWAS the list is meta-data of format: chr, start, end, original_data, block_top_loci... where original_data is an RDS file. That file could be output from fine_mapping_post_processing/gwas_results_export.
Context meta is a metafile that shows the analysis_names and the contained contexts. It can be used to identify the corresponding raw data for each context.

Analytical Logic#

Enrichment

Identify GWAS Blocks: Select GWAS blocks with top loci from Bellenguez data and using a single variant regression method. Locate the original filenames (original_data) for these blocks and map the analysis regions to identify overlapping gene analysis regions and corresponding QTL files with top loci table (condition_top_loci).
Contexts in xQTL Meta-data: Find contexts within the xQTL meta-data that include top loci results for each gene. Retrieve the corresponding original xQTL files (original_data) by referencing the context meta-data.
Execute xQTL GWAS Enrichment: Perform xqtl_gwas_enrichment analysis to generate enrichment parameters (a0, a1) and calculate priors (p1, p2, p12).

Coloc

xQTL Meta-data Contexts: Within the xQTL meta-data, identify contexts that include top loci results for each gene, indicated by condition_top_loci. Retrieve the corresponding original xQTL files (original_data) by referring back to the context meta-data. (Optional) Subset the QTL table with specifies gene list.
Original GWAS Files: Identify GWAS blocks that contain overlapping top loci variants for each gene (block_top_loci) in above file. Utilize the GWAS list to locate the original filenames (original_data) for the identified GWAS block candidates.
Enrichment and Coloc Analysis: Automatically execute xqtl_gwas_enrichment to enable enrichment analysis, generating priors for subsequent coloc analysis through logic defined in enrichment. Then, apply susie_coloc for coloc analysis on the selected xQTL and GWAS original files, analyzing each gene under each identified condition.

Output#

Enrichment analysis results — this is a global enrichment estimate that combines all input data for each context.
Colocalization results for regions of interest — if ld_meta_file_path is provided, would also report coloc cs by coloc postprocessing.

Example#

enrichment

sos run ~/codes/xqtl-pipeline/pipeline/SuSiE_enloc.ipynb xqtl_gwas_enrichment   \
    --gwas_meta_data  /mnt/vast/hpc/csg/rf2872/Work/pecotmr/encoloc_test/gwas.block_results_db.tsv     \
    --xqtl_meta_data  /mnt/vast/hpc/csg/rf2872/Work/Multivariate/gwas/overlap_test/ROSMAP_eQTL.overlapped.gwas.tsv         \
    --xqtl_finemapping_obj preset_variants_result susie_result_trimmed              \
    --xqtl_varname_obj preset_variants_result variant_names             \
    --gwas_finemapping_obj AD_Bellenguez_2022 single_effect_regression susie_result_trimmed          \
    --gwas_varname_obj  AD_Bellenguez_2022 single_effect_regression variant_names         \
    --xqtl_region_obj  region_info   grange   \
    --qtl-path /mnt/vast/hpc/csg/rf2872/Work/Multivariate/susie_2024_new/rds_files         \
    --gwas_path /mnt/vast/hpc/csg/hs3393/RSS_QC/GWAS_finemapping_Feb22_7dataset/SuSiE_RSS \
    --context_meta /mnt/vast/hpc/homes/rf2872/codes/fungen-xqtl-analysis/resource/context_meta.tsv 

coloc (would trigger enrichment automatically)

 sos run ~/codes/xqtl-pipeline/pipeline/SuSiE_enloc.ipynb susie_coloc    \
    --gwas_meta_data  /mnt/vast/hpc/csg/rf2872/Work/Multivariate/susie_2024_new/demo_gwas/demo_gwas.block_results_db.tsv    \
    --xqtl_meta_data  /mnt/vast/hpc/csg/rf2872/Work/Multivariate/susie_2024_new/demo_overlap/demo_overlap.overlapped.gwas.tsv   \
    --xqtl_finemapping_obj preset_variants_result susie_result_trimmed              \
    --xqtl_varname_obj preset_variants_result variant_names             \
    --gwas_finemapping_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed              \
    --gwas_varname_obj  AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names             \
    --xqtl_region_obj  region_info   grange       \
    --qtl-path /mnt/vast/hpc/homes/rf2872/aws/rds_files    \
    --gwas_path /home/hs3393/RSS_QC/GWAS_finemapping_Apr9/univariate_rss    \
    --context_meta /mnt/vast/hpc/homes/rf2872/codes/xqtl-analysis/resource/context_meta.tsv      \
    --ld_meta_file_path /mnt/vast/hpc/csg/data_public/20240120_ADSP_LD_matrix/ld_meta_file.tsv \
    --cwd demo_coloc

eg: gwas_meta_data
output from fine_mapping_post_processing/gwas_results_export

#chr	start	end	region_id	TSS	original_data	combined_data	combined_data_sumstats	conditions	conditions_top_loci
chr1	101384274	104443097	1_101384274-104443097	NA	1_101384274-104443097.susie_rss.rds	gwas.1_101384274-104443097.cis_results_db.export.rds	gwas.1_101384274-104443097.cis_results_db.export_sumstats.rds	NA	NA
chr19	44935906	46842901	19_44935906-46842901	NA	19_44935906-46842901.susie_rss.rds	gwas.19_44935906-46842901.cis_results_db.export.rds	gwas.19_44935906-46842901.cis_results_db.export_sumstats.rds	AD_Bellenguez_2022,AD_Jansen_2021,AD_Kunkle_Stage1_2019,AD_Wightman_Full_2021,AD_Wightman_Excluding23andMe_2021,AD_Wightman_ExcludingUKBand23andME_2021	AD_Wightman_ExcludingUKBand23andME_2021.qc_impute,AD_Wightman_ExcludingUKBand23andME_2021.qc_only	AD_Wightman_ExcludingUKBand23andME_2021.qc_impute

eg: xqtl_meta_data

for enrichment: output from fine_mapping_post_processing/cis_results_export; for coloc: output from fine_mapping_post_processing/overlap_qtl_gwas, the difference between them is without or with last 2 columns: block_top_loci, final_combined_data to document overlapped block info at variant level

#chr	start	end	region_id	TSS	original_data	combined_data	combined_data_sumstats	conditions	conditions_top_loci	block_top_loci	final_combined_data
chr1	167600000	171480000	ENSG00000000457	169894266	MSBB_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds, ROSMAP_Kellis_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000000457.cis_results_db.export.rds	Fungen_xQTL.ENSG00000000457.cis_results_db.export_sumstats.rds	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL,Ast_Kellis_eQTL	BM_22_MSBB_eQTL,BM_44_MSBB_eQTL		
chr19	41840000	47960000	ENSG00000130203	44905790	ROSMAP_Kellis_eQTL.ENSG00000130203.univariate_susie_twas_weights.rds, ROSMAP_Bennett_Klein_pQTL.ENSG00000130203.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000130203.cis_results_db.export.rds	Fungen_xQTL.ENSG00000130203.cis_results_db.export_sumstats.rds	Mic_Kellis_eQTL,DLPFC_Bennett_pQTL	Mic_Kellis_eQTL	19_44935906-46842901	Fungen_xQTL.ENSG00000130203.cis_results_db.export.overlapped.gwas.rds
chr20	49934867	53560000	ENSG00000000419	50958554	MSBB_eQTL.ENSG00000000419.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000000419.cis_results_db.export.rds	Fungen_xQTL.ENSG00000000419.cis_results_db.export_sumstats.rds	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL	BM_22_MSBB_eQTL	Fungen_xQTL.ENSG00000000419.cis_results_db.export.overlapped.gwas.rds

eg: context_meta

should be updated as more analysis is done, we used analysis_name and context in this analysis

analysis_name	cohort	context
KNIGHT_pQTL	KNIGHT	Knight_pQTL_brain,Knight_eQTL_brain
MSBB_eQTL	MSBB	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL
MIGA_eQTL	MIGA	MiGA_GTS_eQTL,MiGA_SVZ_eQTL,MiGA_THA_eQTL

[global]
# Workdir
parameter: cwd = path("output")
# A list of file paths for fine-mapped GWAS results. 
parameter: gwas_meta_data = path
# A list of file paths for fine-mapped xQTL results. 
parameter: xqtl_meta_data = path
# Optional: if a region list is provide the enrichment analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# It is required to input the name of the analysis
parameter: name = f"{xqtl_meta_data:bnn}.{gwas_meta_data:bnn}"
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "5m"
# Memory expected: quite large for enrichment analysis but small for xQTL colocalization
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
#Optional table name in xQTL RDS files to get fine mapping results (eg 'susie_result_trimmed ').
parameter: xqtl_finemapping_obj = []
#Optional table name in xQTL RDS files to get variable names (eg ' variant_names ').
parameter: xqtl_varname_obj = []
#Optional table name in GWAS RDS files to get fine mapping results (eg 'AD_Bellenguez_2022 single_effect_regression susie_result_trimmed ').
parameter: gwas_finemapping_obj = []
#Optional table name in GWAS RDS files to get variable names(eg 'AD_Bellenguez_2022 single_effect_regression variant_names ').
parameter: gwas_varname_obj = []
#Optional table name in xQTL RDS files to get region info (eg 'susie_result_trimmed region_info grange ').
parameter: xqtl_region_obj = []
#Optional table name in GWAS RDS files to get region info (eg 'AD_Bellenguez_2022 single_effect_regression region_info grange ').
parameter: gwas_region_obj = []
#Directory path for GWAS orignal finemapping results 
parameter: gwas_path = ''
#Directory path for xQTL orignal finemapping results 
parameter: qtl_path = ''
# a meta file showing the context and corresponding analysis_name
parameter: context_meta = path()
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# use conditions_top_loci_minp column or not, which can reduce a lot computing resource by pick top 1 isoform
parameter: minp = False
import os
import pandas as pd
import re
def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

def make_unique_data(data):
    return ','.join(set(data.split(',')))

def generate_meta_dataframe(meta_data_path):
    """Generate a new long metadata by converting to context:analysis_name (1:1)."""
    meta = pd.read_csv(meta_data_path, sep='\t')
    new_meta = pd.DataFrame()
    contexts = pd.unique(meta['context'].str.split(',', expand=True).stack().str.strip())

    for context in contexts:
        mask = meta['context'].str.contains(context)
        tmp = pd.DataFrame({
            'context': [context],
            'analysis_name': [','.join(meta.loc[mask, 'analysis_name'])]
        })
        new_meta = pd.concat([new_meta, tmp], ignore_index=True)

    return new_meta

def generate_condition_based_dataframe(xqtl_df, minp):
    """Generate a new long table by converting to condition and region:original_data (1:1)."""
    # only consider the QTL contexts that have top loci data
    condition_column = 'conditions_top_loci_minp' if minp else 'conditions_top_loci'
    conditions = pd.unique(xqtl_df[condition_column].str.split(',', expand=True).stack().str.strip())
    new_df = pd.DataFrame()

    for condition in conditions:
        escaped_condition = re.escape(condition)
        mask = xqtl_df[condition_column].str.contains(escaped_condition, case=False, regex=True)
        
        # Iterate through each unique region in the filtered data
        unique_regions = pd.unique(xqtl_df.loc[mask, 'region_id'])
        for region_id in unique_regions:
            region_mask = (xqtl_df['region_id'] == region_id) & mask
            tmp = pd.DataFrame({
                'condition': [condition],
                'region_id': [region_id],
                'QTL_original_data': [','.join(xqtl_df.loc[region_mask, 'original_data'])],
                'GWAS_original_data': [','.join(xqtl_df.loc[region_mask, 'block_data'])]
            })
            new_df = pd.concat([new_df, tmp], ignore_index=True)

    return new_df


def merge_and_filter_dfs(new_df, new_meta):
    """Merge and filter the dataframes based on condition/context where context is a substring of condition, and analysis_name to pick the corresponding original files."""
    # Explode the QTL data into separate rows
    new_df = new_df.set_index(['condition', 'region_id', 'GWAS_original_data'])['QTL_original_data'].str.split(',', expand=True).stack().reset_index(name='QTL_original_data').drop('level_3', axis=1)
    new_df['analysis_name_prefix'] = new_df['QTL_original_data'].apply(lambda x: x.split('.')[0])

    # Create a custom merge logic to match context as a substring of condition
    def custom_merge(row, df_meta):
        # Filter meta dataframe to find any context that is part of the condition string
        sub_df = df_meta[df_meta['context'].apply(lambda x: x in row['condition'])]
        if not sub_df.empty:
            return sub_df
        else:
            return pd.DataFrame(columns=df_meta.columns)  # Return an empty DataFrame with the same columns if no match found

    # Apply custom merge function row-wise and concatenate results
    results = [custom_merge(row, new_meta) for index, row in new_df.iterrows()]
    merged_df = pd.concat(results, keys=new_df.index).reset_index(level=1, drop=True).join(new_df, how='outer')

    # Filter rows where analysis_name_prefix matches analysis_name exactly
    filtered_df = merged_df[merged_df['analysis_name_prefix'].str.strip() == merged_df['analysis_name'].str.strip()]

    # Drop unnecessary columns and duplicates
    filtered_df = filtered_df.drop(columns=['analysis_name_prefix', 'context']).drop_duplicates()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)

    return filtered_df

def prepare_final_paths(filtered_df, qtl_path, gwas_path):
    """Prepare final paths for QTL and GWAS original data."""
    filtered_df['QTL_original_data'] = filtered_df['QTL_original_data'].apply(
        lambda x: ','.join(f"{qtl_path}/{file_name.strip()}" for file_name in x.split(','))
    )
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(
        lambda x: ','.join(f"{gwas_path}/{file_name.strip()}" for file_name in x.split(','))
    )
    return filtered_df

# get overlapped regions by gene and block region for enrichment analysis
[get_analysis_regions: shared = "regional_data"]
import pandas as pd

def process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data):
    """Process the XQTL and GWAS dataframes."""
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    # filter xQTL data with the ones have overlapped top loci variants with GWAS data
    xqtl_df = xqtl_df[xqtl_df['conditions_top_loci_minp' if minp else 'conditions_top_loci'].notna()]

    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
    # e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'susie_result_trimmed']
    # filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'single_effect_regression'
    gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed']
    if len(gwas_finemapping_obj_filtered) > 0:
        pattern = '|'.join(gwas_finemapping_obj_filtered)
        gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]

    xqtl_df = overlapped_analysis_region(xqtl_df, gwas_df)
    return xqtl_df, gwas_df



def overlapped_analysis_region(xqtl_df, gwas_df):
    # Create an empty dataframe to store overlapping xQTL records
    overlapped_xqtl_df = pd.DataFrame()
    
    # Iterate over each row in the GWAS dataframe
    for index, gwas_row in gwas_df.iterrows():
        gwas_chr = gwas_row['#chr']
        gwas_start = gwas_row['start']
        gwas_end = gwas_row['end']
        gwas_block_id = gwas_row['original_data']  # Assuming each GWAS row has a unique block ID
        
        # Filter xQTL dataframe for the same chromosome
        same_chr_xqtl_df = xqtl_df[xqtl_df['#chr'] == gwas_chr]
        
        # Find overlapping xQTL regions with the current GWAS region
        overlapping_xqtl = same_chr_xqtl_df[((same_chr_xqtl_df['start'] <= gwas_end) & 
                                             (same_chr_xqtl_df['end'] >= gwas_start))].copy() # Add .copy() to avoid SettingWithCopyWarning
        
        # Check if there are any overlapping xQTLs
        if not overlapping_xqtl.empty:
            # Use .loc to safely assign 'block_data' to avoid the SettingWithCopyWarning
            overlapping_xqtl.loc[:, 'block_data'] = gwas_block_id
            
            # Append the overlapping xQTLs to the accumulated dataframe
            overlapped_xqtl_df = pd.concat([overlapped_xqtl_df, overlapping_xqtl], ignore_index=True)
        
    # Group by xQTL region and concatenate 'block_data'
    overlapped_xqtl_df['block_data'] = overlapped_xqtl_df.groupby(['#chr', 'start', 'end'])['block_data'].transform(lambda x: ','.join(x))
    overlapped_xqtl_df = overlapped_xqtl_df.drop_duplicates().reset_index(drop=True)

    return overlapped_xqtl_df


xqtl_df, gwas_df = process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data)

region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
    if region_list.is_file():
        region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
    else:
        raise ValueError("The region_list path provided is not a file.")
        
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))
    
if len(region_ids) > 0:
    xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
    
new_df = generate_condition_based_dataframe(xqtl_df, minp)

# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
    filtered_df = new_df.copy()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
    filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))

filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)
filtered_df = filtered_df.groupby('condition').agg({
    'GWAS_original_data': lambda x: ','.join(x),
    'QTL_original_data': lambda x: ','.join(x)
}).reset_index()
regional_data = {
    'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
    'conditions': [(f"{row['condition']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}

# get overlapped regions from flatten table, to get the regions with overlapped variants and select those regions for coloc analysis
[get_overlapped_analysis_regions: shared = "overlapped_regional_data"]
import pandas as pd
def map_blocks_to_data(blocks, mapping_dict):
    """Map blocks to data using a provided mapping dictionary."""
    mapped_data = ','.join(mapping_dict.get(block.strip(), 'NA') for block in blocks.split(','))
    return mapped_data

def process_dataframes(xqtl_meta_data, gwas_meta_data):
    """Process the XQTL and GWAS dataframes."""
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    # filter xQTL data with the ones have overlapped top loci variants with GWAS data
    xqtl_df = xqtl_df[xqtl_df['block_top_loci'].notna()]

    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")    
    gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
    # e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'RSS_QC_RAISS_imputed']
    # filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'RSS_QC_RAISS_imputed'
    gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed' and obj != 'RSS_QC_RAISS_imputed']
    if len(gwas_finemapping_obj_filtered) > 0:
        pattern = '|'.join(gwas_finemapping_obj_filtered)
        gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]

    gwas_df = gwas_df[gwas_df['region_id'].isin(xqtl_df['block_top_loci'].str.split(',').explode().unique())]

    region_to_combined_data = dict(zip(gwas_df['region_id'], gwas_df['original_data']))
    # and only consider the overlapped GWAS blocks
    xqtl_df['block_data'] = xqtl_df['block_top_loci'].apply(map_blocks_to_data, args=(region_to_combined_data,))
    xqtl_df['block_data'] = xqtl_df['block_data'].str.replace(',NA', '').str.replace('NA,', '')

    return xqtl_df, gwas_df

xqtl_df, gwas_df = process_dataframes(xqtl_meta_data, gwas_meta_data)

region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
    if region_list.is_file():
        region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
    else:
        raise ValueError("The region_list path provided is not a file.")
        
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))
    
if len(region_ids) > 0:
    xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
    
    
new_df = generate_condition_based_dataframe(xqtl_df, minp)

# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
    filtered_df = new_df.copy()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
    filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))

filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)

overlapped_regional_data = {
    'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
    'conditions': [(f"{row['condition']}@{row['region_id']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}

[xqtl_gwas_enrichment]
depends: sos_variable("regional_data")
stop_if(len(regional_data['data']) == 0, f'No files left for analysis')

meta = regional_data['conditions']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta"
output: f'{cwd:a}/{name}.{_meta[0]}.enrichment.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(tidyverse)
    library(pecotmr)
    # RDS files for GWAS data
    gwas_finemapped_data = c(${paths([x for x in _meta[1:len(_meta)]]):r,})
    # RDS files for xQTL data
    xqtl_finemapped_data = c(${paths([x for x in _input]):r,})
    enrich_result = xqtl_enrichment_wrapper(gwas_files = gwas_finemapped_data, xqtl_files = xqtl_finemapped_data, 
                                                xqtl_finemapping_obj =  c("${_meta[0]}",${",".join(['"%s"' % x  for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}), 
                                                xqtl_varname_obj =   c("${_meta[0]}",${",".join(['"%s"' % x  for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}), 
                                                gwas_finemapping_obj =  c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}), 
                                                gwas_varname_obj =  c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}))
    saveRDS(enrich_result, ${_output:ar})

[susie_coloc]
depends: sos_variable("overlapped_regional_data")
# depends: sos_step("xqtl_gwas_enrichment") #changed
stop_if(len(overlapped_regional_data['data']) == 0, f'No files left for analysis')
# skip enrichment or not
parameter: skip_enrich=False
# filter lbf by cs as default in susie coloc. default is False means filtering by V > prior_tol
parameter: filter_lbf_cs = False
# ld reference path for postprocessing 
parameter: ld_meta_file_path = path()
meta = overlapped_regional_data['conditions']
input: overlapped_regional_data["data"], group_by = lambda x: group_by_region(x, overlapped_regional_data["data"]), group_with = "meta"
# output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta[0]}.coloc.rds'
output: f'{cwd:a}/{step_name}/{name}.{_meta[0]}.coloc.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(tidyverse)
    library(pecotmr)
    library(coloc) 
    # RDS files for xQTL data
     # RDS files for xQTL data
    xqtl_finemapped_datas = c(${paths([x for x in _input]):r,})
    coloc_res <- list()
    # are we doing something like this? that means results are by each context, and each one has 
    for(xqtl_finemapped_data in xqtl_finemapped_datas){
      qtl_dat <- readRDS(xqtl_finemapped_data)
      gene = names(qtl_dat)
      context = "${_meta[0]}" %>% str_split(.,"@",simplify = T) %>% .[,1] 
      gene_region = pecotmr:::get_nested_element(qtl_dat[[1]],  c(context,"region_info", "grange"))
  
      # Step 1: find relevant GWAS regions that overlap each the xQTL region of interest
      gwas_finemapped_datas <- c(${ ",".join(['"%s"' % x for x in _meta[1:]])}) 
      gwas_finemapped_datas <- gwas_finemapped_datas[grep('rds$',gwas_finemapped_datas)]
 
      gwas_regions <- gwas_finemapped_datas %>% basename %>% str_extract(., "chr\\d+_\\d+_\\d+")
      overlap_index <- NULL
      for (i in 1:length(gwas_regions)) {
        region <- gwas_regions[i]
        split_region <- unlist(strsplit(region, "_"))
        block_chrom <- as.numeric(split_region[1] %>% gsub("chr","",.))
        block_start <- as.numeric(split_region[2] %>% strsplit(., "-") %>% unlist %>% .[1])
        block_end <- as.numeric(split_region[2] %>% strsplit(., "-") %>% unlist %>% .[2])
        if (gene_region$chrom == block_chrom && (gene_region$start <= block_end |  gene_region$end >= block_start)) {
          overlap_index <- c(overlap_index, i)
        }
      }
  
     if (!is.null(overlap_index)) {
       gwas_finemapped_data <- gwas_finemapped_datas[overlap_index]
  
       # Step 2: load enrichment analysis results
       # Extract values for p1, p2, and p12
     if(${"FALSE" if skip_enrich else "TRUE"}){
       enrich_file = paste0('${cwd:a}','/', '${name}', '.' ,context,'.enrichment.rds')
       p1 <-  readRDS(enrich_file)[[1]]$`Alternative (coloc) p1`
       p2 <-  readRDS(enrich_file)[[1]]$`Alternative (coloc) p2`
       p12 <-  readRDS(enrich_file)[[1]]$`Alternative (coloc) p12`
      
       message("Priors are P1:", p1, "; p2: ", p2, "; p12: ", p12)
     } else {
       p1 = 1e-4
       p2 = 1e-4
       p12 = 5e-6
       message("Priors are using default, P1: 1e-4, P2: 1e-4, P12: 5e-6" )
     }
       # Step 3: Apply colocalization analysis between each condition and GWAS
       coloc_res[[gene]] <- coloc_wrapper(xqtl_file = xqtl_finemapped_data, gwas_files = gwas_finemapped_data, 
                                          xqtl_finemapping_obj =  c(context,${",".join(['"%s"' % x  for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}), 
                                          xqtl_varname_obj =   c(context,${",".join(['"%s"' % x  for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}), 
                                          gwas_finemapping_obj =  c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}), 
                                          gwas_varname_obj =  c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}),
                                          xqtl_region_obj =   c(context,${",".join(['"%s"' % x  for x in xqtl_region_obj]) if len(xqtl_region_obj) != 0 else "NULL"}), 
                                          gwas_region_obj =  c(${",".join(['"%s"' % x for x in gwas_region_obj]) if len(gwas_region_obj) != 0 else "NULL"}),
                                          p1 = p1, p2 = p2, p12 = p12, filter_lbf_cs = ${"FALSE" if skip_enrich else "TRUE"})

        if (${"TRUE" if ld_meta_file_path.is_file() else "FALSE"}) {
          if(length(coloc_res[[gene]]) > 2) {
              coloc_res[[gene]] <- coloc_post_processor(coloc_res[[gene]], LD_meta_file_path = ${ld_meta_file_path:r}, analysis_region = coloc_res[[gene]]$analysis_region)
            if(!is.null(coloc_res[[gene]]$sets$cs))  writeLines(coloc_res[[gene]]$sets$cs %>% unlist, gsub("rds$","coloc_res","${_output}"))
          } else { message("No coloc results were generated.") }
        }
      
      } else {
        message("No overlap was found between GWAS blocks and QTL analysis region.")
        coloc_res <-  "No overlap was found between GWAS blocks and QTL analysis region."
      }
    }
    saveRDS(coloc_res, ${_output:r})