QTL Association Testing

QTL Association Testing#

Description#

We perform QTL association testing using TensorQTL [cf. Taylor-Weiner et al (2019)]. An additional protocol was added to test for quantile QTL associations.

Input#

List of molecular phenotype files: a list of bed.gz files containing the table for the molecular phenotype. It should have a companion index file in tbi format. It is the output of gene_annotation or phenotype_by_chorm

Example phenotype list#

#chr    start   end ID  path
chr12   752578  752579  ENSG00000060237  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   990508  990509  ENSG00000082805  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   2794969 2794970 ENSG00000004478  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   4649113 4649114 ENSG00000139180  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   6124769 6124770 ENSG00000110799  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12   6534516 6534517 ENSG00000111640  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz

The header of the bed.gz is per the [TensorQTL](https://github.com/broadinstitute/tensorqtl) convention:

>    Phenotypes must be provided in BED format, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the center of the cis-window (usually the TSS), with start == end-1.

List of genotypes in PLINK binary format (bed/bim/fam) for each chromosome, previously processed through our genotype QC pipelines.
Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
Optionally, a list of traits (genes, regions of molecular features etc) to analyze.

For cis-analysis:

Optionally, a list of genomic regions associate with each molecular features to analyze. The default cis-analysis will use a window around TSS. This can be customized to take given start and end genomic coordinates.

Output#

For each chromosome, several of summary statistics files are generated, including both nominal test statistics for each test, as well as region (gene) level association evidence.

The columns of nominal association result are as follows:

phenotype_id: Molecular trait identifier.(gene)
variant_id: ID of the variant (rsid or chr:position:ref:alt)
tss_distance: Distance of the SNP to the gene transcription start site (TSS)
af: The allele frequency of this SNPs
ma_samples: Number of samples carrying the minor allele
ma_count: Total number of minor alleles across individuals
pval: Nominal P-value from linear regression
beta: Slope of the linear regression
se: Standard error of beta
chr : Variant chromosome.
pos : Variant chromosomal position (basepairs).
ref : Variant reference allele (A, C, T, or G).
alt : Variant alternate allele.

The columns of interaction association results are as follows:

Model: phenotype ~ \(\beta\)*snp + \(\beta_{msex}\)*msex + \(\beta_{interaction}\)*snp *msex (taking msex as the interaction factor)

chrom: chromosome number
pos : Variant chromosomal position (basepairs)
a1: Variant reference allele (A, C, T, or G)
a2: Variant alternate allele
molecular_trait_id: Molecular trait identifier, varies from phenotypes to phenotypes.
variant_id - ID of the top variant (rsid or chr:position:ref:alt)
af - Alternative allele frequency in MiGA cohort
ma_samples - Number of samples carrying the minor allele
ma_count: Total number of minor alleles across individuals
pvalue: P-value of the main effect from the nonlinear regression
beta: Slope of the main effect from the nonlinear regression
se: Standard error of beta
pvalue_msex: P-value of the msex term from the nonlinear regression
beta_msex: Slope of the msex term from the nonlinear regression
se_msex: Standard error of beta_msex
pvalue_msex_interaction: P-value of the interaction term from the nonlinear regression
beta_msex_interaction: Slope of the interaction term from the nonlinear regression
se_msex_interaction: Standard error of beta_msex_interaction
molecular_trait_object_id: an intermediate id, you can ignore
n: number of phenotype
qvalue_main: the false discovery rate of the main effect(pvalue)
qvalue_interaction: the false discovery rate of the interaction effect(pvalue_msex_interaction)

The column specification of region (gene) level association evidence are as follows:

phenotype_id - Molecular trait identifier. (gene)
num_var - Total number of variants tested in cis
beta_shape1 - First parameter value of the fitted beta distribution
beta_shape2 - Second parameter value of the fitted beta distribution
true_df - Effective degrees of freedom the beta distribution approximation
pval_true_df - Empirical P-value for the beta distribution approximation
variant_id - ID of the top variant (rsid or chr:position:ref:alt)
tss_distance - Distance of the SNP to the gene transcription start site (TSS)
ma_samples - Number of samples carrying the minor allele
ma_count - Total number of minor alleles across individuals
af - Alternative allele frequency in MiGA cohort
ref_factor - Flag indicating if the alternative allele is the minor allele in the cohort (1 if AF <= 0.5, -1 if not)
pval_nominal - Nominal P-value from linear regression
slope - Slope of the linear regression
slope_se - Standard error of the slope
pval_perm - First permutation P-value directly obtained from the permutations with the direct method
pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis

The columns of nominal association results for quantile regression are as follows:

phenotype_id: Molecular trait identifier.(gene)
variant_id: ID of the variant (rsid or chr:position:ref:alt)
chr : Variant chromosome.
pos : Variant chromosomal position (basepairs).
ref : Variant reference allele (A, C, T, or G).
alt : Variant alternate allele.
af: Alternative allele frequency in MiGA cohort
combined_pval(composite-p value using cauchy combination method): the integrated QR p-value across multiple quantile levels.
qr_0.1_pval to qr_0.9_pval: quantile-specific QR p-values for the quantile levels 0.1, 0.2, …, 0.9.
qr_0.1_slope to qr_0.9_slope: quantile-specific QR coefficients for the quantile levels 0.1, 0.2, …, 0.9.
qr_0.1_tval to qr_0.9_tval: quantile-specific QR t-values for the quantile levels 0.1, 0.2, …, 0.9.

Minimal Working Example Steps#

The data can be found on Synapse.

i. Cis TensorQTL Command#

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
    --phenotype-file  output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.region_list.txt \
    --covariate-file output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized_cis_windows prototype_example/protocol_example/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --cwd output/cis_association/ \
    --MAC 5 \
    --container containers/TensorQTL.sif 

ii. Trans TensorQTL Command#

Note: Some protein is not in the customized cis windows list. There we will need to remove them from the analysis by create a region_list. Noted that the region list need to be a actual file. So <() file is not acceptable.

zcat output/protocol_example.protein.bed.gz | cut -f 1,2,3,4 | grep -v -e ENSG00000163554 \
    -e ENSG00000171564 -e ENSG00000171560 -e ENSG00000171557 > output/protocol_example.protein.region_list

The following will take more than 180G of memory to run.

sos run xqtl-protocol/pipeline/TensorQTL.ipynb trans \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
    --phenotype-file  output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.region_list.txt \
    --region-list output/protocol_example.protein.region_list \
    --covariate-file output/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-cis-windows input/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --cwd output/association/trans/ \
    --MAC 5 --numThreads 8 -J 1 -q csg --mem 240G -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml \
    --container containers/TensorQTL.sif 

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

sos run TensorQTL.ipynb -h

usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  cis
  trans

Global Workflow Options:
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --phenotype-file VAL (as path, required)
                        Phenotype file, or a list of phenotype per region.
  --genotype-file VAL (as path, required)
                        A genotype file in PLINK binary format (bed/bam/fam)
                        format, or a list of genotype per chrom
  --covariate-file VAL (as path, required)
                        Covariate file
  --name  f"{phenotype_file:bn}_{covariate_file:bn}"

                        Prefix for the analysis output
  --region-list . (as path)
                        An optional subset of regions of molecular features to
                        analyze. The last column is the gene names
  --region-list-phenotype-column 4 (as int)
  --keep-sample . (as path)
                        Set list of sample to be keep
  --interaction ''
                        FIXME: please document
  --customized-cis-windows . (as path)
                        An optional list documenting the custom cis window for
                        each region to analyze, with four column, chr, start,
                        end, region ID (eg gene ID). If this list is not
                        provided, the default `window` parameter (see below)
                        will be used.
  --phenotype-group . (as path)
                        The phenotype group file to group molecule_trait into
                        molecule_trait_object This applies to multiple molecular
                        events in the same region, such as sQTL analysis.
  --chromosome  (as list)
                        The name of phenotype corresponding to gene_id or
                        gene_name in the region
  --MAC 0 (as int)
                        Minor allele count cutoff
  --window 1000000 (as int)
                        Specify the cis window for the up and downstream radius
                        to analyze around the region of interest in units of bp
                        This parameter will be set to zero if
                        `customized_cis_windows` is provided.
  --numThreads 8 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 12h
  --mem 16G
  --container ''
                        Container option for software to run the analysis:
                        docker or singularity
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

  --maf-threshold  MAC/(2.0*N) 

                        Minor allele frequency cutoff. It will overwrite minor
                        allele cutoff. You may consider setting it to higher for
                        interaction analysis if you have statistical power
                        concerns

Sections
  cis_1:
    Workflow Options:
      --[no-]skip-nominal-if-exist (default to False)
                        parse input file lists skip nominal association results
                        if the files exists already This is false by default
                        which means to recompute everything This is only
                        relevant when the `parquet` files for nominal results
                        exist but not the other files and you want to avoid
                        computing the nominal results again
      --[no-]permutation (default to True)
  cis_2:
  trans_1:
    Workflow Options:
      --batch-size 50000 (as int)
      --pval-threshold 1.0 (as float)
      --[no-]permutation (default to False)
                        Permutation testing is incorrect when the analysis is
                        done by chrom

Old Minimal working example#

An MWE is uploaded to google drive. The singularity image (sif) for running this MWE is uploaded to google drive

FIXME: need to update these links.

FIXME: Also need to update the example commands below using our new example dataset.

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --container containers/TensorQTL.sif --MAC 5

sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --container containers/TensorQTL.sif --MAC 5 --region-name  gene_name

sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/output/data_preprocessing/ALL/covariates/ALL.log2cpm.ALL.covariate.pca.resid.PEER.cov.gz \
    --cwd ./output/trans_tensorQTL/ \
    --region-list /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/reference_data/AD_genes.region_list \
    --customized-cis-windows TADB_enhanced_cis.bed \
    --container containers/TensorQTL.sif --MAC 5 --region-name ID

Setup and global parameters#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Prefix for the analysis output
parameter: name = f"{phenotype_file:bn}_{covariate_file:bn}"
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""

# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()

# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path() 

# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0

# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re

# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1

# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N) 

import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))


if str(genotype_file).endswith("bed") and str(phenotype_file).endswith("bed.gz"):
    input_files = [[phenotype_file, genotype_file]]
    if len(chromosome) > 0:
        input_chroms = [int(x) for x in chromosome]
    else:
        input_chroms = [0]
else:
    import pandas as pd
    import os
    molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
    if "#chr" in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
    genotype_files = pd.read_csv(genotype_file,sep = "\t")
    genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
    genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
    molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
    input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
    # Only keep chromosome specified in --chromosome
    if len(chromosome) > 0:
        input_files = input_files[input_files['#id'].isin(chromosome)]
    input_files = input_files.values.tolist()
    input_chroms = [x[0] for x in input_files]
    input_files = [x[1:] for x in input_files]

cis-xQTL association testing#

[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True
parameter: quantile_regression = False
parameter: qr_vcov= 'robust'
parameter: qr_kernel= 'epa'
parameter: qr_bandwidth= 'hsheather'
parameter: qr_max_iter=200
parameter: qr_p_tol=0.000001
parameter: qr_regress_covariates = True
parameter: qr_add_intercept = False
parameter: qr_scale = False
# use -1 to set the maximum number of cores available
parameter: qr_threads = -1
if quantile_regression:
    permutation = False

# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
    interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
    var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
                     ("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_{"quantile_" if quantile_regression else ""}qtl.pairs.tsv.gz')])
if test_regional_association:
    output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
python: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout' , container = container
    import pandas as pd
    import numpy as np
    import os
    import tensorqtl
    from tensorqtl import genotypeio, cis
    from scipy.stats import chi2
    import multiprocessing as mp
    from tqdm import tqdm

    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]
    covariates_file = "$[covariate_file:a]"
    window = $[window]
    interaction = "$[interaction]"
    ## Load Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    phenotype_id = phenotype_pos_df.index.name

    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]", comment="#", header=None, sep="\t" )
        phenotype_column = 1 if len(region.columns) == 1 else  $[region_list_phenotype_column]
        keep_region = region.iloc[:,phenotype_column-1].to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos','a0','a1']]
    
    ## use custom sample list to subset the covariates data
    if $[keep_sample.is_file()]:
        sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
        covariates_df.loc[sample_list.sample_id]

    # Read interaction files or extract from covariates file.
    var_interaction = interaction
    interaction_s = []
    if os.path.isfile(interaction):
        # update var_interaction and interaction_s
        interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
        interaction_s = interaction_s[interaction_s.index.isin(covariates_df.index)] 
        var_interaction = interaction_s.columns[0] # interaction name
    # check if the interaction term in interaction table is in covariates file, if yes and interaction_s not yet loaded then, extract it out from covariates file
    if var_interaction in covariates_df.columns:
        # only load from covariate if it has not been loaded yet
        if len(interaction_s) == 0:
            interaction_s = covariates_df[var_interaction].to_frame()
        covariates_df = covariates_df.drop(columns=[var_interaction])
    if len(interaction) and len(interaction_s) == 0:
        raise ValueError(f"Cannot find interaction variable or file {interaction}")

    # drop samples that with missing value in iteraction
    if len(interaction_s):
        interaction_s = interaction_s.dropna() 

    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)] 
    if len(interaction_s):
        phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, interaction_s.index)]
        interaction_s = interaction_s[interaction_s.index.isin(phenotype_df.columns)]    
        interaction_s = interaction_s.loc[phenotype_df.columns]

    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    
    ## To simplify things, there should really not be "chr" prefix
    phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
    variant_df.chrom =  variant_df.chrom.astype("str").str.replace("chr", "") 

    ## use custom cis windows list
    if $[customized_cis_windows.is_file()]:
        cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
        cis_list.chr = cis_list.chr.astype(str).str.replace("chr", "")  ## Again to simplify things for chr format concordance.
        if cis_list[['chr', 'ID']].duplicated().sum() != 0: # if cis_list is not unique using identifier ['#chr', 'ID']
                cis_list = cis_list.groupby(['ID', 'chr']).agg({ # use union start-end position and make cis_list unique
                    'start': 'min',
                    'end': 'max'
                }).reset_index()

                cis_list = cis_list[['chr', 'start', 'end', 'ID']]
                #cis_list = cis_list.set_index('chr')
        
        phenotype_pos_df = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
        phenotype_df = phenotype_df.reset_index()
        # Ensure phenotype_pos_df is a subset of cis_list based on ['#chr', 'ID']
        original_count = len(phenotype_pos_df)
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.set_index(['chr', 'ID']).index.isin(cis_list.set_index([cis_list.columns[0],cis_list.columns[3]]).index)] 
        phenotype_df = phenotype_df[phenotype_df.set_index('ID').index.isin(cis_list.set_index(cis_list.columns[3]).index)] 
        phenotype_df = phenotype_df.set_index('ID')
        removed_count = original_count - len(phenotype_pos_df)
        print(f"{removed_count} rows were removed from phenotype_pos_df")

        # Merge the dataframes on 'chr' and 'ID', including 'start' and 'end'
        phenotype_pos_df = phenotype_pos_df.merge(cis_list[['chr', 'ID', 'start', 'end']], 
                                        left_on = ["chr",phenotype_id],
                                        right_on = [cis_list.columns[0],cis_list.columns[3]],
                                        suffixes=('_pheno', ''))

        # Function to decide whether to keep or rename columns: 
        # If 'start' and 'end' values are the same in both dataframes, we keep the original columns without suffixes; 
        # If they're different, we name columns from cis_list with '_cis' suffixes and keep the name still for columns from phentype_pos_df.
        # #in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
        def rename_if_different(row, col):
            if row[f'{col}'] == row[f'{col}_pheno']:
                return row[col]
            else:
                return pd.Series({f'{col}_pheno': row[f'{col}_pheno'], f'{col}_cis': row[col]})
        
        # Apply the renaming logic
        for col in ['start', 'end']:
            if f'{col}_pheno' in phenotype_pos_df.columns:# this condition is to not execute when the original phenotype_pos_df don't have start, end but pos column.
                temp = phenotype_pos_df.apply(lambda row: rename_if_different(row, col), axis=1)
                phenotype_pos_df = phenotype_pos_df.drop(columns=[f'{col}_pheno']).join(temp)
                print(f"Dropped columns due to value mismatch: {col}_pheno")
        
        phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
        
        if len(phenotype_df.index) != len(phenotype_pos_df.index):
            raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
        window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0

    ## Read phenotype group if availble
    if $[phenotype_group.is_file()]:
        group_s = pd.read_csv($[phenotype_group:r], sep='\t', header=None, index_col=0).squeeze()
    else:
        group_s = None

    if $["True" if quantile_regression else "False"]:
        import statsmodels.api as sm
        import scipy.stats as ss
        from scipy.stats import cauchy
        import time
        import warnings
        import torch
        # Perform quantile regression
        start_time = time.time()
        def filter_maf_with_numpy(genotypes, variant_ids, maf_threshold, alleles=2):
            af = genotypes.sum(axis=1) / (alleles * genotypes.shape[1])
            maf = np.where(af > 0.5, 1 - af, af)
            if maf_threshold > 0:
                mask = maf >= maf_threshold
                #mask = mask.astype(bool)
                genotypes = genotypes[mask]
                variant_ids = np.array(variant_ids)[mask]
                af = af[mask]
                maf = maf[mask]
            return genotypes, variant_ids, af, maf

        def remove_covariate_effects_np(X, Z, y, scale=False):
            # for X,Z,y, rows are samples
            if not np.all(Z[:, 0] == 1):
                Z = np.hstack([np.ones((Z.shape[0], 1)), Z])
            
            A = np.linalg.inv(Z.T @ Z)
            SZy = A @ Z.T @ y
            SZX = A @ Z.T @ X
            
            # Adjust y and X based on calculated effects
            y_adjusted = y - Z @ SZy
            X_adjusted = X - Z @ SZX
            
            if scale:
                # Standardize y and X
                y_scaled = (y_adjusted - y_adjusted.mean()) / y_adjusted.std()
                X_scaled = (X_adjusted - X_adjusted.mean(axis=0)) / X_adjusted.std(axis=0)
                return X_scaled, y_scaled
            else:
                return X_adjusted, y_adjusted

        def impute_mean(genotypes_t, missing=-1):#for torch tensor genotypes_t,cols are samples
            m = genotypes_t == missing
            ix = torch.nonzero(m, as_tuple=True)[0]
            if len(ix) > 0:
                a = genotypes_t.sum(1)
                b = m.sum(1).float()
                mu = (a - missing*b) / (genotypes_t.shape[1] - b)
                genotypes_t[m] = mu[ix]

        def drop_zero_variance_rows(genotypes, variant_ids):
            # Calculate the variance of each row
            variances = np.var(genotypes, axis=1)
            # Find the indices of rows with non-zero variance
            non_zero_variance_indices = np.where(variances != 0)[0]
            # Select rows with non-zero variance in genotypes and corresponding variant_ids
            genotypes_updated = genotypes[non_zero_variance_indices, :]
            variant_ids_updated = [variant_ids[i] for i in non_zero_variance_indices.tolist()]
            return genotypes_updated, variant_ids_updated

        def cauchy_meta(pvals):
            # Check input
            pvals = np.array(pvals)
            pvals = pvals[~np.isnan(pvals)]
            if len(pvals) == 0:
                return np.nan
            # pvals[pvals == 0] = 2.2e-308
            # Convert to Cauchy
            cauchy_vals = np.zeros(pvals.shape)
            valid_indices = pvals >= 1e-15
            cauchy_vals[valid_indices] = np.tan(np.pi * (0.5 - pvals[valid_indices]))
            stats = np.mean(cauchy_vals)
            p = cauchy.sf(stats)
            return p

        def fit_quantreg(phenotype_id, phenotype_array, genotype_matrix, variant_ids, af, covar_df, taus=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], vcov='$[qr_vcov]', kernel='$[qr_kernel]', bandwidth='$[qr_bandwidth]', max_iter=$[qr_max_iter], p_tol=$[qr_p_tol]):
            results = []
            variant_pvals = {vid: [] for vid in variant_ids}
            
            if covar_df is not None:
                if genotype_matrix.shape[1] != covar_df.shape[0]:
                    raise ValueError("Mismatch in the number of samples between genotype matrix and covariates DataFrame")
            
            for tau in taus:
                for idx, genotype_row in enumerate(genotype_matrix):
                    current_af = af[idx]
                    if covar_df is not None:
                        X = np.c_[covar_df.values, genotype_row]

                    else:
                        # If covar_df is None, just use the genotype_row
                        X = genotype_row.reshape(-1, 1)
                        
                    if $["True" if qr_add_intercept else "False"]:
                        X = sm.add_constant(X).astype('float')
                    
                    model = sm.QuantReg(phenotype_array, X)
                    try:
                        res = model.fit(q=tau, vcov=vcov, kernel=kernel, bandwidth=bandwidth, max_iter=max_iter, p_tol=p_tol)
                        pval = res.pvalues[-1]
                        variant_pvals[variant_ids[idx]].append(pval)
                        results.append({
                            'phenotype_id': phenotype_id,
                            'variant_id': variant_ids[idx],
                            'af': current_af, 
                            'tau': tau,
                            'pval': pval,
                            'slope': res.params[-1],
                            'tval': res.tvalues[-1]
        
                        })
                    except Exception as e:
                        raise RuntimeError(f"Error fitting model for variant {variant_ids[idx]} at tau {tau}: {e}")

            # Calculate combined p-values for each variant
            start_time_for_combined_p = time.time()
            for vid in variant_ids:
                combined_pval = cauchy_meta(variant_pvals[vid])
                for result in filter(lambda r: r['variant_id'] == vid, results):
                    result['combined_pval'] = combined_pval
            
            results_df = pd.DataFrame(results)
            end_time_for_combined_p = time.time()
            wide_df = pd.pivot_table(results_df, values=['pval', 'slope', 'tval'], index=['phenotype_id', 'variant_id', 'af','combined_pval'], columns='tau')
            wide_df.columns = ['qr_{}_{}'.format(col[1], col[0]) for col in wide_df.columns]
            wide_df.reset_index(inplace=True)

            variant_split = wide_df['variant_id'].str.extract(r'chr(\d+):(\d+)_([A-Z*]+)_([A-Z*]+)')
            variant_split.columns = ['chr', 'pos', 'ref', 'alt']
            final_df = pd.concat([variant_split, wide_df], axis=1)
            return final_df

        # get same order and common samples of phenotype_df, genotype_df, and covariates_df
        covariates_df.index = covariates_df.index.astype(str)
        common_samples = np.intersect1d(phenotype_df.columns, genotype_df.columns)
        genotype_df = genotype_df[common_samples]          
        sample_order = covariates_df.index
        phenotype_df = phenotype_df[sample_order]
        genotype_df = genotype_df[sample_order]

        # quantile qtl mapping:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        chrom = None
        igc = genotypeio.InputGeneratorCis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, group_s=group_s, window=window)
        results_list = [] 

        def quantile_regression_wrapper(args):
            phenotype, genotypes, genotype_range, phenotype_id = args
            genotypes = torch.tensor(genotypes, dtype=torch.float).to(device)
            impute_mean(genotypes)
            if genotypes.is_cuda:
                genotypes = genotypes.cpu()
            genotypes = genotypes.numpy()
            # drop_zero_variance_rows
            variant_ids = variant_df.index[genotype_range[0]:genotype_range[-1]+1].tolist()
            genotypes, variant_ids = drop_zero_variance_rows(genotypes, variant_ids)   
            # filter_maf_with_numpy  
            genotypes, variant_ids, af, maf = filter_maf_with_numpy(genotypes, variant_ids, $[maf_threshold])
        
            # regress out covariates from y and X
            if $["True" if qr_regress_covariates else "False"]:
                covar_np = covariates_df.to_numpy()
                X_adj, phenotype = remove_covariate_effects_np(genotypes.T, covar_np, phenotype, scale = '$[qr_scale]')
                genotypes = X_adj.T    
                quantqtl_start_time = time.time()
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    results_df = fit_quantreg(phenotype_id=phenotype_id,phenotype_array=phenotype,genotype_matrix=genotypes,covar_df=None,variant_ids=variant_ids,af=af
                    )                
            else:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    results_df = fit_quantreg(phenotype_id=phenotype_id,phenotype_array=phenotype,genotype_matrix=genotypes,covar_df=covariates_df,variant_ids=variant_ids,af=af
                    ) 
        
            quantqtl_end_time = time.time()
            runtime = quantqtl_end_time - quantqtl_start_time
            return results_df, runtime

        # quantile regression main function
        qr_threads = $[qr_threads]
        max_cores = mp.cpu_count()
        
        if qr_threads <= 0:
            qr_threads = max_cores
        else:
            qr_threads = min(qr_threads, max_cores)

        results_list = []
        runtime_list = []
        if qr_threads == 1:
            for phenotype, genotypes, genotype_range, phenotype_id in igc.generate_data(chrom=chrom, verbose=True):
                results_df, runtime = quantile_regression_wrapper((phenotype, genotypes, genotype_range, phenotype_id))
                results_list.append(results_df)
                runtime_list.append(runtime)
        else:
            pool = mp.Pool(processes=qr_threads)       
            phenotype_args = [(phenotype, genotypes, genotype_range, phenotype_id)
                              for phenotype, genotypes, genotype_range, phenotype_id in igc.generate_data(chrom=chrom, verbose=True)]
        
            with tqdm(total=len(phenotype_args), desc="Quantile Regression") as progress_bar:
                for results_df, runtime in pool.imap_unordered(quantile_regression_wrapper, phenotype_args):
                    results_list.append(results_df)
                    runtime_list.append(runtime)
                    progress_bar.update(1)
        
            pool.close()
            pool.join()       

        quantile_reg_df = pd.concat(results_list, ignore_index=True)
        # sort the table if chrom and pos is not in ascending order
        quantile_reg_df['pos'] = pd.to_numeric(quantile_reg_df['pos'])
        if not all(quantile_reg_df['pos'].iloc[i] <= quantile_reg_df['pos'].iloc[i+1] for i in range(len(quantile_reg_df)-1)):
            quantile_reg_df = quantile_reg_df.sort_values(by=['chr', 'pos'])        
        end_time = time.time()  
        print(f"Total execution time from the beginning to the quantile regression: {end_time - start_time} seconds")

        # save file
        quantile_reg_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
        # print general information of quantile_reg_df
        print('Output Information:')
        print("Output Rows:", len(quantile_reg_df))
        print("Output Columns:", quantile_reg_df.columns.tolist())
        print("Output Preview:", quantile_reg_df.iloc[1:5, 1:10])
        pass
    else:
        ## cis-QTL mapping: nominal associations for all variant-phenotype pairs
        if not ($[skip_nominal_if_exist] and $[_output["parquet"].is_file()]):
            if len(interaction_s):
                cis.map_nominal(genotype_df, variant_df, 
                        phenotype_df, 
                        phenotype_pos_df, 
                        $[_output["parquet"]:nnnr],
                        covariates_df=covariates_df,
                        interaction_df=interaction_s, 
                        maf_threshold_interaction=$[maf_threshold],
                        window=window,
                        group_s=group_s,
                        run_eigenmt=True, write_top=True, write_stats=True)
            else:
                cis.map_nominal(genotype_df, variant_df,
                    phenotype_df,
                    phenotype_pos_df,
                    $[_output["parquet"]:nnnr],
                    covariates_df=covariates_df, 
                    window=window, 
                    maf_threshold = $[maf_threshold],
                    group_s=group_s)

        ## Load the parquet and save it as txt
        pairs_df = pd.read_parquet($[_output["parquet"]:r])
        ## Remove rows whose 'pval_gi' is null for following t_pval_conversion
        if len(interaction_s):
            pairs_df = pairs_df.dropna(subset=['pval_gi'])
        # print general information of parquet
        print('Output Information:')
        print("This is the file containing the immediate output of TensorQTL's map_nominal function ")
        print(os.path.getsize($[_output["parquet"]:r]))

        ## Adds the group columns to pairs_df, if there is group_s use group_s, else use phenotype_id
        if group_s is not None:
            pairs_df = pairs_df.merge(pd.DataFrame( {"molecular_trait_object_id": group_s}),left_on = "phenotype_id", right_index = True)
        else:
            pairs_df["molecular_trait_object_id"] = pairs_df.phenotype_id
        ## if pos in phenotype_pos_df(start distance and end distance are the same), 
        ## add the column 'end_distance' with the same value as 'start_distance' to avoid mismatch of column and names
        if 'end_distance' not in pairs_df.columns:
            # Get the position of 'start_distance' column
            start_pos = pairs_df.columns.get_loc('start_distance')
            # Create a new DataFrame with the same values as start_distance
            new_df = pairs_df.copy()
            # Insert end_distance column after start_distance
            new_df.insert(start_pos + 1, 'end_distance', pairs_df['start_distance'])
            pairs_df = new_df

        # rename columns
        pairs_df.columns.values[0]  = "molecular_trait_id"
        pairs_df.columns.values[7]  = "pvalue"
        pairs_df.columns.values[8]  = "beta"
        pairs_df.columns.values[9]  = "se"
        pairs_df.rename(columns={'TSS_D': 'tss_distance'}, inplace=True)
        if len(interaction_s):
            # calculate genomic inflation factor lambda on interaction 
            lambda_col_interaction = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pval_gi), 1)/chi2.ppf(0.5,1))
            pairs_df.columns.values[10]  = "pvalue" + '_' + var_interaction
            pairs_df.columns.values[11]  = "beta" + '_' + var_interaction
            pairs_df.columns.values[12]  = "se" + '_' + var_interaction
            pairs_df.columns.values[13]  = "pvalue" + '_' + var_interaction + '_' + 'interaction'
            pairs_df.columns.values[14]  = "beta" + '_' + var_interaction + '_' + 'interaction'
            pairs_df.columns.values[15]  = "se" + '_' + var_interaction + '_' + 'interaction' 

        pairs_df["n"] = len(phenotype_df.columns.values)
        pairs_df = variant_df.merge(pairs_df, right_on='variant_id', left_index=True)
        pairs_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
        # sort the table if chrom and pos is not in ascending order
        if not all(pairs_df['pos'].iloc[i] <= pairs_df['pos'].iloc[i+1] for i in range(len(pairs_df)-1)):
            pairs_df = pairs_df.sort_values(by=['chrom', 'pos'])
        # save file
        pairs_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
        # print general information of pairs_df
        print('Output Information:')
        print("Output Rows:", len(pairs_df))
        print("Output Columns:", pairs_df.columns.tolist())
        print("Output Preview:", pairs_df.iloc[1:5, 1:10])

        if $[test_regional_association]:
            # calculate genomic inflation factor lambda for main variant effect 
            lambda_col = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
            cis_df = cis.map_cis(genotype_df, 
                                variant_df, 
                                phenotype_df,
                                phenotype_pos_df,
                                covariates_df=covariates_df, 
                                seed=999, 
                                window=window, 
                                maf_threshold = $[maf_threshold],
                                group_s=group_s)
            cis_df.index.name = "molecular_trait_id"
            ## Add groups columns for eQTL analysis
            if "group_id" not in cis_df.columns:
                cis_df["group_id"] = cis_df.index
                cis_df["group_size"] = 1
            cis_df = cis_df.rename({"group_id":"molecular_trait_object_id","group_size":"n_traits","num_var" : "n_variants","pval_perm":"p_perm", "pval_beta":"p_beta" },axis = 1)
            cis_df = cis_df.assign(genomic_inflation = lambda dataframe : dataframe["molecular_trait_object_id"].map(lambda molecular_trait_object_id:lambda_col[molecular_trait_object_id]))
            # merge cis_df with variant_df
            cis_df = variant_df.merge(cis_df, right_on='variant_id', left_index=True)
            cis_df.rename(columns={'a1': 'a2', 'a0': 'a1', 'variant_id': 'variant'}, inplace=True)
            # sort the table if chrom and pos is not in ascending order
            if not all(cis_df['pos'].iloc[i] <= cis_df['pos'].iloc[i+1] for i in range(len(cis_df)-1)):
                cis_df = cis_df.sort_values(by=['chrom', 'pos'])
            # save file
            cis_df.to_csv(str($[_output["nominal"]:nnnr])+str('.regional.tsv'), sep='\t', index = None)
            # print general information of cis_df
            print('Output Information:')
            print("Output Rows:", len(cis_df))
            print("Output Columns:", cis_df.columns.tolist())
            print("Output Preview:", cis_df.iloc[0:5, 0:10])

R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
    library(purrr)
    library(tidyr)
    library(readr)
    library(dplyr)
    library(qvalue)
  
    pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
    compute_qvalues <- function(pvalues) {
        tryCatch({
            if(length(pvalues) < 2) {
                return(pvalues)
            } else {
                return(qvalue(pvalues)$qvalues)
            }
        }, error = function(e) {
            message("Too few p-values to calculate qvalue, fall back to BH")
            qvalue(pvalues, pi0 = 1)$qvalues
        })
    }
    
    quantile_regression <- "$[quantile_regression]"
    if (quantile_regression) {
        pairs_df <- pairs_df %>%
            group_by(phenotype_id) %>%
            mutate(
                across(
                    .cols = ends_with("_pval"), 
                    .fns = ~ compute_qvalues(.x),
                    .names = "{sub('_pval$', '_qval', .col)}"
                )
            )
    } else {
        var_interaction <- "$[interaction]"
        # Check if 'interaction' is a file
        if (file.exists(var_interaction)) {
            # Read the file into 'interaction_s' dataframe
            interaction_s <- read.delim(var_interaction, row.names = 1)
            # Update 'var_interaction' to the first column name of 'interaction_s'
            var_interaction <- names(interaction_s)[1]
        }

        if (is.null(var_interaction) || var_interaction == "") {
            pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = compute_qvalues(pvalue))
        } else {
            pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue_main = compute_qvalues(pvalue), qvalue_interaction = compute_qvalues($["pvalue_%s_interaction" % var_interaction]))         
        }
    }
    pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
  
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:n] 
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]

done_if(not test_regional_association)

bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:nnn].regional.tsv
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]:nnn].regional.tsv.gz

[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output: f'{cwd}/{_input["nominal"][0]:bnnnn}.cis_qtl_regional.fdr.gz',
        f'{cwd}/{_input["nominal"][0]:bnnnn}.cis_qtl_regional.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("purrr")
    library("tidyr")
    library("dplyr")
    library("readr")
    library("qvalue")
    emprical_pd = tibble(map(c($[_input["regional"]:r,]), ~read_delim(.x,"\t")))%>%unnest()
    emprical_pd["q_beta"] = tryCatch(qvalue(emprical_pd$p_beta)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH") 
                                                                                              qvalue(emprical_pd$p_beta,pi0 = 1 )$qvalue})  

    emprical_pd["q_perm"] = tryCatch(qvalue(emprical_pd$p_perm)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH") 
                                                                                              qvalue(emprical_pd$p_perm,pi0 = 1 )$qvalue})
    emprical_pd["fdr_beta"] = p.adjust(emprical_pd$p_beta,"fdr")    
    emprical_pd["fdr_perm"] = p.adjust(emprical_pd$p_perm,"fdr")    
    summary = tibble("fdr_perm_0.05" =  sum(emprical_pd["fdr_perm"] < 0.05) , 
                      "fdr_beta_0.05" = sum(emprical_pd["fdr_beta"] < 0.05),
                      "q_perm_0.05" = sum(emprical_pd["q_perm"] < 0.05) ,
                      "q_beta_0.05" = sum(emprical_pd["q_beta"] < 0.05) ,
                       "q_perm_0.01" = sum(emprical_pd["q_perm"] < 0.01) ,
                      "q_beta_0.01" = sum(emprical_pd["q_beta"] < 0.01)  )
    emprical_pd%>%write_delim("$[_output[0]]","\t")
    summary%>%write_delim("$[_output[1]]","\t")

Trans-xQTL association testing#

For transQTL analysis, if you output all the p-values for many genes (default setting) it is suggested to provide the largest memory and CPU threads available on a compute node. eg 250G and >32 threads.

[trans_1]
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_%s" % input_chroms[_index]}.trans_qtl.pairs.tsv.gz'

parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, trans
    from scipy.stats import chi2

    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]
    covariates_file = "$[covariate_file:a]"
    window = $[window]
    ## Loading Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    phenotype_id = phenotype_pos_df.index.name

    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]", comment="#", header=None,sep="\t")
        phenotype_column = 1 if len(region.columns) == 1 else  $[region_list_phenotype_column]
        keep_region = region.iloc[:,phenotype_column-1].to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]

    ## use custom cis windows
    if $[customized_cis_windows.is_file()]:
        cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
        phenotype_pos_df_reset = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
        phenotype_pos_df = phenotype_pos_df_reset.merge(cis_list, left_on = ["chr",phenotype_id],right_on = [cis_list.columns[0],cis_list.columns[3]])#in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
        if len(phenotype_df.index) - phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0]!= len(phenotype_pos_df.index):
            raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
        phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
        window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0
        ## Retaining only traits in cis_list
        if phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0] != 0:
            phenotype_df = phenotype_df.loc[phenotype_df.index.isin(cis_list[phenotype_id])]

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    ## use custom sample list to subset the covariates data
    if $[keep_sample.is_file()]:
        sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
        covariates_df.loc[sample_list.sample_id]
    pr = genotypeio.PlinkReader(plink_prefix_path)
    genotype_df = pr.load_genotypes()
    variant_df = pr.bim.set_index('snp')[['chrom', 'pos','a0','a1']]

    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()

    ## To simplify things, there should really not be "chr" prefix
    phenotype_pos_df.chr = [x.replace("chr","") for x in phenotype_pos_df.chr]
    variant_df.chrom = [x.replace("chr","") for x in variant_df.chrom]

    ## Trans analysis
    trans_df = trans.map_trans(genotype_df, 
                            phenotype_df,
                            covariates_df, 
                            batch_size=$[batch_size],
                            return_sparse=True, 
                            return_r2 = True,
                            pval_threshold=$[pval_threshold], 
                            maf_threshold=$[maf_threshold])

    ## Filter out cis signal, again if customized cis windows are used, the windows is [start-win,end + win] where win = 0, else it is [start - win, start + win]
    trans_df = trans.filter_cis(trans_df, phenotype_pos_df, variant_df, window=window)   

    ## Permutation
    if $['True' if permutation else 'False']:
        perm_df = trans.map_permutations(genotype_df, covariates_df, batch_size=$[batch_size],
                             maf_threshold=$[maf_threshold])
        trans.apply_permutations(perm_df,trans_df)

    ## Output
    trans_df.columns.values[1]  = "molecular_trait_id"
    trans_df.columns.values[2]  = "pvalue"
    trans_df.columns.values[3]  = "beta"
    trans_df.columns.values[4]  = "se"
    trans_df["n"] = len(phenotype_df.columns.values)
    trans_df = variant_df.merge(trans_df, right_on='variant_id', left_index=True)
    trans_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
    # genomic inflation factor
    lambda_col = trans_df.groupby("molecular_trait_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
    # Convert the Series to a DataFrame
    lambda_col = lambda_col.reset_index()
    lambda_col.columns = ['molecular_trait_id', 'genomic_inflation_lambda']
    lambda_col.to_csv("$[_output:nnn].genomic_inflation.tsv.gz", sep='\t', index = None, compression={'method': 'gzip', 'compresslevel': 9})

    # sort the table if chrom and pos is not in ascending order
    if not all(trans_df['pos'].iloc[i] <= trans_df['pos'].iloc[i+1] for i in range(len(trans_df)-1)):
        trans_df = trans_df.sort_values(by=['chrom', 'pos'])
 
    # print general information of trans_df
    print('Output Information:')
    print("Output Rows:", len(trans_df))
    print("Output Columns:", trans_df.columns.tolist())
    print("Output Preview:", trans_df.iloc[0:5, 0:10])
    
    # save output
    trans_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)


R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container
    if ($['FALSE' if permutation and pval_threshold < 1.0 else 'TRUE']){ 
      library(purrr)
      library(tidyr)
      library(readr)
      library(dplyr)
      library(qvalue)
      # calculate qvalue
      pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')  
      pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = qvalue(pvalue)$qvalues)
      pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
    } 

bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:n] 
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]