QTL Association Testing#
Description#
We perform QTL association testing using TensorQTL [cf. Taylor-Weiner et al (2019)]. An additional protocol was added to test for quantile QTL associations.
Input#
List of molecular phenotype files: a list of
bed.gz
files containing the table for the molecular phenotype. It should have a companion index file intbi
format. It is the output of gene_annotation or phenotype_by_chorm
Example phenotype list#
#chr start end ID path
chr12 752578 752579 ENSG00000060237 /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12 990508 990509 ENSG00000082805 /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12 2794969 2794970 ENSG00000004478 /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12 4649113 4649114 ENSG00000139180 /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12 6124769 6124770 ENSG00000110799 /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12 6534516 6534517 ENSG00000111640 /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
The header of the bed.gz is per the [TensorQTL](https://github.com/broadinstitute/tensorqtl) convention:
> Phenotypes must be provided in BED format, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the center of the cis-window (usually the TSS), with start == end-1.
List of genotypes in PLINK binary format (
bed
/bim
/fam
) for each chromosome, previously processed through our genotype QC pipelines.Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
Optionally, a list of traits (genes, regions of molecular features etc) to analyze.
For cis-analysis:
Optionally, a list of genomic regions associate with each molecular features to analyze. The default cis-analysis will use a window around TSS. This can be customized to take given start and end genomic coordinates.
Output#
For each chromosome, several of summary statistics files are generated, including both nominal test statistics for each test, as well as region (gene) level association evidence.
The columns of nominal association result are as follows:
phenotype_id: Molecular trait identifier.(gene)
variant_id: ID of the variant (rsid or chr:position:ref:alt)
tss_distance: Distance of the SNP to the gene transcription start site (TSS)
af: The allele frequency of this SNPs
ma_samples: Number of samples carrying the minor allele
ma_count: Total number of minor alleles across individuals
pval: Nominal P-value from linear regression
beta: Slope of the linear regression
se: Standard error of beta
chr : Variant chromosome.
pos : Variant chromosomal position (basepairs).
ref : Variant reference allele (A, C, T, or G).
alt : Variant alternate allele.
The column specification of region (gene) level association evidence are as follows:
phenotype_id - Molecular trait identifier. (gene)
num_var - Total number of variants tested in cis
beta_shape1 - First parameter value of the fitted beta distribution
beta_shape2 - Second parameter value of the fitted beta distribution
true_df - Effective degrees of freedom the beta distribution approximation
pval_true_df - Empirical P-value for the beta distribution approximation
variant_id - ID of the top variant (rsid or chr:position:ref:alt)
tss_distance - Distance of the SNP to the gene transcription start site (TSS)
ma_samples - Number of samples carrying the minor allele
ma_count - Total number of minor alleles across individuals
af - Alternative allele frequency in MiGA cohort
ref_factor - Flag indicating if the alternative allele is the minor allele in the cohort (1 if AF <= 0.5, -1 if not)
pval_nominal - Nominal P-value from linear regression
slope - Slope of the linear regression
slope_se - Standard error of the slope
pval_perm - First permutation P-value directly obtained from the permutations with the direct method
pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis
The columns of nominal association results for quantile regression are as follows:
phenotype_id: Molecular trait identifier.(gene)
variant_id: ID of the variant (rsid or chr:position:ref:alt)
chr : Variant chromosome.
pos : Variant chromosomal position (basepairs).
ref : Variant reference allele (A, C, T, or G).
alt : Variant alternate allele.
af: Alternative allele frequency in MiGA cohort
combined_pval(composite-p value using cauchy combination method): the integrated QR p-value across multiple quantile levels.
qr_0.1_pval to qr_0.9_pval: quantile-specific QR p-values for the quantile levels 0.1, 0.2, …, 0.9.
qr_0.1_slope to qr_0.9_slope: quantile-specific QR coefficients for the quantile levels 0.1, 0.2, …, 0.9.
qr_0.1_tval to qr_0.9_tval: quantile-specific QR t-values for the quantile levels 0.1, 0.2, …, 0.9.
Minimal Working Example Steps#
The data can be found on Synapse.
i. Cis TensorQTL Command#
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.region_list.txt \
--covariate-file output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
--customized_cis_windows prototype_example/protocol_example/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--cwd output/cis_association/ \
--MAC 5 \
--container containers/TensorQTL.sif
ii. Trans TensorQTL Command#
Note: Some protein is not in the customized cis windows list. There we will need to remove them from the analysis by create a region_list. Noted that the region list need to be a actual file. So <()
file is not acceptable.
zcat output/protocol_example.protein.bed.gz | cut -f 1,2,3,4 | grep -v -e ENSG00000163554 \
-e ENSG00000171564 -e ENSG00000171560 -e ENSG00000171557 > output/protocol_example.protein.region_list
The following will take more than 180G of memory to run.
sos run xqtl-protocol/pipeline/TensorQTL.ipynb trans \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.region_list.txt \
--region-list output/protocol_example.protein.region_list \
--covariate-file output/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-cis-windows input/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--cwd output/association/trans/ \
--MAC 5 --numThreads 8 -J 1 -q csg --mem 240G -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml \
--container containers/TensorQTL.sif
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command Interface#
sos run TensorQTL.ipynb -h
usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
cis
trans
Global Workflow Options:
--cwd output (as path)
Path to the work directory of the analysis.
--phenotype-file VAL (as path, required)
Phenotype file, or a list of phenotype per region.
--genotype-file VAL (as path, required)
A genotype file in PLINK binary format (bed/bam/fam)
format, or a list of genotype per chrom
--covariate-file VAL (as path, required)
Covariate file
--name f"{phenotype_file:bn}_{covariate_file:bn}"
Prefix for the analysis output
--region-list . (as path)
An optional subset of regions of molecular features to
analyze. The last column is the gene names
--region-list-phenotype-column 4 (as int)
--keep-sample . (as path)
Set list of sample to be keep
--interaction ''
FIXME: please document
--customized-cis-windows . (as path)
An optional list documenting the custom cis window for
each region to analyze, with four column, chr, start,
end, region ID (eg gene ID). If this list is not
provided, the default `window` parameter (see below)
will be used.
--phenotype-group . (as path)
The phenotype group file to group molecule_trait into
molecule_trait_object This applies to multiple molecular
events in the same region, such as sQTL analysis.
--chromosome (as list)
The name of phenotype corresponding to gene_id or
gene_name in the region
--MAC 0 (as int)
Minor allele count cutoff
--window 1000000 (as int)
Specify the cis window for the up and downstream radius
to analyze around the region of interest in units of bp
This parameter will be set to zero if
`customized_cis_windows` is provided.
--numThreads 8 (as int)
Number of threads
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 12h
--mem 16G
--container ''
Container option for software to run the analysis:
docker or singularity
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
--maf-threshold MAC/(2.0*N)
Minor allele frequency cutoff. It will overwrite minor
allele cutoff. You may consider setting it to higher for
interaction analysis if you have statistical power
concerns
Sections
cis_1:
Workflow Options:
--[no-]skip-nominal-if-exist (default to False)
parse input file lists skip nominal association results
if the files exists already This is false by default
which means to recompute everything This is only
relevant when the `parquet` files for nominal results
exist but not the other files and you want to avoid
computing the nominal results again
--[no-]permutation (default to True)
cis_2:
trans_1:
Workflow Options:
--batch-size 50000 (as int)
--pval-threshold 1.0 (as float)
--[no-]permutation (default to False)
Permutation testing is incorrect when the analysis is
done by chrom
Old Minimal working example#
An MWE is uploaded to google drive. The singularity image (sif) for running this MWE is uploaded to google drive
FIXME: need to update these links.
FIXME: Also need to update the example commands below using our new example dataset.
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file plink_files_list.txt \
--phenotype-file MWE.bed.recipe \
--covariate-file ALL.covariate.pca.BiCV.cov.gz \
--cwd ./output/ \
--container containers/TensorQTL.sif --MAC 5
sos run pipeline/TensorQTL.ipynb trans \
--genotype-file plink_files_list.txt \
--phenotype-file MWE.bed.recipe \
--covariate-file ALL.covariate.pca.BiCV.cov.gz \
--cwd ./output/ \
--container containers/TensorQTL.sif --MAC 5 --region-name gene_name
sos run pipeline/TensorQTL.ipynb trans \
--genotype-file plink_files_list.txt \
--phenotype-file MWE.bed.recipe \
--covariate-file /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/output/data_preprocessing/ALL/covariates/ALL.log2cpm.ALL.covariate.pca.resid.PEER.cov.gz \
--cwd ./output/trans_tensorQTL/ \
--region-list /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/reference_data/AD_genes.region_list \
--customized-cis-windows TADB_enhanced_cis.bed \
--container containers/TensorQTL.sif --MAC 5 --region-name ID
Setup and global parameters#
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Prefix for the analysis output
parameter: name = f"{phenotype_file:bn}_{covariate_file:bn}"
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""
# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()
# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path()
# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1
# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)
import os
import pandas as pd
def adapt_file_path(file_path, reference_file):
"""
Adapt a single file path based on its existence and a reference file's path.
Args:
- file_path (str): The file path to adapt.
- reference_file (str): File path to use as a reference for adaptation.
Returns:
- str: Adapted file path.
Raises:
- FileNotFoundError: If no valid file path is found.
"""
reference_path = os.path.dirname(reference_file)
# Check if the file exists
if os.path.isfile(file_path):
return file_path
# Check file name without path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
# Check file name in reference file's directory
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
# Check original file path prefixed with reference file's directory
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
# If all checks fail, raise an error
raise FileNotFoundError(f"No valid path found for file: {file_path}")
def adapt_file_path_all(df, column_name, reference_file):
return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
if str(genotype_file).endswith("bed") and str(phenotype_file).endswith("bed.gz"):
input_files = [[phenotype_file, genotype_file]]
if len(chromosome) > 0:
input_chroms = [int(x) for x in chromosome]
else:
input_chroms = [0]
else:
import pandas as pd
import os
molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
if "#chr" in molecular_pheno_files.columns:
molecular_pheno_files = molecular_pheno_files.groupby(['#chr', '#dir']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
genotype_files = pd.read_csv(genotype_file,sep = "\t")
genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
# Only keep chromosome specified in --chromosome
if len(chromosome) > 0:
input_files = input_files[input_files['#id'].isin(chromosome)]
input_files = input_files.values.tolist()
input_chroms = [x[0] for x in input_files]
input_files = [x[1:] for x in input_files]
cis-xQTL association testing#
[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True
parameter: quantile_regression = False
parameter: qr_vcov= 'robust'
parameter: qr_kernel= 'epa'
parameter: qr_bandwidth= 'hsheather'
parameter: qr_max_iter=200
parameter: qr_p_tol=0.000001
parameter: qr_regress_covariates = True
parameter: qr_add_intercept = False
parameter: qr_scale = False
# use -1 to set the maximum number of cores available
parameter: qr_threads = -1
if quantile_regression:
permutation = False
# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_{"quantile_" if quantile_regression else ""}qtl.pairs.tsv.gz')])
if test_regional_association:
output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
python: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout' , container = container, entrypoint = entrypoint
import pandas as pd
import numpy as np
import os
import tensorqtl
from tensorqtl import genotypeio, cis
from scipy.stats import chi2
import multiprocessing as mp
from tqdm import tqdm
## Define paths
plink_prefix_path = $[_input[1]:nar]
expression_bed = $[_input[0]:ar]
covariates_file = "$[covariate_file:a]"
window = $[window]
interaction = "$[interaction]"
## Load Data
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
phenotype_id = phenotype_pos_df.index.name
## Analyze only the regions listed
if $[region_list.is_file()]:
region = pd.read_csv("$[region_list:a]", comment="#", header=None, sep="\t" )
phenotype_column = 1 if len(region.columns) == 1 else $[region_list_phenotype_column]
keep_region = region.iloc[:,phenotype_column-1].to_list()
phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos','a0','a1']]
## use custom sample list to subset the covariates data
if $[keep_sample.is_file()]:
sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
covariates_df.loc[sample_list.sample_id]
# Read interaction files or extract from covariates file.
var_interaction = interaction
interaction_s = []
if os.path.isfile(interaction):
# update var_interaction and interaction_s
interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
interaction_s = interaction_s[interaction_s.index.isin(covariates_df.index)]
var_interaction = interaction_s.columns[0] # interaction name
# check if the interaction term in interaction table is in covariates file, if yes and interaction_s not yet loaded then, extract it out from covariates file
if var_interaction in covariates_df.columns:
# only load from covariate if it has not been loaded yet
if len(interaction_s) == 0:
interaction_s = covariates_df[var_interaction].to_frame()
covariates_df = covariates_df.drop(columns=[var_interaction])
if len(interaction) and len(interaction_s) == 0:
raise ValueError(f"Cannot find interaction variable or file {interaction}")
# drop samples that with missing value in iteraction
if len(interaction_s):
interaction_s = interaction_s.dropna()
## Retaining only common samples
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
if len(interaction_s):
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, interaction_s.index)]
interaction_s = interaction_s[interaction_s.index.isin(phenotype_df.columns)]
interaction_s = interaction_s.loc[phenotype_df.columns]
covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
## To simplify things, there should really not be "chr" prefix
phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
variant_df.chrom = variant_df.chrom.astype("str").str.replace("chr", "")
## use custom cis windows list
if $[customized_cis_windows.is_file()]:
cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
cis_list.chr = cis_list.chr.astype(str).str.replace("chr", "") ## Again to simplify things for chr format concordance.
phenotype_pos_df = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
phenotype_pos_df = phenotype_pos_df.merge(cis_list, left_on = ["chr",phenotype_id],right_on = [cis_list.columns[0],cis_list.columns[3]])#in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
if len(phenotype_df.index) != len(phenotype_pos_df.index):
raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0
## Read phenotype group if availble
if $[phenotype_group.is_file()]:
group_s = pd.read_csv($[phenotype_group:r], sep='\t', header=None, index_col=0, squeeze=True)
else:
group_s = None
if $["True" if quantile_regression else "False"]:
import statsmodels.api as sm
import scipy.stats as ss
from scipy.stats import cauchy
import time
import warnings
import torch
# Perform quantile regression
start_time = time.time()
def filter_maf_with_numpy(genotypes, variant_ids, maf_threshold, alleles=2):
af = genotypes.sum(axis=1) / (alleles * genotypes.shape[1])
maf = np.where(af > 0.5, 1 - af, af)
if maf_threshold > 0:
mask = maf >= maf_threshold
#mask = mask.astype(bool)
genotypes = genotypes[mask]
variant_ids = np.array(variant_ids)[mask]
af = af[mask]
maf = maf[mask]
return genotypes, variant_ids, af, maf
def remove_covariate_effects_np(X, Z, y, scale=False):
# for X,Z,y, rows are samples
if not np.all(Z[:, 0] == 1):
Z = np.hstack([np.ones((Z.shape[0], 1)), Z])
A = np.linalg.inv(Z.T @ Z)
SZy = A @ Z.T @ y
SZX = A @ Z.T @ X
# Adjust y and X based on calculated effects
y_adjusted = y - Z @ SZy
X_adjusted = X - Z @ SZX
if scale:
# Standardize y and X
y_scaled = (y_adjusted - y_adjusted.mean()) / y_adjusted.std()
X_scaled = (X_adjusted - X_adjusted.mean(axis=0)) / X_adjusted.std(axis=0)
return X_scaled, y_scaled
else:
return X_adjusted, y_adjusted
def impute_mean(genotypes_t, missing=-1):#for torch tensor genotypes_t,cols are samples
m = genotypes_t == missing
ix = torch.nonzero(m, as_tuple=True)[0]
if len(ix) > 0:
a = genotypes_t.sum(1)
b = m.sum(1).float()
mu = (a - missing*b) / (genotypes_t.shape[1] - b)
genotypes_t[m] = mu[ix]
def drop_zero_variance_rows(genotypes, variant_ids):
# Calculate the variance of each row
variances = np.var(genotypes, axis=1)
# Find the indices of rows with non-zero variance
non_zero_variance_indices = np.where(variances != 0)[0]
# Select rows with non-zero variance in genotypes and corresponding variant_ids
genotypes_updated = genotypes[non_zero_variance_indices, :]
variant_ids_updated = [variant_ids[i] for i in non_zero_variance_indices.tolist()]
return genotypes_updated, variant_ids_updated
def cauchy_meta(pvals):
# Check input
pvals = np.array(pvals)
pvals = pvals[~np.isnan(pvals)]
if len(pvals) == 0:
return np.nan
# pvals[pvals == 0] = 2.2e-308
# Convert to Cauchy
cauchy_vals = np.zeros(pvals.shape)
valid_indices = pvals >= 1e-15
cauchy_vals[valid_indices] = np.tan(np.pi * (0.5 - pvals[valid_indices]))
stats = np.mean(cauchy_vals)
p = cauchy.sf(stats)
return p
def fit_quantreg(phenotype_id, phenotype_array, genotype_matrix, variant_ids, af, covar_df, taus=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], vcov='$[qr_vcov]', kernel='$[qr_kernel]', bandwidth='$[qr_bandwidth]', max_iter=$[qr_max_iter], p_tol=$[qr_p_tol]):
results = []
variant_pvals = {vid: [] for vid in variant_ids}
if covar_df is not None:
if genotype_matrix.shape[1] != covar_df.shape[0]:
raise ValueError("Mismatch in the number of samples between genotype matrix and covariates DataFrame")
for tau in taus:
for idx, genotype_row in enumerate(genotype_matrix):
current_af = af[idx]
if covar_df is not None:
X = np.c_[covar_df.values, genotype_row]
else:
# If covar_df is None, just use the genotype_row
X = genotype_row.reshape(-1, 1)
if $["True" if qr_add_intercept else "False"]:
X = sm.add_constant(X).astype('float')
model = sm.QuantReg(phenotype_array, X)
try:
res = model.fit(q=tau, vcov=vcov, kernel=kernel, bandwidth=bandwidth, max_iter=max_iter, p_tol=p_tol)
pval = res.pvalues[-1]
variant_pvals[variant_ids[idx]].append(pval)
results.append({
'phenotype_id': phenotype_id,
'variant_id': variant_ids[idx],
'af': current_af,
'tau': tau,
'pval': pval,
'slope': res.params[-1],
'tval': res.tvalues[-1]
})
except Exception as e:
raise RuntimeError(f"Error fitting model for variant {variant_ids[idx]} at tau {tau}: {e}")
# Calculate combined p-values for each variant
start_time_for_combined_p = time.time()
for vid in variant_ids:
combined_pval = cauchy_meta(variant_pvals[vid])
for result in filter(lambda r: r['variant_id'] == vid, results):
result['combined_pval'] = combined_pval
results_df = pd.DataFrame(results)
end_time_for_combined_p = time.time()
wide_df = pd.pivot_table(results_df, values=['pval', 'slope', 'tval'], index=['phenotype_id', 'variant_id', 'af','combined_pval'], columns='tau')
wide_df.columns = ['qr_{}_{}'.format(col[1], col[0]) for col in wide_df.columns]
wide_df.reset_index(inplace=True)
variant_split = wide_df['variant_id'].str.extract(r'chr(\d+):(\d+)_([A-Z*]+)_([A-Z*]+)')
variant_split.columns = ['chr', 'pos', 'ref', 'alt']
final_df = pd.concat([variant_split, wide_df], axis=1)
return final_df
# get same order and common samples of phenotype_df, genotype_df, and covariates_df
covariates_df.index = covariates_df.index.astype(str)
common_samples = np.intersect1d(phenotype_df.columns, genotype_df.columns)
genotype_df = genotype_df[common_samples]
sample_order = covariates_df.index
phenotype_df = phenotype_df[sample_order]
genotype_df = genotype_df[sample_order]
# quantile qtl mapping:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
chrom = None
igc = genotypeio.InputGeneratorCis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, group_s=group_s, window=window)
results_list = []
def quantile_regression_wrapper(args):
phenotype, genotypes, genotype_range, phenotype_id = args
genotypes = torch.tensor(genotypes, dtype=torch.float).to(device)
impute_mean(genotypes)
if genotypes.is_cuda:
genotypes = genotypes.cpu()
genotypes = genotypes.numpy()
# drop_zero_variance_rows
variant_ids = variant_df.index[genotype_range[0]:genotype_range[-1]+1].tolist()
genotypes, variant_ids = drop_zero_variance_rows(genotypes, variant_ids)
# filter_maf_with_numpy
genotypes, variant_ids, af, maf = filter_maf_with_numpy(genotypes, variant_ids, $[maf_threshold])
# regress out covariates from y and X
if $["True" if qr_regress_covariates else "False"]:
covar_np = covariates_df.to_numpy()
X_adj, phenotype = remove_covariate_effects_np(genotypes.T, covar_np, phenotype, scale = '$[qr_scale]')
genotypes = X_adj.T
quantqtl_start_time = time.time()
with warnings.catch_warnings():
warnings.simplefilter("ignore")
results_df = fit_quantreg(phenotype_id=phenotype_id,phenotype_array=phenotype,genotype_matrix=genotypes,covar_df=None,variant_ids=variant_ids,af=af
)
else:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
results_df = fit_quantreg(phenotype_id=phenotype_id,phenotype_array=phenotype,genotype_matrix=genotypes,covar_df=covariates_df,variant_ids=variant_ids,af=af
)
quantqtl_end_time = time.time()
runtime = quantqtl_end_time - quantqtl_start_time
return results_df, runtime
# quantile regression main function
qr_threads = $[qr_threads]
max_cores = mp.cpu_count()
if qr_threads <= 0:
qr_threads = max_cores
else:
qr_threads = min(qr_threads, max_cores)
results_list = []
runtime_list = []
if qr_threads == 1:
for phenotype, genotypes, genotype_range, phenotype_id in igc.generate_data(chrom=chrom, verbose=True):
results_df, runtime = quantile_regression_wrapper((phenotype, genotypes, genotype_range, phenotype_id))
results_list.append(results_df)
runtime_list.append(runtime)
else:
pool = mp.Pool(processes=qr_threads)
phenotype_args = [(phenotype, genotypes, genotype_range, phenotype_id)
for phenotype, genotypes, genotype_range, phenotype_id in igc.generate_data(chrom=chrom, verbose=True)]
with tqdm(total=len(phenotype_args), desc="Quantile Regression") as progress_bar:
for results_df, runtime in pool.imap_unordered(quantile_regression_wrapper, phenotype_args):
results_list.append(results_df)
runtime_list.append(runtime)
progress_bar.update(1)
pool.close()
pool.join()
quantile_reg_df = pd.concat(results_list, ignore_index=True)
# sort the table if chrom and pos is not in ascending order
quantile_reg_df['pos'] = pd.to_numeric(quantile_reg_df['pos'])
if not all(quantile_reg_df['pos'].iloc[i] <= quantile_reg_df['pos'].iloc[i+1] for i in range(len(quantile_reg_df)-1)):
quantile_reg_df = quantile_reg_df.sort_values(by=['chr', 'pos'])
end_time = time.time()
print(f"Total execution time from the beginning to the quantile regression: {end_time - start_time} seconds")
# save file
quantile_reg_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
# print general information of quantile_reg_df
print('Output Information:')
print("Output Rows:", len(quantile_reg_df))
print("Output Columns:", quantile_reg_df.columns.tolist())
print("Output Preview:", quantile_reg_df.iloc[1:5, 1:10])
pass
else:
## cis-QTL mapping: nominal associations for all variant-phenotype pairs
if not ($[skip_nominal_if_exist] and $[_output["parquet"].is_file()]):
if len(interaction_s):
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
$[_output["parquet"]:nnnr],
covariates_df=covariates_df,
interaction_df=interaction_s,
maf_threshold_interaction=$[maf_threshold],
window=window,
group_s=group_s,
run_eigenmt=True, write_top=True, write_stats=True)
else:
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
$[_output["parquet"]:nnnr],
covariates_df=covariates_df,
window=window,
maf_threshold = $[maf_threshold],
group_s=group_s)
## Load the parquet and save it as txt
pairs_df = pd.read_parquet($[_output["parquet"]:r])
# print general information of parquet
print('Output Information:')
print("This is the file containing the immediate output of TensorQTL's map_nominal function ")
print(os.path.getsize($[_output["parquet"]:r]))
## Adds the group columns to pairs_df, if there is group_s use group_s, else use phenotype_id
if group_s is not None:
pairs_df = pairs_df.merge(pd.DataFrame( {"molecular_trait_object_id": group_s}),left_on = "phenotype_id", right_index = True)
else:
pairs_df["molecular_trait_object_id"] = pairs_df.phenotype_id
# rename columns
pairs_df.columns.values[0] = "molecular_trait_id"
pairs_df.columns.values[7] = "pvalue"
pairs_df.columns.values[8] = "beta"
pairs_df.columns.values[9] = "se"
pairs_df.rename(columns={'TSS_D': 'tss_distance'}, inplace=True)
if len(interaction_s):
# calculate genomic inflation factor lambda on interaction
# lambda_col_interaction = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pval_gi), 1)/chi2.ppf(0.5,1))
pairs_df.columns.values[10] = "pvalue" + '_' + var_interaction
pairs_df.columns.values[11] = "beta" + '_' + var_interaction
pairs_df.columns.values[12] = "se" + '_' + var_interaction
pairs_df.columns.values[13] = "pvalue" + '_' + var_interaction + '_' + 'interaction'
pairs_df.columns.values[14] = "beta" + '_' + var_interaction + '_' + 'interaction'
pairs_df.columns.values[15] = "se" + '_' + var_interaction + '_' + 'interaction'
pairs_df["n"] = len(phenotype_df.columns.values)
pairs_df = variant_df.merge(pairs_df, right_on='variant_id', left_index=True)
pairs_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
# sort the table if chrom and pos is not in ascending order
if not all(pairs_df['pos'].iloc[i] <= pairs_df['pos'].iloc[i+1] for i in range(len(pairs_df)-1)):
pairs_df = pairs_df.sort_values(by=['chrom', 'pos'])
# save file
pairs_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
# print general information of pairs_df
print('Output Information:')
print("Output Rows:", len(pairs_df))
print("Output Columns:", pairs_df.columns.tolist())
print("Output Preview:", pairs_df.iloc[1:5, 1:10])
if $[test_regional_association]:
# calculate genomic inflation factor lambda for main variant effect
lambda_col = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
cis_df = cis.map_cis(genotype_df,
variant_df,
phenotype_df,
phenotype_pos_df,
covariates_df=covariates_df,
seed=999,
window=window,
maf_threshold = $[maf_threshold],
group_s=group_s)
cis_df.index.name = "molecular_trait_id"
## Add groups columns for eQTL analysis
if "group_id" not in cis_df.columns:
cis_df["group_id"] = cis_df.index
cis_df["group_size"] = 1
cis_df = cis_df.rename({"group_id":"molecular_trait_object_id","group_size":"n_traits","num_var" : "n_variants","pval_perm":"p_perm", "pval_beta":"p_beta" },axis = 1)
cis_df = cis_df.assign(genomic_inflation = lambda dataframe : dataframe["molecular_trait_object_id"].map(lambda molecular_trait_object_id:lambda_col[molecular_trait_object_id]))
# merge cis_df with variant_df
cis_df = variant_df.merge(cis_df, right_on='variant_id', left_index=True)
cis_df.rename(columns={'a1': 'a2', 'a0': 'a1', 'variant_id': 'variant'}, inplace=True)
# sort the table if chrom and pos is not in ascending order
if not all(cis_df['pos'].iloc[i] <= cis_df['pos'].iloc[i+1] for i in range(len(cis_df)-1)):
cis_df = cis_df.sort_values(by=['chrom', 'pos'])
# save file
cis_df.to_csv(str($[_output["nominal"]:nnnr])+str('.regional.tsv'), sep='\t', index = None)
# print general information of cis_df
print('Output Information:')
print("Output Rows:", len(cis_df))
print("Output Columns:", cis_df.columns.tolist())
print("Output Preview:", cis_df.iloc[0:5, 0:10])
R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
library(purrr)
library(tidyr)
library(readr)
library(dplyr)
library(qvalue)
pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
compute_qvalues <- function(pvalues) {
tryCatch({
if(length(pvalues) < 2) {
return(pvalues)
} else {
return(qvalue(pvalues)$qvalues)
}
}, error = function(e) {
message("Too few p-values to calculate qvalue, fall back to BH")
qvalue(pvalues, pi0 = 1)$qvalues
})
}
quantile_regression <- "$[quantile_regression]"
if (quantile_regression) {
pairs_df <- pairs_df %>%
group_by(phenotype_id) %>%
mutate(
across(
.cols = ends_with("_pval"),
.fns = ~ compute_qvalues(.x),
.names = "{sub('_pval$', '_qval', .col)}"
)
)
} else {
var_interaction <- "$[interaction]"
# Check if 'interaction' is a file
if (file.exists(var_interaction)) {
# Read the file into 'interaction_s' dataframe
interaction_s <- read.delim(var_interaction, row.names = 1)
# Update 'var_interaction' to the first column name of 'interaction_s'
var_interaction <- names(interaction_s)[1]
}
if (is.null(var_interaction) || var_interaction == "") {
pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = compute_qvalues(pvalue))
} else {
pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue_main = compute_qvalues(pvalue), qvalue_interaction = compute_qvalues($["pvalue_%s_interaction" % var_interaction]))
}
}
pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
bgzip --compress-level 9 $[_output["nominal"]:n]
tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]
done_if(not test_regional_association)
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
bgzip --compress-level 9 $[_output["nominal"]:nnn].regional.tsv
tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]:nnn].regional.tsv.gz
[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output: f'{cwd}/{_input["nominal"][0]:bnnnn}.cis_qtl_regional.fdr.gz',
f'{cwd}/{_input["nominal"][0]:bnnnn}.cis_qtl_regional.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint = entrypoint
library("purrr")
library("tidyr")
library("dplyr")
library("readr")
library("qvalue")
emprical_pd = tibble(map(c($[_input["regional"]:r,]), ~read_delim(.x,"\t")))%>%unnest()
emprical_pd["q_beta"] = tryCatch(qvalue(emprical_pd$p_beta)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH")
qvalue(emprical_pd$p_beta,pi0 = 1 )$qvalue})
emprical_pd["q_perm"] = tryCatch(qvalue(emprical_pd$p_perm)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH")
qvalue(emprical_pd$p_perm,pi0 = 1 )$qvalue})
emprical_pd["fdr_beta"] = p.adjust(emprical_pd$p_beta,"fdr")
emprical_pd["fdr_perm"] = p.adjust(emprical_pd$p_perm,"fdr")
summary = tibble("fdr_perm_0.05" = sum(emprical_pd["fdr_perm"] < 0.05) ,
"fdr_beta_0.05" = sum(emprical_pd["fdr_beta"] < 0.05),
"q_perm_0.05" = sum(emprical_pd["q_perm"] < 0.05) ,
"q_beta_0.05" = sum(emprical_pd["q_beta"] < 0.05) ,
"q_perm_0.01" = sum(emprical_pd["q_perm"] < 0.01) ,
"q_beta_0.01" = sum(emprical_pd["q_beta"] < 0.01) )
emprical_pd%>%write_delim("$[_output[0]]","\t")
summary%>%write_delim("$[_output[1]]","\t")
Trans-xQTL association testing#
For transQTL analysis, if you output all the p-values for many genes (default setting) it is suggested to provide the largest memory and CPU threads available on a compute node. eg 250G and >32 threads.
[trans_1]
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_%s" % input_chroms[_index]}.trans_qtl.pairs.tsv.gz'
parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container, entrypoint = entrypoint
import pandas as pd
import numpy as np
import tensorqtl
from tensorqtl import genotypeio, trans
from scipy.stats import chi2
## Define paths
plink_prefix_path = $[_input[1]:nar]
expression_bed = $[_input[0]:ar]
covariates_file = "$[covariate_file:a]"
window = $[window]
## Loading Data
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
phenotype_id = phenotype_pos_df.index.name
## Analyze only the regions listed
if $[region_list.is_file()]:
region = pd.read_csv("$[region_list:a]", comment="#", header=None,sep="\t")
phenotype_column = 1 if len(region.columns) == 1 else $[region_list_phenotype_column]
keep_region = region.iloc[:,phenotype_column-1].to_list()
phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
## use custom cis windows
if $[customized_cis_windows.is_file()]:
cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
phenotype_pos_df_reset = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
phenotype_pos_df = phenotype_pos_df_reset.merge(cis_list, left_on = ["chr",phenotype_id],right_on = [cis_list.columns[0],cis_list.columns[3]])#in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
if len(phenotype_df.index) - phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0]!= len(phenotype_pos_df.index):
raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0
## Retaining only traits in cis_list
if phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0] != 0:
phenotype_df = phenotype_df.loc[phenotype_df.index.isin(cis_list[phenotype_id])]
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
## use custom sample list to subset the covariates data
if $[keep_sample.is_file()]:
sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
covariates_df.loc[sample_list.sample_id]
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos','a0','a1']]
## Retaining only common samples
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
## To simplify things, there should really not be "chr" prefix
phenotype_pos_df.chr = [x.replace("chr","") for x in phenotype_pos_df.chr]
variant_df.chrom = [x.replace("chr","") for x in variant_df.chrom]
## Trans analysis
trans_df = trans.map_trans(genotype_df,
phenotype_df,
covariates_df,
batch_size=$[batch_size],
return_sparse=True,
return_r2 = True,
pval_threshold=$[pval_threshold],
maf_threshold=$[maf_threshold])
## Filter out cis signal, again if customized cis windows are used, the windows is [start-win,end + win] where win = 0, else it is [start - win, start + win]
trans_df = trans.filter_cis(trans_df, phenotype_pos_df, variant_df, window=window)
## Permutation
if $['True' if permutation else 'False']:
perm_df = trans.map_permutations(genotype_df, covariates_df, batch_size=$[batch_size],
maf_threshold=$[maf_threshold])
trans.apply_permutations(perm_df,trans_df)
## Output
trans_df.columns.values[1] = "molecular_trait_id"
trans_df.columns.values[2] = "pvalue"
trans_df.columns.values[3] = "beta"
trans_df.columns.values[4] = "se"
trans_df["n"] = len(phenotype_df.columns.values)
trans_df = variant_df.merge(trans_df, right_on='variant_id', left_index=True)
trans_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
# genomic inflation factor
lambda_col = trans_df.groupby("molecular_trait_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
# Convert the Series to a DataFrame
lambda_col = lambda_col.reset_index()
lambda_col.columns = ['molecular_trait_id', 'genomic_inflation_lambda']
lambda_col.to_csv("$[_output:nnn].genomic_inflation.tsv.gz", sep='\t', index = None, compression={'method': 'gzip', 'compresslevel': 9})
# sort the table if chrom and pos is not in ascending order
if not all(trans_df['pos'].iloc[i] <= trans_df['pos'].iloc[i+1] for i in range(len(trans_df)-1)):
trans_df = trans_df.sort_values(by=['chrom', 'pos'])
# print general information of trans_df
print('Output Information:')
print("Output Rows:", len(trans_df))
print("Output Columns:", trans_df.columns.tolist())
print("Output Preview:", trans_df.iloc[0:5, 0:10])
# save output
trans_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container, entrypoint = entrypoint
if ($['FALSE' if permutation and pval_threshold < 1.0 else 'TRUE']){
library(purrr)
library(tidyr)
library(readr)
library(dplyr)
library(qvalue)
# calculate qvalue
pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = qvalue(pvalue)$qvalues)
pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
}
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
bgzip --compress-level 9 $[_output["nominal"]:n]
tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]