Normalization and phenotype table generation for splicingQTL analysis#
Description#
Quality control and normalization are performed on output from the leafcutter and psichomics tools. The raw output data is first converted to bed format. Quality control involves the removal of features with high missingness across samples (default 40%) and any remaining missing values are retained. Then introns with less than a minimal variation (default of 0.005) are removed from the data. Quantile-Quantile normalization is performed on the quality controlled data. Imputation of missing values is done separately afterwards.
Input#
leafcutter
#
The sample_list_intron_usage_perind.counts.gz file generated by previous splicing_calling.ipynb.
Minimal working example input may be found at google drive, specifically the sample_fastq_bam_list_intron_usage_perind.counts.gz
file.
psichomics
#
The psichomics_raw_data.tsv file generated by previous splicing_calling.ipynb.
Minimal working example input may be found at google drive, specifically the psi_raw_data.tsv
file.
Output#
leafcutter
#
{sample_list}
below refers to the name of the meta-data file input for previous step.
Main output include:
{sample_list}_intron_usage_perind.counts.gz_raw_data.qqnorm.txt
a merged table with normalized intron usage ratio for each sample, ready to be phenotype input for tensorQTL in following format:
` a merged table with normalized intron usage ratio for each sample, ready to be phenotype input for tensorQTL in following format:
#Chr start end ID samp1 samp2 samp3 ...
chromosome intron_start intron_end {chr}:{intron_start}:{intron_end}:{cluster_id}_{strand} data data data ...
(the strand info in “ID” column is calculated strandness of junctions via retools in previous workflow)
psichomics
#
Main output include:
psichomics_raw_data_bedded.qqnorm.txt
a merged table with normalized intron usage ratio for each sample, ready to be phenotype input for tensorQTL in following format:
` a merged table with normalized percent spliced in (psi) value for each sample, ready to be phenotype input for tensorQTL in following format:
#Chr start end ID samp1 samp2 samp3 ...
chromosome intron_start intron_end {event_type}_{chr}_{strand}_{coordinates}_{gene} data data data ...
Minimal Working Example Steps#
ii. Splicing QC and Normalization#
a. Leafcutter#
Timing: ~30min
!sos run splicing_normalization.ipynb leafcutter_norm \
--cwd ../../output_test/leafcutter/normalize \
--ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
-s force -c ../csg.yml -q neurology
INFO: Running leafcutter_norm_1:
INFO: t7dd380a023cbb83f re-execute completed
INFO: t7dd380a023cbb83f submitted to neurology with job id Your job 3983798 ("job_t7dd380a023cbb83f") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: leafcutter_norm_1 output: ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz_phenotype_file_list.txt
INFO: Running leafcutter_norm_2:
INFO: td8bff19d02211b90 submitted to neurology with job id Your job 3983807 ("job_td8bff19d02211b90") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: leafcutter_norm_2 output: ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz_raw_data.txt
INFO: Running leafcutter_norm_3:
INFO: t2b8b4754c0be4cbc submitted to neurology with job id Your job 3983815 ("job_t2b8b4754c0be4cbc") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: leafcutter_norm_3 output: ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz_raw_data.qqnorm.txt
INFO: Workflow leafcutter_norm (ID=w9a20332cfb71c800) is executed successfully with 3 completed steps and 3 completed tasks.
To perform leafcutter analysis with QC (setting cluster counts of 0 as NA), imputation (flashR), and normalization, use the following commands (RECOMMENDED):
# Step 1: QC
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
--cwd ../../output_test/leafcutter/normalize \
--ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
--no_norm # add no norm to skip last step (qqnorm) in leafcutter_norm
# Step 2: Imputation
sos run pipeline/phenotype_imputation.ipynb EBMF \
--phenoFile <output_from_step1> \
--cwd ../../output_test/leafcutter/imputation \
--prior ebnm_point_laplace --varType 1 \
--container oras://ghcr.io/cumc/factor_analysis_apptainer:latest \
--mem 40G --numThreads 20 --walltime 100h
# Step 3: Normalization
sos run pipeline/splicing_normalization.ipynb leafcutter_qqnorm \
--qced-data <output_from_step2> \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest
Or if you are going to use the default mean imputation method, you can run it with one-step command (remove –no_norm in the first step):
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
--cwd ../../output_test/leafcutter/normalize \
--ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
--mean_impute
a. Psichomics#
Timing: ~20min
!sos run splicing_normalization.ipynb psichomics_norm \
--cwd ../../output_test/psichomics/normalize \
--ratios ../../output_test/psichomics/psi_raw_data.tsv \
--container oras://ghcr.io/cumc/psichomics_apptainer:latest \
-c ../csg.yml -q neurology
INFO: Running psichomics_norm_1:
INFO: psichomics_norm_1 (index=0) is ignored due to saved signature
INFO: psichomics_norm_1 (index=0) is ignored due to saved signature
INFO: psichomics_norm_1 output: /restricted/projectnb/xqtl/xqtl_protocol/output_test/psichomics/normalize/psichomics_raw_data_bedded.txt
INFO: Running psichomics_norm_2:
INFO: t261c5bcc674fd156 submitted to neurology with job id Your job 4284374 ("job_t261c5bcc674fd156") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: psichomics_norm_2 output: /restricted/projectnb/xqtl/xqtl_protocol/output_test/psichomics/normalize/psichomics_raw_data_bedded.qqnorm.txt
INFO: Workflow psichomics_norm (ID=w414de35dd6b2a7af) is executed successfully with 1 completed step, 1 ignored step and 1 completed task.
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command interface#
sos run splicing_normalization.ipynb -h
usage: sos run splicing_normalization.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
leafcutter_norm
Global Workflow Options:
--cwd output (as path)
The output directory for generated files.
--ratios VAL (as path, required)
intron usage ratio file wiht samples after QC
--job-size 1 (as int)
Raw data directory, default to the same directory as
sample list parameter: data_dir = path(f"{ratios:d}")
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 8 (as int)
Number of threads
--container ''
Software container option
Sections
leafcutter_norm_1:
Workflow Options:
--chr-blacklist f'{ratios:dd}/black_list.txt'
leafcutter_norm_2:
Setup and global parameters#
[global]
# The output directory for generated files.
parameter: cwd = path("output")
# intron usage ratio file wiht samples after QC
# optional parameter black list if user want to blacklist some chromosomes and not to analyze
parameter: chr_blacklist = path(".")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')
0. Generate sample list#
The rna seq sample IDs are different from the wgs samples. And Hao provided a table for that, but I need to cut that list in a same length with my samples in sQTL analysis.
Step Inputs:#
ROSMAP_JointCall_sample_participant_lookup_fixed
: A sample comparison table for ROSMAP datasets.
Step Outputs:#
ROSMAP_JointCall_sample_participant_lookup_fixed.rnaseq
: A sample comparison table for ROSMAP datasets with same length as length with my samples.
[Junc_list]
parameter: junc_path = path
parameter: file_suffix = "junc"
parameter: leafcutter_version = "leafcutter2"
parameter: dataset = "ROSMAP_DLPFC"
output: f'{cwd}/{leafcutter_version}/{dataset}_intron_usage_perind.junc'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stderr'
cd ${cwd}
realpath ${junc_path}/*${file_suffix} > ${_output}.tmp
grep -v "all" ${_output}.tmp > ${_output}
rm ${_output}.tmp
[Jointcall_samples]
parameter: sample_table = path
parameter: junc_list = path
parameter: leafcutter_version = "leafcutter2"
input: sample_table, junc_list
output: f'{cwd}/{leafcutter_version}/{sample_table:b}.rnaseq'
R: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stderr'
library(data.table)
library(tidyverse)
sample_lookup_hao = read_delim(${_input[0]:r}, "\t" ,col_names = T)
sample_lookup_hao$sample_id<-sample_lookup_hao$sample_id%>%gsub(".final","",.)
sample_ids<-list.files(paste0("${cwd}","/","${leafcutter_version}","/"),"junc$")%>%stringr::str_split(.,"[.]",simplify=T)%>%.[,1]
junc_list<-read.table(${_input[1]:r})
sample_ids<-junc_list$V1%>%basename%>%stringr::str_split(.,"[.]",simplify=T)%>%.[,1]
sample_lookup_hao<-sample_lookup_hao[sample_lookup_hao$sample_id%in%sample_ids,]
write.table(sample_lookup_hao,${_output:r},quote = F,row.names = F,sep='\t')
leafcutter_norm
#
Documentation: leafcutter
. The choices of regtool parameters are discussed here.
Parameter Annotations#
chr_blacklist: file of blacklisted chromosomes to exclude from analysis, one per line. If none is provided, will default blacklist nothing.
Things to keep in mind#
Seems leafcutter_norm_1 requires ~ 10G memory (or larger if having large input) or there will be segmentation fault.
[leafcutter_norm_1]
parameter: ratios = path
parameter: pseudo_ratio = False
import os
if os.path.isfile(f'{ratios:dd}/black_list.txt'):
chr_blacklist = f'{ratios:dd}/black_list.txt'
input: ratios, group_by = 'all'
output: f'{ratios}_phenotype_file_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint = entrypoint
# code in [leafcutter_norm_1] and [leafcutter_norm_3] is modified from
# https://github.com/davidaknowles/leafcutter/blob/master/scripts/prepare_phenotype_table.py
import sys
import gzip
import os
import numpy as np
import pandas as pd
import scipy as sc
import pickle
from sklearn import linear_model
def stream_table(f, ss = ''):
fc = '#'
while fc[0] == "#":
fc = f.readline().strip()
head = fc.split(ss)
for ln in f:
ln = ln.strip().split(ss)
attr = {}
for i in range(len(head)):
try: attr[head[i]] = ln[i]
except: break
yield attr
def get_chromosomes(ratio_file):
"""Get chromosomes from table. Returns set of chromosome names"""
try: open(ratio_file)
except:
sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
return
sys.stderr.write("Parsing chromosome names...\n")
chromosomes = set()
with gzip.open(ratio_file, 'rt') as f:
f.readline()
for line in f:
chromosomes.add(line.split(":")[0])
return(chromosomes)
def get_blacklist_chromosomes(chromosome_blacklist_file):
"""
Get list of chromosomes to ignore from a file with one blacklisted
chromosome per line. Returns list. eg. ['X', 'Y', 'MT']
"""
if os.path.isfile(chromosome_blacklist_file):
with open(chromosome_blacklist_file, 'r') as f:
return(f.read().splitlines())
else:
return([])
def create_phenotype_table(ratio_file, chroms, blacklist_chroms):
dic_pop, fout = {}, {}
try: open(ratio_file)
except:
sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
return
sys.stderr.write("Starting...\n")
for i in chroms:
fout[i] = open(ratio_file+".phen_"+i, 'w')
fout_ave = open(ratio_file+".ave", 'w')
valRows, valRowsnn, geneRows = [], [], []
finished = False
header = gzip.open(ratio_file, 'rt').readline().split()[1:]
for i in fout:
fout[i].write("\t".join(["#chr","start", "end", "ID"]+header)+'\n')
for dic in stream_table(gzip.open(ratio_file, 'rt'),' '):
chrom = dic['chrom']
chr_ = chrom.split(":")[0]
if chr_ in blacklist_chroms: continue
NA_indices, aveReads = [], []
tmpvalRow = []
i = 0
for sample in header:
try: count = dic[sample]
except: print([chrom, len(dic)])
num, denom = count.split('/')
if float(denom) < 1:
count = "NA"
tmpvalRow.append("NA")
NA_indices.append(i)
else:
# add a 0.5 pseudocount
if ${pseudo_ratio}:
pseudocount = 0.01 * float(denom)
else:
pseudocount = 0.5
count = (float(num)+pseudocount)/((float(denom))+pseudocount)
tmpvalRow.append(count)
aveReads.append(count)
parts = chrom.split(":")
# Check the number of parts to determine how to unpack
if len(parts) == 5:
chr_, s, e, clu, category = parts #leafcutter2 would have 5 columns, the last column is function category
elif len(parts) == 4:
chr_, s, e, clu = parts #leafcutter would have 4 columns
category = None # or some default value, if necessary
else:
raise ValueError("Unexpected format of 'chrom' string")
if len(tmpvalRow) > 0:
fout[chr_].write("\t".join([chr_,s,e,chrom]+[str(x) for x in tmpvalRow])+'\n')
fout_ave.write(" ".join(["%s"%chrom]+[str(min(aveReads)), str(max(aveReads)), str(np.mean(aveReads))])+'\n')
valRows.append(tmpvalRow)
geneRows.append("\t".join([chr_,s,e,chrom]))
if len(geneRows) % 1000 == 0:
sys.stderr.write("Parsed %s introns...\n"%len(geneRows))
for i in fout:
fout[i].close()
matrix = np.array(valRows)
# write the corrected tables
sample_names = []
for name in header:
sample_names.append(name.replace('.Aligned.sortedByCoord.out.md', ''))
fout = {}
for i in chroms:
fn="%s.qqnorm_%s"%(ratio_file,i)
print("Outputting: " + fn)
fout[i] = open(fn, 'w')
fout[i].write("\t".join(['#Chr','start','end','ID'] + sample_names)+'\n')
lst = []
for i in range(len(matrix)):
chrom, s = geneRows[i].split()[:2]
lst.append((chrom, int(s), "\t".join([geneRows[i]] + [str(x) for x in matrix[i]])+'\n'))
lst.sort()
for ln in lst:
fout[ln[0]].write(ln[2])
fout_run = open("%s_phenotype_file_list.txt"%ratio_file, 'w')
fout_run.write("#chr\t#dir\n")
for i in fout:
fout[i].close()
fout_run.write("%s\t"%(i))
fout_run.write("%s.qqnorm_%s\n"%(ratio_file, i))
fout_run.close()
ratio_file = f'${_input}'
chroms = get_chromosomes(f'${_input}')
blacklist_chroms = get_blacklist_chromosomes(f'${chr_blacklist}')
create_phenotype_table(ratio_file, chroms, blacklist_chroms)
[leafcutter_norm_2]
parameter: autosomes = True
import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(f'{_input[0]}',sep = "\t")
# filter the autosomes or not
if autosomes:
print("Analyzing autosomes 1 to 22...")
mask = molecular_pheno_chr_inv['#chr'].str.match(r'chr(\d+)$')
chr_numbers = molecular_pheno_chr_inv['#chr'].str.extract(r'chr(\d+)', expand=False).dropna().astype(int)
molecular_pheno_chr_inv = molecular_pheno_chr_inv[mask & chr_numbers.between(1, 22)]
molecular_pheno_chr_inv = molecular_pheno_chr_inv.values.tolist()
file_inv = [x[1] for x in molecular_pheno_chr_inv]
input: file_inv # This design is necessary to avoid using for_each, as sos can not take chr number as an input.
output: f'{_input[0]:n}_raw_data.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint = entrypoint
head -1 ${_input[0]:r} > ${_output}
cat ${_input:r} | grep -v "#Chr" >> ${_output}
[leafcutter_norm_3,leafcutter_qqnorm]
parameter: no_norm = False
stop_if(no_norm)
parameter: mean_impute = False
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
# qced_data refers to data that has undergone leafcutter_norm1/2 normalization or has already been imputed.
# When this data is provided, mean imputation is automatically enabled (mean imputation set to true). Thus, the default imputation method is mean imputation.
# However, if the data provided is already imputed (i.e., contains no NA values), no further imputation will be applied.
parameter: qced_data = path(".")
parameter: bgzip=False
import os
if qced_data.is_dir():
print('Target Qced file with suffix "_raw_data.txt"')
input_file = paths([os.path.join(qced_data, x) for x in os.listdir(qced_data) if x.endswith('_raw_data.txt')])
bgzip=True
else:
print('Loading QCed data as inputed file')
input_file = paths(qced_data)
print('Setting mean_imput to True. If the input has already been imputed (i.e., no NA values are present), nothing will change. Otherwise, mean imputation will be applied to NA values.')
mean_impute = True
input: input_file
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint = entrypoint
import numpy as np
import pandas as pd
from sklearn import preprocessing
from scipy.stats import norm
from scipy.stats import rankdata
def qqnorm(x):
n=len(x)
a=3.0/8.0 if n<=10 else 0.5
return(norm.ppf( (rankdata(x,nan_policy="omit")-a)/(n+1.0-2.0*a) ))
#return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))
raw_df = pd.read_csv(f'${_input}',sep = "\t")
valRows = raw_df.iloc[:,4:]
headers = list(raw_df)
drop_list = []
na_limit = len(valRows.columns)*${na_rate}
for index, row in valRows.iterrows():
# If ratio is missing for over 40% of the samples, drop
if (row.isna().sum()) > na_limit:
drop_list.append(index)
# Set missing values as the mean of existed values in a row
if ${mean_impute}:
valRows.iloc[index, ] = row.fillna(row.mean())
# else:
# row.fillna(row.mean())
# drop introns with variance smaller than some minimal value
if np.nanstd(row) < ${min_variance}:
drop_list.append(index)
# save the intron information and sample values for remaining introns/rows
newtable = raw_df.drop(drop_list).iloc[:,0:4]
valRows = valRows.drop(drop_list)
# scale normalize on each row
valRows_matrix = []
for c in (valRows.values.tolist()):
c = preprocessing.scale(c)
valRows_matrix.append(c)
# qqnorms on the columns
matrix = np.array(valRows_matrix)
for i in range(len(matrix[0,:])):
matrix[:,i] = qqnorm(matrix[:,i])
normalized_table = pd.DataFrame(matrix)
# reset row index for the saved intron infomation so the index will match sample values
newtable = newtable.reset_index(drop=True)
# merge the two parts of table
output = pd.concat([newtable, normalized_table], axis=1)
output.columns = headers
# write normalized table
output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None)
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint = entrypoint
if ${'true' if bgzip else 'false'}; then
# Compress the sorted BED file
bgzip -c "${_output}" > "${_output:n}.bed.gz"
# Index the compressed BED file
tabix -p bed "${_output:n}.bed.gz"
fi
psichomics_norm
#
Documentation: psichomics
Consider retaining more information, the only QC on PSI values here are NA removal and a minimal variance filter, however, psichomics team suggested some further QC which can be checked here.
For reference, default minimal variance in leafcutter QC is 0.005.
Due to the difference of alternative splicing event type in psichomics outputs, we are not performing normalization on each sample’s data (columns) but only on each event (rows).
TO DO: rename the .qqnorm. and test downstreams
[psichomics_norm_1]
parameter: ratios = path
input: ratios, group_by = 'all'
output: f'{cwd}/psichomics_raw_data_bedded.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint = entrypoint
library(psichomics)
library(data.table)
psi_data <- as.matrix(fread("${_input}"),rownames=1)
psi_data = as.data.frame(psi_data)
# Process PSI df into bed file for tensorQTL. (This part of code is modified from Ryan Yordanoff's work)
parsed_events <- parseSplicingEvent(row.names(psi_data))
# Create bedfile df and fill values with parsed values
bed_file <- data.frame("chr"=parsed_events$chrom,"start"=parsed_events$start,"end"=parsed_events$end,"ID"=row.names(parsed_events),psi_data,check.names = FALSE)
names(bed_file)[1] <- "#Chr"
bed_file$'#Chr' <- sub("^", "chr", bed_file$'#Chr')
row.names(bed_file) <- NULL
bed_file <- bed_file[order(bed_file$'#Chr',bed_file$start,bed_file$end),]
# Create BED file output
write.table(x=bed_file, file = "${cwd}/psichomics_raw_data_bedded.txt", quote = FALSE, row.names = FALSE, sep = "\t")
[psichomics_norm_2]
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint = entrypoint
import numpy as np
import pandas as pd
from sklearn import preprocessing
from scipy.stats import norm
from scipy.stats import rankdata
def qqnorm(x):
n=len(x)
a=3.0/8.0 if n<=10 else 0.5
return(norm.ppf( (rankdata(x,nan_policy="omit")-a)/(n+1.0-2.0*a) ))
#return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))
raw_df = pd.read_csv(f'${_input}',sep = "\t")
valRows = raw_df.iloc[:,4:]
headers = list(raw_df)
drop_list = []
na_limit = len(valRows.columns)*${na_rate}
for index, row in valRows.iterrows():
# If ratio is missing for over 40% of the samples, drop
if (row.isna().sum()) > na_limit:
drop_list.append(index)
# drop introns with variance smaller than some minimal value
elif np.std(row) < ${min_variance}:
drop_list.append(index)
# save the intron information and sample values for remaining introns/rows
newtable = raw_df.drop(drop_list).iloc[:,0:4]
valRows = valRows.drop(drop_list)
# scale normalize on each row
valRows_matrix = []
for c in (valRows.values.tolist()):
c = preprocessing.scale(c)
valRows_matrix.append(c)
# qqnorms on the columns
matrix = np.array(valRows_matrix)
for i in range(len(matrix[0,:])):
matrix[:,i] = qqnorm(matrix[:,i])
normalized_table = pd.DataFrame(matrix)
# reset row index for the saved intron infomation so the index will match sample values
newtable = newtable.reset_index(drop=True)
# merge the two parts of table
output = pd.concat([newtable, normalized_table], axis=1)
output.columns = headers
# write normalized table, avoid scientific notation
output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None, float_format='%.16f')