Phenotype data imputation#
Description#
Empirical Bayes Matrix Factorization (EBMF) is the primary imputation method we use to impute molecular phenotype data [cf. Qi et al., medRxiv, 2023]. We also provide a collection of other methods to impute missing omics data values including missing forest, XGboost, k-nearest neighbors, soft impute, mean imputation, and limit of detection.
Input#
A molecular phenotype data with missing where first four columns are chr, start, end, and ID. The rest columns are samples.
Output#
A complete molecular phenotype data where first four columns are chr, start, end, and ID. The rest columns are samples.
Minimal Working Example Steps#
ii. Imputation#
Timing < 10 min
!sos run phenotype_imputation.ipynb EBMF \
--phenoFile ../../../output_test/phenotype_by_chrom/protocol_example.protein.bed.gz \
--cwd ../../../output_test/impute \
--prior ebnm_point_laplace --varType 1 \
--container oras://ghcr.io/cumc/factor_analysis_apptainer:latest \
--mem 40G \
-c ../../csg.yml -q neurology
INFO: Running EBMF:
INFO: teca723bbf6582b89 restart from status failed
INFO: teca723bbf6582b89 submitted to neurology with job id Your job 5934007 ("job_teca723bbf6582b89") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: EBMF output: /restricted/projectnb/xqtl/xqtl_protocol/output_test/impute/protocol_example.protein.bed.imputed.bed.gz
INFO: Workflow EBMF (ID=wbd1e90d8b53b039b) is executed successfully with 1 completed step and 1 completed task.
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command Interface#
!sos run phenotype_imputation.ipynb -h
usage: sos run phenotype_imputation.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
EBMF
gEBMF
missforest
missxgboost
knn
soft
mean
lod
bed_filter_na
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--phenoFile VAL (as path, required)
Molecular phenotype matrix
--[no-]qc-prior-to-impute (default to True)
QC before imputation to remove unqualified features
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 72h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 20 (as int)
Number of threads
--container ''
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
Sections
EBMF:
Workflow Options:
--prior 'ebnm_point_laplace'
prior distribution of loadings and factors
--varType 1
--num-factor 60
gEBMF:
Workflow Options:
--nCores 1
--num-factor 60
--backfit-iter 3
--[no-]save-flash (default to True)
--[no-]null-check (default to False)
missforest:
missxgboost:
knn:
soft:
mean:
lod:
bed_filter_na:
Workflow Options:
--rank-max 50 (as int)
--lambda-hyp 30 (as int)
--impute-method soft
--tol-missing 0.05 (as float)
Tolerance of missingness rows with missing rate larger
than tol_missing will be removed, with missing rate
smaller than tol_missing will be mean_imputed. Say if we
want to keep rows with less than 5% missing, then we use
0.05 as tol_missing.
Setup and global parameters#
[global]
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# QC before imputation to remove unqualified features
parameter: qc_prior_to_impute = True
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "72h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
[EBMF]
# prior distribution of loadings and factors
parameter: prior = "ebnm_point_laplace"
parameter: varType = '1'
parameter: num_factor = '60'
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}', cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library("flashier")
library("tibble")
library("readr")
library("dplyr")
#library("irlba")
library("softImpute")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
# Tunning the parameter
lambda <- 30
rank <- 50
if (nrow(pheno_NAs) < 1500) {
lambda <- 5
} else {
lambda <- 30
}
# Impute
X_mis_C <- as(as.matrix(pheno_NAs), "Incomplete")
###uses "svd" algorithm
fit1 <- softImpute::softImpute(X_mis_C,rank = rank,lambda = lambda,type = "svd")
pheno_soft <- softImpute::complete(as.matrix(pheno_NAs),fit1)
min_sd <- min(apply(pheno_soft, 1, sd))
pca_res <- irlba::irlba(pheno_soft, nv = ${num_factor})
pca_res <- list(pca_res$d, pca_res$u, pca_res$v)
names(pca_res) <- c('d', 'u', 'v')
fl_pca <- flash_init(as.matrix(pheno_NAs), S = min_sd, var_type = ${varType}) |>
flash_factors_init(pca_res, ebnm_fn = ${prior}) |>
flash_backfit(maxiter = 300)
pheno.imp <- ifelse(is.na(as.matrix(pheno_NAs)), fitted(fl_pca), as.matrix(pheno_NAs))
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
write_delim(cbind(pheno_qc[, 1:4], pheno.imp), "${_output:n}", delim = "\t" )
} else {
if (length(miss.ind) > 0) {
pheno.imp.qc <- pheno.imp[-miss.ind, ]
write_delim(cbind(pheno[-miss.ind, 1:4], pheno.imp.qc), "${_output:n}", delim = "\t" )
} else {
pheno.imp.qc <- pheno.imp
write_delim(cbind(pheno[, 1:4], pheno.imp.qc), "${_output:n}", delim = "\t" )
}
}
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[gEBMF]
parameter: nCores = '1'
parameter: num_factor = '60'
parameter: backfit_iter = '3'
parameter: save_flash = True
parameter: null_check = False
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}', cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library("flashier")
library("tibble")
library("readr")
library("dplyr")
library("snow")
library("softImpute")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
pheno_range <- range(unlist(as.list(as.data.frame(pheno_NAs))), na.rm = TRUE)
if (pheno_range[1] >= 0 && pheno_range[2] <= 1) {
dat <- qnorm(as.matrix(pheno_NAs))
}else{
dat <- as.matrix(pheno_NAs)
}
rm(pheno_NAs)
groups <- pheno$`#chr`
null_check <- ${"TRUE" if null_check else "FALSE"}
cl <- flash_init_cluster_for_grouped_data(dat, groups, ${nCores} , K = ${num_factor})
for (i in seq_len(${backfit_iter})) {
flash_backfit_grouped_data(cl, maxiter = 10)
flash_impute_grouped_data(cl)
}
fl <- flash_recover_fl_object_from_cluster(cl)
snow::stopCluster(cl)
rm(cl)
if(${"TRUE" if null_check else "FALSE"}){
fl <- flash_nullcheck(fl)
}
if(${"TRUE" if save_flash else "FALSE"}){
saveRDS(fl,paste0('${_output:nnn}', '.gEBMF_factors.rds'))
}
x_imp <- ifelse(is.na(dat), fitted(fl), dat)
#message('done.')
if (pheno_range[1] >= 0 && pheno_range[2] <= 1) {
pheno_imp <- pnorm(x_imp)
}else{
pheno_imp <- x_imp
}
write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t")
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[missforest]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library(flashier)
library("tibble")
library("readr")
library("dplyr")
library("missForest")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
pheno_imp <- missForest(as.matrix(pheno_NAs), parallelize = 'no')$ximp
# If ’variables’ the data is split into pieces of the size equal to the number of cores registered in the parallel backend.
write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[missxgboost]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
source('./xgb_imp.R')
library("tibble")
library("readr")
library("dplyr")
library("missForest")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
pheno_imp <- xgboost_imputation(as.matrix(pheno_NAs))
write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[knn]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library(impute)
library("tibble")
library("readr")
library("dplyr")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
pheno_NAs = pheno_NAs %>% select(where(~mean(is.na(.)) < 0.8))
# I removed columns that have more than 80 % missing values, cuz knn cannot impute them.
pheno_imp <- impute.knn(as.matrix(pheno_NAs), rowmax = 1)$data
write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[soft]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library(softImpute)
library("tibble")
library("readr")
library("dplyr")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
X_mis_C <- as(as.matrix(pheno_NAs), "Incomplete")
###uses "svd" algorithm
fit <- softImpute(X_mis_C,rank = 50,lambda = 30,type = "svd")
pheno_imp <- complete(as.matrix(pheno_NAs),fit)
write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[mean]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library("tibble")
library("readr")
library("dplyr")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
Yfill <- as.matrix(pheno_NAs)
for (t.row in 1:nrow(pheno_NAs)) {
Yfill[t.row, is.na(Yfill[t.row,])] <- rowMeans(Yfill, na.rm = TRUE)[t.row]
}
write_delim(cbind(pheno_qc[, 1:4], Yfill), "${_output:n}", delim = "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[lod]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
library("tibble")
library("readr")
library("dplyr")
pheno <- read_delim(${_input:ar}, delim = "\t")
# Get index of features that have > 40% missing
miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4)
# Get index of features where more than 95% values are zero.
value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
miss.ind <- c(miss.ind, value0.ind)
# remove these rows if qc_prior_to_impute = True
if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
if (length(miss.ind) > 0) {
pheno_qc <- pheno[-miss.ind, ]
} else {
pheno_qc <- pheno
}
pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
} else {
pheno_NAs <- pheno[, 5:ncol(pheno)]
}
Yfill <- as.matrix(pheno_NAs)
for (t.row in 1:nrow(pheno_NAs)) {
Yfill[t.row, is.na(Yfill[t.row,])] <- min(Yfill[t.row, ], na.rm = TRUE)
}
write_delim(cbind(pheno_qc[, 1:4], Yfill), "${_output:n}", delim = "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[bed_filter_na]
parameter: rank_max = 50 # max rank estimated in the per-chr methyl matrix
parameter: lambda_hyp = 30 # hyper par, indicating the importance of the nuclear norm
parameter: impute_method = "soft"
# Tolerance of missingness rows with missing rate larger than tol_missing will be removed,
# with missing rate smaller than tol_missing will be mean_imputed. Say if we want to keep rows with less than 5% missing, then we use 0.05 as tol_missing.
parameter: tol_missing = 0.05
input: phenoFile
output: f'{cwd:a}/{_input:nn}.filter_na.{impute_method}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
library("dplyr")
library("tibble")
library("readr")
library(softImpute)
compute_missing <- function(mtx){
miss <- sum(is.na(mtx))/length(mtx)
return(miss)
}
mean_impute <- function(mtx){
f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
return(mtx)
}
soft_impute <- function(){
f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
return(mtx)
}
filter_mtx <- function(X, missing_rate_thresh) {
rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
if (length(rm_col)) X <- X[, -rm_col]
return((X))
}
bed = read_delim("${_input}")
mtx = bed[,5:ncol(bed)]%>%as.matrix
rownames(mtx) = bed[,4]%>%unlist()
tbl_filtered = filter_mtx(mtx%>%t(),${tol_missing})
if ( "${impute_method}" == "mean" ){
tbl_filtered = tbl_filtered%>%mean_impute()%>%t()
} else if ("${impute_method}" == "soft"){
tbl_filtered_C= as(t(tbl_filtered),"Incomplete")
fit=softImpute(tbl_filtered_C,rank=${rank_max},lambda=${lambda_hyp},type="svd")
tbl_filtered = complete(t(tbl_filtered),fit)
}
tbl_filtered = tbl_filtered%>%as_tibble(rownames = colnames(bed)[4])
bed_filtered = inner_join(bed[,1:4],tbl_filtered)
bed_filtered%>%write_delim("${_output:n}", "\t" )
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output:n}
tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `zcat $i | grep -v "##" | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
The resource usage for softimputing 450K methylation data are as followed:
```
time elapsed: 880.90s
peak first occurred: 152.11s
peak last occurred: 175.41s
max vms_memory: 38.95GB
max rss_memory: 34.35GB
memory check interval: 1s
return code: 0
```