Phenotype data imputation

Contents

Phenotype data imputation#

Description#

Empirical Bayes Matrix Factorization (EBMF) is the primary imputation method we use to impute molecular phenotype data [cf. Qi et al., medRxiv, 2023]. We also provide a collection of other methods to impute missing omics data values including missing forest, XGboost, k-nearest neighbors, soft impute, mean imputation, and limit of detection.

Input#

A molecular phenotype data with missing where first four columns are chr, start, end, and ID. The rest columns are samples.

Output#

A complete molecular phenotype data where first four columns are chr, start, end, and ID. The rest columns are samples.

Minimal Working Example Steps#

ii. Imputation#

Timing < 10 min

!sos run phenotype_imputation.ipynb EBMF \
    --phenoFile ../../../output_test/phenotype_by_chrom/protocol_example.protein.bed.gz \
    --cwd ../../../output_test/impute \
    --prior ebnm_point_laplace --varType 1 \
    --container oras://ghcr.io/cumc/factor_analysis_apptainer:latest \
    --mem 40G \
    -c ../../csg.yml  -q neurology

INFO: Running EBMF: 
INFO: teca723bbf6582b89 restart from status failed
INFO: teca723bbf6582b89 submitted to neurology with job id Your job 5934007 ("job_teca723bbf6582b89") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: EBMF output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/impute/protocol_example.protein.bed.imputed.bed.gz
INFO: Workflow EBMF (ID=wbd1e90d8b53b039b) is executed successfully with 1 completed step and 1 completed task.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

!sos run phenotype_imputation.ipynb -h

usage: sos run phenotype_imputation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  EBMF
  gEBMF
  missforest
  missxgboost
  knn
  soft
  mean
  lod
  bed_filter_na

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --[no-]qc-prior-to-impute (default to True)
                        QC before imputation to remove unqualified features
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 72h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container ''
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""


Sections
  EBMF:
    Workflow Options:
      --prior 'ebnm_point_laplace'
                        prior distribution of loadings and factors
      --varType 1
      --num-factor 60
  gEBMF:
    Workflow Options:
      --nCores 1
      --num-factor 60
      --backfit-iter 3
      --[no-]save-flash (default to True)
      --[no-]null-check (default to False)
  missforest:
  missxgboost:
  knn:
  soft:
  mean:
  lod:
  bed_filter_na:
    Workflow Options:
      --rank-max 50 (as int)
      --lambda-hyp 30 (as int)
      --impute-method soft
      --tol-missing 0.05 (as float)
                        Tolerance of missingness rows with missing rate larger
                        than tol_missing will be removed, with missing rate
                        smaller than tol_missing will be mean_imputed. Say if we
                        want to keep rows with less than 5% missing, then we use
                        0.05 as tol_missing.

Setup and global parameters#

[global]
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# QC before imputation to remove unqualified features 
parameter: qc_prior_to_impute = True
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "72h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

[EBMF]
# prior distribution of loadings and factors
parameter: prior = "ebnm_point_laplace"
parameter: varType = '1'
parameter: num_factor = '60'
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}', cores = numThreads  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library("flashier")
   library("tibble")
   library("readr")
   library("dplyr")
   #library("irlba")
   library("softImpute")
  
    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    # Tunning the parameter
    lambda <- 30
    rank <- 50
    if (nrow(pheno_NAs) < 1500) {
      lambda <- 5
    } else {
      lambda <- 30
    }  
    # Impute
    X_mis_C <- as(as.matrix(pheno_NAs), "Incomplete")
    ###uses "svd" algorithm
    fit1 <- softImpute::softImpute(X_mis_C,rank = rank,lambda = lambda,type = "svd")
    pheno_soft <- softImpute::complete(as.matrix(pheno_NAs),fit1)
    min_sd <- min(apply(pheno_soft, 1, sd))
    pca_res <- irlba::irlba(pheno_soft, nv = ${num_factor})
    pca_res <- list(pca_res$d, pca_res$u, pca_res$v)
    names(pca_res) <- c('d', 'u', 'v')
    fl_pca <- flash_init(as.matrix(pheno_NAs), S = min_sd, var_type = ${varType}) |>
      flash_factors_init(pca_res, ebnm_fn = ${prior}) |>
      flash_backfit(maxiter = 300)
    pheno.imp <- ifelse(is.na(as.matrix(pheno_NAs)), fitted(fl_pca), as.matrix(pheno_NAs))
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      write_delim(cbind(pheno_qc[, 1:4], pheno.imp), "${_output:n}", delim = "\t" )
    } else {
      if (length(miss.ind) > 0) {
         pheno.imp.qc <- pheno.imp[-miss.ind, ]
         write_delim(cbind(pheno[-miss.ind, 1:4], pheno.imp.qc), "${_output:n}", delim = "\t" )
      } else {
         pheno.imp.qc <- pheno.imp
         write_delim(cbind(pheno[, 1:4], pheno.imp.qc), "${_output:n}", delim = "\t" )
      }
    } 

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[gEBMF]
parameter: nCores = '1'
parameter: num_factor = '60'
parameter: backfit_iter = '3'
parameter: save_flash = True
parameter: null_check = False
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}', cores = numThreads  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library("flashier")
   library("tibble")
   library("readr")
   library("dplyr")
   library("snow")
   library("softImpute")
  
    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    pheno_range <- range(unlist(as.list(as.data.frame(pheno_NAs))), na.rm = TRUE) 
    if (pheno_range[1] >= 0 && pheno_range[2] <= 1) {
        dat <- qnorm(as.matrix(pheno_NAs))
    }else{
        dat <- as.matrix(pheno_NAs)
    }
    rm(pheno_NAs)
    groups <- pheno$`#chr`
    null_check <- ${"TRUE" if null_check else "FALSE"}
    cl <- flash_init_cluster_for_grouped_data(dat, groups, ${nCores} , K = ${num_factor})
    for (i in seq_len(${backfit_iter})) {
        flash_backfit_grouped_data(cl, maxiter = 10)
        flash_impute_grouped_data(cl)
      }
    fl <- flash_recover_fl_object_from_cluster(cl)
    snow::stopCluster(cl)
    rm(cl)

    if(${"TRUE" if null_check else "FALSE"}){
        fl <- flash_nullcheck(fl)
      }

    if(${"TRUE" if save_flash else "FALSE"}){
        saveRDS(fl,paste0('${_output:nnn}', '.gEBMF_factors.rds'))
      }

    x_imp <- ifelse(is.na(dat), fitted(fl), dat)
    #message('done.')
    if (pheno_range[1] >= 0 && pheno_range[2] <= 1) {
        pheno_imp <- pnorm(x_imp)
    }else{
        pheno_imp <- x_imp
    }
  
    write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t")

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[missforest]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(flashier)
   library("tibble")
   library("readr")
   library("dplyr")
   library("missForest")

    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    pheno_imp <- missForest(as.matrix(pheno_NAs), parallelize = 'no')$ximp
    # If ’variables’ the data is split into pieces of the size equal to the number of cores registered in the parallel backend.
    write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[missxgboost]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   source('./xgb_imp.R')
   library("tibble")
   library("readr")
   library("dplyr")
   library("missForest")

    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    pheno_imp <- xgboost_imputation(as.matrix(pheno_NAs))
    write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[knn]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(impute)
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    pheno_NAs = pheno_NAs %>% select(where(~mean(is.na(.)) < 0.8))
    # I removed columns that have more than 80 % missing values, cuz knn cannot impute them. 
    pheno_imp <- impute.knn(as.matrix(pheno_NAs), rowmax = 1)$data
    write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[soft]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(softImpute)
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    X_mis_C <- as(as.matrix(pheno_NAs), "Incomplete")
    ###uses "svd" algorithm
    fit <- softImpute(X_mis_C,rank = 50,lambda = 30,type = "svd")
    pheno_imp <- complete(as.matrix(pheno_NAs),fit)
    write_delim(cbind(pheno_qc[, 1:4], pheno_imp), "${_output:n}", delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[mean]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    Yfill <- as.matrix(pheno_NAs)
    for (t.row in 1:nrow(pheno_NAs)) {
        Yfill[t.row, is.na(Yfill[t.row,])] <- rowMeans(Yfill, na.rm = TRUE)[t.row] 
    }    
    write_delim(cbind(pheno_qc[, 1:4], Yfill), "${_output:n}", delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[lod]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    # Get index of features that have > 40% missing 
    miss.ind <- which(rowSums(is.na(pheno[, 5:ncol(pheno)]))/(ncol(pheno) -4) > 0.4) 
    # Get index of features where more than 95% values are zero.
    value0.ind <- which(rowSums((pheno[, 5:ncol(pheno)] == 0 | is.na(pheno[, 5:ncol(pheno)])))/(ncol(pheno) -4) > 0.95)
    miss.ind <- c(miss.ind, value0.ind)
    # remove these rows if qc_prior_to_impute = True
    if (${"TRUE" if qc_prior_to_impute else "FALSE"}) {
      if (length(miss.ind) > 0) {
         pheno_qc <- pheno[-miss.ind, ]
       } else {
         pheno_qc <- pheno
       }
      pheno_NAs <- pheno_qc[, 5:ncol(pheno_qc)]
    } else {
      pheno_NAs <- pheno[, 5:ncol(pheno)]
    }
    Yfill <- as.matrix(pheno_NAs)
    for (t.row in 1:nrow(pheno_NAs)) {
        Yfill[t.row, is.na(Yfill[t.row,])] <- min(Yfill[t.row, ], na.rm = TRUE)
    }
    write_delim(cbind(pheno_qc[, 1:4], Yfill), "${_output:n}", delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[bed_filter_na]
parameter: rank_max = 50 # max rank estimated in the per-chr methyl matrix
parameter: lambda_hyp = 30 # hyper par, indicating the importance of the nuclear norm
parameter: impute_method = "soft"
# Tolerance of missingness rows with missing rate larger than tol_missing will be removed,
# with missing rate smaller than tol_missing will be mean_imputed. Say if we want to keep rows with less than 5% missing, then we use 0.05 as tol_missing.
parameter: tol_missing = 0.05
input: phenoFile
output: f'{cwd:a}/{_input:nn}.filter_na.{impute_method}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
   library("dplyr")
   library("tibble")
   library("readr")
   library(softImpute)
   compute_missing <- function(mtx){
          miss <- sum(is.na(mtx))/length(mtx)
          return(miss)
        }

        mean_impute <- function(mtx){
          f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
          for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
          return(mtx)
        }
    
        soft_impute <- function(){
          f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
          for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
          return(mtx)
        }
  
  
        filter_mtx <- function(X, missing_rate_thresh) {
            rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
            if (length(rm_col)) X <- X[, -rm_col]
            return((X))
        }  
  
    bed = read_delim("${_input}")
    mtx = bed[,5:ncol(bed)]%>%as.matrix
    rownames(mtx) = bed[,4]%>%unlist()
    tbl_filtered = filter_mtx(mtx%>%t(),${tol_missing})
    if ( "${impute_method}" == "mean" ){
    tbl_filtered = tbl_filtered%>%mean_impute()%>%t()
     } else if ("${impute_method}" == "soft"){ 
      tbl_filtered_C= as(t(tbl_filtered),"Incomplete")
      fit=softImpute(tbl_filtered_C,rank=${rank_max},lambda=${lambda_hyp},type="svd")
      tbl_filtered = complete(t(tbl_filtered),fit)
    }
    tbl_filtered = tbl_filtered%>%as_tibble(rownames = colnames(bed)[4])  
    bed_filtered = inner_join(bed[,1:4],tbl_filtered)
    bed_filtered%>%write_delim("${_output:n}", "\t" )
  
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

The resource usage for softimputing 450K methylation data are as followed:

``` 
time elapsed: 880.90s
peak first occurred: 152.11s
peak last occurred: 175.41s
max vms_memory: 38.95GB
max rss_memory: 34.35GB
memory check interval: 1s
return code: 0
```