Covariate Data Formatting

Contents

Covariate Data Formatting#

Description#

Our covariate preprocessing steps merge genotypic principal components and fixed covariate files into one file for downstream QTL analysis.

Input#

PCA file as output from the PCA module
Fixed covariate files

Output#

PCA + Covariate file

Minimal Working Example Steps#

The data and singularity used in this minimal working example can be found on Synapse.

i. Merge Covariates and Genotype PCs#

Timing: <1 min

You can edit the total amount of variation you want your PCs to explain by editing the --k parameter. In this example, we chose 80%.

sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate \
    --pcaFile output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.rds \
    --covFile  input/protocol_example.samples.tsv \
    --tol_cov 0.4  \
    --k `awk '$3 < 0.8' output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1 ` \
    --container containers/bioinfo.sif

INFO: Running merge_genotype_pc:
INFO: merge_genotype_pc is completed.
INFO: merge_genotype_pc output:   /Users/alexmccreight/xqtl-protocol/output/covariate/protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.gz
INFO: Workflow merge_genotype_pc (ID=weaef0ce301340b3d) is executed successfully with 1 completed step.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

!sos run covariate_formatting.ipynb -h

usage: sos run covariate_formatting.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  merge_genotype_pc

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --covFile VAL (as path, required)
                        The covariate file
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 2G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""


Sections
  merge_genotype_pc:
    Workflow Options:
      --pcaFile VAL (as path, required)
                        An RDS file as the output of the genotype PCA module
      --k 20 (as int)
                        The number of PCs to retain, by default is 20, in
                        practice can be the number of PC that captured more than
                        70% PVE
      --name  f'{covFile:bn}.{pcaFile:bn}'

      --outliersFile . (as path)
                        Outliers
      --[no-]remove-outliers (default to False)
      --tol-cov -1.0 (as float)
                        Tolerance of missingness in covariates, -1 means do
                        nothing, otherwise for samples with covariates missing
                        rate larger than tol_cov will be removed, with missing
                        rate smaller than tol_cov will be kept.
      --[no-]mean-impute (default to True)

Setup and global parameters#

[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# The covariate file
parameter: covFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "2G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f"{cwd:a}")

Step 0: Merge Covariates and Genotype PCs#

[merge_genotype_pc]
# An RDS file as the output of the genotype PCA module
parameter: pcaFile = path
# The number of PCs to retain, by default is 20, in practice can be the number of PC that captured more than 70% PVE
parameter: k = 20
parameter: name = f'{covFile:bn}.{pcaFile:bn}'
# Outliers
parameter: outliersFile = path(".") 
parameter: remove_outliers = False
# Tolerance of missingness in covariates, -1 means do nothing, otherwise for samples with covariates missing rate larger than tol_cov will be removed,
# with missing rate smaller than tol_cov will be kept.
parameter: tol_cov = -1.0 
parameter: mean_impute = True
stop_if(remove_outliers and not outliersFile.is_file(), msg = "No outliers file specified, please add outliers file or remove the remove-outliers flag")
input: pcaFile, covFile
output:  f'{cwd:a}/{name}.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
        library("dplyr")
        library("readr")
        library("data.table")

        compute_missing <- function(mtx){
          miss <- sum(is.na(mtx))/length(mtx)
          return(miss)
        }

        mean_impute <- function(mtx){
          f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
          for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
          return(mtx)
        }

        filter_mtx <- function(X, missing_rate_thresh) {
            rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
            if (length(rm_col)) X <- X[, -rm_col]
            return($['mean_impute(X)' if mean_impute else 'X'])
        }  

        pca_output = readRDS("$[_input[0]]")$pc_scores
        mtx = pca_output%>%select(contains("PC"))%>%t()
        colnames(mtx) <- pca_output$IID
        ## Keep only the number of PCs specified
        mtx = mtx[1:$[k],]
        mtx = mtx%>%as_tibble(rownames = "#id")
        ## Load covariates
        covt = transpose(fread("$[_input[1]]", head=T), keep.names="#id", make.names=1) %>% as_tibble()

        ## Retaining only the overlapped samples with genotype
        overlap = intersect(colnames(covt),colnames(mtx))
        
        ## Report sample counts:
        print(paste(ncol(covt) - 1, "samples are in the covariate file", sep = " "))
        print(paste(ncol(mtx), "samples are in the PCA file", sep = " "))

        ## Report identical samples:
        print(paste(length(overlap) - 1, "samples overlap between covariate & PCA files and are included in the analysis:", sep = " "))
        print(overlap[!overlap == '#id'])

        ## Report non-overlapping samples:
        cov_missing = covt %>% select(-all_of(overlap))
        print(paste(ncol(cov_missing), "samples in the covariate file are missing from the PCA file:", sep = " "))
        print(colnames(cov_missing))

        pca_missing = mtx %>% select(-all_of(overlap))
        print(paste(ncol(pca_missing), "samples in the PCA file are missing from the covariate file:", sep = " "))
        print(colnames(pca_missing))
        
        ## Removal of outlier if needed
        if ($["TRUE" if remove_outliers else "FALSE"]){
              outlier = fread("$[outliersFile]", head = FALSE)$V2 %>% as_tibble()
              overlap = setdiff(overlap,outlier)
              }
        covt = covt%>%select(all_of(overlap))

        mtx = mtx%>%select(all_of(overlap))
        output = bind_rows(covt,mtx)

        ## Handle missingess in covariates
        if($[tol_cov] == -1){if(sum(is.na(output)) > 0 ){ stop("NA in covariates input: Check input file or set parameter tol_cov to allow for removing missing values; mean_impute to allow for imputing missing values")}}
        output = output%>%as.data.frame
        rownames(output) = output$`#id`
        output = filter_mtx(output[,2:ncol(output)],$[tol_cov])%>%as_tibble(rownames = "#id")
        output%>%write_delim("$[_output]","\t")

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
        stdout=$[_output:n].stdout 
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done