Sample level RNA-seq quality control

Sample level RNA-seq quality control#

Description#

Quality control of the TPM matrices follows methods outlined by GTEx V8. First, genes are removed from the matrices if over 20% of samples have have a TPM expression level of 10% or less. Sample level filtering includes three checks to detect sample outliers. Samples are only removed if they are marked as outliers in all three checks.

The first looks at Relative Log Expression (RLE). It is assumed that most gene expression values in a sample should be near a mean value and only a few genes are differentially expressed. To calculate RLE, for each gene i, calculate its median in N samples as Medi. Then for each sample j and expression value eij, count the difference between eij and Medi (deij = eij-Medi). Then create boxplots for each sample based on deij and sort by interquartile range. Samples with larger interquartile ranges are more likely to be outliers.

The second quality control check looks at heirarchical clustering of samples. Samples are expected to have short distances between others and therefore should cluster homogeneously, so distant samples are expected to be outliers. Distance is calculated as 1-spearman correlation for heirarchical clustering. The top 100 genes sorted by variance are used to calculate Mahalanobis distance. Chi2 p-values are then calculated based on Mahalanobis distance. Clusters with 60% or more samples with Bonferroni corrected p-values less than 0.05 are marked as outliers.

The third and last quality control check looks at D-statistics which represent the average correlation between a sample’s expression and other sampels. Samples with low D-statistics are likely to be outliers.

Input#

Gene expression in TPM. A data table with gene ID as first column and each sample ID as a subsquent columns.

Output#

A set of three diagnostic plots from each of the outlier detection method
A TPM file with low expression genes, samples with missing data, and outliers removed
Raw gene count file will also be processed to remove the same individuals and genes, as determined by the QC on TPM data.

Minimal Working Example Steps#

vi. Multi-sample RNA-seq QC#

Timing: <15min

Implement RNA-seq QC that identifies and removes genes and samples from the TPM matrix

Note: We recommend using DSFilterPercent default value of 0.05. DSFilterPercent changed from default (0.05) to 0.1 here for testing with few samples.

!sos run bulk_expression_QC.ipynb qc \
    --cwd ../../output_test/rnaseqc_qc \
    --tpm-gct ../../output_test/star_output/PCC_sample_list_subset.rnaseqc.gene_tpm.gct.gz \
    --counts-gct ../../output_test/star_output/PCC_sample_list_subset.rnaseqc.gene_readsCount.gct.gz \
    --DSFilterPercent 0.1 \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -s force -c ../csg.yml  -q neurology 

INFO: Running basic check and low expression filtering: 
INFO: tafc955937c3625d3 re-execute completed
INFO: tafc955937c3625d3 submitted to neurology with job id Your job 2487320 ("job_tafc955937c3625d3") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: basic check and low expression filtering output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/rnaseqc_qc/PCC_sample_list_subset.rnaseqc.low_expression_filtered.tpm.gct.gz
INFO: Running remove outliers: 
INFO: tc7361a814fec7f21 submitted to neurology with job id Your job 2487339 ("job_tc7361a814fec7f21") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: remove outliers output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/rnaseqc_qc/PCC_sample_list_subset.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz
INFO: Running remove gene and samples from raw counts: 
INFO: tab40adeac3335a70 submitted to neurology with job id Your job 2487364 ("job_tab40adeac3335a70") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: remove gene and samples from raw counts output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/rnaseqc_qc/PCC_sample_list_subset.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz
INFO: Workflow qc (ID=w1ced59ea596cdce1) is executed successfully with 3 completed steps and 3 completed tasks.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command interface#

sos run bulk_expression_QC.ipynb -h

usage: sos run bulk_expression_QC.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  qc

Global Workflow Options:
  --tpm-gct VAL (as path, required)
                        Required input is TPM file
  --counts-gct . (as path)
                        Raw counts file is optional and if available, it will be
                        filtered to match with the TPM file sample and genes
  --cwd output (as path)
  --container ''

Sections
  qc_1:
    Workflow Options:
      --low-expr-TPM 0.1 (as float)
      --low-expr-TPM-percent 0.2 (as float)
  qc_2:
    Workflow Options:
      --RLEFilterPercent 0.05 (as float)
      --DSFilterPercent 0.05 (as float)
      --topk-genes 100 (as int)
      --cluster-percent 0.6 (as float)
      --pvalue-cutoff 0.05 (as float)
      --cluster-level 5 (as int)
  qc_3:

Workflow implementation#

[global]
# Required input is TPM file
parameter: tpm_gct = path
# Raw counts file is optional and if available, it will be filtered to match with the TPM file sample and genes
parameter: counts_gct = path()
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"

# Number of threads
parameter: numThreads = 8

[qc_1 (basic check and low expression filtering)]
parameter: low_expr_TPM = 0.1
parameter: low_expr_TPM_percent = 0.2
input: tpm_gct
output: f'{cwd}/{_input:bnnn}.low_expression_filtered.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "$[ ]", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log',container = container, entrypoint = entrypoint
    # Load packages
    library(dplyr)
    library(readr)
    library(tibble)
    library(purrr)
    ## Setting inputs
    low_expr_TPM <- $[low_expr_TPM]
    low_expr_TPM_percent <- $[low_expr_TPM_percent]
    ## QC
    ### Data integrity check. The input file should has sample ID as column name; the first column is gene ID
    TPM_data <- read_tsv($[_input:ar], col_names = TRUE, comment = "#")%>%as.data.frame()
    names(TPM_data)[1] <- "feature"
    if( sum(duplicated(TPM_data$feature)) > 0){
      message("feature (e.g. gene names) should be in the first column. Please remove duplicated feature IDs, Exit!")
      quit(save = "no", status = 1, runLast = FALSE)
    }else{
      rownames(TPM_data) <- TPM_data$feature
      TPM_data = TPM_data[,-1]
      loaded_sample_count <- ncol(TPM_data)
    }
    #### NA check
    if(sum(is.na(TPM_data)) > 0 ){      
      message(paste0("NA is not allowed in the data, there are ",sum(is.na(TPM_data))," NAs, Exit!"))
      quit(save = "no", status = 1, runLast = FALSE)}
    #### make sure every expr column is in numeric type
    matrix_check <- map(TPM_data, is.numeric) %>% unlist
    if(sum(!matrix_check) > 0){
      message("The following column(s) in expression matrix is/are NOT in numeric type. Plese check, Proceed by excluding those samples")
      message(paste(names(matrix_check)[!matrix_check], collapse = "; "))
      TPM_data = TPM_data[,matrix_check]
    }
    message("Gene expression profiles loaded successfully!")
    message(paste(nrow(TPM_data), "genes and", ncol(TPM_data), "samples are loaded from", $[_input:r], sep = " "))
    
    #### Filter out low exp genes
    keep_genes_idx <- (rowMeans(TPM_data>low_expr_TPM)>low_expr_TPM_percent) 
    TPM_data = TPM_data[keep_genes_idx,]
    message(paste(sum(1 - keep_genes_idx), "genes are filtered by filter: >", low_expr_TPM_percent*100, "% samples have expression values <", low_expr_TPM))
    message(paste(sum(keep_genes_idx), "genes left, saving output to $[_output]."))
    TPM_data%>%as_tibble(rownames = "gene_ID")%>%write_delim("$[_output]","\t")

CAUTION: I (GW) am not sure what to do with the offset for log-transformation on TPM. GTEx suggests using offset = 1. Here I keep the recommendation from these authors where both 0.0001 and 1 were used in different steps.

[qc_2 (remove outliers)]
parameter: RLEFilterPercent = 0.05
parameter: DSFilterPercent = 0.05
parameter: topk_genes = 100
parameter: cluster_percent = 0.6
parameter: pvalue_cutoff = 0.05
parameter: cluster_level = 5
output: f'{cwd}/{_input:bnnn}.outlier_removed.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "$[ ]", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    library(RColorBrewer)
    library(ape)
    library(reshape2)
    library(dplyr)
    library(readr)
    ## Setting parameters
    RLEFilterPercent <-$[RLEFilterPercent]
    DSFilterPercent <- $[DSFilterPercent]
    ### Hcluster parameter
    pvalues.cut <- $[pvalue_cutoff]
    topk_genes <- $[topk_genes]
    cluster_percent <- $[cluster_percent]
    treesNum <- $[cluster_level]
    ## Define functions to accomodate rank deficient covariance matrices in https://github.com/cumc/xqtl-protocol/issues/307
    mahalanobis = function (x, center, cov, inverted = FALSE, ...) 
        {
            x <- if (is.vector(x)) 
                matrix(x, ncol = length(x))
            else as.matrix(x)
            if (!isFALSE(center)) 
                x <- sweep(x, 2L, center)
            if (!inverted) 
                cov <- MASS::ginv(cov, ...)
            setNames(rowSums(x %*% cov * x), rownames(x))
        }
          
    ## laod Data
    TPM_data <- read_tsv($[_input:ar], col_names = TRUE, comment = "#")%>%as.data.frame()
    rownames(TPM_data) <- TPM_data$gene_ID
    TPM_data = TPM_data[,-1]
    RLEFilterLength <- RLEFilterPercent*ncol(TPM_data)
    DSFilter <- DSFilterPercent*ncol(TPM_data)
  
    ## Outlier detection
  
    ### RLE
    # https://github.com/stormlovetao/eQTLQC/blob/86dcc388c8da7f1bd5b223f4b9b26f09c907eb15/Sample/src/report.Rmd#L71
    log_offset <- 0.0001
    logtpm = log10(TPM_data%>%as.matrix + log_offset)
    rle=logtpm-apply(logtpm, 1, median) # change "/" to "-" so that we got log(fold-change) which centered on 0 on the RLE plot.
    iqr = apply(rle,2,IQR)
    rle=melt( rle , variable.name = "Sample",value.name ="TPM", id="feature")
    names(rle) <- c("feature","Sample","TPM")
    rle_IQR <- rle %>% group_by(Sample) %>% summarise(IQR = IQR(TPM))
    rle_IQR_range <- rle_IQR$IQR %>% range %>% abs() %>% max()
    rle_IQR_range <- 2*rle_IQR_range %>% ceiling()
    bymedian <- with(rle, reorder(Sample, TPM, IQR))  # sort by IQR
    par(mar=c(3,3,3,3))
    pdf(file = "$[_output:n].RLEplot.pdf")
    boxplot(TPM ~ bymedian, data=rle, outline=F, ylim = c(-rle_IQR_range, rle_IQR_range), las=2, boxwex=1, col='gray', cex.axis=0.3, main="RLE plot before QC", xlab="", ylab="Residual expression levels", frame=F)
    dev.off()
    ExpPerSample <- nrow(TPM_data)
    RLEFilterList <- unique(levels(bymedian)[((length(levels(bymedian))-(RLEFilterLength))+1):(length(levels(bymedian))+1)])
    RLEFilterList <- as.character(RLEFilterList)
    message(paste0("The right most ", RLEFilterPercent*100, "% samples (N = ", length(RLEFilterList), ") are marked as candidate outliers in this step:") )
    for (x in 1:length(RLEFilterList)){
      message(RLEFilterList[x])
    }

    ### hcluster  
    sampleDists <- 1 - cor(logtpm, method='spearman')
    hc <- hclust(as.dist(sampleDists), method = "complete")
    hcphy <- as.phylo(hc)
  
    pdf(file = "$[_output:n].preQC_cluster.pdf")
    plot(hcphy, type = "unrooted", cex=.2, lab4ut='axial',underscore = T, main="Sample clustering before QC (Spearman - Cor.)")
    dev.off()
    # https://github.com/stormlovetao/eQTLQC/blob/86dcc388c8da7f1bd5b223f4b9b26f09c907eb15/Sample/src/report.Rmd#L102
    log_offset <- 1
    logtpm = log10(TPM_data%>%as.matrix + log_offset)
    ntop <- topk_genes
    Pvars <- apply(logtpm, 1, var)
    select <- order(Pvars, decreasing =TRUE)[seq_len(min(ntop, length(Pvars)))]
    MD_matrix <- logtpm[select, ]
    MahalanobisDistance = mahalanobis(t(MD_matrix), colMeans(t(MD_matrix)), cov(t(MD_matrix))) 
    # Note: t(MD_matrix)=sample_row*gene_column, Manhalanobis() returns one vector with length=row number
    pvalues = pchisq(MahalanobisDistance, df=nrow(MD_matrix), lower.tail=F)
    pvalues.adjust = p.adjust(pvalues, method ="bonferroni") # adjusted pvalues for each sample
    pvalues.low <- pvalues.adjust[pvalues.adjust<pvalues.cut]
    
    HCoutliers <- character()
    for(x in c(1:treesNum)){
      trees <- cutree(hc,k=x)
      idx <- c(1:x)#which tree is checking
      for(i in idx)
      {
        group <- hc$labels[which(trees == i)]
        if(sum(group %in% names(pvalues.low))/length(group) >= cluster_percent)
        {
          HCoutliers <- union(HCoutliers,group)
        }
      }
    }
    
    message(paste(length(HCoutliers), "samples are marked as candidate outlier(s) in this step.", sep = " "))
    if(length(HCoutliers)>0){
      message("Sample outliers are marked in red as follows:")
      for (x in 1:length(HCoutliers)){
          message(HCoutliers[x])
      }
      co1 = hc$labels%in%HCoutliers
      co1[which(co1 == "FALSE")]<-"gray0"
      co1[which(co1 == "TRUE")]<-"red"
      par(mar=c(3,3,3,3))
  
    pdf(file = "$[_output:n].cluster.pdf")
    plot(hcphy, tip.col = co1, type = "unrooted", cex=.2, lab4ut='axial',underscore = T, main="Label Outliers in Red")
    Xcol = c("gray0", "red")
    Xtext = c("Normal Sample", "Outliers")
    legend('bottomleft',pch=21,Xtext, col='white',pt.bg=Xcol, cex=1)
    dev.off()
    }else{
      message("No outlier detected.")
    }
    
    ### D-s
    D = apply(1-sampleDists, 1, median)
    pdf(file = "$[_output:n].D_stat_hist.pdf")
    hist(D, breaks=100, ylab="Number of samples", xlab="D-statistic", main="Histogram of Sample D-statistics before data QC")
    dev.off()
    
    DSFilter <- sort(D)[DSFilter]
    D<-as.data.frame(D)
    D<-data.frame(Sample = rownames(D),D = D$D)
    D_filterList = D%>%filter(D <= DSFilter)
    D_filterList <- D_filterList$Sample
    D_filterList<-as.character(D_filterList)
    message(paste0("The right most ", DSFilterPercent*100, "% samples (N=", length(D_filterList), ") are marked as candidate outliers in this step:") )
    for (x in 1:length(D_filterList)){
      message(D_filterList[x])
    }
    
    ## Outliers are the intersect of three candidates list
    outliersList <- intersect(RLEFilterList,intersect(HCoutliers,D_filterList))
    message("Outliers:")
    for (x in 1:length(outliersList)){
      message(outliersList[x])
    }
    outliersIndex <- which(colnames(logtpm) %in% outliersList)
    if(!length(outliersIndex) == 0){
        TPM_data <- TPM_data[,-outliersIndex]
    }
    ## Add 2 header lines, https://github.com/getzlab/rnaseqc/blob/286f99dfd4164d33014241dd4f3149da0cddf5bf/src/RNASeQC.cpp#L426
    cat(paste("#1.2\n#", nrow(TPM_data), ncol(TPM_data), "\n"), file=$[_output:nr], append=FALSE)
    TPM_data%>%as_tibble(rownames = "gene_ID")%>%write_delim($[_output:nr],delim = "\t",col_names = T, append = T)
  
bash: expand = "$[ ]", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    gzip -f --best $[_output:n]

Additional formatting:

Filter out the geneCount table based on TPM table.
Adds two comment lines above the header of TPM and geneCount table to mimick the original output from RNASeQC.

[qc_3 (remove gene and samples from raw counts)]
stop_if(not counts_gct.is_file())
output: f'{cwd}/{_input:bnnn}.geneCount.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "$[ ]", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
    library("dplyr")
    library("readr")
    # Reason to use read.table: 1.accomodate both " " and "\t"
    tpm = read_delim($[_input:ar], "\t", col_names = T, comment = "#")
    geneCount = read_delim($[counts_gct:ar], "\t" ,col_names = T, comment = "#")
    ## Make geneCount consistant with tpm, remove gene by filter() and remove samples by select()
    geneCount = geneCount%>%filter(gene_ID %in% tpm$gene_ID)%>%select(colnames(tpm))
    ## Add 2 header lines, https://github.com/getzlab/rnaseqc/blob/286f99dfd4164d33014241dd4f3149da0cddf5bf/src/RNASeQC.cpp#L426
    cat(paste("#1.2\n#", nrow(geneCount), ncol(geneCount) - 1, "\n"), file=$[_output:nr], append=FALSE)
    ## Save each file with 3 header line
    geneCount%>%write_delim($[_output:nr],delim = "\t",col_names = T, append = T)

bash: expand = "$[ ]", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    gzip -f --best $[_output:n]