Quantification of methylation data#

Methods overview#

This notebook implements two methods to quantify methylation data, using sesame and minfi. We recommend sesame over minfi.

Procedure

minfi

sesame

SNP/Cross reaction removal

dropLociWithSnps + manual removal

Q (qualityMask)

sample quality

detectionP + mean

sesameQC_calcStats + “detection” + frac_dt

Bias correction

preprocessQuantile

D ( dyeBiasNL)

Probe quality

detectionP

“P (pOOBAH Detection p-value masking using oob)”

Background substraction

NA

B (noob)

Input#

  1. sample_sheet: path to csv/tsv file that documenting all the meta-information of the bisulfite sequencing. The user need to manually ensure/rename the column names corresponding to the first and second half of the idat file names are “Sentrix_ID” and “Sentrix_Position”

  2. [optional] idat_folder: path to the folder containing all the IDAT files to generate methylation data matrices from. Default is set to using the same folder where sample_sheet locates.

  3. [optional] cross_reactive_probes: A list of CpG probes that are reported to map to multiple regions in the genome.

Output#

  • A pair bed.gz file for beta and M value.

  • Probe to gene annotation.

Minimal working example#

sos run pipeline/methylation_calling.ipynb sesame \
    --sample-sheet data/MWE/MWE_Sample_sheet.csv \
    --container containers/methylation.sif

sos run pipeline/methylation_calling.ipynb sesame \
    --sample-sheet data/MWE/MWE_Sample_sheet_int.csv \
    --container containers/methylation.sif --sample_sheet_header_rows 0
sos run pipeline/methylation_calling.ipynb minfi \
    --sample-sheet data/MWE/MWE_Sample_sheet.csv \
    --container containers/methylation.sif

Command interface#

sos run methylation_calling.ipynb -h
usage: sos run methylation_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  sesame
  minfi

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --sample-sheet VAL (as path, required)
                        The companion sample sheet csv file as outlined in the
                        input section.
  --idat-folder  path(f"{sample_sheet:d}")

                        Raw data folder
  --[no-]keep-only-cpg-probes (default to False)
                        Remove probes that are SNPs
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option

Sections
  sesame_1:
    Workflow Options:
      --samples-frac-dt-cutoff 0.8 (as float)
                        threshold to filter out samples based on frac_dt
                        (Percentage of probe Detection Success)  percentage
      --sample-sheet-header-rows 7 (as int)
                        The header rows in the sample sheet csv
  minfi_1:
    Workflow Options:
      --samples-pval-cutoff 0.05 (as float)
                        threshold to filter out samples based on detection P
                        value
      --probe-pval-cutoff 0.01 (as float)
                        threshold to filter out probes based on detection P
                        value
      --cross-reactive-probes /opt/cross_reactive_probe_Hop2020.txt (as path)
                        FIXME: document here where this list is obtained from.
                        Also the documentation below doesn't sound right. Please
                        fix. Use the default list in our docker, if want to skip
                        methylation, specify it as "."
      --GRCh-build 38 (as int)
                        38 (hg38) or 37 (hg19) for epic data, by default 38.
                        Noted for 450K data only GRCh37 is availble
  *_methylation_2:

Global parameters#

keep_only_cpg_probes option dictate whether only cpg probes should be kept:

  • On an Illumina methylation bead chip, there are three types of probes,whose nature were indicated by their names. - cg: cpg probe; - rs: explict snp probe; - ch: non-CpG targeting probes; reported to be more prone to cross-hybirdization

    Following the guideline of Zhou W 2016, by default we do not remove all the rs and ch probes. However, for research that are focusing on the CpG sites, like mQTL discovery, we should use keep_only_cpg_probes parameter to filter out other types of probes.

[global]
# The output directory for generated files.
parameter: cwd = path("output")
# The companion sample sheet csv file as outlined in the input section.
parameter: sample_sheet = path
# Raw data folder
parameter: idat_folder = path(f"{sample_sheet:d}")
# Remove probes that are SNPs
parameter: keep_only_cpg_probes = False
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"

# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')

Sesame#

Getting the beta value from EPIC450 IDAT for 750 samples from 3000 wells take ~40 mins.

Based on sesame documentation, the processing procedure suitable for human on EPIC 450 and 850 platform is “QCDPB”

The code for each processing procedure are as followed:


Code

Name

Detail

Q

qualityMask

Mask probes of poor design

C

inferInfiniumIChannel

Infer channel for Infinium-I probes

D

dyeBiasNL

Dye bias correction (non-linear)

P

pOOBAH

Detection p-value masking using oob

B

noob

Background subtraction using oob

Other potential procedures are

Code

Name

Detail

0

resetMask

Reset mask to all FALSE

G

prefixMaskButCG

Mask all but cg- probes

H

prefixMaskButC

Mask all but cg- and ch-probes

E

dyeBiasL

Dye bias correction (linear)

I

detectionIB

Mask detection by intermediate beta values

M value is calculated as M = log2(beta/(1-beta)) The way we handle beta == 0 or beta == 1 is by replacing them with the next min/max value among the beta matrix, which is based on here

[sesame_1]
processing_option = "QCDGPB" if keep_only_cpg_probes else "QCDPB"
# threshold to filter out samples based on frac_dt (Percentage of probe Detection Success)  percentage
parameter: samples_frac_dt_cutoff = 0.8
# The header rows in the sample sheet csv. Use 0 for no headers. Typically it should be 7. 
parameter: sample_sheet_header_rows = float

# The number of cores to use. if set to 0 (default), the effective number of cores to used is automatically determined by BiocParallel::multicoreWorkers() function (i.e. n_available_cores-2)
parameter: n_cores = 1

input: sample_sheet
output: f'{cwd}/{_input:bn}.sesame.rds',f'{cwd}/{_input:bn}.sesame.beta.tsv',f'{cwd}/{_input:bn}.sesame.M.tsv',f'{cwd}/{_input:bn}.sample_qcs.sesame.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    library(sesame)
    library(data.table)
    setDTthreads(threads = 0)
    require(parallel)
    require(BiocParallel)
    
    n_cores=${n_cores}
    if(n_cores==0)n_cores=BiocParallel::multicoreWorkers() 
    # Cache the sesameData, it will not actually download anything if the data is here alerady
    sesameData::sesameDataCache()
    # Define function 
    B2M<-function (x)
    {
    x[x == 0] <- min(x[x != 0])
    x[x == 1] <- max(x[x != 1])
    log2(x) - log2(1 - x)
    }
    # Load the sample sheet to get the sample names
    sample_sheet = fread("${_input}" , skip = ${sample_sheet_header_rows} )
    if("Sentrix_Row_Column" %in% colnames(sample_sheet)){
    sample_sheet[,well_name:=paste(Sentrix_ID,Sentrix_Row_Column,sep='_')]
      } else { sample_sheet[,well_name:=paste(Sentrix_ID,Sentrix_Position,sep='_')]
      }
    sample_sheet[,Sample_Name:=.SD,.SDcols=colnames(sample_sheet)[1]]
    sample_sheet$Sample_Name = as.character(sample_sheet$Sample_Name)

    # scan and load the data
    sdfs <- openSesame(${idat_folder:r}, prep = "", func = NULL,BPPARAM = BiocParallel::MulticoreParam(n_cores))
    # Keep only the samples in the sample_sheet to save memory
    sdfs <- sdfs[which(names(sdfs)%in%sample_sheet$well_name)]

    message("IDAT files loaded.")
    # Get Sample level QC:
    qcs <- openSesame(sdfs, prep="", func=sesameQC_calcStats,BPPARAM = BiocParallel::MulticoreParam(n_cores))
    message("Sample-level QC completed.")
    # Get poor samples base on percentage of probe fail
    qcs_dt<-data.table(do.call(rbind, lapply(qcs, as.data.frame)),keep.rownames='id')

    poor_samples<-qcs_dt[frac_dt<${samples_frac_dt_cutoff}]$id #all samples with < 80% probes detection are deened as poor

    ## This handles the case where poor_sample is empty
    sdfs <- sdfs[which(!(names(sdfs)%in%poor_samples))]
    message(paste0(ifelse(length(poor_samples) > 0, poor_samples , "No sample" ) , " removed due to low quality based on frac_dt < ${samples_frac_dt_cutoff} "))
    
    # Preprocess the data via QCDPB procedure
    ## The masking procedure of sesame is not removing the probes, but instead introduce NA in the masked probes. Therefore it makesense to na.rm it.
    beta <- openSesame(sdfs,prep = "${processing_option}",BPPARAM = BiocParallel::MulticoreParam(n_cores))
    message("Beta calculated!")
    beta <- beta[rowSums(is.na(beta)) != ncol(beta), ] # Use this instead of na.omit because if not the full rows are NA, then the probe is not masked.
    
    #replace well_name by sample name in the matrix column names
    colnames(beta)<-sample_sheet[colnames(beta),on='well_name']$Sample_Name
    
    #calculate the M value
    M <- B2M(beta)
    message("M values calculated.")
    
    #Save the outputs
    fwrite(data.table(beta,keep.rownames = 'ID'),"${_output[1]}",sep="\t")
    fwrite(data.table(M,keep.rownames = 'ID'),"${_output[2]}",sep="\t")
    fwrite(qcs_dt,"${_output[3]}",sep="\t")
    saveRDS(list("sdfs" = sdfs,"qcs" = qcs),"${_output[0]}")

    message("sesame analysis completed!")

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "This is the file containing the intermediate QC table of sesame"
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6,7,8,9,10   >> $stdout ; done
        for i in $[_output[2]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6,7,8,9,10   >> $stdout ; done
    

minfi#

By default, for Infinium MethylationEPIC the data will be annotated based on hg38 using this annotation, alternatively user can set the --hg-build parameter back to 19 to use the hg19 annotation.

For 450K data however, only hg19 annotation is availble, which is what we would use would use for minfi to work. However, we will reannotate everything to hg38 anyways in the next step.

  1. All the IDAT file in the specified folder and sub-folder will be loaded for samples in input sample CSV file

  2. The methylation data samples will first be filtered based on bisulphite conversation rate. This operation is done using the bscon function from watermelon package

  3. samples will then be filtered based on a detection pvalue, which indicates the quality of the signal at each genomics position

  4. Stratified Quantile Normalization will then be applied.

  5. features will be filtered if they are on sex chr, known to be cross-reactive,maping to multiple regions in the genome., overlapping with snps, or having too low a detection P. The list of cross-reactive probe can be found as /opt/cross_reactive_probe_Hop2020.txt in our docker and here.

  6. Beta and M value will for all the probes/samples will then each be saved to a indexed bed.gz file.

As documented here when the batch of IDAT data are different, there will be a problem reading the IDAT file without specifing the force = TRUE option in the read.metharray.exp(targets = targets,force = TRUE)

[minfi_1]
# threshold to filter out samples based on detection P value
parameter: samples_pval_cutoff = 0.05
# threshold to filter out probes based on detection P value
parameter: probe_pval_cutoff = 0.01
# FIXME: document here where this list is obtained from. Also the documentation below doesn't sound right. Please fix.
## Use the default list in our docker, if want to skip methylation, specify it as "."
parameter: cross_reactive_probes = path("/opt/cross_reactive_probe_Hop2020.txt")
# 38 (hg38) or 19 (hg19) for epic data, by default 38. Noted for 450K data only GRCh37 is availble
parameter: hg_build = 38 
input: sample_sheet
output: f'{cwd}/{_input:bn}.minfi.rds',f'{cwd}/{_input:bn}.minfi.beta.tsv',f'{cwd}/{_input:bn}.minfi.M.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    ## load libraries
    library(dplyr)
    library(readr)
    library(tibble)
    library(minfi)
    sessionInfo()
    cross_reactive = readr::read_delim("${cross_reactive_probes}","\t")$probe
    ## 1. read idat files
    targets <- read.metharray.sheet(${_input:adr})
    colnames(targets)[1] = "Sample_Name"
    ## Only read samples with data
    Missing_sample = targets%>%filter(!stringr::str_detect(targets$Basename ,"/") )%>%pull(Sample_Name)
    if (length(Missing_sample)) message(paste0("Samples ",paste0(Missing_sample,collapse = ", "), " do not have IDAT data" ))

    targets = targets%>%filter(stringr::str_detect(targets$Basename ,"/") )     
    rgSet <- read.metharray.exp(targets = targets)
    if(${hg_build} == 38 && rgSet@annotation["array"] == 'IlluminaHumanMethylationEPIC' ){rgSet@annotation['annotation'] = "ilm10b5.hg38"}
    message("RGSet object created.")
    
    # Quality Control and Normalization

    ## 2. QC based on p-value, remove samples with average p value less than 0.05
    
    detP <- detectionP(rgSet)
    keep <- colMeans(detP) < ${samples_pval_cutoff}
    rgSet <- rgSet[,keep]
    targets <- targets[keep,]
    message("Samples with average detection p-value < ${samples_pval_cutoff} removed.")

    ## 3. Normalize the data - Quantile
    mSetSq <- preprocessQuantile(rgSet)
    message("RGSet data quantile normalized")
    
    ## 4. Remove cross-reactive probes
    no_cross_reactive <- !(featureNames(mSetSq) %in% cross_reactive)
    mSetSq <- mSetSq[no_cross_reactive, ]
    message("Cross-reactive probes removed")
    
    ## 5. Drop probes that are also SNPs
    if (${"T" if keep_only_cpg_probes else "F"} ){
        mSetSq <- dropLociWithSnps(mSetSq)
        message("probes overlapping with SNPs removed")
    }
  
    ## 6. Remove probes with < ${probe_pval_cutoff} detection p-values
    detP <- detP[match(featureNames(mSetSq),rownames(detP)),]
    keep <- rowSums(detP < ${probe_pval_cutoff}) == ncol(mSetSq)
    mSetSq <- mSetSq[keep,]
  
    ## 7. get Beta and M values
    mSetSqbval <- getBeta(mSetSq)%>%as_tibble(rownames = "ID")
    mSetSqMval <- getM(mSetSq)%>%as_tibble(rownames = "ID")

    message("Beta-value and M value obtained")
    
    ## 8. output data
    mSetSqbval = mSetSqbval%>%rename_at(vars(rgSet@colData%>%rownames()), function(x) rgSet@colData[x,]%>%as_tibble%>%pull(Sample_Name) )
    mSetSqMval = mSetSqMval%>%rename_at(vars(rgSet@colData%>%rownames()), function(x) rgSet@colData[x,]%>%as_tibble%>%pull(Sample_Name) )
    mSetSqbval%>%readr::write_delim("${_output[1]}","\t")
    mSetSqMval%>%readr::write_delim("${_output[2]}","\t")
    output = list("rgSet" = rgSet,mSetSq = "mSetSq", mSetSqbval = "mSetSqbval")
    output%>%saveRDS(${_output[0]:r})

Annotate probes#

The probes are annotated via sesameData package and formatted as bgzipped bed files, regardless of method used to process the IDAT.

[*_2]
output: f'{_input[1]:n}.bed.gz', f'{_input[2]:n}.bed.gz', f'{_input[0]:n}.gene_id.annot.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    library(sesame)
    library(tibble)
    library(dplyr)
    library(readr) 
    sesameData::sesameDataCache()
    betas = read_delim("${_input[1]}","\t")
        M = read_delim("${_input[2]}","\t")
    probe_annot = sesameData::sesameData_annoProbes(betas$ID,column = "gene_id")
    probe_annot = cbind("ID" = probe_annot%>%names,probe_annot%>%as_tibble)%>%as_tibble
  
    betas = inner_join(probe_annot%>%dplyr::select("#chr" = seqnames, start , end , ID ),betas , by = "ID" )%>%
          mutate(end = start +1 ,chr_num = stringr::str_remove(`#chr`,"chr")%>%as.numeric)%>%arrange(chr_num,`#chr`,start)%>%select(-chr_num) 
    M = inner_join(probe_annot%>%dplyr::select("#chr" = seqnames, start , end , ID ), M , by = "ID" )%>%
          mutate(end = start +1 , chr_num = stringr::str_remove(`#chr`,"chr")%>%as.numeric)%>%arrange(chr_num,`#chr`,start)%>%select(-chr_num) 
  
    betas%>%readr::write_delim("${_output[0]:n}","\t")
    M%>%readr::write_delim("${_output[1]:n}","\t")
    probe_annot%>%write_delim("${_output[2]}","\t")
  
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output[0]:n} 
    tabix ${_output[0]}
    bgzip -f ${_output[1]:n} 
    tabix ${_output[1]}
    rm -f ${_output[0]:n} ${_output[1]:n}

bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6,7,8,9,10   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6,7,8,9,10   >> $stdout ; done
        for i in $[_output[2]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6,7,8,9,10   >> $stdout ; done