Quantification of methylation data#
Methods overview#
This notebook implements two methods to quantify methylation data, using sesame
and minfi
. We recommend sesame
over minfi
.
Procedure |
|
|
---|---|---|
SNP/Cross reaction removal |
dropLociWithSnps + manual removal |
Q (qualityMask) |
sample quality |
detectionP + mean |
sesameQC_calcStats + “detection” + frac_dt |
Bias correction |
preprocessQuantile |
D ( dyeBiasNL) |
Probe quality |
detectionP |
“P (pOOBAH Detection p-value masking using oob)” |
Background substraction |
NA |
B (noob) |
Input#
sample_sheet
: path to csv/tsv file that documenting all the meta-information of the bisulfite sequencing. The user need to manually ensure/rename the column names corresponding to the first and second half of the idat file names are “Sentrix_ID” and “Sentrix_Position”[optional]
idat_folder
: path to the folder containing all the IDAT files to generate methylation data matrices from. Default is set to using the same folder wheresample_sheet
locates.[optional]
cross_reactive_probes
: A list of CpG probes that are reported to map to multiple regions in the genome.
Output#
A pair bed.gz file for
beta
andM
value.Probe to gene annotation.
Minimal working example#
sos run pipeline/methylation_calling.ipynb sesame \
--sample-sheet data/MWE/MWE_Sample_sheet.csv \
--container containers/methylation.sif
sos run pipeline/methylation_calling.ipynb sesame \
--sample-sheet data/MWE/MWE_Sample_sheet_int.csv \
--container containers/methylation.sif --sample_sheet_header_rows 0
sos run pipeline/methylation_calling.ipynb minfi \
--sample-sheet data/MWE/MWE_Sample_sheet.csv \
--container containers/methylation.sif
Command interface#
sos run methylation_calling.ipynb -h
usage: sos run methylation_calling.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
sesame
minfi
Global Workflow Options:
--cwd output (as path)
The output directory for generated files.
--sample-sheet VAL (as path, required)
The companion sample sheet csv file as outlined in the
input section.
--idat-folder path(f"{sample_sheet:d}")
Raw data folder
--[no-]keep-only-cpg-probes (default to False)
Remove probes that are SNPs
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 8 (as int)
Number of threads
--container ''
Software container option
Sections
sesame_1:
Workflow Options:
--samples-frac-dt-cutoff 0.8 (as float)
threshold to filter out samples based on frac_dt
(Percentage of probe Detection Success) percentage
--sample-sheet-header-rows 7 (as int)
The header rows in the sample sheet csv
minfi_1:
Workflow Options:
--samples-pval-cutoff 0.05 (as float)
threshold to filter out samples based on detection P
value
--probe-pval-cutoff 0.01 (as float)
threshold to filter out probes based on detection P
value
--cross-reactive-probes /opt/cross_reactive_probe_Hop2020.txt (as path)
FIXME: document here where this list is obtained from.
Also the documentation below doesn't sound right. Please
fix. Use the default list in our docker, if want to skip
methylation, specify it as "."
--GRCh-build 38 (as int)
38 (hg38) or 37 (hg19) for epic data, by default 38.
Noted for 450K data only GRCh37 is availble
*_methylation_2:
Global parameters#
keep_only_cpg_probes
option dictate whether only cpg probes should be kept:
On an Illumina methylation bead chip, there are three types of probes,whose nature were indicated by their names. - cg: cpg probe; - rs: explict snp probe; - ch: non-CpG targeting probes; reported to be more prone to cross-hybirdization
Following the guideline of Zhou W 2016, by default we do not remove all the rs and ch probes. However, for research that are focusing on the CpG sites, like mQTL discovery, we should use
keep_only_cpg_probes
parameter to filter out other types of probes.
[global]
# The output directory for generated files.
parameter: cwd = path("output")
# The companion sample sheet csv file as outlined in the input section.
parameter: sample_sheet = path
# Raw data folder
parameter: idat_folder = path(f"{sample_sheet:d}")
# Remove probes that are SNPs
parameter: keep_only_cpg_probes = False
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
Sesame
#
Getting the beta value from EPIC450 IDAT for 750 samples from 3000 wells take ~40 mins.
Based on sesame documentation, the processing procedure suitable for human on EPIC 450 and 850 platform is “QCDPB”
The code for each processing procedure are as followed:
Code |
Name |
Detail |
---|---|---|
Q |
qualityMask |
Mask probes of poor design |
C |
inferInfiniumIChannel |
Infer channel for Infinium-I probes |
D |
dyeBiasNL |
Dye bias correction (non-linear) |
P |
pOOBAH |
Detection p-value masking using oob |
B |
noob |
Background subtraction using oob |
Other potential procedures are
Code |
Name |
Detail |
---|---|---|
0 |
resetMask |
Reset mask to all FALSE |
G |
prefixMaskButCG |
Mask all but cg- probes |
H |
prefixMaskButC |
Mask all but cg- and ch-probes |
E |
dyeBiasL |
Dye bias correction (linear) |
I |
detectionIB |
Mask detection by intermediate beta values |
M value is calculated as M = log2(beta/(1-beta)) The way we handle beta == 0 or beta == 1 is by replacing them with the next min/max value among the beta matrix, which is based on here
[sesame_1]
processing_option = "QCDGPB" if keep_only_cpg_probes else "QCDPB"
# threshold to filter out samples based on frac_dt (Percentage of probe Detection Success) percentage
parameter: samples_frac_dt_cutoff = 0.8
# The header rows in the sample sheet csv. Use 0 for no headers. Typically it should be 7.
parameter: sample_sheet_header_rows = float
# The number of cores to use. if set to 0 (default), the effective number of cores to used is automatically determined by BiocParallel::multicoreWorkers() function (i.e. n_available_cores-2)
parameter: n_cores = 1
input: sample_sheet
output: f'{cwd}/{_input:bn}.sesame.rds',f'{cwd}/{_input:bn}.sesame.beta.tsv',f'{cwd}/{_input:bn}.sesame.M.tsv',f'{cwd}/{_input:bn}.sample_qcs.sesame.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
library(sesame)
library(data.table)
setDTthreads(threads = 0)
require(parallel)
require(BiocParallel)
n_cores=${n_cores}
if(n_cores==0)n_cores=BiocParallel::multicoreWorkers()
# Cache the sesameData, it will not actually download anything if the data is here alerady
sesameData::sesameDataCache()
# Define function
B2M<-function (x)
{
x[x == 0] <- min(x[x != 0])
x[x == 1] <- max(x[x != 1])
log2(x) - log2(1 - x)
}
# Load the sample sheet to get the sample names
sample_sheet = fread("${_input}" , skip = ${sample_sheet_header_rows} )
if("Sentrix_Row_Column" %in% colnames(sample_sheet)){
sample_sheet[,well_name:=paste(Sentrix_ID,Sentrix_Row_Column,sep='_')]
} else { sample_sheet[,well_name:=paste(Sentrix_ID,Sentrix_Position,sep='_')]
}
sample_sheet[,Sample_Name:=.SD,.SDcols=colnames(sample_sheet)[1]]
sample_sheet$Sample_Name = as.character(sample_sheet$Sample_Name)
# scan and load the data
sdfs <- openSesame(${idat_folder:r}, prep = "", func = NULL,BPPARAM = BiocParallel::MulticoreParam(n_cores))
# Keep only the samples in the sample_sheet to save memory
sdfs <- sdfs[which(names(sdfs)%in%sample_sheet$well_name)]
message("IDAT files loaded.")
# Get Sample level QC:
qcs <- openSesame(sdfs, prep="", func=sesameQC_calcStats,BPPARAM = BiocParallel::MulticoreParam(n_cores))
message("Sample-level QC completed.")
# Get poor samples base on percentage of probe fail
qcs_dt<-data.table(do.call(rbind, lapply(qcs, as.data.frame)),keep.rownames='id')
poor_samples<-qcs_dt[frac_dt<${samples_frac_dt_cutoff}]$id #all samples with < 80% probes detection are deened as poor
## This handles the case where poor_sample is empty
sdfs <- sdfs[which(!(names(sdfs)%in%poor_samples))]
message(paste0(ifelse(length(poor_samples) > 0, poor_samples , "No sample" ) , " removed due to low quality based on frac_dt < ${samples_frac_dt_cutoff} "))
# Preprocess the data via QCDPB procedure
## The masking procedure of sesame is not removing the probes, but instead introduce NA in the masked probes. Therefore it makesense to na.rm it.
beta <- openSesame(sdfs,prep = "${processing_option}",BPPARAM = BiocParallel::MulticoreParam(n_cores))
message("Beta calculated!")
beta <- beta[rowSums(is.na(beta)) != ncol(beta), ] # Use this instead of na.omit because if not the full rows are NA, then the probe is not masked.
#replace well_name by sample name in the matrix column names
colnames(beta)<-sample_sheet[colnames(beta),on='well_name']$Sample_Name
#calculate the M value
M <- B2M(beta)
message("M values calculated.")
#Save the outputs
fwrite(data.table(beta,keep.rownames = 'ID'),"${_output[1]}",sep="\t")
fwrite(data.table(M,keep.rownames = 'ID'),"${_output[2]}",sep="\t")
fwrite(qcs_dt,"${_output[3]}",sep="\t")
saveRDS(list("sdfs" = sdfs,"qcs" = qcs),"${_output[0]}")
message("sesame analysis completed!")
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "This is the file containing the intermediate QC table of sesame"
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6,7,8,9,10 >> $stdout ; done
for i in $[_output[2]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6,7,8,9,10 >> $stdout ; done
minfi
#
By default, for Infinium MethylationEPIC the data will be annotated based on hg38 using this annotation, alternatively user can set the --hg-build
parameter back to 19 to use the hg19 annotation.
For 450K data however, only hg19 annotation is availble, which is what we would use would use for minfi to work. However, we will reannotate everything to hg38 anyways in the next step.
All the IDAT file in the specified folder and sub-folder will be loaded for samples in input sample CSV file
The methylation data samples will first be filtered based on bisulphite conversation rate. This operation is done using the bscon function from watermelon package
samples will then be filtered based on a detection pvalue, which indicates the quality of the signal at each genomics position
Stratified Quantile Normalization will then be applied.
features will be filtered if they are on sex chr, known to be cross-reactive,maping to multiple regions in the genome., overlapping with snps, or having too low a detection P. The list of cross-reactive probe can be found as
/opt/cross_reactive_probe_Hop2020.txt
in our docker and here.Beta and M value will for all the probes/samples will then each be saved to a indexed bed.gz file.
As documented here when the batch of IDAT data are different, there will be a problem reading the IDAT file without specifing the force = TRUE option in the read.metharray.exp(targets = targets,force = TRUE)
[minfi_1]
# threshold to filter out samples based on detection P value
parameter: samples_pval_cutoff = 0.05
# threshold to filter out probes based on detection P value
parameter: probe_pval_cutoff = 0.01
# FIXME: document here where this list is obtained from. Also the documentation below doesn't sound right. Please fix.
## Use the default list in our docker, if want to skip methylation, specify it as "."
parameter: cross_reactive_probes = path("/opt/cross_reactive_probe_Hop2020.txt")
# 38 (hg38) or 19 (hg19) for epic data, by default 38. Noted for 450K data only GRCh37 is availble
parameter: hg_build = 38
input: sample_sheet
output: f'{cwd}/{_input:bn}.minfi.rds',f'{cwd}/{_input:bn}.minfi.beta.tsv',f'{cwd}/{_input:bn}.minfi.M.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
## load libraries
library(dplyr)
library(readr)
library(tibble)
library(minfi)
sessionInfo()
cross_reactive = readr::read_delim("${cross_reactive_probes}","\t")$probe
## 1. read idat files
targets <- read.metharray.sheet(${_input:adr})
colnames(targets)[1] = "Sample_Name"
## Only read samples with data
Missing_sample = targets%>%filter(!stringr::str_detect(targets$Basename ,"/") )%>%pull(Sample_Name)
if (length(Missing_sample)) message(paste0("Samples ",paste0(Missing_sample,collapse = ", "), " do not have IDAT data" ))
targets = targets%>%filter(stringr::str_detect(targets$Basename ,"/") )
rgSet <- read.metharray.exp(targets = targets)
if(${hg_build} == 38 && rgSet@annotation["array"] == 'IlluminaHumanMethylationEPIC' ){rgSet@annotation['annotation'] = "ilm10b5.hg38"}
message("RGSet object created.")
# Quality Control and Normalization
## 2. QC based on p-value, remove samples with average p value less than 0.05
detP <- detectionP(rgSet)
keep <- colMeans(detP) < ${samples_pval_cutoff}
rgSet <- rgSet[,keep]
targets <- targets[keep,]
message("Samples with average detection p-value < ${samples_pval_cutoff} removed.")
## 3. Normalize the data - Quantile
mSetSq <- preprocessQuantile(rgSet)
message("RGSet data quantile normalized")
## 4. Remove cross-reactive probes
no_cross_reactive <- !(featureNames(mSetSq) %in% cross_reactive)
mSetSq <- mSetSq[no_cross_reactive, ]
message("Cross-reactive probes removed")
## 5. Drop probes that are also SNPs
if (${"T" if keep_only_cpg_probes else "F"} ){
mSetSq <- dropLociWithSnps(mSetSq)
message("probes overlapping with SNPs removed")
}
## 6. Remove probes with < ${probe_pval_cutoff} detection p-values
detP <- detP[match(featureNames(mSetSq),rownames(detP)),]
keep <- rowSums(detP < ${probe_pval_cutoff}) == ncol(mSetSq)
mSetSq <- mSetSq[keep,]
## 7. get Beta and M values
mSetSqbval <- getBeta(mSetSq)%>%as_tibble(rownames = "ID")
mSetSqMval <- getM(mSetSq)%>%as_tibble(rownames = "ID")
message("Beta-value and M value obtained")
## 8. output data
mSetSqbval = mSetSqbval%>%rename_at(vars(rgSet@colData%>%rownames()), function(x) rgSet@colData[x,]%>%as_tibble%>%pull(Sample_Name) )
mSetSqMval = mSetSqMval%>%rename_at(vars(rgSet@colData%>%rownames()), function(x) rgSet@colData[x,]%>%as_tibble%>%pull(Sample_Name) )
mSetSqbval%>%readr::write_delim("${_output[1]}","\t")
mSetSqMval%>%readr::write_delim("${_output[2]}","\t")
output = list("rgSet" = rgSet,mSetSq = "mSetSq", mSetSqbval = "mSetSqbval")
output%>%saveRDS(${_output[0]:r})
Annotate probes#
The probes are annotated via sesameData
package and formatted as bgzipped bed files, regardless of method used to process the IDAT.
[*_2]
output: f'{_input[1]:n}.bed.gz', f'{_input[2]:n}.bed.gz', f'{_input[0]:n}.gene_id.annot.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
library(sesame)
library(tibble)
library(dplyr)
library(readr)
sesameData::sesameDataCache()
betas = read_delim("${_input[1]}","\t")
M = read_delim("${_input[2]}","\t")
probe_annot = sesameData::sesameData_annoProbes(betas$ID,column = "gene_id")
probe_annot = cbind("ID" = probe_annot%>%names,probe_annot%>%as_tibble)%>%as_tibble
betas = inner_join(probe_annot%>%dplyr::select("#chr" = seqnames, start , end , ID ),betas , by = "ID" )%>%
mutate(end = start +1 ,chr_num = stringr::str_remove(`#chr`,"chr")%>%as.numeric)%>%arrange(chr_num,`#chr`,start)%>%select(-chr_num)
M = inner_join(probe_annot%>%dplyr::select("#chr" = seqnames, start , end , ID ), M , by = "ID" )%>%
mutate(end = start +1 , chr_num = stringr::str_remove(`#chr`,"chr")%>%as.numeric)%>%arrange(chr_num,`#chr`,start)%>%select(-chr_num)
betas%>%readr::write_delim("${_output[0]:n}","\t")
M%>%readr::write_delim("${_output[1]:n}","\t")
probe_annot%>%write_delim("${_output[2]}","\t")
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
bgzip -f ${_output[0]:n}
tabix ${_output[0]}
bgzip -f ${_output[1]:n}
tabix ${_output[1]}
rm -f ${_output[0]:n} ${_output[1]:n}
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6,7,8,9,10 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6,7,8,9,10 >> $stdout ; done
for i in $[_output[2]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6,7,8,9,10 >> $stdout ; done