RNA-seq calling and QC#
This document shows the use of various modules to prepare reference data, perform RNA-seq calling, expression level quantification and quality control. In particular,
A minimal working example is available on Google Drive.
This is a master control notebook for the molecular_phenotype_calling section of our pipeline. When following example commands was excuted, it will checked if the reference data was presented and generate the reference data if not. After that, molecular phenotype calling will be done based on the recipe that were fed into it. For each rows in the recipe, It will output a bed.gz file that are ready for downstream analysis.
By default, the reference data is assumed to be stored in the reference_data folder under the cwd. However, user can specify whether the
Input:
A recipe file, with columns:
"Theme", "Sample_fastq_list", "Data_dir"
Output:
A bed.gz file for each row in the Input
parameters:
flag
--stranded
or--no-stranded
: whether the rna seq data intended to be called is stranded or not.flag
--pair-ended
or--no-pair-ended
: whether the rna seq data intended to be called is paired-ended or not.--exe_dir
: diretory to the workflow notebook--sample-participant-lookup
: a sample participant lookup table with the first column being sample ID and second column being the cooresponding sample ids in the genotype file
sos run bulk_expression_commands.ipynb -h
usage: sos run bulk_expression_commands.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
ref_data
gene_annotation
indexing
fastqc
rnaseqc
RSEM
QC
normalization
Global Workflow Options:
--recipe VAL (as path, required)
The aforementioned input recipe
--cwd . (as path)
Overall wd, the file structure of analysis is
wd/[steps]/[sub_dir for each steps]
--refd path(f'{cwd}/reference_data/')
dir to store the reference data
--exe-dir /home/gw/GIT/xqtl-protocol/pipeline (as path)
Diretory to the excutable
--container-rnaquant '/mnt/mfs/statgen/containers/rna_quantification.sif'
--yml csg.yml (as path)
--jobs 50 (as int)
--queue 'csg,q'
Sections
ref_data_1:
ref_data_2:
gene_annotation:
Workflow Options:
--[no-]stranded (required)
indexing:
fastqc:
rnaseqc:
Workflow Options:
--fasta-with-adapters-etc 'TruSeq3-PE.fa'
RSEM:
Workflow Options:
--fasta-with-adapters-etc 'TruSeq3-PE.fa'
--[no-]is-paired-end (required)
QC:
Workflow Options:
--low-expr-TPM 0.1 (as float)
--low-expr-TPM-percent 0.2 (as float)
--RLEFilterPercent 0.05 (as float)
--DSFilterPercent 0.05 (as float)
--topk-genes 100 (as int)
--cluster-percent 0.6 (as float)
--pvalues-cut 0.05 (as float)
--cluster-level 5 (as int)
normalization:
Workflow Options:
--sample-participant-lookup VAL (as path, required)
--tpm-threshold 0.1 (as float)
--count-threshold 6 (as int)
--sample-frac-threshold 0.2 (as float)
--normalization-method tmm
[global]
## The aforementioned input recipe
parameter: recipe = path
## Overall wd, the file structure of analysis is wd/[steps]/[sub_dir for each steps]
parameter: cwd = path("output")
## dir to store the reference data
parameter: refd = path(f'{cwd}/reference_data/')
## Diretory to the excutable
parameter: exe_dir = path("~/GIT/xqtl-protocol/pipeline/")
parameter: container_rnaquant = '/mnt/mfs/statgen/containers/rna_quantification.sif'
## Cloud computing options:
parameter: yml = path("./csg.yml")
parameter: jobs = 50 # Number of jobs that are submitted to the cluster
parameter: queue = "csg" # The queue that jobs are submitted to
submission = f'-J {jobs} -c {yml} -q {queue}'
parameter: yml = path("csg.yml")
import pandas as pd
input_inv = pd.read_csv(recipe, sep = "\t").to_dict("records")
import os
import pandas as pd
recipe = pd.DataFrame( {"Theme" : ["AC","DLPFC","PCC"] ,
"Sample_fastq_list" : ["AC_sample_fastq.list",
"DLPFC_sample_fastq.list",
"PCC_sample_fastq"],
"Data_dir" : ["/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/",
"/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/",
"/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/"]})
recipe.to_csv("/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/mwe_recipe.txt","\t",index = 0)
recipe
Theme | Sample_fastq_list | Data_dir | |
---|---|---|---|
0 | AC | AC_sample_fastq.list | /mnt/mfs/statgen/xqtl_workflow_testing/rna_qua... |
1 | DLPFC | DLPFC_sample_fastq.list | /mnt/mfs/statgen/xqtl_workflow_testing/rna_qua... |
2 | PCC | PCC_sample_fastq | /mnt/mfs/statgen/xqtl_workflow_testing/rna_qua... |
sos dryrun ./bulk_expression_master.ipynb fastqc \
--recipe "/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/mwe_recipe.txt" \
--cwd ./
sos dryrun ./bulk_expression_master.ipynb normalization \
--recipe "/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/mwe_recipe.txt" \
--cwd ./ \
--sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
--fasta-with-adapters-etc "TruSeq3-PE.fa" \
--stranded \
--is-paired-end > example_script
cd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control
cat example_script | grep -v HINT
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/pipeline/reference_data.ipynb download_hg_reference --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/pipeline/reference_data.ipynb download_gene_annotation --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/pipeline/reference_data.ipynb download_ercc_reference --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/reference_data.ipynb hg_reference \
--cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/GRCh38_full_analysis_set_plus_decoy_hla /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data \
--ercc-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.fa \
--hg-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--container /mnt/mfs/statgen/containers/rna_quantification.sif
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/reference_data.ipynb gene_annotation \
--cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data \
--ercc-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.gtf \
--hg-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--hg-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/GRCh38_full_analysis_set_plus_decoy_hla /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.noALT_noHLA_noDecoy_ERCC.fasta \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--stranded -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/reference_data.ipynb STAR_index RSEM_index \
--cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data \
--hg-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/GRCh38_full_analysis_set_plus_decoy_hla /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.noALT_noHLA_noDecoy_ERCC.fasta \
--hg-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd ./ \
--samples AC_sample_fastq.list \
--data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
--fasta_with_adapters_etc TruSeq3-PE.fa \
--STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
--gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd ./ \
--samples DLPFC_sample_fastq.list \
--data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
--fasta_with_adapters_etc TruSeq3-PE.fa \
--STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
--gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd ./ \
--samples PCC_sample_fastq \
--data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
--fasta_with_adapters_etc TruSeq3-PE.fa \
--STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
--gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd ./ \
--samples AC_sample_fastq.list \
--data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
--fasta_with_adapters_etc TruSeq3-PE.fa \
--STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
--RSEM-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/RSEM_Index \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G \
--is_paired_end -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd ./ \
--samples DLPFC_sample_fastq.list \
--data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
--fasta_with_adapters_etc TruSeq3-PE.fa \
--STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
--RSEM-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/RSEM_Index \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G \
--is_paired_end -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd ./ \
--samples PCC_sample_fastq \
--data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
--fasta_with_adapters_etc TruSeq3-PE.fa \
--STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
--RSEM-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/RSEM_Index \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--mem 40G \
--is_paired_end -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_QC.ipynb qc \
--cwd . \
--tpm-gct AC.rnaseqc.gene_tpm.gct.gz \
--counts-gct AC.rnaseqc.gene_readsCount.gct.gz \
--low_expr_TPM 0.1 \
--low_expr_TPM_percent 0.2 \
--RLEFilterPercent 0.05 \
--DSFilterPercent 0.05 \
--topk_genes 100 \
--cluster_percent 0.6 \
--pvalues_cut 0.05 \
--cluster_level 5 \
--container /mnt/mfs/statgen/containers/rna_quantification.sif -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_QC.ipynb qc \
--cwd . \
--tpm-gct DLPFC.rnaseqc.gene_tpm.gct.gz \
--counts-gct DLPFC.rnaseqc.gene_readsCount.gct.gz \
--low_expr_TPM 0.1 \
--low_expr_TPM_percent 0.2 \
--RLEFilterPercent 0.05 \
--DSFilterPercent 0.05 \
--topk_genes 100 \
--cluster_percent 0.6 \
--pvalues_cut 0.05 \
--cluster_level 5 \
--container /mnt/mfs/statgen/containers/rna_quantification.sif -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_QC.ipynb qc \
--cwd . \
--tpm-gct PCC.rnaseqc.gene_tpm.gct.gz \
--counts-gct PCC.rnaseqc.gene_readsCount.gct.gz \
--low_expr_TPM 0.1 \
--low_expr_TPM_percent 0.2 \
--RLEFilterPercent 0.05 \
--DSFilterPercent 0.05 \
--topk_genes 100 \
--cluster_percent 0.6 \
--pvalues_cut 0.05 \
--cluster_level 5 \
--container /mnt/mfs/statgen/containers/rna_quantification.sif -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_normalization.ipynb normalize \
--cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control \
--tpm-gct AC.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
--counts-gct AC.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
--annotation-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--tpm_threshold 0.1 \
--count_threshold 6 \
--sample_frac_threshold 0.2 \
--normalization_method tmm -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_normalization.ipynb normalize \
--cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control \
--tpm-gct DLPFC.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
--counts-gct DLPFC.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
--annotation-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--tpm_threshold 0.1 \
--count_threshold 6 \
--sample_frac_threshold 0.2 \
--normalization_method tmm -J 50 -c csg.yml -q csg,q
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_normalization.ipynb normalize \
--cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control \
--tpm-gct PCC.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
--counts-gct PCC.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
--annotation-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
--sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
--container /mnt/mfs/statgen/containers/rna_quantification.sif \
--tpm_threshold 0.1 \
--count_threshold 6 \
--sample_frac_threshold 0.2 \
--normalization_method tmm -J 50 -c csg.yml -q csg,q
Download, Process, and index reference data (first time use)#
Download reference and annotation data (first time use)#
[ref_data_1]
output: f'{refd:a}/GRCh38_full_analysis_set_plus_decoy_hla.fa',
f"{refd:a}/Homo_sapiens.GRCh38.103.chr.gtf",
f"{refd:a}/ERCC92.gtf", f"{refd:a}/ERCC92.fa"
report: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
sos run $[exe_dir]/pipeline/reference_data.ipynb download_hg_reference --cwd $[_output[0]:d]
sos run $[exe_dir]/pipeline/reference_data.ipynb download_gene_annotation --cwd $[_output[0]:d]
sos run $[exe_dir]/pipeline/reference_data.ipynb download_ercc_reference --cwd $[_output[0]:d]
Reformat reference and annotation data (first time use)#
[ref_data_2]
output: f'{_input:n}.noALT_noHLA_noDecoy_ERCC.fasta',f'{_input:bn}.noALT_noHLA_noDecoy_ERCC.dict'
report: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
sos run $[exe_dir]/reference_data.ipynb hg_reference \
--cwd $[_output[0]:d] \
--ercc-reference $[_input[3]] \
--hg-reference $[_input[1]] \
--container $[container_rnaquant]
**Notice: you need to know if your RNA-seq data is stranded or unstranded. If it is stranded, you should add `--stranded` to the command below.**
[gene_annotation]
parameter: stranded = bool
input: output_from("ref_data_1"),output_from("ref_data_2")[0]
output: f'{_input[1]:n}.reformatted.ERCC.gtf', f'{_input[1]:n}.reformatted{".collapse_only" if stranded else ""}.gene.ERCC.gtf'
report: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
sos run $[exe_dir]/reference_data.ipynb gene_annotation \
--cwd $[_output[0]:d] \
--ercc-gtf $[_input[2]] \
--hg-gtf $[_input[1]] \
--hg-reference $[_input[4]] \
--container $[container_rnaquant] \
$["--stranded" if stranded else "--no-stranded"] $[submission if yml.is_file() else "" ]
Generation of Index (First time only)#
[indexing]
input: output_from("gene_annotation")[1],output_from("ref_data_2")[0]
output: f"{refd:a}/RSEM_Index/rsem_reference.n2g.idx.fa",
f"{refd:a}/RSEM_Index/rsem_reference.grp",
f"{refd:a}/RSEM_Index/rsem_reference.idx.fa",
f"{refd:a}/RSEM_Index/rsem_reference.ti",
f"{refd:a}/RSEM_Index/rsem_reference.chrlist",
f"{refd:a}/RSEM_Index/rsem_reference.seq",
f"{refd:a}/RSEM_Index/rsem_reference.transcripts.fa",
f"{refd:a}/STAR_Index/chrName.txt",
f"{refd:a}/STAR_Index/Log.out",
f"{refd:a}/STAR_Index/transcriptInfo.tab",
f"{refd:a}/STAR_Index/exonGeTrInfo.tab",
f"{refd:a}/STAR_Index/SAindex",
f"{refd:a}/STAR_Index/SA",
f"{refd:a}/STAR_Index/genomeParameters.txt",
f"{refd:a}/STAR_Index/chrStart.txt",
f"{refd:a}/STAR_Index/sjdbList.out.tab",
f"{refd:a}/STAR_Index/exonInfo.tab",
f"{refd:a}/STAR_Index/sjdbList.fromGTF.out.tab",
f"{refd:a}/STAR_Index/chrLength.txt",
f"{refd:a}/STAR_Index/sjdbInfo.txt",
f"{refd:a}/STAR_Index/Genome",
f"{refd:a}/STAR_Index/chrNameLength.txt",
f"{refd:a}/STAR_Index/geneInfo.tab"
report: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
sos run $[exe_dir]/reference_data.ipynb STAR_index RSEM_index \
--cwd $[refd:a] \
--hg-reference $[_input[1]] \
--hg-gtf $[_input[0]] \
--container $[container_rnaquant] \
--mem 40G $[submission if yml.is_file() else "" ]
Perform data quality summary via fastqc
#
You need to figure out from fastqc results what adapter reference sequence to use for --fasta_with_adapters_etc
. See section Step 1 below for details. We chose --fasta_with_adapters_etc TruSeq3-PE.fa
for our toy example.
fasta_with_adapters_etc
: filename for the adapter reference file. According toTrimmomatic
documention,
As a rule of thumb newer libraries will use
TruSeq3
, but this really depends on your service provider. If you use FASTQC, the “Overrepresented Sequences” report can help indicate which adapter file is best suited for your data. “Illumina Single End” or “Illumina Paired End” sequences indicate single-end or paired-endTruSeq2
libraries, and the appropriate adapter files areTruSeq2-SE.fa
andTruSeq2-PE.fa
respectively. “TruSeq Universal Adapter” or “TruSeq Adapter, Index …” indicatesTruSeq-3
libraries, and the appropriate adapter files areTruSeq3-SE.fa
orTruSeq3-PE.fa
, for single-end and paired-end data respectively. Adapter sequences forTruSeq2
multiplexed libraries, indicated by “Illumina Multiplexing …”, and the various RNA library preparations are not currently included.
We have fastqc
workflow defined as followed. Users should decide what fasta adapter reference to use based on fastqc
results (or their own knowledge).
It is assumed that all studies that were ran in one run are using the same adapter file
[fastqc]
input: for_each = "input_inv"
report: expand = "$[ ]"
sos run $[exe_dir]/RNA_calling.ipynb fastqc \
--cwd $[cwd:a]/output/rnaseq/fastqc/ \
--samples $[_input_inv['Sample_fastq_list']] \
--data-dir $[_input_inv['Data_dir']] \
--container $[container_rnaquant] $[submission if yml.is_file() else "" ]
Multi-sample RNA-seq Calling, QC, and normalizations (Automatically for every studies using the same set of ref/index/gtf)#
Call gene-level RNA expression via rnaseqc
#
[rnaseqc]
parameter:fasta_with_adapters_etc = "TruSeq3-PE.fa"
input: output_from("gene_annotation")[-1],output_from("indexing")[-1],for_each = "input_inv"
output: f'{cwd}/{_input_inv["Theme"]}.rnaseqc.gene_tpm.gct.gz',
f'{cwd}/{_input_inv["Theme"]}.rnaseqc.gene_readsCount.gct.gz',
f'{cwd}/{_input_inv["Theme"]}.rnaseqc.exon_readsCount.gct.gz',
f'{cwd}/{_input_inv["Theme"]}.rnaseqc.metrics.tsv'
report: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
sos run $[exe_dir]/RNA_calling.ipynb rnaseqc_call \
--cwd $[_output[0]:d]/ \
--samples $[_input_inv['Sample_fastq_list']] \
--data-dir $[_input_inv['Data_dir']] \
--fasta_with_adapters_etc $[fasta_with_adapters_etc] \
--STAR-index $[_input[1]:d] \
--gtf $[_input[0]] \
--container $[container_rnaquant] \
--mem 40G $[submission if yml.is_file() else "" ]
Call transcript level RNA expression via RSEM
#
[RSEM]
parameter:fasta_with_adapters_etc = "TruSeq3-PE.fa"
parameter: paired_end = bool
input: output_from("indexing")[0],output_from("indexing")[-1],for_each = "input_inv"
output: f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_expected_count.txt.gz',
f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_tpm.txt.gz',
f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_fpkm.txt.gz',
f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_isopct.txt.gz',
f'{cwd}/{_input_inv["Theme"]}.rsem_genes_expected_count.txt.gz',
f'{cwd}/{_input_inv["Theme"]}.rsem_genes_tpm.txt.gz',
f'{cwd}/{_input_inv["Theme"]}.rsem_genes_fpkm.txt.gz'
report: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
sos run $[exe_dir]/RNA_calling.ipynb rnaseqc_call \
--cwd $[_output[0]:d]/ \
--samples $[_input_inv['Sample_fastq_list']] \
--data-dir $[_input_inv['Data_dir']] \
--fasta_with_adapters_etc $[fasta_with_adapters_etc] \
--STAR-index $[_input[1]:d] \
--RSEM-index $[_input[0]:d] \
--container $[container_rnaquant] \
--mem 40G \
$["--paired_end" if paired_end else " "] $[submission if yml.is_file() else "" ]
Multi-sample RNA-seq QC (For every set of samples)#
We need to use a different MWE data-set that contains multiple samples – here is the Google Drive link.
[QC]
parameter: low_expr_TPM = 0.1
parameter: low_expr_TPM_percent = 0.2
parameter: RLEFilterPercent = 0.05
parameter: DSFilterPercent = 0.05
parameter: topk_genes = 100
parameter: cluster_percent = 0.6
parameter: pvalues_cut = 0.05
parameter: cluster_level = 5
depends: output_from("RSEM")
input: output_from("rnaseqc")
output: f'{cwd}/{_input[0]:bnnn}.low_expression_filtered.outlier_removed.tpm.gct.gz', f'{cwd}/{_input[0]:bnnn}.low_expression_filtered.outlier_removed.geneCount.gct.gz'
report: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
sos run $[exe_dir]/bulk_expression_QC.ipynb qc \
--cwd $[_output[0]:d] \
--tpm-gct $[_input[0]] \
--counts-gct $[_input[1]] \
--low_expr_TPM $[low_expr_TPM] \
--low_expr_TPM_percent $[low_expr_TPM_percent] \
--RLEFilterPercent $[RLEFilterPercent] \
--DSFilterPercent $[DSFilterPercent] \
--topk_genes $[topk_genes] \
--cluster_percent $[cluster_percent] \
--pvalues_cut $[pvalues_cut] \
--cluster_level $[cluster_level] \
--container $[container_rnaquant] $[submission if yml.is_file() else "" ]
Multi-sample read count normalization (For every set of samples)#
[normalization]
parameter: sample_participant_lookup = path
parameter: tpm_threshold = 0.1
parameter: count_threshold = 6
parameter: sample_frac_threshold = 0.2
parameter: normalization_method = 'tmm'
input: output_from("QC"),output_from("gene_annotation")
output: f'{cwd:a}/{_input[0]:bnnn}.{normalization_method}.expression.bed.gz'
report: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
sos run $[exe_dir]/bulk_expression_normalization.ipynb normalize \
--cwd $[_output:d] \
--tpm-gct $[_input[0]] \
--counts-gct $[_input[1]] \
--annotation-gtf $[_input[-1]] \
--sample-participant-lookup $[sample_participant_lookup] \
--container $[container_rnaquant] \
--tpm_threshold $[tpm_threshold] \
--count_threshold $[count_threshold] \
--sample_frac_threshold $[sample_frac_threshold] \
--normalization_method $[normalization_method] $[submission if yml.is_file() else "" ]