RNA-seq calling and QC

RNA-seq calling and QC#

This document shows the use of various modules to prepare reference data, perform RNA-seq calling, expression level quantification and quality control. In particular,

A minimal working example is available on Google Drive.

This is a master control notebook for the molecular_phenotype_calling section of our pipeline. When following example commands was excuted, it will checked if the reference data was presented and generate the reference data if not. After that, molecular phenotype calling will be done based on the recipe that were fed into it. For each rows in the recipe, It will output a bed.gz file that are ready for downstream analysis.

By default, the reference data is assumed to be stored in the reference_data folder under the cwd. However, user can specify whether the

Input:

A recipe file, with columns: "Theme", "Sample_fastq_list", "Data_dir"
Output:

A bed.gz file for each row in the Input
parameters:
- flag --stranded or --no-stranded : whether the rna seq data intended to be called is stranded or not.
- flag --pair-ended or --no-pair-ended : whether the rna seq data intended to be called is paired-ended or not.
- --exe_dir : diretory to the workflow notebook
- --sample-participant-lookup : a sample participant lookup table with the first column being sample ID and second column being the cooresponding sample ids in the genotype file

sos run bulk_expression_commands.ipynb -h

usage: sos run bulk_expression_commands.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  ref_data
  gene_annotation
  indexing
  fastqc
  rnaseqc
  RSEM
  QC
  normalization

Global Workflow Options:
  --recipe VAL (as path, required)
                        The aforementioned input recipe
  --cwd . (as path)
                        Overall wd, the file structure of analysis is
                        wd/[steps]/[sub_dir for each steps]
  --refd  path(f'{cwd}/reference_data/')

                        dir to store the reference data
  --exe-dir /home/gw/GIT/xqtl-protocol/pipeline (as path)
                        Diretory to the excutable
  --container-rnaquant '/mnt/mfs/statgen/containers/rna_quantification.sif'
  --yml csg.yml (as path)
  --jobs 50 (as int)
  --queue 'csg,q'

Sections
  ref_data_1:
  ref_data_2:
  gene_annotation:
    Workflow Options:
      --[no-]stranded (required)
  indexing:
  fastqc:
  rnaseqc:
    Workflow Options:
      --fasta-with-adapters-etc 'TruSeq3-PE.fa'
  RSEM:
    Workflow Options:
      --fasta-with-adapters-etc 'TruSeq3-PE.fa'
      --[no-]is-paired-end (required)
  QC:
    Workflow Options:
      --low-expr-TPM 0.1 (as float)
      --low-expr-TPM-percent 0.2 (as float)
      --RLEFilterPercent 0.05 (as float)
      --DSFilterPercent 0.05 (as float)
      --topk-genes 100 (as int)
      --cluster-percent 0.6 (as float)
      --pvalues-cut 0.05 (as float)
      --cluster-level 5 (as int)
  normalization:
    Workflow Options:
      --sample-participant-lookup VAL (as path, required)
      --tpm-threshold 0.1 (as float)
      --count-threshold 6 (as int)
      --sample-frac-threshold 0.2 (as float)
      --normalization-method tmm

[global]
## The aforementioned input recipe
parameter: recipe = path
## Overall wd, the file structure of analysis is wd/[steps]/[sub_dir for each steps]
parameter:  cwd = path("output")
## dir to store the reference data
parameter: refd = path(f'{cwd}/reference_data/')
## Diretory to the excutable
parameter: exe_dir = path("~/GIT/xqtl-protocol/pipeline/")
parameter: container_rnaquant = '/mnt/mfs/statgen/containers/rna_quantification.sif'
## Cloud computing options:
parameter: yml = path("./csg.yml")
parameter: jobs = 50 # Number of jobs that are submitted to the cluster
parameter: queue = "csg" # The queue that jobs are submitted to
submission = f'-J {jobs} -c {yml} -q {queue}'
parameter: yml = path("csg.yml") 
import pandas as pd
input_inv = pd.read_csv(recipe, sep = "\t").to_dict("records")
import os

import pandas as pd
recipe = pd.DataFrame( {"Theme" : ["AC","DLPFC","PCC"] ,
                "Sample_fastq_list" : ["AC_sample_fastq.list",
                   "DLPFC_sample_fastq.list",
                   "PCC_sample_fastq"],
                "Data_dir" : ["/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/",
                     "/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/",
                     "/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/"]})
recipe.to_csv("/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/mwe_recipe.txt","\t",index = 0)
recipe

	Theme	Sample_fastq_list	Data_dir
0	AC	AC_sample_fastq.list	/mnt/mfs/statgen/xqtl_workflow_testing/rna_qua...
1	DLPFC	DLPFC_sample_fastq.list	/mnt/mfs/statgen/xqtl_workflow_testing/rna_qua...
2	PCC	PCC_sample_fastq	/mnt/mfs/statgen/xqtl_workflow_testing/rna_qua...

sos dryrun ./bulk_expression_master.ipynb fastqc \
    --recipe "/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/mwe_recipe.txt" \
    --cwd ./
    

sos dryrun ./bulk_expression_master.ipynb normalization \
    --recipe "/mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/mwe_recipe.txt"  \
    --cwd ./ \
    --sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt  \
    --fasta-with-adapters-etc  "TruSeq3-PE.fa" \
    --stranded \
    --is-paired-end > example_script

cd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control
cat example_script | grep -v HINT

sos run /home/hs3163/GIT/xqtl-protocol/pipeline/pipeline/reference_data.ipynb download_hg_reference --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/pipeline/reference_data.ipynb download_gene_annotation --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data
sos run /home/hs3163/GIT/xqtl-protocol/pipeline/pipeline/reference_data.ipynb download_ercc_reference --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/reference_data.ipynb hg_reference \
    --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/GRCh38_full_analysis_set_plus_decoy_hla /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data \
    --ercc-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.fa \
    --hg-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/reference_data.ipynb gene_annotation \
        --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data \
        --ercc-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.gtf \
        --hg-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
        --hg-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/GRCh38_full_analysis_set_plus_decoy_hla /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.noALT_noHLA_noDecoy_ERCC.fasta \
        --container /mnt/mfs/statgen/containers/rna_quantification.sif \
        --stranded  -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/reference_data.ipynb STAR_index RSEM_index \
    --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data \
    --hg-reference /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/GRCh38_full_analysis_set_plus_decoy_hla /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92 /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/ERCC92.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \
    --mem 40G -J 50 -c csg.yml -q csg,q



sos run  /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd ./ \
    --samples  AC_sample_fastq.list \
    --data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
    --fasta_with_adapters_etc TruSeq3-PE.fa  \
    --STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
    --gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \
    --mem 40G -J 50 -c csg.yml -q csg,q



sos run  /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd ./ \
    --samples  DLPFC_sample_fastq.list \
    --data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
    --fasta_with_adapters_etc TruSeq3-PE.fa  \
    --STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
    --gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \
    --mem 40G -J 50 -c csg.yml -q csg,q



sos run  /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd ./ \
    --samples  PCC_sample_fastq \
    --data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
    --fasta_with_adapters_etc TruSeq3-PE.fa  \
    --STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
    --gtf /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \
    --mem 40G -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd ./ \
    --samples  AC_sample_fastq.list \
    --data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
    --fasta_with_adapters_etc TruSeq3-PE.fa  \
    --STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
    --RSEM-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/RSEM_Index \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \ 
    --mem 40G \
    --is_paired_end -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd ./ \
    --samples  DLPFC_sample_fastq.list \
    --data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
    --fasta_with_adapters_etc TruSeq3-PE.fa  \
    --STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
    --RSEM-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/RSEM_Index \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \ 
    --mem 40G \
    --is_paired_end -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd ./ \
    --samples  PCC_sample_fastq \
    --data-dir /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/rna_expression/ \
    --fasta_with_adapters_etc TruSeq3-PE.fa  \
    --STAR-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/STAR_Index \
    --RSEM-index /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/RSEM_Index \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif \ 
    --mem 40G \
    --is_paired_end -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_QC.ipynb qc \
    --cwd . \
    --tpm-gct AC.rnaseqc.gene_tpm.gct.gz \
    --counts-gct  AC.rnaseqc.gene_readsCount.gct.gz \
    --low_expr_TPM 0.1  \
    --low_expr_TPM_percent 0.2 \
    --RLEFilterPercent 0.05 \
    --DSFilterPercent 0.05 \
    --topk_genes 100 \
    --cluster_percent 0.6 \
    --pvalues_cut 0.05 \
    --cluster_level 5 \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_QC.ipynb qc \
    --cwd . \
    --tpm-gct DLPFC.rnaseqc.gene_tpm.gct.gz \
    --counts-gct  DLPFC.rnaseqc.gene_readsCount.gct.gz \
    --low_expr_TPM 0.1  \
    --low_expr_TPM_percent 0.2 \
    --RLEFilterPercent 0.05 \
    --DSFilterPercent 0.05 \
    --topk_genes 100 \
    --cluster_percent 0.6 \
    --pvalues_cut 0.05 \
    --cluster_level 5 \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_QC.ipynb qc \
    --cwd . \
    --tpm-gct PCC.rnaseqc.gene_tpm.gct.gz \
    --counts-gct  PCC.rnaseqc.gene_readsCount.gct.gz \
    --low_expr_TPM 0.1  \
    --low_expr_TPM_percent 0.2 \
    --RLEFilterPercent 0.05 \
    --DSFilterPercent 0.05 \
    --topk_genes 100 \
    --cluster_percent 0.6 \
    --pvalues_cut 0.05 \
    --cluster_level 5 \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control \
    --tpm-gct AC.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct AC.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf  /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif  \
    --tpm_threshold 0.1 \
    --count_threshold 6  \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm  -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control \
    --tpm-gct DLPFC.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct DLPFC.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf  /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif  \
    --tpm_threshold 0.1 \
    --count_threshold 6  \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm  -J 50 -c csg.yml -q csg,q



sos run /home/hs3163/GIT/xqtl-protocol/pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control \
    --tpm-gct PCC.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct PCC.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf  /mnt/mfs/statgen/xqtl_workflow_testing/phenotype_master_control/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted..collapse_only.gene.ERCC.gtf  \
    --sample-participant-lookup /mnt/mfs/statgen/snuc_pseudo_bulk/data/reference_data/sampleSheetAfterQC.txt \
    --container /mnt/mfs/statgen/containers/rna_quantification.sif  \
    --tpm_threshold 0.1 \
    --count_threshold 6  \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm  -J 50 -c csg.yml -q csg,q

Download, Process, and index reference data (first time use)#

Download reference and annotation data (first time use)#

[ref_data_1]
output: f'{refd:a}/GRCh38_full_analysis_set_plus_decoy_hla.fa',
        f"{refd:a}/Homo_sapiens.GRCh38.103.chr.gtf",
        f"{refd:a}/ERCC92.gtf", f"{refd:a}/ERCC92.fa"
report:  expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
        sos run $[exe_dir]/pipeline/reference_data.ipynb download_hg_reference --cwd $[_output[0]:d]
        sos run $[exe_dir]/pipeline/reference_data.ipynb download_gene_annotation --cwd $[_output[0]:d]
        sos run $[exe_dir]/pipeline/reference_data.ipynb download_ercc_reference --cwd $[_output[0]:d]

Reformat reference and annotation data (first time use)#

[ref_data_2]
output:  f'{_input:n}.noALT_noHLA_noDecoy_ERCC.fasta',f'{_input:bn}.noALT_noHLA_noDecoy_ERCC.dict'
report:  expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
        sos run $[exe_dir]/reference_data.ipynb hg_reference \
            --cwd $[_output[0]:d] \
            --ercc-reference $[_input[3]] \
            --hg-reference $[_input[1]] \
            --container $[container_rnaquant]

**Notice: you need to know if your RNA-seq data is stranded or unstranded. If it is stranded, you should add `--stranded` to the command below.**

[gene_annotation]
parameter: stranded = bool
input: output_from("ref_data_1"),output_from("ref_data_2")[0]
output: f'{_input[1]:n}.reformatted.ERCC.gtf', f'{_input[1]:n}.reformatted{".collapse_only" if stranded else ""}.gene.ERCC.gtf'
report:  expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    sos run $[exe_dir]/reference_data.ipynb gene_annotation \
            --cwd $[_output[0]:d] \
            --ercc-gtf $[_input[2]] \
            --hg-gtf $[_input[1]] \
            --hg-reference $[_input[4]] \
            --container $[container_rnaquant] \
            $["--stranded" if stranded else "--no-stranded"]  $[submission if yml.is_file() else "" ]

Generation of Index (First time only)#

[indexing]
input: output_from("gene_annotation")[1],output_from("ref_data_2")[0]
output: f"{refd:a}/RSEM_Index/rsem_reference.n2g.idx.fa", 
        f"{refd:a}/RSEM_Index/rsem_reference.grp", 
        f"{refd:a}/RSEM_Index/rsem_reference.idx.fa",
        f"{refd:a}/RSEM_Index/rsem_reference.ti", 
        f"{refd:a}/RSEM_Index/rsem_reference.chrlist",
        f"{refd:a}/RSEM_Index/rsem_reference.seq", 
        f"{refd:a}/RSEM_Index/rsem_reference.transcripts.fa",
        f"{refd:a}/STAR_Index/chrName.txt",
        f"{refd:a}/STAR_Index/Log.out", 
        f"{refd:a}/STAR_Index/transcriptInfo.tab",
        f"{refd:a}/STAR_Index/exonGeTrInfo.tab", 
        f"{refd:a}/STAR_Index/SAindex",
        f"{refd:a}/STAR_Index/SA",
        f"{refd:a}/STAR_Index/genomeParameters.txt", 
        f"{refd:a}/STAR_Index/chrStart.txt",
        f"{refd:a}/STAR_Index/sjdbList.out.tab", 
        f"{refd:a}/STAR_Index/exonInfo.tab",
        f"{refd:a}/STAR_Index/sjdbList.fromGTF.out.tab", 
        f"{refd:a}/STAR_Index/chrLength.txt",
        f"{refd:a}/STAR_Index/sjdbInfo.txt", 
        f"{refd:a}/STAR_Index/Genome",
        f"{refd:a}/STAR_Index/chrNameLength.txt", 
        f"{refd:a}/STAR_Index/geneInfo.tab"
report:  expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
        sos run $[exe_dir]/reference_data.ipynb STAR_index RSEM_index \
            --cwd $[refd:a] \
            --hg-reference $[_input[1]] \
            --hg-gtf $[_input[0]] \
            --container $[container_rnaquant] \
            --mem 40G $[submission if yml.is_file() else "" ]

Perform data quality summary via `fastqc`#

You need to figure out from fastqc results what adapter reference sequence to use for --fasta_with_adapters_etc. See section Step 1 below for details. We chose --fasta_with_adapters_etc TruSeq3-PE.fa for our toy example.

fasta_with_adapters_etc: filename for the adapter reference file. According to Trimmomatic documention,

As a rule of thumb newer libraries will use TruSeq3, but this really depends on your service provider. If you use FASTQC, the “Overrepresented Sequences” report can help indicate which adapter file is best suited for your data. “Illumina Single End” or “Illumina Paired End” sequences indicate single-end or paired-end TruSeq2 libraries, and the appropriate adapter files are TruSeq2-SE.fa and TruSeq2-PE.fa respectively. “TruSeq Universal Adapter” or “TruSeq Adapter, Index …” indicates TruSeq-3 libraries, and the appropriate adapter files are TruSeq3-SE.fa or TruSeq3-PE.fa, for single-end and paired-end data respectively. Adapter sequences for TruSeq2 multiplexed libraries, indicated by “Illumina Multiplexing …”, and the various RNA library preparations are not currently included.

We have fastqc workflow defined as followed. Users should decide what fasta adapter reference to use based on fastqc results (or their own knowledge). It is assumed that all studies that were ran in one run are using the same adapter file

[fastqc]
input: for_each = "input_inv"
report:  expand = "$[ ]"
    sos run $[exe_dir]/RNA_calling.ipynb fastqc \
        --cwd $[cwd:a]/output/rnaseq/fastqc/ \
        --samples $[_input_inv['Sample_fastq_list']] \
        --data-dir $[_input_inv['Data_dir']]   \
        --container $[container_rnaquant] $[submission if yml.is_file() else "" ]

Multi-sample RNA-seq Calling, QC, and normalizations (Automatically for every studies using the same set of ref/index/gtf)#

Call gene-level RNA expression via `rnaseqc`#

[rnaseqc]
parameter:fasta_with_adapters_etc = "TruSeq3-PE.fa"
input:  output_from("gene_annotation")[-1],output_from("indexing")[-1],for_each = "input_inv"
output: f'{cwd}/{_input_inv["Theme"]}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{_input_inv["Theme"]}.rnaseqc.gene_readsCount.gct.gz',
        f'{cwd}/{_input_inv["Theme"]}.rnaseqc.exon_readsCount.gct.gz',
        f'{cwd}/{_input_inv["Theme"]}.rnaseqc.metrics.tsv'
report:  expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
        sos run  $[exe_dir]/RNA_calling.ipynb rnaseqc_call \
            --cwd $[_output[0]:d]/ \
            --samples  $[_input_inv['Sample_fastq_list']] \
            --data-dir $[_input_inv['Data_dir']] \
            --fasta_with_adapters_etc $[fasta_with_adapters_etc]  \
            --STAR-index $[_input[1]:d] \
            --gtf $[_input[0]] \
            --container $[container_rnaquant] \
            --mem 40G $[submission if yml.is_file() else "" ]

Call transcript level RNA expression via `RSEM`#

[RSEM]
parameter:fasta_with_adapters_etc = "TruSeq3-PE.fa"
parameter: paired_end = bool
input:  output_from("indexing")[0],output_from("indexing")[-1],for_each = "input_inv"
output: f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_expected_count.txt.gz',
        f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_tpm.txt.gz',
        f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_fpkm.txt.gz',
        f'{cwd}/{_input_inv["Theme"]}.rsem_transcripts_isopct.txt.gz',
        f'{cwd}/{_input_inv["Theme"]}.rsem_genes_expected_count.txt.gz',
        f'{cwd}/{_input_inv["Theme"]}.rsem_genes_tpm.txt.gz',
        f'{cwd}/{_input_inv["Theme"]}.rsem_genes_fpkm.txt.gz'
report:  expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
        sos run $[exe_dir]/RNA_calling.ipynb rnaseqc_call \
            --cwd $[_output[0]:d]/ \
            --samples  $[_input_inv['Sample_fastq_list']] \
            --data-dir $[_input_inv['Data_dir']] \
            --fasta_with_adapters_etc $[fasta_with_adapters_etc]  \
            --STAR-index $[_input[1]:d] \
            --RSEM-index $[_input[0]:d] \
            --container $[container_rnaquant] \ 
            --mem 40G \
            $["--paired_end" if paired_end else " "] $[submission if yml.is_file() else "" ]

Multi-sample RNA-seq QC (For every set of samples)#

We need to use a different MWE data-set that contains multiple samples – here is the Google Drive link.

[QC]
parameter: low_expr_TPM = 0.1
parameter: low_expr_TPM_percent = 0.2
parameter: RLEFilterPercent = 0.05
parameter: DSFilterPercent = 0.05
parameter: topk_genes = 100
parameter: cluster_percent = 0.6
parameter: pvalues_cut = 0.05
parameter: cluster_level = 5
depends: output_from("RSEM")
input: output_from("rnaseqc")
output: f'{cwd}/{_input[0]:bnnn}.low_expression_filtered.outlier_removed.tpm.gct.gz', f'{cwd}/{_input[0]:bnnn}.low_expression_filtered.outlier_removed.geneCount.gct.gz'
report:  expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    sos run $[exe_dir]/bulk_expression_QC.ipynb qc \
        --cwd $[_output[0]:d] \
        --tpm-gct $[_input[0]] \
        --counts-gct  $[_input[1]] \
        --low_expr_TPM $[low_expr_TPM]  \
        --low_expr_TPM_percent $[low_expr_TPM_percent] \
        --RLEFilterPercent $[RLEFilterPercent] \
        --DSFilterPercent $[DSFilterPercent] \
        --topk_genes $[topk_genes] \
        --cluster_percent $[cluster_percent] \
        --pvalues_cut $[pvalues_cut] \
        --cluster_level $[cluster_level] \
        --container $[container_rnaquant] $[submission if yml.is_file() else "" ]

Multi-sample read count normalization (For every set of samples)#

[normalization]
parameter: sample_participant_lookup = path
parameter: tpm_threshold = 0.1
parameter: count_threshold = 6
parameter: sample_frac_threshold = 0.2
parameter: normalization_method = 'tmm'
input: output_from("QC"),output_from("gene_annotation")
output: f'{cwd:a}/{_input[0]:bnnn}.{normalization_method}.expression.bed.gz'
report:  expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    sos run $[exe_dir]/bulk_expression_normalization.ipynb normalize \
        --cwd $[_output:d] \
        --tpm-gct $[_input[0]] \
        --counts-gct $[_input[1]] \
        --annotation-gtf  $[_input[-1]]  \
        --sample-participant-lookup $[sample_participant_lookup] \
        --container $[container_rnaquant]  \
        --tpm_threshold $[tpm_threshold] \
        --count_threshold $[count_threshold]  \
        --sample_frac_threshold $[sample_frac_threshold] \
        --normalization_method $[normalization_method]  $[submission if yml.is_file() else "" ]