Quantifying expression from RNA-seq data#

Description#

Our pipeline follows the GTEx pipeline from the GTEx/TOPMed project for the quantification of expression from RNA-seq data. Either paired end or single end fastq.gz files may be used as the initial input. Different read strandedness options are possible, including rf, fr, unstranded or strand_missing, based on library types from Signal et al [cf. Signal et al (2022)]. Strand detection steps are included in this pipeline to automaticaly detect the strand of the input fastq file by levradging the gene count table ouptut from STAR. Read length data is required and is set to a default value of 100 for reads of zero length.

We recommend using fastqc to quality control reads, followed by the removal of adaptors with fastp if necessary. After quality control has been conducted, STAR may be used to align reads to the reference genome. We combine this mapping step with an additionaly quality control step using Picard to mark duplicate reads and collect other metrics.

Gene-level RNA expression is called with RNA-SeQC v2.4.2 using a reference gtf that has been collapsed to contain genes instead of gene transcripts [cf. DeLuca et al., Bioinformatics, 2012]. Reads are only included if they are uniquely mapped (mapping quality of 255 for STAR BAMs), if they are aligned in proper pairs, if the read alignment distance was less than or equal to six (i.e. alignments must not contain more than six non-reference bases), and if they are fully contained within exon boundaries. Reads overlapping introns are not included. Exon level read counts are also called by RNA-SeQC. If a read overlaps multiple exons, then a fractional value equal to the portion of the read contained within the exon is allotted. We call transcript-level RNA expression using RSEM v1.3.0 with a transcript reference gtf.

Input#

A tab delimited file containing sample read information with one line per sample. This file should include a header for the following columns:

  1. ID - sample ID (required)

  2. fq1 - name of the fastq file (required)

  3. fq2 - name of the second fastq file of the sample if using paired end data (required only if using paired end data)

  4. strand - the strandedness of the sample (optional)
    Options include rf, fr, unstranded or strand_missing, based on library types from Signal et al [cf. Signal et al. 2022]. Strand detection steps are included in this pipeline to automaticaly detect the strand of the input fastq file by leveraging the gene count table ouptut from STAR.

  5. read_length - the read length of the sample’s reads (optional)
    If this is unknown, enter 0 and the pipeline will use a default read length of 100. This default can be changed using the --sjdbOverhang parameter. If none of the samples have read length information then please leave out the read_length column so that the pipeline uses the default length for each sample.

Output#

Gene and transcript expression matrices in addition to other intermediate files (see Command interface steps below)

Minimal Working Example Steps#

i. Perform data quality summary via fastqc#

Timing: <4 min

Generate fastqc report.

FIXME


!sos run RNA_calling.ipynb fastqc \
    --cwd ../../output_test \
    --sample-list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c ../csg.yml  -q neurology
INFO: Input samples are paired-end sequences.
INFO: Running fastqc: 
INFO: t6c9b3702399ca271 submitted to neurology with job id Your job 2473763 ("job_t6c9b3702399ca271") has been submitted
INFO: t4e4ffcdfae79f275 submitted to neurology with job id Your job 2473764 ("job_t4e4ffcdfae79f275") has been submitted
INFO: t002fa5ea2fc74efe submitted to neurology with job id Your job 2473765 ("job_t002fa5ea2fc74efe") has been submitted
INFO: tf93ce24f1580b2b3 submitted to neurology with job id Your job 2473766 ("job_tf93ce24f1580b2b3") has been submitted
INFO: t386bb7e2e8436820 submitted to neurology with job id Your job 2473767 ("job_t386bb7e2e8436820") has been submitted
INFO: t54ae889b31a122b1 submitted to neurology with job id Your job 2473768 ("job_t54ae889b31a122b1") has been submitted
INFO: t3f1ff409f6f0c26c submitted to neurology with job id Your job 2473769 ("job_t3f1ff409f6f0c26c") has been submitted
INFO: t7750753938a38086 submitted to neurology with job id Your job 2473770 ("job_t7750753938a38086") has been submitted
INFO: t1a1d8d75185936a3 submitted to neurology with job id Your job 2473771 ("job_t1a1d8d75185936a3") has been submitted
INFO: t7d74fb9d4370b2f8 submitted to neurology with job id Your job 2473772 ("job_t7d74fb9d4370b2f8") has been submitted
INFO: tf25a5c96c0afea02 submitted to neurology with job id Your job 2473773 ("job_tf25a5c96c0afea02") has been submitted
INFO: t50f350a2d0a9b770 submitted to neurology with job id Your job 2473774 ("job_t50f350a2d0a9b770") has been submitted
INFO: tcc8278b405e89bca submitted to neurology with job id Your job 2473775 ("job_tcc8278b405e89bca") has been submitted
INFO: t62c5fa6ee2f7c7dc submitted to neurology with job id Your job 2473776 ("job_t62c5fa6ee2f7c7dc") has been submitted
INFO: t8ad9981a328c22ce submitted to neurology with job id Your job 2473777 ("job_t8ad9981a328c22ce") has been submitted
INFO: t365ddb198dd08add submitted to neurology with job id Your job 2473778 ("job_t365ddb198dd08add") has been submitted
INFO: t1d88d659d66b24f7 submitted to neurology with job id Your job 2473779 ("job_t1d88d659d66b24f7") has been submitted
INFO: t2d89152690c6fdd0 submitted to neurology with job id Your job 2473780 ("job_t2d89152690c6fdd0") has been submitted
INFO: t03056c7c071f8774 submitted to neurology with job id Your job 2473781 ("job_t03056c7c071f8774") has been submitted
INFO: t3be3c3d82d746be4 submitted to neurology with job id Your job 2473782 ("job_t3be3c3d82d746be4") has been submitted
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 20 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 1 task.
ERROR: [fastqc]: [fastqc]: Output target /restricted/projectnb/xqtl/xqtl_protocol/output_test/1000-PCC.bam.1.fq_fastqc.html does not exist after the completion of step fastqc

ii. Cut adaptor (Optional)#

Timing: ~10 min

Trim adaptors with fastp. A new sample list will be generated witht the trimmed fastq files.

!sos run RNA_calling.ipynb fastp_trim_adaptor \
    --cwd ../../output_test \
    --sample-list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c ../csg.yml  -q neurology
INFO: Input samples are paired-end sequences.
INFO: Running fastp_trim_adaptor_1: 
INFO: tf7f9ca7a0415220a submitted to neurology with job id Your job 2473815 ("job_tf7f9ca7a0415220a") has been submitted
INFO: t3a0f6bb6054347f9 submitted to neurology with job id Your job 2473816 ("job_t3a0f6bb6054347f9") has been submitted
INFO: t11993d4365affa80 submitted to neurology with job id Your job 2473817 ("job_t11993d4365affa80") has been submitted
INFO: t81717ecba3dd47ef submitted to neurology with job id Your job 2473818 ("job_t81717ecba3dd47ef") has been submitted
INFO: t67c489f5d6c4c96d submitted to neurology with job id Your job 2473820 ("job_t67c489f5d6c4c96d") has been submitted
INFO: tbd40a790cf6fff58 submitted to neurology with job id Your job 2473821 ("job_tbd40a790cf6fff58") has been submitted
INFO: t2f7a8204d07a3580 submitted to neurology with job id Your job 2473822 ("job_t2f7a8204d07a3580") has been submitted
INFO: t5053d7957ff62683 submitted to neurology with job id Your job 2473823 ("job_t5053d7957ff62683") has been submitted
INFO: t3ca261894a797116 submitted to neurology with job id Your job 2473824 ("job_t3ca261894a797116") has been submitted
INFO: tedb41e7f30829651 submitted to neurology with job id Your job 2473825 ("job_tedb41e7f30829651") has been submitted
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 1 task.
INFO: fastp_trim_adaptor_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/1000-PCC.bam.1.fq.trimmed.fq.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/1000-PCC.bam.2.fq.trimmed.fq.gz... (20 items in 10 groups)
INFO: Running fastp_trim_adaptor_2: 
INFO: fastp_trim_adaptor_2 is completed.
INFO: fastp_trim_adaptor_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/../../PCC_sample_list_subset.trimmed.txt
INFO: Workflow fastp_trim_adaptor (ID=w22250ab85e96dedb) is executed successfully with 2 completed steps, 11 completed substeps and 10 completed tasks.

iii. Read alignment via STAR and QC via Picard#

Timing <2 hours

Align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. This step also checks the standedness of the data and removes duplicates with Picard. If –varVCFfile is provided, VW tags will be added to the alignments, and filtering based on these tags will be performed. The –wasp option applies WASP correction, while –unique enables unique filtering. Using both options applies both filters, while omitting them means no filtering is applied. Note that filtering is not required when quantifying expression for eQTLs. NB: If you just provide varVCFfile but not provide --unique or --wasp, filtering won’t be performed although the name of the output bam files will contain _wasp_qc. This is required if aligning reads to generate splicing data later in the pipeline.

# STAR alignment without adding tags for filters(of course without filter): the execution speed will also be shorter than STARalignment without filter 
!sos run RNA_calling.ipynb STAR_align \
    --cwd ../../output_test/star_output \
    --sample_list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --sample-name 1000-PCC.bam \
    -s build -c ../csg.yml  -q neurology 
INFO: Input samples are paired-end sequences.
INFO: Running STAR_align_1: 
Insufficent memory for STAR, changing to 40G
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
Using read length specified in the sample list
INFO: tff65c4e72cd9bf6b restart from status submitted
INFO: t4b927e18be9c4b88 restart from status submitted
INFO: tff65c4e72cd9bf6b submitted to neurology with job id Your job 2399515 ("job_tff65c4e72cd9bf6b") has been submitted
INFO: tcaa28c0bb65f9149 restart from status submitted
INFO: tfda1f7e13e034e17 restart from status submitted
INFO: t4b927e18be9c4b88 submitted to neurology with job id Your job 2399516 ("job_t4b927e18be9c4b88") has been submitted
INFO: te8abdb425a2403aa restart from status submitted
INFO: tec167ba0767b70e7 restart from status submitted
INFO: tcaa28c0bb65f9149 submitted to neurology with job id Your job 2399517 ("job_tcaa28c0bb65f9149") has been submitted
INFO: te2655672abbc9b17 restart from status submitted
INFO: t87915c95c4041d60 restart from status submitted
INFO: tfda1f7e13e034e17 submitted to neurology with job id Your job 2399518 ("job_tfda1f7e13e034e17") has been submitted
INFO: te07ad3a636b7b1dd restart from status submitted
INFO: te8abdb425a2403aa submitted to neurology with job id Your job 2399519 ("job_te8abdb425a2403aa") has been submitted
INFO: tdf4f824b17a8a5dc restart from status submitted
INFO: tec167ba0767b70e7 submitted to neurology with job id Your job 2399520 ("job_tec167ba0767b70e7") has been submitted
INFO: te2655672abbc9b17 submitted to neurology with job id Your job 2399521 ("job_te2655672abbc9b17") has been submitted
INFO: t87915c95c4041d60 submitted to neurology with job id Your job 2399522 ("job_t87915c95c4041d60") has been submitted
INFO: te07ad3a636b7b1dd submitted to neurology with job id Your job 2399523 ("job_te07ad3a636b7b1dd") has been submitted
INFO: tdf4f824b17a8a5dc submitted to neurology with job id Your job 2399524 ("job_tdf4f824b17a8a5dc") has been submitted
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 8 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 6 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: STAR_align_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.Aligned.sortedByCoord.out.bam /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.Aligned.toTranscriptome.out.bam... (20 items in 10 groups)
INFO: Running strand_detected_1: 
for sample 1000-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9945125919850982, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=0) is completed.
for sample 1001-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9946500588593552, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=1) is completed.
for sample 1002-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9943981411103944, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=2) is completed.
for sample 1003-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.99371513806055, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=3) is completed.
for sample 1004-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9947175340513231, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=4) is completed.
for sample 1005-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9943227514543671, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=5) is completed.
for sample 1006-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9951195009722332, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=6) is completed.
for sample 1009-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9941384572915384, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=7) is completed.
for sample 1010-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.9905496531374057, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=8) is completed.
for sample 1011-PCC.bam
Counts for the 2nd read strand aligned with RNA is 0.994062350839579, > 90% of aligned count
Data is likely RF/fr-firststrand
INFO: strand_detected_1 (index=9) is completed.
INFO: Running strand_detected_2: 
Using strand specified in the input samples list ['rf', 'rf', 'rf', 'rf', 'rf', 'rf', 'rf', 'rf', 'rf', 'rf'], replacing strand_missing with detected strand
INFO: strand_detected_2 is completed.
INFO: Running STAR_align_2: 
INFO: t6dcd2e18a3a45ff6 re-execute completed
INFO: t64cea24f76f181ec re-execute completed
INFO: t6dcd2e18a3a45ff6 submitted to neurology with job id Your job 2399713 ("job_t6dcd2e18a3a45ff6") has been submitted
INFO: td34a8dded96d0613 re-execute completed
INFO: t2b1cc10183d526af re-execute completed
INFO: t64cea24f76f181ec submitted to neurology with job id Your job 2399714 ("job_t64cea24f76f181ec") has been submitted
INFO: t448a4eafbcb0db0e re-execute completed
INFO: td34a8dded96d0613 submitted to neurology with job id Your job 2399715 ("job_td34a8dded96d0613") has been submitted
INFO: tfeedd0c0bb469db9 re-execute completed
INFO: td2521716031d133c re-execute completed
INFO: t2b1cc10183d526af submitted to neurology with job id Your job 2399716 ("job_t2b1cc10183d526af") has been submitted
INFO: t052b51dab89cb9f6 re-execute completed
INFO: taa6cb5680f559bc8 re-execute completed
INFO: t448a4eafbcb0db0e submitted to neurology with job id Your job 2399717 ("job_t448a4eafbcb0db0e") has been submitted
INFO: tc238906499d07145 re-execute completed
INFO: tfeedd0c0bb469db9 submitted to neurology with job id Your job 2399718 ("job_tfeedd0c0bb469db9") has been submitted
INFO: td2521716031d133c submitted to neurology with job id Your job 2399719 ("job_td2521716031d133c") has been submitted
INFO: t052b51dab89cb9f6 submitted to neurology with job id Your job 2399720 ("job_t052b51dab89cb9f6") has been submitted
INFO: taa6cb5680f559bc8 submitted to neurology with job id Your job 2399721 ("job_taa6cb5680f559bc8") has been submitted
INFO: tc238906499d07145 submitted to neurology with job id Your job 2399722 ("job_tc238906499d07145") has been submitted
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 8 tasks.
INFO: Waiting for the completion of 8 tasks.
INFO: Waiting for the completion of 6 tasks.
INFO: Waiting for the completion of 6 tasks.
INFO: Waiting for the completion of 6 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: STAR_align_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.alignment_summary_metrics /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.rna_metrics... (40 items in 10 groups)
INFO: Running STAR_align_3: 
INFO: t4431c5c078751a48 submitted to neurology with job id Your job 2400115 ("job_t4431c5c078751a48") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: STAR_align_3 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/PCC_sample_list_subset_bam_list
INFO: Workflow STAR_align (ID=w9e2ac856c879aa3f) is executed successfully with 5 completed steps, 32 completed substeps and 21 completed tasks.
# STAR alignment with only WASP correction to diminish allel-specific bias:
!sos run RNA_calling.ipynb STAR_align \
    --cwd ../../output_test/star_output/wasp \
    --sample_list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile ../../reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf
    --sample-name 1000-PCC.bam \
    --wasp yes \
    -s build -c ../csg.yml  -q neurology 
# STAR alignment with only Unique filtering to get unique reads:
!sos run RNA_calling.ipynb STAR_align \
    --cwd ../../output_test/star_output/unique \
    --sample_list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile ../../reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf
    --sample-name 1000-PCC.bam \
    --unique yes \
    -s build -c ../csg.yml  -q neurology 
# STAR alignment with both WASP correction and unique filtering:
!sos run RNA_calling.ipynb STAR_align \
    --cwd ../../output_test/star_output/unique_wasp \
    --sample_list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile ../../reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf
    --sample-name 1000-PCC.bam \
    --unique yes \
    --wasp yes \
    -s build -c ../csg.yml  -q neurology 

iv. Call gene-level RNA expression via rnaseqc#

Timing <30 min

Call gene-level RNA expression using rnaseqc and run multiqc. The gtf file must include gene-level data instead of transcript-level data. Add --varVCFfile if adding tags during STAR_align. Change your --cwd option for different filter choice.

!sos run RNA_calling.ipynb rnaseqc_call \
    --cwd ../../output_test/star_output \
    --sample-list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --bam-list ../../output_test/star_output/PCC_sample_list_subset_bam_list \
    -s build  -c ../csg.yml  -q neurology 
INFO: Input samples are paired-end sequences.
INFO: Running rnaseqc_call_1: 
INFO: Step rnaseqc_call_1 (index=0) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=1) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=2) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=3) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=4) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=5) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=6) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=7) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=8) is ignored with signature constructed
INFO: Step rnaseqc_call_1 (index=9) is ignored with signature constructed
INFO: rnaseqc_call_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.rnaseqc.gene_tpm.gct.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.rnaseqc.gene_reads.gct.gz... (40 items in 10 groups)
INFO: Running rnaseqc_call_2: 
INFO: Step rnaseqc_call_2 (index=0) is ignored with signature constructed
INFO: Step rnaseqc_call_2 (index=0) is ignored with signature constructed
INFO: rnaseqc_call_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/PCC_sample_list_subset.rnaseqc.gene_tpm.gct.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/PCC_sample_list_subset.rnaseqc.gene_readsCount.gct.gz... (4 items)
INFO: Running rnaseqc_call_3: 
INFO: Step rnaseqc_call_3 (index=0) is ignored with signature constructed
INFO: Step rnaseqc_call_3 (index=0) is ignored with signature constructed
INFO: rnaseqc_call_3 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/PCC_sample_list_subset.multiqc_report.html
INFO: Running rnaseqc_call_4: 
INFO: t1555865b4d80b5cd submitted to neurology with job id Your job 2474362 ("job_t1555865b4d80b5cd") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: rnaseqc_call_4 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/PCC_sample_list_subset.picard.aggregated_quality.metrics.tsv
INFO: Workflow rnaseqc_call (ID=w79a27a4da28c0267) is executed successfully with 1 completed step, 3 ignored steps, 12 ignored substeps and 11 completed tasks.

v. Call transcript level RNA expression via RSEM#

Timing <X hours

Call transcript level RNA expression using RSEM and run multiqc. The gtf file used as input should match the one used to generate the RSEM index and therefore contain transcript-level, not gene-level, information. Add --varVCFfile if adding tags during STAR_align. Change your --cwd option for different filter choice.

FIXME

!sos run RNA_calling.ipynb rsem_call \
    --cwd ../../output_test/star_output \
    --sample-list ../../PCC_sample_list_subset \
    --data-dir /restricted/projectnb/amp-ad/ROSMAP_PCC_AC/PCC/ \
    --STAR-index ../../reference_data/STAR_Index/ \
    --gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    --reference-fasta ../../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --bam-list ../../output_test/star_output/PCC_sample_list_subset_bam_list \
    --RSEM-index ../../reference_data/RSEM_Index \
    -s build  -c ../csg.yml  -q neurology 
INFO: Input samples are paired-end sequences.
INFO: Running rsem_call_1: 
INFO: t17683f24f8513b9f submitted to neurology with job id Your job 2474418 ("job_t17683f24f8513b9f") has been submitted
INFO: tfd9bbafcf2595672 submitted to neurology with job id Your job 2474419 ("job_tfd9bbafcf2595672") has been submitted
INFO: t660ceef6ffd66beb submitted to neurology with job id Your job 2474420 ("job_t660ceef6ffd66beb") has been submitted
INFO: t86853a6bef66a16f submitted to neurology with job id Your job 2474421 ("job_t86853a6bef66a16f") has been submitted
INFO: t964856c7b9681030 submitted to neurology with job id Your job 2474422 ("job_t964856c7b9681030") has been submitted
INFO: t2445a0809b0320b4 submitted to neurology with job id Your job 2474423 ("job_t2445a0809b0320b4") has been submitted
INFO: td531b7f2ccc37b56 submitted to neurology with job id Your job 2474424 ("job_td531b7f2ccc37b56") has been submitted
INFO: t5826bd542c40b4aa submitted to neurology with job id Your job 2474425 ("job_t5826bd542c40b4aa") has been submitted
INFO: t5a87f52c0ac43d60 submitted to neurology with job id Your job 2474426 ("job_t5a87f52c0ac43d60") has been submitted
INFO: t1c886f27aef992cb submitted to neurology with job id Your job 2474427 ("job_t1c886f27aef992cb") has been submitted
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 10 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 9 tasks.
INFO: Waiting for the completion of 8 tasks.
INFO: Waiting for the completion of 8 tasks.
INFO: Waiting for the completion of 8 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 7 tasks.
INFO: Waiting for the completion of 6 tasks.
INFO: Waiting for the completion of 6 tasks.
INFO: Waiting for the completion of 5 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 4 tasks.
INFO: Waiting for the completion of 3 tasks.
INFO: Waiting for the completion of 2 tasks.
INFO: Waiting for the completion of 2 tasks.
INFO: Waiting for the completion of 2 tasks.
INFO: Waiting for the completion of 2 tasks.
INFO: Waiting for the completion of 1 task.
INFO: rsem_call_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.rsem.isoforms.results /restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/1000-PCC.bam.rsem.genes.results... (30 items in 10 groups)
INFO: Running rsem_call_2: 
INFO: t8726dc727d5bd7b2 submitted to neurology with job id Your job 2474550 ("job_t8726dc727d5bd7b2") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
ERROR: [rsem_call_2]: [t8726dc727d5bd7b2]: Executing script in Singularity returns an error (exitcode=1, stderr=/restricted/projectnb/xqtl/xqtl_protocol/output_test/star_output/PCC_sample_list_subset.rsem.aggregated_quality.metrics.stderr).
The script has been saved to /usr3/graduate/fgrennjr/.sos/bcd8530c53fd4b87//usr3/graduate/fgrennjr/.sos/bcd8530c53fd4b87.To reproduce the error please run:
singularity exec  /usr3/graduate/fgrennjr/.sos/singularity/library/ghcr.io-cumc-rna_quantification_apptainer-latest.sif micromamba run -a "" -n rna_quantification Rscript /usr3/graduate/fgrennjr/.sos/bcd8530c53fd4b87/singularity_run_768303.R
[rsem_call]: Exits with 2 pending steps (rsem_call_3, rsem_call_4)

Troubleshooting#

Step

Substep

Problem

Possible Reason

Solution

Command interface#

!sos run RNA_calling.ipynb -h
usage: sos run RNA_calling.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  bam_to_fastq
  fastqc
  fastp_trim_adaptor
  trimmomatic_trim_adaptor
  STAR_align
  strand_detected
  picard_qc
  rnaseqc_call
  rsem_call
  filter_reads

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --sample-list VAL (as path, required)
                        Sample meta data list
  --sample-name  (as list)
                        Sample names to analyze
  --data-dir  path(f"{sample_list:d}")

                        Raw data directory, default to the same directory as
                        sample list
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --java-mem 6G
                        Memory for Java virtual mechine (`picard`)
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option
  --varVCFfile ''
                        VarVCFfile for data preparation for wasp_filtering
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

  --[no-]uncompressed (default to False)
                        Whether the fasta/fastq file is compressed or not.

Sections
  bam_to_fastq:
  fastqc:
  fastp_trim_adaptor_1:
    Workflow Options:
      --window-size 4 (as int)
                        sliding window setting
      --required-quality 20 (as int)
      --leading 20 (as int)
                        the mean quality requirement option for cut_front
      --trailing 20 (as int)
                        the mean quality requirement option for cut_tail
      --min-len 15 (as int)
                        reads shorter than length_required will be discarded
      --fasta-with-adapters-etc . (as path)
                        Path to the reference adaptors
  fastp_trim_adaptor_2:
  trimmomatic_trim_adaptor:
    Workflow Options:
      --fasta-with-adapters-etc . (as path)
                        Illumina clip setting Path to the reference adaptors
      --seed-mismatches 2 (as int)
      --palindrome-clip-threshold 30 (as int)
      --simple-clip-threshold 10 (as int)
      --window-size 4 (as int)
                        sliding window setting
      --required-quality 20 (as int)
      --leading 3 (as int)
                        Other settings
      --trailing 3 (as int)
      --min-len 50 (as int)
  STAR_align_1:
    Workflow Options:
      --gtf VAL (as path, required)
                        Reference gene model
      --STAR-index VAL (as path, required)
                        STAR indexing file
      --outFilterMultimapNmax 20 (as int)
      --alignSJoverhangMin 8 (as int)
      --alignSJDBoverhangMin 1 (as int)
      --outFilterMismatchNmax 999 (as int)
      --outFilterMismatchNoverLmax 0.1 (as float)
      --alignIntronMin 20 (as int)
      --alignIntronMax 1000000 (as int)
      --alignMatesGapMax 1000000 (as int)
      --outFilterType BySJout
      --outFilterScoreMinOverLread 0.33 (as float)
      --outFilterMatchNminOverLread 0.33 (as float)
      --limitSjdbInsertNsj 1200000 (as int)
      --outSAMstrandField intronMotif
      --outFilterIntronMotifs None
      --alignSoftClipAtReferenceEnds Yes
      --quantMode TranscriptomeSAM GeneCounts (as list)
      --outSAMattrRGline ID:rg1 SM:sm1 (as list)
      --outSAMattributes NH HI AS nM NM ch (as list)
      --chimSegmentMin 15 (as int)
      --chimJunctionOverhangMin 15 (as int)
      --chimOutType Junctions WithinBAM SoftClip (as list)
      --chimMainSegmentMultNmax 1 (as int)
      --sjdbOverhang 100 (as int)
      --mapping-quality 255 (as int)
  strand_detected_1:
  strand_detected_2:
    Workflow Options:
      --strand ''
  picard_qc, STAR_align_2:
    Workflow Options:
      --gtf VAL (as path, required)
                        Reference gene model
      --ref-flat VAL (as path, required)
                        Path to flat reference file, for computing QC metric
      --reference-fasta VAL (as path, required)
                        The fasta reference file used to generate star index
      --optical-distance 100 (as int)
                        For the patterned flowcell models (HiSeq X), change to
                        2500
      --[no-]zap-raw-bam (default to False)
  STAR_align_3:
  rnaseqc_call_1:
    Workflow Options:
      --bam-list VAL (as path, required)
      --gtf VAL (as path, required)
                        Reference gene model
      --detection-threshold 5 (as int)
      --mapping-quality 255 (as int)
  rnaseqc_call_2:
    Workflow Options:
      --bam-list VAL (as path, required)
  rsem_call_1:
    Workflow Options:
      --bam-list VAL (as path, required)
      --RSEM-index VAL (as path, required)
      --max-frag-len 1000 (as int)
      --[no-]estimate-rspd (default to True)
  rsem_call_2:
    Workflow Options:
      --bam-list VAL (as path, required)
  rsem_call_3, rnaseqc_call_3:
  rsem_call_4, rnaseqc_call_4:
  filter_reads:
    Workflow Options:
      --unique ''
      --wasp ''
      --mapping-quality 255 (as int)

Setup and global parameters#

[global]
# The output directory for generated files.
parameter: cwd = path("output")
# Sample meta data list
parameter: sample_list = path
# Sample names to analyze
parameter: sample_name = list() #input should be --sample_name sample1 sample2, if multiple samples
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{sample_list:d}")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Memory for Java virtual mechine (`picard`)
parameter: java_mem = "6G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
# VarVCFfile for data preparation for wasp_filtering
parameter: varVCFfile = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')
## Whether the fasta/fastq file is compressed or not.
parameter: uncompressed = False
import os
import pandas as pd
## FIX: The way to get sample needs to be revamped to 1. Accomodate rf/fr as column 2. accomodate single end read (Only 2 fq/samples)
sample_inv = pd.read_csv(sample_list,sep = "\t")

# Align specific sample, to ensure correct grouping in subsequent steps, perform this step prior to extracting strand, read_length, and sample_id information
if len(sample_name)>0:
    print("Align sample",sample_specific, 'only...')
    sample_inv = sample_inv[sample_inv['ID'].isin(sample_name)] 
    
## Extract strand information if user have specified the strand
strand_inv = []

if "strand" in sample_inv.columns:
    strand_inv = sample_inv.strand.values.tolist()
    sample_inv = sample_inv.drop("strand" , axis = 1)
    stop_if(not all([x in ["fr", "rf", "unstranded","strand_missing"] for x in strand_inv ]), msg = "strand columns should only include ``fr``, ``rf``, ``strand_missing`` or ``unstranded``")
            
## Extract read_length if user have specified read_length
read_length = [0] * len(sample_inv.index)
if "read_length" in sample_inv.columns:
    read_length = sample_inv.read_length.values.tolist()
    sample_inv = sample_inv.drop("read_length" , axis = 1)
    

## Extract sample_id
sample_inv_list = sample_inv.values.tolist()
sample_id = [x[0] for x in sample_inv_list]

    
## Get the file name for single/paired end data
file_inv = [x[1:] for x in sample_inv_list]
file_inv = [item for sublist in file_inv for item in sublist]

raw_reads = [f'{data_dir}/{x}' for x in file_inv]


for y in raw_reads:
        if not os.path.isfile(y):
            raise ValueError(f"File {y} does not exist")

if len(raw_reads) != len(set(raw_reads)):
        raise ValueError("Duplicated files are found (but should not be allowed) in sample file list")

# Is the RNA-seq data pair-end
is_paired_end = 0 if len(raw_reads) == len(sample_id) else 1 
from sos.utils import env
env.logger.info(f'Input samples are {"paired-end" if is_paired_end else "single-end"} sequences.')

Step 0: Convert BAM back to fastq if necessary#

[bam_to_fastq]
input: raw_reads, group_by = 1
output: f'{cwd}/{_input:bn}.1.fastq',f'{cwd}/{_input:bn}.2.fastq'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    samtools sort -n ${_input} -o ${_output[0]:nn}.sorted.bam
    bedtools bamtofastq -i ${_output[0]:nn}.sorted.bam -fq ${_output[0]} -fq2 ${_output[1]}

Step 1: QC before alignment#

This step utilize fastqc and will generate two QC report in html format. It should be noted that the paired_end reads will also be qc separately

Step Inputs#

  • fastq1 and fastq2: paths to original fastq.gz file.

Step Outputs#

  • Two html file for QC report

[fastqc]
input: raw_reads, group_by =  1
output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    fastqc ${_input} -o ${_output[0]:d}

Step 2: Remove adaptor through fastp#

Documentation: fastp

We use fastp in place of the Trimmomatic for fastp’s ability to detect adapter from the reads. It was a c++ command line tool published in Sept 2018. It will use the following algorithm to detect the adaptors:

The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k = 10) of first N reads (N = 1 M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter

It was demostrated that fastp can remove all the adaptor automatically and completely faster than Trimmomatic and cutadapt

Step Inputs#

  • fastq: 1 set of fq.gz file for each sample. (2 files if paired end, 1 file if single end )

Step Outputs#

  • 1 set of fastq.gz file for alignment. (2 files if paired end, 1 file if single end )

  • The unpaired reads (i.e.where a read survived, but the partner read did not.) will be discarded by default as those were not used in the following steps. This feature can be added were it was needed.

  • 1 set of html documenting the quality of input reads(1 html file and 1 json file)

Step options#

A few options were selected to be customizable so that fastp step can have the same level of flexiblity as the Trimmomatic step had. They are:

  • min_len: length_required, reads shorter than this will be discarded, default is 15. (int [=15])

  • window_size: cut_window_size, the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])

  • leading/trailing : cut_front_mean_quality/cut_tail_mean_quality, the mean quality requirement option for cut_front/cut_tail, which move a sliding window from front/tail, drop the bases in the window if its mean quality < threshold. Notice the choice of quality score in fastp (N=20) is a lot higher than that of trimmomatic (N=3). Default fastp setting is in line with that of cutadapt.

The default value are set to be the same as the default value of the fastp software.

[fastp_trim_adaptor_1]
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# the mean quality requirement option for cut_front
parameter: leading = 20
# the mean quality requirement option for cut_tail
parameter: trailing = 20
# reads shorter than length_required will be discarded
parameter: min_len = 15
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
warn_if(fasta_with_adapters_etc.is_file(),msg = "Use input fasta and adaptor detection of paired-end read was disabled" )

input: raw_reads, group_by = is_paired_end + 1 , group_with = "sample_id"
output: [f'{cwd}/{path(x):bn}.trimmed.fq.gz' for x in _input]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash:  container=container,expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
        fastp -i ${f'{_input[0]} -I {_input[1]}' if is_paired_end else _input} -o ${ f'{_output[0]} -O {_output[1]}' if is_paired_end else _output } \
            ${f'--adapter_fasta {fasta_with_adapters_etc}' if fasta_with_adapters_etc.is_file() else "--detect_adapter_for_pe"}  -V -h ${_output[0]:n}.html -j ${_output[0]:n}.json -w ${numThreads} \
            --length_required ${min_len}  -W ${window_size} -M ${required_quality} -5 -3 --cut_front_mean_quality ${leading} --cut_tail_mean_quality ${leading}
[fastp_trim_adaptor_2]
input: group_by = "all"
output: f'{cwd}/{sample_list:n}.trimmed.txt'
sample_inv_tmp = pd.read_csv(sample_list,sep = "\t")
import csv
if is_paired_end:
    sample_inv_tmp.fq1 = [f'{x:r}'.replace("'","") for x in _input][::2]
    sample_inv_tmp.fq2 = [f'{x:r}'.replace("'","") for x in _input][1::2]
else:
    sample_inv_tmp.fq1 = [f'{x:r}'.replace("'","") for x in _input]
sample_inv_tmp.to_csv(_output,sep = "\t",index = 0)

Step 2 Alternative: Remove adaptor through Trimmomatic#

Documentation: Trimmomatic

We have replaced this with fastp, see above, which performs better than Trimmomatic in terms of removing adaptors that Trimmomatic cannot detect. fastp can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to fastp.

Step Inputs#

  • software_dir: directory for the software

  • fasta_with_adapters_etc: filename for the adapter reference file. According to Trimmomatic documention,

As a rule of thumb newer libraries will use TruSeq3, but this really depends on your service provider. If you use FASTQC, the “Overrepresented Sequences” report can help indicate which adapter file is best suited for your data. “Illumina Single End” or “Illumina Paired End” sequences indicate single-end or paired-end TruSeq2 libraries, and the appropriate adapter files are TruSeq2-SE.fa and TruSeq2-PE.fa respectively. “TruSeq Universal Adapter” or “TruSeq Adapter, Index …” indicates TruSeq-3 libraries, and the appropriate adapter files are TruSeq3-SE.fa or TruSeq3-PE.fa, for single-end and paired-end data respectively. Adapter sequences for TruSeq2 multiplexed libraries, indicated by “Illumina Multiplexing …”, and the various RNA library preparations are not currently included.

We have fastqc workflow previously defined and executed. Users should decide what fasta adapter reference to use based on fastqc results (or their own knowledge).

Step Outputs#

  • Two paired fastq.gz file for alignment

  • Two unpaired fastq.gz

For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the ‘paired’ output where both reads survived the processing, and 2 for corresponding ‘unpaired’ output where a read survived, but the partner read did not.

You need to figure out from fastqc results what adapter reference sequence to use., eg --fasta_with_adapters_etc TruSeq3-PE.fa. These files can be downloaded from Trimmomatic github repo.

[trimmomatic_trim_adaptor]
# Illumina clip setting
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
parameter: seed_mismatches = 2
parameter: palindrome_clip_threshold = 30
parameter: simple_clip_threshold = 10
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# Other settings
parameter: leading = 3
parameter: trailing = 3
parameter: min_len = 50
input: raw_reads, group_by = is_paired_end + 1 , group_with = "sample_id"
output: ([ f'{cwd}/{_sample_id}_paired_{_input[0]:bn}.gz', f'{cwd}/{_sample_id}_unpaired_{_input[0]:bn}.gz',  f'{cwd}/{_sample_id}_paired_{_input[1]:bn}.gz',f'{cwd}/{_sample_id}_unpaired_{_input[1]:bn}.gz' ] if is_paired_end else f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz')
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    trimmomatic -Xmx${java_mem} ${"PE" if is_paired_end else "SE"}  -threads ${numThreads} \
        ${_input:r}  ${_output:r} \
        ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \
        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}

Step 3: Alignment through STAR#

Documentation : STAR and Script in docker

This step is the main step for STAR alignment.

Originally, the STAR alignment is implemented using the run_STAR.py command from gtex. However, in order to accomodate different read length among samples, We reimplement what was included in the wrapper, including the re-alignment of STAR unsorted bam using samtools to avoid high mem consumption.

Step Inputs#

  • paths to trimmed fastq.gz file from Step 1 as documented in the new sample list.

  • STAR_index: directory for the STAR aligment index

Step Outputs#

  • bam file output ${cwd}/{sample_id}.Aligned.sortedByCoord.bam, will be used in step 3 and 4

  • bam file output ${cwd}/{sample_id}.Aligned.toTranscriptome.bam, will be used in step 5

NB:\({}" and "\)[]” are exchangeable as expension, just to keep them consistent throught the bash block.

[STAR_align_1]
# Reference gene model
parameter: gtf = path
# STAR indexing file
parameter: STAR_index = path
parameter: outFilterMultimapNmax = 20 
parameter: alignSJoverhangMin = 8 
parameter: alignSJDBoverhangMin = 1 
parameter: outFilterMismatchNmax = 999 
parameter: outFilterMismatchNoverLmax = 0.1 #larger than Yang's group (outFilterMismatchNoverReadLmax 0.04)
parameter: alignIntronMin = 20 
parameter: alignIntronMax = 1000000 
parameter: alignMatesGapMax = 1000000 
parameter: outFilterType =  "BySJout" 
parameter: outFilterScoreMinOverLread = 0.33 
parameter: outFilterMatchNminOverLread = 0.33 
parameter: limitSjdbInsertNsj = 1200000 
parameter: outSAMstrandField = "intronMotif" 
parameter: outFilterIntronMotifs = "None" 
parameter: alignSoftClipAtReferenceEnds = "Yes" 
parameter: quantMode = ["TranscriptomeSAM", "GeneCounts"]
parameter: outSAMattrRGline = ["ID:rg1", "SM:sm1"]
parameter: outSAMattributes = ["NH", "HI", "AS", "nM", "NM", "ch"] #(Yang's group: outSAMattributes NH HI AS nM XS vW). vW added is varVCFfile is set
parameter: chimSegmentMin = 15 
parameter: chimJunctionOverhangMin = 15 
parameter: chimOutType = ["Junctions", "WithinBAM", "SoftClip"]
parameter: chimMainSegmentMultNmax = 1 
parameter: sjdbOverhang = 100
parameter: mapping_quality = 255
if int(mem.replace("G","")) <  40:
    print("Insufficent memory for STAR, changing to 40G")
    star_mem = '40G'
else:
    star_mem = mem
if varVCFfile:
    if not path(varVCFfile).is_file():
        raise FileNotFoundError(f"Cannot find varVCFfile ``{varVCFfile}``")
    print("Adding wasp filter in STAR alignment")
# This option is commented out because it will force the downstream analysis to use 40G, which significantlly slow down the process.
input: raw_reads, group_by = is_paired_end + 1, group_with = {"sample_id","read_length"}
output: cord_bam = f'{cwd}/{_sample_id}.Aligned.sortedByCoord.out.bam',
        trans_bam = f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out.bam'
if _read_length == 0:
    print("Using specified --sjdbOverhang as read length")
else:
    print("Using read length specified in the sample list")
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = star_mem, cores = numThreads
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    
    # Remove previous output files
    rm -rf $[cwd]/$[_sample_id].*.out.*.gz
    rm -rf $[cwd]/$[_sample_id]._STARpass1

    set -e

    # Record start timestamp for alignment
    align_start_time=$(date +%s)

    # Run STAR alignment
    star --runMode alignReads \
         --runThreadN $[numThreads] \
         --genomeDir $[STAR_index] \
         --readFilesIn $[_input:r] \
         --readFilesCommand $["cat" if uncompressed else "zcat"] \
         --outFileNamePrefix $[_output[0]:nnnn]. \
         --outSAMstrandField $[outSAMstrandField] \
         --twopassMode Basic \
         --outFilterMultimapNmax $[outFilterMultimapNmax] \
         --alignSJoverhangMin $[alignSJoverhangMin] \
         --alignSJDBoverhangMin $[alignSJDBoverhangMin] \
         --outFilterMismatchNmax $[outFilterMismatchNmax] \
         --outFilterMismatchNoverLmax $[outFilterMismatchNoverLmax] \
         --alignIntronMin $[alignIntronMin] \
         --alignIntronMax $[alignIntronMax] \
         --alignMatesGapMax $[alignMatesGapMax] \
         --outFilterType $[outFilterType] \
         --outFilterScoreMinOverLread $[outFilterScoreMinOverLread] \
         --outFilterMatchNminOverLread $[outFilterMatchNminOverLread] \
         --limitSjdbInsertNsj $[limitSjdbInsertNsj] \
         --outFilterIntronMotifs $[outFilterIntronMotifs] \
         --alignSoftClipAtReferenceEnds $[alignSoftClipAtReferenceEnds] \
         --quantMode $[" ".join(quantMode)] \
         --outSAMtype BAM Unsorted \
         --outSAMunmapped Within \
         --genomeLoad NoSharedMemory \
         --chimSegmentMin $[chimSegmentMin] \
         --chimJunctionOverhangMin $[chimJunctionOverhangMin] \
         --chimOutType $[" ".join(chimOutType)] \
         --chimMainSegmentMultNmax $[chimMainSegmentMultNmax] \
         --chimOutJunctionFormat 0 \
         --outSAMattributes $[" ".join(outSAMattributes)] $["vW" if varVCFfile else ""] \
         --outSAMattrRGline $[" ".join(outSAMattrRGline)] \
         --sjdbOverhang $[sjdbOverhang if _read_length == 0 else _read_length ] \
         --sjdbGTFfile $[gtf] $[("--varVCFfile %s --waspOutputMode SAMtag" % varVCFfile) if varVCFfile else ""]

    # Record end timestamp for alignment and calculate runtime
    align_end_time=$(date +%s)
    align_runtime=$((align_end_time - align_start_time))

    # Remove temporary files
    rm -r $[_output[0]:nnnn]._STARgenome
    $[ f'rm -rf [_output[0]:nnnn]._STARtmp' if is_paired_end else ""]

    # Record start timestamp for sorting
    sort_start_time=$(date +%s)

    # Sort the aligned BAM file
    samtools sort --threads $[numThreads] -o $[_output[0]:nnnn].Aligned.sortedByCoord.out.bam $[_output[0]:nnnn].Aligned.out.bam
    rm $[_output[0]:nnnn].Aligned.out.bam

    # Record end timestamp for sorting and calculate runtime
    sort_end_time=$(date +%s)
    sort_runtime=$((sort_end_time - sort_start_time))

    # Index the final BAM file
    samtools index $[_output[0]]

    # Print runtime information
    echo "STAR alignment started at: $(date -d @$align_start_time)"
    echo "STAR alignment ended at: $(date -d @$align_end_time)"
    echo "STAR alignment runtime: $align_runtime seconds"
    echo ""
    echo "Sorting started at: $(date -d @$sort_start_time)"
    echo "Sorting ended at: $(date -d @$sort_end_time)"
    echo "Sorting runtime: $sort_runtime seconds"
[strand_detected_1: shared="step_strand_detected"]
input: output_from("STAR_align_1")["trans_bam"], group_with = "sample_id"
import pandas as pd
ReadsPerGene = pd.read_csv(f'{cwd}/{_sample_id}.ReadsPerGene.out.tab' , sep = "\t" , header = None )
ReadsPerGene_list = ReadsPerGene.loc[3::,].sum(axis = 0, numeric_only = True).values.tolist()
strand_percentage = [x/ReadsPerGene_list[0] for x in ReadsPerGene_list]
print(f'for sample {_sample_id}')
if strand_percentage[1] > 0.9:
    print(f'Counts for the 1st read strand aligned with RNA is {strand_percentage[1]}, > 90% of aligned count')
    print('Data is likely FR/fr-secondstrand')
    strand_detected = "fr"
elif  strand_percentage[2] > 0.9:
    print(f'Counts for the 2nd read strand aligned with RNA is {strand_percentage[2]}, > 90% of aligned count')
    print('Data is likely RF/fr-firststrand')
    strand_detected = "rf"
elif max( strand_percentage[1],  strand_percentage[2]) < 0.6:
    print(f'Both {strand_percentage[1]} and  {strand_percentage[2]} are under 60% of reads explained by one direction')
    print('Data is likely unstranded')
    strand_detected = "unstranded"
else:
    strand_detected = 'strand_missing' 
    print(f'Data does not fall into a likely stranded (max percent explained {max( strand_percentage[1],  strand_percentage[2])} > 0.9) or unstranded layout (max percent explained {max( strand_percentage[1],  strand_percentage[2])} < 0.6), please check your data and manually specified the ``--strand`` option as ``fr``, ``rf`` or ``unstranded``')
[strand_detected_2: shared="strand"]
input: group_by = "all"
parameter: strand = ""
import pandas as pd
if not strand:
    if len(strand_inv) > 0:
        strand = strand_inv
        for i in range(0,len(strand)):
            if strand[i] == "strand_missing":
                strand[i] = step_strand_detected[i]
        print(f'Using strand specified in the input samples list {strand}, replacing strand_missing with detected strand')
    else:
        warn_if(not all(x is step_strand_detected[0] for x in step_strand_detected), msg = "strands detected are different among samples, please check your protocol, we will use the detected strand for each samples")    
        strand = step_strand_detected
        print(f'Using detected strand for each samples {strand}')
else:
    stop_if(strand not in ["fr", "rf", "unstranded"], msg = "``--strand`` option should be ``fr``, ``rf`` or ``unstranded``")
    print(f'Using ``--strand`` overwrite option for all the samples {strand[0]}') 
    strand = [strand] * len(step_strand_detected)

Step 4: Mark duplicates reads & QC through Picard#

This step is the first QC step after STAR alignment. This step will performed QC to collect multipe metrics regarding the RNASeq using Picard. Then it will also generate a new .bam file with duplication marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024

Step Inputs:#

  • STAR_bam: path to the output in Step 2.

  • reference_fasta: The fasta reference file used to generate star index, it is critical that the this fasta is exactly the same as those that generate the STAR Index, else this error will occurs: cumc/xqtl-protocol#357

  • RefFlat file

This file is needed for picard CollectRnaSeqMetrics module, which in turn

produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES).

The refFlat file can be generated by the reference_data module, RefFlat_generation step.

Step Outputs:#

  • A collection of metrics file for each of the samples

  • A new .bam file with duplication marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024

[picard_qc, STAR_align_2]
# Reference gene model
parameter: gtf = path
depends: sos_variable('strand')
# Path to flat reference file, for computing QC metric
parameter: ref_flat = path
# The fasta reference file used to generate star index
parameter: reference_fasta = path
# For the patterned flowcell models (HiSeq X), change to 2500
parameter: optical_distance = 100
parameter: zap_raw_bam = False
picard_strand_dict = {"rf":"SECOND_READ_TRANSCRIPTION_STRAND","fr": "FIRST_READ_TRANSCRIPTION_STRAND","unstranded":"NONE" }
input: output_from('filter_reads'), group_by = 2, group_with = {"sample_id","strand"}
output: picard_metrics = f'{cwd}/{_sample_id}.alignment_summary_metrics',
        picard_rna_metrics = f'{cwd}/{_sample_id}.rna_metrics',
        md_bam = f'{_input[0]:n}.md.bam',
        md_metrics = f'{_input[0]:n}.md.metrics',
        bigwig = f'{_input[0]:n}.md.bw',
        output_summary = f'{cwd}/{_sample_id}.bam_file_meta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    # Record start timestamp for CollectMultipleMetrics
    start_time=$(date +%s)
    echo "CollectMultipleMetrics started at: $(date)"
    set -e
    picard -Xmx${java_mem} CollectMultipleMetrics \
        -REFERENCE_SEQUENCE ${reference_fasta} \
        -PROGRAM CollectAlignmentSummaryMetrics \
        -PROGRAM CollectInsertSizeMetrics \
        -PROGRAM QualityScoreDistribution \
        -PROGRAM MeanQualityByCycle \
        -PROGRAM CollectBaseDistributionByCycle \
        -PROGRAM CollectGcBiasMetrics \
        -VALIDATION_STRINGENCY STRICT \
        -INPUT  ${_input["cord_bam_wasp_qc"]} \
        -OUTPUT  ${_output[0]:n}
    
    # Record end timestamp and calculate runtime for CollectMultipleMetrics
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "CollectMultipleMetrics ended at: $(date)"
    echo "CollectMultipleMetrics runtime: $runtime seconds"

bash: container=container, expand= "${ }", stderr = f'{_output[-1]:n}.stderr', stdout = f'{_output[-1]:n}.stdout', entrypoint=entrypoint
    start_time=$(date +%s)
    echo "MarkDuplicates started at: $(date)"
    set -e
    picard -Xmx${java_mem}  MarkDuplicates \
        -I ${_input["cord_bam_wasp_qc"]}  \
        -O ${_output[2]} \
        -PROGRAM_RECORD_ID null \
        -M ${_output[3]} \
        -TMP_DIR ${cwd}\
        -MAX_RECORDS_IN_RAM 500000 -SORTING_COLLECTION_SIZE_RATIO 0.25 \
        -ASSUME_SORT_ORDER coordinate \
        -TAGGING_POLICY DontTag \
        -OPTICAL_DUPLICATE_PIXEL_DISTANCE ${optical_distance} \
        -CREATE_INDEX true \
        -CREATE_MD5_FILE true \
        -VALIDATION_STRINGENCY STRICT \
        -REMOVE_SEQUENCING_DUPLICATES false \
        -REMOVE_DUPLICATES false
    samtools index ${_output[2]}
    # Record end timestamp and calculate runtime for MarkDuplicates
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "MarkDuplicates ended at: $(date)"
    echo "MarkDuplicates runtime: $runtime seconds"

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.rna_metrics.stderr', stdout = f'{_output[0]:n}.rna_metrics.stdout', entrypoint=entrypoint
    set -e
    # Get only lines with rRNA and transcript_id from the GTF file
    cat ${gtf} | grep rRNA | grep transcript_id > ${cwd}/$(basename ${gtf}).tmp.${_input["cord_bam_wasp_qc"]:bnnnn}  # To avoid data racing problem, modification was made here to change the output location to the output path
    
    # Extract header from the BAM file
    samtools view -H ${_input["cord_bam_wasp_qc"]}  > ${_input["cord_bam_wasp_qc"]}.RI

python: container=container, expand= "${ }", stderr = f'{_output[0]:n}.rna_metrics.stderr', stdout = f'{_output[0]:n}.rna_metrics.stdout', entrypoint=entrypoint
        import os
        import pandas as pd
        from collections import defaultdict
        chrom = []
        start = []
        end = []
        strand = []
        tag = []
        filename = os.path.basename("${gtf}.tmp.${_input["cord_bam_wasp_qc"]:bnnnn}")
        annotation_gtf = os.path.join("${cwd}",filename) # also change the path to read the file only with rRNA and transcript_id from the GTF file

        with open(annotation_gtf, 'r') as gtf:
            for row in gtf:
                row = row.strip().split('\t')
                if row[0][0]=='#' or row[2]!="transcript": continue # skip header
                chrom.append(row[0])
                start.append(row[3])
                end.append(row[4])
                strand.append(row[6])
                attributes = defaultdict()
                for a in row[8].replace('"', '').split(';')[:-1]:
                    kv = a.strip().split(' ')
                    if kv[0]!='tag':
                        attributes[kv[0]] = kv[1]
                    else:
                        attributes.setdefault('tags', []).append(kv[1])
                tag.append(attributes)
        transcript_id = [x["transcript_id"] for x in tag]
        RI = pd.DataFrame(data={'chr':chrom, 'start':start, 'end':end, 'strand':strand,'transcript_id' : transcript_id })
        RI.to_csv("${_input["cord_bam_wasp_qc"]}.RI", index = 0, header = 0, mode = "a",sep = "\t" )

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.rna_metrics.stderr', stdout = f'{_output[0]:n}.rna_metrics.stdout', entrypoint=entrypoint
    # Record start timestamp for CollectRnaSeqMetrics
    start_time=$(date +%s)
    echo "CollectRnaSeqMetrics started at: $(date)"
    set -e
    picard -Xmx${java_mem} CollectRnaSeqMetrics \
        -REF_FLAT ${ref_flat} \
        -RIBOSOMAL_INTERVALS ${_input["cord_bam_wasp_qc"]}.RI \
        -STRAND_SPECIFICITY ${picard_strand_dict[_strand]} \
        -CHART_OUTPUT ${_output[0]:n}.rna_metrics.pdf \
        -VALIDATION_STRINGENCY STRICT \
        -INPUT ${_input["cord_bam_wasp_qc"]}  \
        -OUTPUT  ${_output[0]:n}.rna_metrics
    
    rm ${cwd}/$(basename ${gtf}).tmp.${_input["cord_bam_wasp_qc"]:bnnnn} # remove the path of the corresponding file with only rRNA and transcript_id from the GTF file
    
    # Record end timestamp and calculate runtime for CollectRnaSeqMetrics
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "CollectRnaSeqMetrics ended at: $(date)"
    echo "CollectRnaSeqMetrics runtime: $runtime seconds"

bash: container=container, expand= "$[ ]", stderr = f'{_output[0]:n}.bw.stderr', stdout = f'{_output[0]:n}.bw.stdout', entrypoint=entrypoint
    generate_bigwig() {
        # Check if enough arguments are provided
        if [[ $# -ne 1 ]]; then
            echo "Usage: generate_bigwig <file.bam>"
            return 1
        fi

        local input_file="$1"
        
        # Derive file.bw from file.bam
        local bigwig_file="${input_file%.*}.bw"

        # Execute the commands
        samtools index "$input_file"
        bamcoverage -b "$input_file" -o "$bigwig_file"
    }

    export -f generate_bigwig  # To make it available in subshells, if needed.
    generate_bigwig $[_output["md_bam"]]

import pandas as pd
out = pd.DataFrame({
    "sample_id": _sample_id,
    "strand": _strand,
    "coord_bam_list": f'{_output["md_bam"]:b}',
    "BW_list": f'{_output["bigwig"]:b}',
    "SJ_list": f'{_output["md_bam"]:bnnnnn}.SJ.out.tab',
    "trans_bam_list": f'{_output["md_bam"]:bnnnn}.toTranscriptome.out{"_wasp_qc" if varVCFfile else ""}.bam'},
    index=[0])
out.to_csv(_output["output_summary"], sep="\t", index=False)

if zap_raw_bam:
    _input["cord_bam_wasp_qc"].zap()
[STAR_align_3]
input: group_by = "all"
depends: sos_variable("strand")
output: f'{cwd}/{sample_list:bn}.bam_file_list'
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = 1
python: container=container, expand= "${ }", entrypoint=entrypoint

    import pandas as pd

    coord_bam_list = [${_input:br,}][2::6]
    sorted_bigwig_list = [${_input:br,}][4::6]
    SJ_list = [f'{x}.SJ.out.tab' for x in [${_input:bnnnnnr,}][2::6]]
    trans_bam_list = [f'{x}.toTranscriptome.out{"_wasp_qc" if "${varVCFfile}" else ""}.bam' for x in [${_input:bnnnnr,}][2::6]]

    out = pd.DataFrame({
        "sample_id": ${sample_id},
        "strand": ${strand},
        "coord_bam_list": coord_bam_list,
        "BW_list": sorted_bigwig_list,
        "SJ_list": SJ_list,
        "trans_bam_list": trans_bam_list
    })

    out.to_csv("${_output}", sep="\t", index=False)

Step 5: Post aligment QC through RNA-SeQC#

Documentation : RNA-SeQC and Script in docker

This step is second QC step after STAR alignment. It will perform RNA-seq quantification as well.

Step Inputs#

  • bam_list: path to the output of STAR_output, which outlined the strands and bam file used for each samples.

  • gtf: reference genome .gtf file, this gtf file need to have the same chr name format as the index used to generate the bam file and must be on collapsed gene gtf

Step Outputs#

The following output files are generated in the output directory you provide:

  • {sample}.metrics.tsv : A tab-delimited list of (Statistic, Value) pairs of all statistics and metrics recorded.

  • {sample}.exon_reads.gct : A tab-delimited GCT file with (Exon ID, Gene Name, coverage) tuples for all exons which had at least part of one read mapped.

  • {sample}.gene_reads.gct : A tab-delimited GCT file with (Gene ID, Gene Name, coverage) tuples for all genes which had at least one read map to at least one of its exons

  • {sample}.gene_tpm.gct : A tab-delimited GCT file with (Gene ID, Gene Name, TPM) tuples for all genes reported in the gene_reads.gct file. Note: this file is renamed to .gene_rpkm.gct if the –rpkm flag is present.

  • {sample}.fragmentSizes.txt : A list of fragment sizes recorded, if a BED file was provided

  • {sample}.coverage.tsv : A tab-delimited list of (Gene ID, Transcript ID, Mean Coverage, Coverage Std, Coverage CV) tuples for all transcripts encountered in the GTF.

[rnaseqc_call_1]
import os
import pandas as pd
parameter: bam_list = path
# Reference gene model
parameter: gtf = path
sample_inv = pd.read_csv(bam_list,sep = "\t")
parameter: detection_threshold = 5
parameter: mapping_quality = 255
## Extract strand information if user have specified the strand
strand_list = sample_inv.strand.values.tolist()
stop_if(not all([x in ["fr", "rf", "unstranded"] for x in strand_list ]), msg = "strand columns should only include ``fr``, ``rf`` or ``unstranded``, please check the bam_list")     
## Extract sample_id
sample_id = sample_inv.sample_id.values.tolist()
    
## Get the file name for cood_bam data
coord_bam_list_inv = sample_inv.coord_bam_list.values.tolist()
coord_bam_list = [f'{cwd}/{x}' for x in coord_bam_list_inv ]
input:  coord_bam_list, group_by = 1, group_with = {"sample_id","strand_list"}
output: f'{cwd}/{_sample_id}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.gene_reads.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.exon_reads.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.metrics.tsv'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    cd ${cwd} && \
    rnaseqc \
        ${gtf:a} \
        ${_input:a} \
        ${_output[0]:d}  \
        ${("--stranded " + _strand_list) if _strand_list != "unstranded" else ""} \
        ${f'--detection-threshold {detection_threshold}'} ${f'--mapping-quality {mapping_quality}'} ${ '' if is_paired_end else '-u'}

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    mv ${_output[0]:d}/${_input:b}.gene_tpm.gct ${_output[0]:n}
    mv ${_output[0]:d}/${_input:b}.gene_reads.gct ${_output[1]:n}
    mv ${_output[0]:d}/${_input:b}.exon_reads.gct ${_output[2]:n}
    mv ${_output[0]:d}/${_input:b}.metrics.tsv ${_output[3]}
    gzip ${_output[0]:n} ${_output[1]:n} ${_output[2]:n}

The RNASEQC results were merged in the following step,

[rnaseqc_call_2]
parameter: bam_list = path
input: group_by = "all"
output: f'{cwd}/{bam_list:bn}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.gene_readsCount.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.exon_readsCount.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.metrics.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    import pandas as pd
    import os
    def make_gct(gct_path):
        # sample_name
        sample_name = ".".join(os.path.basename(gct_path).split(".")[:-4])
        # read_input
        pre_gct = pd.read_csv(gct_path,sep = "\t",
                              skiprows= 2,index_col="Name").drop("Description",axis = 1)
        pre_gct.index.name = "gene_ID"
        pre_gct.columns = [sample_name]
        return(pre_gct)

    def merge_gct(gct_path_list):
        gct = pd.DataFrame()
        for gct_path in gct_path_list:
            #check duplicated indels and remove them.
            gct_col = make_gct(gct_path)
            gct = gct.merge(gct_col,right_index=True,left_index = True,how = "outer")
        return gct

    input_list = [${_input:r,}]
    tpm_list = input_list[0::4]
    gc_list = input_list[1::4]
    ec_list = input_list[2::4]
    gct_path_list_list = [tpm_list,gc_list,ec_list]
    output_path = [${_output:r,}][0:3]
    for i in range(0,len(output_path)):
        output = merge_gct(gct_path_list_list[i])
        output.to_csv(output_path[i], sep = "\t")
    metrics_list = input_list[3::4]
    with open("${cwd}/${bam_list:bn}.rnaseqc.metrics_output_list", "w") as f:
        f.write('\n'.join(metrics_list))

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    aggregate_rnaseqc_metrics  ${_output[3]:n}_output_list ${_output[3]:nn}

Step 6: Quantify expression through RSEM#

Documentation : RSEM and Script in docker

This step generate the expression matrix from STAR output. Estimate gene and isoform expression from RNA-Seq data are generated.

Step Input#

  • bam_list: path to the output of STAR_output, which outlined the strands and bam file used for each samples.

  • transcript-level BAM file: path to the output of Step 3.

  • RSEM_index: path to RSEM index

Step Outputs#

Please see the output section of https://deweylab.github.io/RSEM/rsem-calculate-expression.html

[rsem_call_1]
parameter: bam_list = path
parameter: RSEM_index = path
parameter: max_frag_len = 1000
parameter: estimate_rspd = True

sample_inv = pd.read_csv(bam_list,sep = "\t")

## Extract strand information if user have specified the strand
strand_list = sample_inv.strand.values.tolist()
stop_if(not all([x in ["fr", "rf", "unstranded"] for x in strand_list ]), msg = "strand columns should only include ``fr``, ``rf`` or ``unstranded``, please check the bam_list")     
## Extract sample_id
sample_id = sample_inv.sample_id.values.tolist()
    
## Get the file name for trans_bam_list data
trans_bam_list_inv = sample_inv.trans_bam_list.values.tolist()
trans_bam_list = [f'{cwd}/{x}' for x in trans_bam_list_inv]
input: trans_bam_list, group_by = 1, group_with = {"sample_id", "strand_list"} 
output: f'{cwd}/{_sample_id}.rsem.isoforms.results', f'{cwd}/{_sample_id}.rsem.genes.results',f'{cwd}/{_sample_id}.rsem.stat/{_sample_id}.rsem.cnt'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    run_rsem ${RSEM_index:a} ${_input:a} ${_sample_id} \
        -o ${_output[0]:d} \
        --max_frag_len ${max_frag_len} \
        --estimate_rspd ${'true' if estimate_rspd else 'false'} \
        --paired_end ${"true" if is_paired_end else "false"} \
        --is_stranded ${"true" if _strand_list != "unstranded" else "false"} \
        --threads ${numThreads}

The RSEM results were merged in the following steps, seven files (four for each columns in the isoform output and 3 for each of the genes output) will be generated.

[rsem_call_2]
parameter: bam_list = path
input: group_by = "all"
output: f'{cwd}/{bam_list:bn}.rsem_transcripts_expected_count.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_tpm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_fpkm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_isopct.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_expected_count.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_tpm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_fpkm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem.aggregated_quality.metrics.tsv'

task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
python: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    input_list = [${_input:r,}]
    with open('${cwd}/${bam_list:bn}.rsem.isoforms_output_list', "w") as f:
        f.write('\n'.join(input_list[0::3]))
    with open('${cwd}/${bam_list:bn}.rsem.genes_output_list', "w") as f:
        f.write('\n'.join(input_list[1::3]))

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
         aggregate_rsem_results ${cwd}/${bam_list:bn}.rsem.isoforms_output_list {expected_count,TPM,FPKM,IsoPct} ${_output[0]:nnn} -o ${cwd}
         aggregate_rsem_results ${cwd}/${bam_list:bn}.rsem.genes_output_list {expected_count,TPM,FPKM} ${_output[1]:nnn} -o ${cwd}
# -o to specify the output-dir(writable);the defualt is cwd of the sos command-/home/username/data(read-only)

R:  container=container, expand= "${ }", stderr = f'{_output[-1]:n}.stderr', stdout = f'{_output[-1]:n}.stdout', entrypoint=entrypoint
     readRSEM.cnt <- function (source) {
            # RSEM .cnt files gives statistics about the (transcriptome) alignment passed to RSEM:
            # Row 1: N0 (# unalignable reads);
            #        N1 (# alignable reads);
            #        N2 (# filtered reads due to too many alignments);
            #        N_tot (N0+N1+N2)
            # Row 2: nUnique (# reads aligned uniquely to a gene);
            #        nMulti (# reads aligned to multiple genes);
            #        nUncertain (# reads aligned to multiple locations in the given reference sequences, which include isoform-level multi-mapping reads)
            # Row 3: nHits (# total alignments);
            #        read_type (0: single-end read, no quality; 1: single-end read, with quality score; 2: paired-end read, no quality score; 3: paired-end read, with quality score)
            # Source: https://groups.google.com/forum/#!topic/rsem-users/usmPKgsC5LU
            # Note: N1 = nUnique + nMulti

            stopifnot(file.exists(source[1]))
            isDir <- file.info(source)$isdir
            if (isDir) {
                files <- system(paste("find", source, "-name \"*.rsem.cnt\""), intern=TRUE)
                stopifnot(length(files) > 0)
                samples <- gsub("_rsem.cnt", "", basename(files), fixed=TRUE)
            } else {
                files <- source
                samples <- gsub(".rsem.cnt", "", basename(files), fixed=TRUE)
            }
            metrics <- list()
            for (i in 1:length(files)) {
                m <- read.table(files[i], header=FALSE, sep=" ", comment.char="#", stringsAsFactors=FALSE, nrows=3, fill=TRUE)
                metrics[[i]] <- data.frame(Sample=samples[i],
                File=files[i],
                TotalReads=m[1, 4],
                AlignedReads=m[1, 2],
                UniquelyAlignedReads=m[2, 1],
                stringsAsFactors=FALSE)
            }
            metrics <- do.call(rbind, metrics)
            row.names(metrics) <- metrics$Sample

            return(metrics)
        }
        sourceRSEM = c(${_input:r,})
        sourceRSEM = sourceRSEM[seq(3,length(sourceRSEM),3)]
        metrics.RSEM = readRSEM.cnt(sourceRSEM)
        write.table(metrics.RSEM, file="${_output[-1]}",col.names=TRUE, row.names=FALSE, quote=FALSE)

Step 7: summarize with MultiQC#

MultiQC

MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn’t run other tools for you - it’s designed to be placed at the end of analysis pipelines or to be run manually when you’ve finished running your tools. When you launch MultiQC, it recursively searches through any provided file paths and finds files that it recognises. It parses relevant information from these and generates a single stand-alone HTML report file. It also saves a directory of data files with all parsed data for further downstream use.

MultiQC will automatically generate QC report for anything embedded within the given directory. Therefore providing the directory containing all the output will surfice.

The output of MultiQC is a multi-module report each corresponding to the quality report of each step of analysis previously performed.

[rsem_call_3,rnaseqc_call_3]
output: f'{_input[0]:nnn}.multiqc_report.html'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
report: output = f"{_output:n}.multiqc_config.yml"
  extra_fn_clean_exts:
      - '_rsem'
  fn_ignore_dirs:
      - '*_STARpass1'
bash:  container=container,expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    multiqc ${_input:d} -v -n ${_output:b} -o ${_output:d} -c ${_output:n}.multiqc_config.yml
[rsem_call_4, rnaseqc_call_4]
# Path to flat reference file, for computing QC metric
output: f'{_input[0]:nn}.picard.aggregated_quality.metrics.tsv'# it will be outputed in a very 'deep' directory with {cwd}/
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
R: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    ## Define Function
    readPicard.alignment_summary_metrics <- function (source) {
      stopifnot(length(source) == 1)
      stopifnot(file.exists(source))
      isDir <- file.info(source)$isdir
      if (isDir) {
        files <- system(paste("find -L", source, "-name \"*.alignment_summary_metrics\""), intern=TRUE)
        stopifnot(length(files) > 0)
        samples <- gsub(".alignment_summary_metrics", "", basename(files), fixed=TRUE)
      } else {
        files <- source
        samples <- gsub(".alignment_summary_metrics", "", basename(files), fixed=TRUE)
      }
    
      metrics <- list()
      for (i in 1:length(files)) {
        m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=${is_paired_end}+1)
        metrics[[i]] <- data.frame(Sample=samples[i],
                                   File=files[i],
                                   PF_READS=sum(m$PF_READS[1:2]),
                                   PF_READS_ALIGNED=sum(m$PF_READS_ALIGNED[1:2]),
                                   PCT_PF_READS_ALIGNED=sum(m$PF_READS_ALIGNED[1:2])/sum(m$PF_READS[1:2]),
                                   stringsAsFactors=FALSE)
      }
      metrics <- do.call(rbind, metrics)
      row.names(metrics) <- metrics$Sample
    
      return(metrics)
    }

    readPicard.rna_metrics <- function(source) {

      stopifnot(length(source) == 1)
      stopifnot(file.exists(source))
      isDir <- file.info(source)$isdir
      if (isDir) {
        files <- system(paste("find -L", source, "-name \"*.rna_metrics\""), intern=TRUE)
        stopifnot(length(files) > 0)
        samples <- gsub(".rna_metrics", "", basename(files), fixed=TRUE)
      } else {
        files <- source
        samples <- gsub(".rna_metrics", "", basename(files), fixed=TRUE)
      }

      metrics <- list()
      for (i in 1:length(files)) {
        m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=1)
        metrics[[i]] <- data.frame(Sample=samples[i],
                                   File=files[i],
                                   PCT_RIBOSOMAL_BASES=m$PCT_RIBOSOMAL_BASES,
                                   PCT_CODING_BASES=m$PCT_CODING_BASES,
                                   PCT_UTR_BASES=m$PCT_UTR_BASES,
                                   PCT_INTRONIC_BASES=m$PCT_INTRONIC_BASES,
                                   PCT_INTERGENIC_BASES=m$PCT_INTERGENIC_BASES,
                                   PCT_MRNA_BASES=m$PCT_MRNA_BASES,
                                   PCT_USABLE_BASES=m$PCT_USABLE_BASES,
                                   MEDIAN_CV_COVERAGE=m$MEDIAN_CV_COVERAGE,
                                   MEDIAN_5PRIME_BIAS=m$MEDIAN_5PRIME_BIAS,
                                   MEDIAN_3PRIME_BIAS=m$MEDIAN_3PRIME_BIAS,
                                   MEDIAN_5PRIME_TO_3PRIME_BIAS=m$MEDIAN_5PRIME_TO_3PRIME_BIAS,
                                   stringsAsFactors=FALSE)
      }
      metrics <- do.call(rbind, metrics)
      row.names(metrics) <- metrics$Sample
    
      return(metrics)
    }


    readPicard.duplicate_metrics <- function(source) {

      stopifnot(length(source) == 1)
      stopifnot(file.exists(source))
      pattern_suffix <- ifelse("${varVCFfile}"!="","_wasp_qc","")
      pattern <- paste0('*.Aligned.sortedByCoord.out',pattern_suffix,'.md.metrics')
      substitute <- paste0('.Aligned.sortedByCoord.out',pattern_suffix,'.md.metrics')
      isDir <- file.info(source)$isdir
      if (isDir) {
        files <- system(paste("find -L", source, "-name",pattern), intern=TRUE)
        stopifnot(length(files) > 0)
        samples <- gsub(substitute, "", basename(files), fixed=TRUE)
      } else {
        files <- source
        samples <- gsub(substitute, "", basename(files), fixed=TRUE)
      }

      metrics <- list()
      for (i in 1:length(files)) {
        m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=1)
        metrics[[i]] <- data.frame(Sample=samples[i],
                                   File=files[i],
                                   PERCENT_DUPLICATION=m$PERCENT_DUPLICATION,
                                   ESTIMATED_LIBRARY_SIZE=m$ESTIMATED_LIBRARY_SIZE,
                                   stringsAsFactors=FALSE)
      }
      metrics <- do.call(rbind, metrics)
      row.names(metrics) <- metrics$Sample
    
      return(metrics)
    }


    readPicard <- function(source) {
      metrics.aln <- readPicard.alignment_summary_metrics(source)
      metrics.rna <- readPicard.rna_metrics(source)
      metrics.dup <- readPicard.duplicate_metrics(source)

      stopifnot(all(row.names(metrics.aln) %in% row.names(metrics.rna)) &
                all(row.names(metrics.rna) %in% row.names(metrics.dup)) &
                all(row.names(metrics.dup) %in% row.names(metrics.aln)))

      metrics.aln$File <- NULL
      metrics.rna$File <- NULL
      metrics.dup$File <- NULL
      metrics.rna$Sample <- NULL
      metrics.dup$Sample <- NULL

      metrics <- cbind(metrics.aln, metrics.rna[row.names(metrics.aln), ])
      metrics <- cbind(metrics, metrics.dup[row.names(metrics.aln), ])

      return(metrics)
    }

    ## Execution  
    sourcePicard = ${_input[-1]:dr}

    Picard_qualityMetrics <- readPicard(sourcePicard)
    write.table(Picard_qualityMetrics, file="${_output}",col.names=TRUE, row.names=FALSE, quote=FALSE)


# Need to considering .checkpoint directory(hidden) when there's interruption caused by previous error in [STAR_lign] on interactive sessions. It might cause inconsistent length of metrics of metrics.aln.

Step 8: Filter Reads with WASP#

This optional step in the STAR alignment process offers filtration based on WASP correction and unique read identification. WASP correction mitigates reference allele bias in RNA-seq data by eliminating reads with specific WASP flags. Unique read filtering retains only uniquely mapped reads, identified by a mapping quality (MAPQ) value of 255, as per GTEx Consortium standards. These filters can be applied individually or in combination to enhance alignment quality and prepare BAM files for different analysis.

Step Inputs:#

  • No input needed to specify manually. All files are from the output from STAR_align_1

Step Outputs:#

  • A collection of filtered(or not) .bam files

[filter_reads]
parameter: unique=""
parameter: wasp=""
parameter: mapping_quality = 255
depends: sos_variable('strand')
input: output_from("STAR_align_1"), group_by = 2, group_with = {"sample_id","strand"}
output: cord_bam_wasp_qc = f'{cwd}/{_sample_id}.Aligned.sortedByCoord.out_wasp_qc.bam',
        trans_bam_wasp_qc = f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out_wasp_qc.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint

    # WASP-based QC filtering function
    wasp_filter() {
        # Check if an argument is provided
        if [[ -z "$1" ]]; then
            echo "Usage: wasp_filter <file.something.ext>"
            return 1
        fi
        local input_file="$1"
        local base_name="${input_file%.*}"  # Removes the last extension
        local derived_file="${base_name%.*}.int.bam"  # Removes the second to last extension and adds .ext     

        if [ $[unique] ]; then # pay attention to the spcae around the variable in [[]].
            # wasp_filtering & unique_filtering
            if [ $[wasp] ]; then
                echo "Applying unique_WASP filtering"
                samtools view -h -q $[mapping_quality] "$input_file" | grep -v "vW:i:[2-7]" | samtools view -b > "$derived_file"
            # unique_filtering
            else
                echo "Applying unique filtering"
                samtools view -h -q $[mapping_quality] "$input_file" | samtools view -b > "$derived_file" # remember to convert SAM files back to BAM files
            fi
        else
            # only wasp_filtering: Potential Allele-Specific Alignment Bias
            if [ $[wasp] ]; then
                echo "Applying WASP filtering"
                samtools view -h "$input_file" | grep -v "vW:i:[2-7]" | samtools view -b > "$derived_file"
            # no filter
            else
                echo "Applying no filtering"
                cp "$input_file" "$derived_file"
            fi
        fi
        return 0
    }

    # I tried using the filtered bam file to replace the original bam file, but I got "Exec format error". Don't do that! It's risky for the file being overwritten and being read simultaneously.

    export -f wasp_filter  # Make it available in subshells, if needed

    # Perform WASP-based QC filtering
    if [ $[varVCFfile] ]; then
        wasp_filter $[_input[0]]
        wasp_filter $[_input[1]] 
        mv $[_output[0]:nnnn].Aligned.sortedByCoord.int.bam  $[_output[0]]
        mv $[_output[1]:nnnn].Aligned.toTranscriptome.int.bam  $[_output[1]]
    fi