Reference Data Standardization

Reference Data Standardization#

Description#

We follow the TOPMed workflow from Broad for the aquisition and preparation of reference files for use throughout our pipeline. Our processed reference data may be found in this folder on Synapse. We have included the PDF document compiled by Data Standardization Working Group in the on Synapse as well as on ADSP Dashboard. It contains the reference data to use for the project.

Input and Output#

Download Reference Data#

Input: None
Output:

00-All.vcf.gz
00-All.vcf.gz.tbi
ERCC92.fa
ERCC92.gtf
ERCC92.zip
GRCh38_full_analysis_set_plus_decoy_hla.fa
Homo_sapiens.GRCh38.103.chr.gtf
Homo_sapiens.GRCh38.103.chr.gtf.gz

GFF3 To GTF#

Input: an uncompressed gff3 file.(i.e. can be view via cat)

Output: a gtf file

Format Reference Data#

Input: a reference fasta file.

Output: a reference fasta file without HLA, ALT, Decoy but with ERCC:

ERCC92.patched.fa
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.dict
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta.fai
Homo_sapiens.GRCh38.103.chr.reformatted.gtf

Format Gene Feature Data#

Input: a gtf file.

Output: a gtf file with collapsed gene model.

Generate STAR Index#

Input: a gtf file and an acompanying fasta file.

Output: A folder containing STAR index files.

Generate RSEM Index#

Input: a gtf file and an acompanying fasta file.

Output: A folder containing RSEM index files.

Generate RefFlat Annotation for Picard#

Input: a gtf file

Output: a ref.flat file

Generate SUPPA Annotation for Psichomics#

Input: a gtf file

Output: a .SUPPA_annotation.rds file

Extract rsIDs for known variants#

Input: a .vcf.gz file

Output: a .variants.gz file

TAD Annotation#

Input: a file containing a list of TADs and a region list

Output: a .annotation file

Minimal Working Example Steps#

i. Download Reference Data#

Timing: ~2 hours

!sos run reference_data.ipynb download_hg_reference --cwd ../reference_data
!sos run reference_data.ipynb download_gene_annotation --cwd ../reference_data
!sos run reference_data.ipynb download_ercc_reference --cwd ../reference_data
!sos run reference_data.ipynb download_dbsnp --cwd ../reference_data

INFO: Running download_hg_reference: 
GRCh38_ful...lus_decoy_hla.fa: downloaded                                       : 
INFO: download_hg_reference is completed.
INFO: download_hg_reference output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa
INFO: Workflow download_hg_reference (ID=w5027ac774a9eadc1) is executed successfully with 1 completed step.
INFO: Running download_gene_annotation: 
Homo_sapie...8.103.chr.gtf.gz:   0%|               | 0/49087092 [00:00<?, ?it/s]
INFO: download_gene_annotation is completed.                   [00:07<0
INFO: download_gene_annotation output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.gtf
INFO: Workflow download_gene_annotation (ID=w0bded47053a09483) is executed successfully with 1 completed step.
INFO: Running download_ercc_reference: 
ERCC92.zip:   0%|                                     | 0/28717 [00:00<?, ?it/s]
INFO: download_ercc_reference is completed.                   5.95it/s]
INFO: download_ercc_reference output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/ERCC92.gtf /restricted/projectnb/xqtl/xqtl_protocol/reference_data/ERCC92.fa
INFO: Workflow download_ercc_reference (ID=w1ed1866b727d8c9e) is executed successfully with 1 completed step.
INFO: Running download_dbsnp: 
00-All.vcf.gz
00-All.vcf.gz.tbi
00-All.vcf.gz.tbi
00-All.vcf.gz.tbi
00-All.vcf.gz.tbi: downloaded                                                   : 
00-All.vcf.gz: downloaded                                                       : 
INFO: download_dbsnp is completed.
INFO: download_dbsnp output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.vcf.gz /restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.vcf.gz.tbi
INFO: Workflow download_dbsnp (ID=wd4d1efccb2946bf1) is executed successfully with 1 completed step.

ii. Format Reference Data#

Timing: ~40 min

!sos run reference_data.ipynb hg_reference \
    --cwd ../reference_data \
    --ercc-reference ../reference_data/ERCC92.fa \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c csg.yml -q neurology

INFO: Running HLA ALT Decoy removal: 
INFO: t52741959485e9056 submitted to neurology with job id Your job 2339992 ("job_t52741959485e9056") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: HLA ALT Decoy removal output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta
INFO: Running merge with ERCC reference: 
INFO: t84b67f567fd35728 submitted to neurology with job id Your job 2339996 ("job_t84b67f567fd35728") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: merge with ERCC reference output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta
INFO: Running index the fasta file: 
INFO: tf5b9c47f5c546b07 submitted to neurology with job id Your job 2340092 ("job_tf5b9c47f5c546b07") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: index the fasta file output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.dict
INFO: Workflow hg_reference (ID=wc092cd2056576e99) is executed successfully with 3 completed steps and 3 completed tasks.

Change the –stranded option to –no-stranded if using un-stranded RNA-seq protocol.

!sos run reference_data.ipynb hg_gtf \
    --cwd ../reference_data \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c csg.yml -q neurology

INFO: Running add chr prefix to gtf file: 
INFO: tf004f605413b7240 re-execute completed
INFO: tf004f605413b7240 submitted to neurology with job id Your job 2473403 ("job_tf004f605413b7240") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: add chr prefix to gtf file output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.gtf
INFO: Running collapsed gene model: 
INFO: t764808e29ab699b0 re-execute completed
INFO: t764808e29ab699b0 submitted to neurology with job id Your job 2473428 ("job_t764808e29ab699b0") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: collapsed gene model output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf
INFO: Workflow hg_gtf (ID=wafa522e33646f5f8) is executed successfully with 2 completed steps and 2 completed tasks.

iii. Format Gene Feature Data#

Timing: ~1min

Change the –stranded option to –no-stranded if using un-stranded RNA-seq protocol.

!sos run reference_data.ipynb gene_annotation \
    --cwd ../reference_data \
    --ercc-gtf ../reference_data/ERCC92.gtf \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -s build -c csg.yml -q neurology

INFO: Running add chr prefix to gtf file: 
INFO: Running Preprocess ERCC gtf file: 
INFO: Step ercc_gtf (index=0) is ignored with signature constructed
INFO: Step ercc_gtf (index=0) is ignored with signature constructed
INFO: Preprocess ERCC gtf file output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/ERCC92.genes.patched.gtf
INFO: Step hg_gtf_1 (index=0) is ignored with signature constructed
INFO: Step hg_gtf_1 (index=0) is ignored with signature constructed
INFO: add chr prefix to gtf file output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.gtf
INFO: Running collapsed gene model: 
INFO: Step hg_gtf_2 (index=0) is ignored with signature constructed
INFO: Step hg_gtf_2 (index=0) is ignored with signature constructed
INFO: collapsed gene model output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf
INFO: Running gene_annotation: 
WARNING: Invalid tag "gene_annotation_Homo_sapiens.GRCh38.103.chr.reformatted.ERCC Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC" is added as "gene_annotation_Homo_sapiens.GRCh38.103.chr.reformatted.ERCCHomo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC"
INFO: tf66344318fec0849 submitted to neurology with job id Your job 2473505 ("job_tf66344318fec0849") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: gene_annotation output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf
INFO: Workflow gene_annotation (ID=wd680c7eefcd76b7e) is executed successfully with 1 completed step, 3 ignored steps and 1 completed task.

iv. Generate STAR Index#

Timing: ~30 min

Generate the STAR index without the GTF annotation file to allow for customized read length later in STAR alignment.

!sos run reference_data.ipynb STAR_index \
    --cwd ../reference_data \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c csg.yml -q neurology \
    --numThreads 10 \
    --mem 40G

INFO: Running STAR_index: 
INFO: t41bda200d129e222 re-execute completed
INFO: t41bda200d129e222 submitted to neurology with job id Your job 2380397 ("job_t41bda200d129e222") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: STAR_index output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/STAR_Index/chrName.txt /restricted/projectnb/xqtl/xqtl_protocol/reference_data/STAR_Index/SAindex... (8 items)
INFO: Workflow STAR_index (ID=wc6567c60a9fc9f11) is executed successfully with 1 completed step and 1 completed task.

v. Generate RSEM Index#

Timing: ~1min

Generate RSEM index with the uncollapsed gtf file (without the gene tag in its file name.)

!sos run reference_data.ipynb RSEM_index \
    --cwd ../reference_data \
    --hg-reference ../reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c csg.yml -q neurology 

INFO: Running RSEM_index: 
INFO: t3cd7c2d1d2347634 submitted to neurology with job id Your job 2473576 ("job_t3cd7c2d1d2347634") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: RSEM_index output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/RSEM_Index/rsem_reference.n2g.idx.fa /restricted/projectnb/xqtl/xqtl_protocol/reference_data/RSEM_Index/rsem_reference.grp... (7 items)
INFO: Workflow RSEM_index (ID=wee55a025a3891eb8) is executed successfully with 1 completed step and 1 completed task.

vi. Generate RefFlat Annotation for Picard#

Timing: ~1min

!sos run reference_data.ipynb RefFlat_generation \
    --cwd ../reference_data \
    --hg-gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -c csg.yml -q neurology 

INFO: Running RefFlat_generation: 
INFO: t9e2919744de892d8 submitted to neurology with job id Your job 2473638 ("job_t9e2919744de892d8") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: RefFlat_generation output:   ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat
INFO: Workflow RefFlat_generation (ID=w3f901e2f82fc33e0) is executed successfully with 1 completed step and 1 completed task.

vii. Generate SUPPA Annotation for Psichomics#

Timing: ~1min

!sos run reference_data.ipynb SUPPA_annotation \
    --cwd ../reference_data \
    --hg_gtf ../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container oras://ghcr.io/cumc/psichomics_apptainer:latest \
    -c csg.yml -q neurology 

INFO: Running SUPPA_annotation_1: 
INFO: t043bdd93ad2e832e submitted to neurology with job id Your job 2473686 ("job_t043bdd93ad2e832e") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: SUPPA_annotation_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/hg38.Homo_sapiens.GRCh38.103.chr.reformatted.ERCC_SE_strict.ioe
INFO: Running SUPPA_annotation_2: 
INFO: tde3dad57fa2b8b7e submitted to neurology with job id Your job 2473726 ("job_tde3dad57fa2b8b7e") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: SUPPA_annotation_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds
INFO: Workflow SUPPA_annotation (ID=wc2895daedb91047b) is executed successfully with 2 completed steps and 2 completed tasks.

viii. Extract rsIDs for known variants#

Timing: <30min

Extract the rsID for the known varriants so that we can distinguish them from the novel variants we had in our study. The procedure/rationale is explained in this post

!sos run VCF_QC.ipynb dbsnp_annotate \
    --genoFile ../reference_data/00-All.vcf.gz \
    --cwd ../reference_data \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
    --numThreads 10 \
    -J 10 -c csg.yml -q neurology

INFO: Running dbsnp_annotate: 
INFO: tb731222e5323dc2d restart from status submitted
INFO: tb731222e5323dc2d submitted to neurology with job id Your job 2546318 ("job_tb731222e5323dc2d") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: dbsnp_annotate output:   /restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.variants.gz
INFO: Workflow dbsnp_annotate (ID=w6bb14ba9b9686b80) is executed successfully with 1 completed step and 1 completed task.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command interface#

sos run reference_data.ipynb -h

usage: sos run reference_data.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  download_hg_reference
  download_gene_annotation
  download_ercc_reference
  download_dbsnp
  gff3_to_gtf
  hg_reference
  hg_gtf
  ercc_gtf
  gene_annotation
  STAR_index
  RSEM_index
  RefFlat_generation
  SUPPA_annotation
  psi_hg38_annotation_modification

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option

Sections
  download_hg_reference:
  download_gene_annotation:
  download_ercc_reference:
  download_dbsnp:
  gff3_to_gtf:
    Workflow Options:
      --gff3-file VAL (as path, required)
  hg_reference_1:
    Workflow Options:
      --hg-reference VAL (as path, required)
                        Path to HG reference file
  hg_reference_2:
    Workflow Options:
      --ercc-reference VAL (as path, required)
  hg_reference_3:
  hg_gtf_1:
    Workflow Options:
      --hg-reference VAL (as path, required)
      --hg-gtf VAL (as path, required)
  hg_gtf_2:
    Workflow Options:
      --[no-]stranded (required)
  ercc_gtf:
    Workflow Options:
      --ercc-gtf VAL (as path, required)
  gene_annotation:
  STAR_index:
    Workflow Options:
      --hg-reference VAL (as path, required)
  RSEM_index:
    Workflow Options:
      --hg-gtf VAL (as path, required)
      --hg-reference VAL (as path, required)
  RefFlat_generation:
    Workflow Options:
      --hg-gtf VAL (as path, required)
  SUPPA_annotation_1:
    Workflow Options:
      --hg-gtf VAL (as path, required)
  SUPPA_annotation_2:
    Workflow Options:
      --hg-gtf VAL (as path, required)
  psi_hg38_annotation_modification_1:
    Workflow Options:
      --hg-gtf VAL (as path, required)
      --hgrc-db VAL (as path, required)

[global]
# The output directory for generated files.
parameter: cwd = path("output")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
from sos.utils import expand_size

Data download#

[download_hg_reference]
output: f"{cwd:a}/GRCh38_full_analysis_set_plus_decoy_hla.fa"
download: dest_dir = cwd
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

[download_gene_annotation]
output: f"{cwd:a}/Homo_sapiens.GRCh38.103.chr.gtf"
download: dest_dir = cwd, decompress=True
    http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf.gz

[download_ercc_reference]
output: f"{cwd:a}/ERCC92.gtf", f"{cwd:a}/ERCC92.fa"
download: dest_dir = cwd, decompress=True
    https://tools.thermofisher.com/content/sfs/manuals/ERCC92.zip

[download_dbsnp]
output: f"{cwd:a}/00-All.vcf.gz", f"{cwd:a}/00-All.vcf.gz.tbi"
download: dest_dir = cwd
    ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
    ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz.tbi

GFF3 to GTF formatting#

[gff3_to_gtf]
parameter: gff3_file = path
input: gff3_file
output: f'{cwd}/{_input:bn}.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint = entrypoint
        gffread ${_input} -T -o ${_output}

HG reference file preprocessing#

Remove the HLA/ALT/Decoy record from the fasta – because none of the downstreams RNA-seq calling pipeline component can handle them properly.
Adding in ERCC information to the fasta file – even if ERCC is not included in the RNA-seq library it does not harm to add them.
Generating index for the fasta file

[hg_reference_1 (HLA ALT Decoy removal)]
# Path to HG reference file
parameter: hg_reference = path
input: hg_reference
output:  f'{cwd}/{_input:bn}.noALT_noHLA_noDecoy.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    with open('${_input}', 'r') as fasta:
        contigs = fasta.read()
        contigs = contigs.split('>')
        contig_ids = [i.split(' ', 1)[0] for i in contigs]

        # exclude ALT, HLA and decoy contigs
        filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
        if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
    
    with open('${_output}', 'w') as fasta:
        fasta.write(filtered_fasta)

[hg_reference_2 (merge with ERCC reference)]
parameter: ercc_reference = path
output: f'{cwd}/{_input:bn}_ERCC.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    sed 's/ERCC-/ERCC_/g' ${ercc_reference} > ${ercc_reference:n}.patched.fa
    cat ${_input} ${ercc_reference:n}.patched.fa > ${_output}

[hg_reference_3 (index the fasta file)]
output: f'{cwd}/{_input:bn}.dict'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    samtools faidx ${_input}
    picard CreateSequenceDictionary \
        R=${_input} \
        O=${_output}

Transcript and gene model reference processing#

This step modify the gtf file for following reasons:

RSEM require GTF input to have the same chromosome name format (with chr prefix) as the fasta file. Although for STAR, this problem can be solved by the --sjdbGTFchrPrefix "chr" option, we have to add chr to it for use with RSEM anyways. That is why we would like to do it here.
Gene model collapsing script collapse_annotation.py from GTEx require the gtf have transcript_type instead transcript_biotype in its annotation. We rename it here, although this problem can also be solved by modifying the collapse_annotation.py while building the docker, since we are already customizing the reference file for 1 above, we add this in as well, as another customization made.
Adding in ERCC information to the gtf reference.

We may reimplement 1 and 2 with the alternative approaches discussed above if the problem with RSEM is solved, or when RSEM is no longer needed.

[hg_gtf_1 (add chr prefix to gtf file)]
parameter: hg_reference = path
parameter: hg_gtf = path
input: hg_reference, hg_gtf
output: f'{cwd}/{_input[1]:bn}.reformatted.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    library("readr")
    library("stringr")
    library("dplyr")
    options(scipen = 999)
    con <- file("${_input[0]}","r")
    fasta <- readLines(con,n=1)
    close(con)
    gtf = read_delim("${_input[1]}", "\t",  col_names  = F, comment = "#", col_types="ccccccccc")
    if(!str_detect(fasta,">chr")) {
        gtf_mod = gtf%>%mutate(X1 = str_remove_all(X1,"chr"))
    } else if (!any(str_detect(gtf$X1[1],"chr"))) {
        gtf_mod = gtf%>%mutate(X1 = paste0("chr",X1))
    } else (gtf_mod = gtf)
    if(any(str_detect(gtf_mod$X9, "transcript_biotype"))) {
      gtf_mod = gtf_mod%>%mutate(X9 = str_replace_all(X9,"transcript_biotype","transcript_type"))
    }
    gtf_mod%>%write.table("${_output}",sep = "\t",quote = FALSE,col.names = F,row.names = F)

Text below is taken from broadinstitute/gtex-pipeline

Gene-level expression and eQTLs from the GTEx project are calculated based on a collapsed gene model (i.e., combining all isoforms of a gene into a single transcript), according to the following rules:

Transcripts annotated as “retained_intron” or “read_through” are excluded. Additionally, transcripts that overlap with annotated read-through transcripts may be blacklisted (blacklists for GENCODE v19, 24 & 25 are provided in this repository; no transcripts were blacklisted for v26).
The union of all exon intervals of each gene is calculated.
Overlapping intervals between genes are excluded from all genes.

The purpose of step 3 is primarily to exclude overlapping regions from genes annotated on both strands, which can’t be unambiguously quantified from unstranded RNA-seq (GTEx samples were sequenced using an unstranded protocol). For stranded protocols, this step can be skipped by adding the --collapse_only flag.

Further documentation is available on the GTEx Portal.

[hg_gtf_2 (collapsed gene model)]
parameter: stranded = bool
output: f'{_input:n}{".collapse_only" if stranded else ""}.gene.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    collapse_annotation.py ${"--collapse_only" if stranded else ""} ${_input} ${_output}

[ercc_gtf (Preprocess ERCC gtf file)]
parameter: ercc_gtf = path
input: ercc_gtf
output: f'{cwd}/{_input:bn}.genes.patched.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint = entrypoint
    with open('${_input}') as exon_gtf, open('${_output}', 'w') as gene_gtf:
        for line in exon_gtf:
            f = line.strip().split('\t')
            f[0] = f[0].replace('-','_')  # required for RNA-SeQC/GATK (no '-' in contig name)
        
            attr = f[8]
            if attr[-1]==';':
                attr = attr[:-1]
            attr = dict([i.split(' ') for i in attr.replace('"','').split('; ')])
            # add gene_name, gene_type
            attr['gene_name'] = attr['gene_id']
            attr['gene_type'] = 'ercc_control'
            attr['gene_status'] = 'KNOWN'
            attr['level'] = 2
            for k in ['id', 'type', 'name', 'status']:
                attr['transcript_'+k] = attr['gene_'+k]
        
            attr_str = []
            for k in ['gene_id', 'transcript_id', 'gene_type', 'gene_status', 'gene_name',
                'transcript_type', 'transcript_status', 'transcript_name']:
                attr_str.append('{0:s} "{1:s}";'.format(k, attr[k]))
            attr_str.append('{0:s} {1:d};'.format('level', attr['level']))
            f[8] = ' '.join(attr_str)
        
            # write gene, transcript, exon
            gene_gtf.write('\t'.join(f[:2]+['gene']+f[3:])+'\n')
            gene_gtf.write('\t'.join(f[:2]+['transcript']+f[3:])+'\n')
            f[8] = ' '.join(attr_str[:2])
            gene_gtf.write('\t'.join(f[:2]+['exon']+f[3:])+'\n')

[gene_annotation]
input: output_from("hg_gtf_1"), output_from("hg_gtf_2"), output_from("ercc_gtf")
output: f'{cwd}/{_input[0]:bn}.ERCC.gtf', f'{cwd}/{_input[1]:bn}.ERCC.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    cat ${_input[0]} ${_input[2]} > ${_output[0]}
    cat ${_input[1]} ${_input[2]} > ${_output[1]}

Generating index file for `STAR`#

This step generate the index file for STAR alignment. This file just need to generate once and can be re-used.

At least 40GB of memory is needed.

Step Inputs#

gtf and fasta: path to reference sequence. Both of them needs to be unzipped. gtf should be the one prior to collapse by gene.
sjdbOverhang: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. We use 100 here as recommended by the TOPMed pipeline. See here for some additional discussions.

Step Output#

Indexing file stored in {cwd}/STAR_Index, which will be used by STAR

[STAR_index]
parameter: hg_reference = path
# Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.
# Default choice follows from TOPMed pipeline recommendation.
if expand_size(mem) < expand_size('40G'):
    print("Insufficent memory for STAR, changing to 40G")
    star_mem = '40G'
else:
    star_mem = mem
input: hg_reference
output: f"{cwd}/STAR_Index/chrName.txt", 
        f"{cwd}/STAR_Index/SAindex", f"{cwd}/STAR_Index/SA", f"{cwd}/STAR_Index/genomeParameters.txt", 
        f"{cwd}/STAR_Index/chrStart.txt",
        f"{cwd}/STAR_Index/chrLength.txt", 
        f"{cwd}/STAR_Index/Genome", f"{cwd}/STAR_Index/chrNameLength.txt"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}', cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout', entrypoint = entrypoint
    STAR --runMode genomeGenerate \
         --genomeDir ${_output[0]:d} \
         --genomeFastaFiles ${_input[0]} \
         --runThreadN ${numThreads}

Generating index file for `RSEM`#

This step generate the indexing file for RSEM. This file just need to generate once.

Step Inputs#

gtf and fasta: path to reference sequence. gtf should be the one prior to collapse by gene.
sjdbOverhang: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

Step Outputs#

Indexing file stored in RSEM_Index, which will be used by RSEM

[RSEM_index]
parameter: hg_gtf = path
parameter: hg_reference = path
input: hg_reference, hg_gtf
output: f"{cwd}/RSEM_Index/rsem_reference.n2g.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.grp", 
        f"{cwd}/RSEM_Index/rsem_reference.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.ti", 
        f"{cwd}/RSEM_Index/rsem_reference.chrlist", f"{cwd}/RSEM_Index/rsem_reference.seq", 
        f"{cwd}/RSEM_Index/rsem_reference.transcripts.fa"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout', entrypoint = entrypoint
    rsem-prepare-reference \
            ${_input[0]} \
            ${_output[1]:n} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}

Generation of RefFlat file#

This file is needed for picard CollectRnaSeqMetrics module, which in turn

produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES).

[RefFlat_generation]
parameter: hg_gtf = path
input: hg_gtf
output: f'{_input:n}.ref.flat'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint = entrypoint
    gtfToGenePred ${_input}  ${_output}.tmp -genePredExt -geneNameAsName2
    awk -F'\t' -v OFS="\t" '{$1=$12 OFS $1;}7' ${_output}.tmp | cut -f 1-11 > ${_output}
    rm ${_output}.tmp

Generation of SUPPA annotation for psichomics.#

The generation of custom alternative splicing annotation is based on this tutorial. The way to generate the local alternative splicing suppa output is documented on SUPPA github page.

sos run pipeline/reference_data.ipynb SUPPA_annotation \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container containers/psichomics.sif

[SUPPA_annotation_1]
parameter: hg_gtf = path
input: hg_gtf
output: f'{cwd}/hg38.{_input:bn}_SE_strict.ioe' # The stderr file must not shared the same start with the output file
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{cwd}/{_input:bn}.stderr', stdout = f'{cwd}/{_input:bn}.stdout', entrypoint = entrypoint
    suppa.py generateEvents -i ${_input} -o ${cwd}/hg38.${_input:bn} -f ioe -e SE SS MX RI FL

[SUPPA_annotation_2]
parameter: hg_gtf = path
output: f'{cwd}/{hg_gtf:bn}.SUPPA_annotation.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint = entrypoint
    library("psichomics")
    suppa <- parseSuppaAnnotation("${_input:d}", genome="hg38") 
    annot <- prepareAnnotationFromEvents(suppa)
    saveRDS(annot, file=${_output:r})

Generate psichomics Hg38 splicing annotation#

Since the original annotation provided by psichomics package is using gene symbols, we modified it to use Ensembl IDs. The modified annotation will be used for detecting RNA alternative splicing using psichomics.

sos run pipeline/reference_data.ipynb psi_hg38_annotation \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.gtf \
    --hgrc-db reference_data/hgnc_database.txt \
    --container container/psichomics.sif

[psi_hg38_annotation]
parameter: hg_gtf = path
# FIXME: Please document what this file is and where do we get it @xuanhe.
parameter: hgrc_db = path
input: hg_gtf, hgrc_db
output: f'{cwd}/psichomics_hg38_annotation.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint = entrypoint
    library("psichomics")
    library("purrr")
    library("dplyr")
    library("tidyr")
    library("data.table")
  
    # load psicomics default annotation, option of hg38 from listSplicingAnnotations()
    annotation <- loadAnnotation("AH63657")

    
    # reduce the demension of annotation file
    annotation <- 
      map(annotation, ~.x%>%
                       tidyr::unnest(cols = `Gene`))
  
    # Create empty colomns for each event for easier mapping
    annotation[["Tandem UTR"]][["SUPPA.Event.ID"]] <- NA
    annotation[["Tandem UTR"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Alternative first exon"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Alternative last exon"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Mutually exclusive exon"]][["VAST-TOOLS.Event.ID"]] <- NA
  
    # extract Ensembl ID substring from original SUPPA.ID and VASTTOOL.ID
    annotation <- 
      map(annotation, ~.x%>%
                       mutate(ENSG.SUPPA = substr(`SUPPA.Event.ID`, 1, 15))%>%
                       mutate(ENSG.VAST = substr(`VAST-TOOLS.Event.ID`, 1, 15)))
  
    # Load gtf file
    gtf_sample <- read.table('${_input[0]}',header = FALSE, sep = '\t')
  
    # from the gtf file, seperate gene names and corresponding Ensembl ID
    gtf_sample <- separate(gtf_sample, V9, sep = ";",into = c("gene_id", "transcript_id", "exon_number", "gene_name"))
    gtf_sample <- separate(gtf_sample, gene_id, sep = " ",into = c("gene_id", "gene_id_val"))
    gtf_sample <- separate(gtf_sample, gene_name, sep = "e ",into = c("gene_name", "gene_name_val"))
  
    gtf_name_id_match <- gtf_sample[,c("gene_id_val","gene_name_val")]
    gtf_name_id_match <- gtf_name_id_match[!duplicated(gtf_name_id_match), ]
  
    # For any matched approved id in the psi hg38 annotation and gtf file, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.GTF` = gtf_name_id_match$gene_id_val[match(`Gene`, gtf_name_id_match$gene_name_val)]))
  
    # load hgnc database
    hgnc_db <- fread('${_input[1]}', fill = TRUE, header = TRUE, sep = '\t', quote="")
  
    # Combine the `Ensembl.ID.supplied.by.Ensembl.` and `Ensembl.gene.ID` column, if there are any conflict use the former
    # For conflict ones (15 total) both the former and latter records are poiting to the same gene name in Ensembl website so the order should not matter
    hgnc_db <- hgnc_db %>%
    mutate(ENSG.ID = ifelse(`Ensembl ID(supplied by Ensembl)` == "", `Ensembl gene ID`, `Ensembl ID(supplied by Ensembl)`))
  
    # Create a one to one reference list for approved names, previous names and aliases
    # There is no duplicate symbol and Ensembl id info for approved symbol so no need for chromosome verification
    hgnc_name_id_match <- hgnc_db[,c("Approved symbol","ENSG.ID")]
    hgnc_name_prev_check <- hgnc_db[,c("Previous symbols","Chromosome","ENSG.ID")]
    hgnc_name_alias_check <- hgnc_db[,c("Alias symbols","Chromosome","ENSG.ID")]
  
    # Remove NAs
    hgnc_name_prev_check <- hgnc_name_prev_check[hgnc_name_prev_check$ENSG.ID != "",]
    hgnc_name_alias_check <- hgnc_name_alias_check[hgnc_name_alias_check$ENSG.ID != "",]

    hgnc_name_prev_check <- hgnc_name_prev_check[hgnc_name_prev_check$"Previous symbols" != "",] 
    hgnc_name_alias_check <- hgnc_name_alias_check[hgnc_name_alias_check$"Alias symbols" != "",]

    # Seperate symbol column values from list of sybols to individual rows with one each
    hgnc_name_prev_check <- separate_rows(hgnc_name_prev_check, "Previous symbols", convert = FALSE)
    hgnc_name_alias_check <- separate_rows(hgnc_name_alias_check, "Alias symbols", convert = FALSE)
  
    # Convert chomosome info in hgnc database to number for matching with other database
    hgnc_name_prev_check <- separate(hgnc_name_prev_check, "Chromosome", sep = 'p', into = "Chrp", remove = FALSE)
    hgnc_name_prev_check <- separate(hgnc_name_prev_check, "Chromosome", sep = 'q', into = "Chrq", remove = FALSE)
    hgnc_name_prev_check <- hgnc_name_prev_check%>%
                                mutate(Chr = ifelse(nchar(hgnc_name_prev_check$Chrp) <= 2, Chrp, Chrq))
    
    hgnc_name_alias_check <- separate(hgnc_name_alias_check, "Chromosome", sep = 'p', into = "Chrp", remove = FALSE)
    hgnc_name_alias_check <- separate(hgnc_name_alias_check, "Chromosome", sep = 'q', into = "Chrq", remove = FALSE)
    hgnc_name_alias_check <- hgnc_name_alias_check%>%
                                mutate(Chr = ifelse(nchar(hgnc_name_alias_check$Chrp) <= 2, Chrp, Chrq))
  
    # For any matched approved id in the psi hg38 annotation and hgnc database, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.HGNC` = hgnc_name_id_match$`ENSG.ID`[match(`Gene`, hgnc_name_id_match$"Approved symbol")]))
  
    # Drop hypothetical genes
    annotation<-
      map(annotation, ~.x%>%
            subset(`Gene` != 'Hypothetical'))
  
    # IN remaining NAs, for any matched alias/previous names and chromosome in the psi hg38 annotation and hgnc database, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                      mutate(ENSG.HGNC = ifelse(is.na(`ENSG.HGNC`) | `ENSG.HGNC` == "",
                                                hgnc_name_alias_check$`ENSG.ID`[match(`Gene`, hgnc_name_alias_check$"Alias symbols") & match(`Chromosome`, hgnc_name_alias_check$Chr)],
                                                `ENSG.HGNC`))%>%
                      mutate(ENSG.HGNC = ifelse(is.na(`ENSG.HGNC`) | `ENSG.HGNC` == "",
                                                hgnc_name_prev_check$`ENSG.ID`[match(`Gene`, hgnc_name_prev_check$"Previous symbols") & match(`Chromosome`, hgnc_name_prev_check$Chr)],
                                                `ENSG.HGNC`)))
  
    # Build the final Ensembl id column base on the gtf file first, then for remaining NAs check the HGNC database record, SUPPA and VASTTOOL record,
    # Drop special cases that Ensembl ID is not recorded in VASTTOOLs and SUPPA
    # finnally in the remaining NA Ensmbl ID if the gene name in original annotation is NCBI IDs just use it
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.ID` = `ENSG.GTF`)%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.HGNC`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.VAST`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.SUPPA`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(substr(`ENSG.ID`, 1, 4) == 'ENSG',
                                                 `ENSG.ID`,
                                                 NA))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`) & substr(`Gene`, 1, 3) == 'LOC',
                                               `Gene`,
                                                `ENSG.ID`)))

    # Use the Ensembl IDs to replace gene names, drop remaining NAs
    annotation <-
    map(annotation, ~.x%>%
                       mutate(`Gene` = `ENSG.ID`)%>%
            drop_na(`Gene`)
          )

    # save modified annotation
    saveRDS(annotation, file = "${_output}")

Generate TAD associated index#

We use https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7895846/ to define the genotype cis region for fine mapping with different phenotypes. The actual TAD dataframe for each cell types and tissues are download from the hg38 TAD link from http://3dgenome.fsm.northwestern.edu/publications.html. The generation of this TAD file follows three steps:

Union of TADs across tissues
Extend with TAD boundaries
Extend with cis windows

[Tad_annotateion]
## The tad file downloads from 
parameter: TAD_list = path
parameter: region_list = path
parameter: TAD_genotype = path
input: TAD_list, region_list
output:f'{cwd:a}/{region_list:bn}.annotation' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
python: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint = entrypoint
    import pandas as pd
    import numpy as np
    #tad = pd.read_csv("${_input[0]}","\t",header = None)
    #tad.columns = ["#chr","start","end"]
    #tad["tad_index"] = [f'tad{x}' for x in np.arange(1,len(tad)+1)]
    tad = pd.read_csv("${_input[0]}","\t")
    region_list =  pd.read_csv("${_input[1]}","\t")
    tad = tad.merge(region_list, on = "#chr").query("start_y > start_x &  end_y < end_x").drop(["start_y","end_y"] , axis = 1).rename({"start_x":"start","end_x":"end"})
    tad.to_csv("${_output}","\t") 
    tad_geno_list = pd.read_csv("${TAD_genotype}","\t")
    tad_gene_id = tad.merge(tad_geno_list,left_on = "tad_index",right_on = "#id").iloc[:,[4,6]]
    tad_gene_id.columns = ["ID","dir"]
    tad_gene_id.to_csv("${_output}.genotype_files_list.tsv","\t") 

/home/hs3163/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: FutureWarning: In a future version of pandas all arguments of read_csv except for the argument 'filepath_or_buffer' will be keyword-only.
  exec(code_obj, self.user_global_ns, self.user_ns)
/home/hs3163/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: FutureWarning: In a future version of pandas all arguments of read_csv except for the argument 'filepath_or_buffer' will be keyword-only.
  exec(code_obj, self.user_global_ns, self.user_ns)
/home/hs3163/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: FutureWarning: In a future version of pandas all arguments of read_csv except for the argument 'filepath_or_buffer' will be keyword-only.
  exec(code_obj, self.user_global_ns, self.user_ns)

Reference Data Standardization

Contents

Reference Data Standardization#

Description#

Input and Output#

Download Reference Data#

GFF3 To GTF#

Format Reference Data#

Format Gene Feature Data#

Generate STAR Index#

Generate RSEM Index#

Generate RefFlat Annotation for Picard#

Generate SUPPA Annotation for Psichomics#

Extract rsIDs for known variants#

TAD Annotation#

Minimal Working Example Steps#

i. Download Reference Data#

ii. Format Reference Data#

iii. Format Gene Feature Data#

iv. Generate STAR Index#

v. Generate RSEM Index#

vi. Generate RefFlat Annotation for Picard#

vii. Generate SUPPA Annotation for Psichomics#

viii. Extract rsIDs for known variants#

Troubleshooting#

Command interface#

Data download#

GFF3 to GTF formatting#

HG reference file preprocessing#

Transcript and gene model reference processing#

Generating index file for STAR#

Step Inputs#

Step Output#

Generating index file for RSEM#

Step Inputs#

Step Outputs#

Generation of RefFlat file#

Generation of SUPPA annotation for psichomics.#

Generate psichomics Hg38 splicing annotation#

Generate TAD associated index#

Generating index file for `STAR`#

Generating index file for `RSEM`#