APA Calling#

Aim#

The purpose of this notebook is to call APA-based information (PDUI) based on DAPARS2 method.

Methods#

%preview ../../images/apa_calling.png
> ../../images/apa_calling.png (45.2 KiB):
../../../_images/6131e86ecc58a7cd2854b09a42a3b7f2266d002ef920445f28b515b24ac04a64.png

3’UTR Reference#

Call WIG data from transcriptome BAM files#

Using bedtools or rsem-bam2wig, for RSEM based alignment

Config files Generation#

  • Python 3 loops to read line by line the sum of reads coverage of all chromosome.

Dapars2 Main Function#

  • Dapars2_Multi_Sample.py: use the least sqaures methods to calculate the usage of long isoforms (3UTR/DaPars2)

    Note: this part of code have been modified from source to deal with some formatting discrepancy in wig file

Impute missing values in Dapars result#

KNN using impute R package.

Input#

  • A list of transcriptome level BAM files, eg generated by RSEM

  • The 3’UTR annotation reference file

If you do not have 3’UTR annotation file, please generate it first. Input to this step is the transcriptome level gene feature file in GTF format that we previously prepared.

Output#

  • Dapars config files

  • PUDI (Raw) information saved in txt

  • PDUI (Imputed) information saved in txt. This is recommended for further analysis.

Minimal working example#

To generate 3’UTR reference data,

sos run apa_calling.ipynb UTR_reference \
    --cwd output/apa \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container /mnt/mfs/statgen/ls3751/container/dapars2_final.sif

Command interface#

sos run apa_calling.ipynb -h
usage: sos run apa_calling.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  UTR_reference
  bam2tools
  bam2toolsv1
  bam2toolsv2
  bam2toolstmp
  APAconfig
  APAmain

Global Workflow Options:
  --walltime 400h
  --mem 200G
  --ncore 16 (as int)
  --cwd output (as path)
                        the output directory for generated files
  --numThreads 8 (as int)
                        Number of threads
  --job-size 1 (as int)
  --container ''

Sections
  UTR_reference:        Generate the 3UTR region according to the gtf file
    Workflow Options:
      --hg-gtf VAL (as path, required)
                        gtf file
  bam2tools:
    Workflow Options:
      --n 9 (as int)
  bam2toolsv1:
    Workflow Options:
      --sample VAL (as path, required)
      --tissue VAL (as path, required)
  bam2toolsv2:
    Workflow Options:
      --sample VAL (as path, required)
      --tissue VAL (as path, required)
  bam2toolstmp:
    Workflow Options:
      --sample VAL (as path, required)
      --tissue VAL (as path, required)
  APAconfig:            Generate configuration file
    Workflow Options:
      --bfile VAL (as path, required)
      --annotation VAL (as path, required)
      --least-pass-coverage-percentage 0.3 (as float)
                        Default parameters for Dapars2:
      --coverage-threshold 10 (as int)
  APAmain:              Call Dapars2 multi_chromosome
    Workflow Options:
      --[no-]chr-prefix (default to False)
      --chrlist VAL VAL ... (as type, required)

Workflow implementation#

[global]
parameter: walltime = '400h'
parameter: mem = '200G'
parameter: ncore = 16
# the output directory for generated files
parameter: cwd = path("output")
# Number of threads
parameter: numThreads = 8
parameter: job_size = 1
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

Step 0: Generate 3UTR regions based on GTF#

The 3UTR regions (saved in bed format) could be use repeatly for different samples. It only served as the reference region. You may not need to run it if given generated hg19/hg38 3UTR regions.

# Generate the 3UTR region according to the gtf file
[UTR_reference]
# gtf file
parameter: hg_gtf = path
input: hg_gtf
output: f'{cwd}/{_input:bn}.bed', 
        f'{cwd}/{_input:bn}.transcript_to_geneName.txt',
        f'{cwd}/{_input:bn}_3UTR.bed'
bash: expand = '${ }', container = container
    gtf2bed12.py --gtf ${_input} --out ${cwd}
    mv ${cwd}/gene_annotation.bed ${_output[0]}
    mv ${cwd}/transcript_to_geneName.txt ${_output[1]}
    DaPars_Extract_Anno.py -b ${_output[0]} -s ${_output[1]} -o ${_output[2]}

Step 1: Generate WIG calls and flagstat files from BAM files#

Generating WIG from BAM data via bedtools is recommended by Dapars authors. However, our transcriptome level calls are made using RSEM, which in fact contains a program called rsem-bam2wig with this one additional feature:

–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line

Here we stick to bedtools because of its popularity and most generic BAM files may not have the ZW tag anyways.

[bam2tools]
parameter: n = 9
n  = [x for x in range(n)]
input: for_each = 'n'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand = True, container = container
    import glob
    import os
    import subprocess
    path = "/mnt/mfs/ctcn/datasets/rosmap/rnaseq/dlpfcTissue/batch{_n}/STAR_aligned"
    name = glob.glob(path + "/**/*Aligned.sortedByCoord.out.bam", recursive = True)
    wigpath = "/home/ls3751/project/ls3751/wig/batch{_n}/"
    if not os.path.exists(wigpath):
        os.makedirs(os.path.dirname(wigpath))
    for i in name:
        id = i.split("/")[-2]
        filedir = path + "/" + id + "/" + id + ".bam"
        out = wigpath + id + ".wig"
        new_cmd = "bedtools genomecov -ibam " + filedir + " -bga -split -trackline" + " > " + out
        os.system(new_cmd)
        out_2 = wigpath + id + ".flagstat"
        new_cmd_2 = f"samtools flagstat --thread {numThreads} " + filedir + " > " + out_2
        os.system(new_cmd_2)
[bam2toolsv1]
parameter: sample = path
parameter: tissue = path
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, concurrent = True
python: expand = "${ }", container = container
    import os
    import multiprocessing
    import re
    sample_list = [line.strip('\n') for line in open("${sample}")]

    def call_wig_flagstat(subject):
        id = re.findall(r"\D(\d{8})\D",subject)[1]
        prefix = "/mnt/mfs/statgen/ls3751/aqtl_analysis/wig/" + "${tissue}" + "/"
        out = prefix + id + ".wig"
        out_2 = prefix + id + ".flagstat"
        new_cmd = "bedtools genomecov -ibam " + subject + " -bga -split -trackline" + " > " + out
        os.system(new_cmd)
        new_cmd_2 = f"samtools flagstat --thread {numThreads} " + subject + " > " + out_2
        os.system(new_cmd_2)
    
    for sample in sample_list:
        call_wig_flagstat(sample)
    ### process = multiprocessing.Process(target = call_wig_flagstat, args =(sample_list[i]))
    ### process.start()
    ### for p in processes:
    ###     p.join()
[bam2toolsv2]
parameter: sample = path
parameter: tissue = path
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand = "${ }", container = container
    input="${sample}"
    while IFS=',' read -r col1 col2
    do
      bedtools genomecov -ibam $col1  -bga -split -trackline  > /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/${tissue}/$col2.wig &
      samtools flagstat --thread ${numThreads} $col1  > /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/${tissue}/$col2.flagstat &
    done < "$input"
[bam2toolstmp]
parameter: sample = path
parameter: tissue = path
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = ncore
python: expand = "${ }", container = container
    import os
    import re
    sample_list = [line.strip('\n') for line in open("${sample}")]

    def copy_wigfile(subject):
        id = re.findall(r"\D(\d{8})\D",subject)[0]
        prefix = "/mnt/mfs/statgen/ls3751/aqtl_analysis/wig/" + "${tissue}" + "/"
        file1 = os.path.splitext(subject)[0] + ".flagstat"
        out = prefix + id + ".wig"
        out_2 = prefix + id + ".flagstat"
        new_cmd = "cp " + subject + " " + out
        os.system(new_cmd)
        new_cmd_2 = "cp " + file1 + " " + out_2
        os.system(new_cmd_2)
    
    for sample in sample_list:
        copy_wigfile(sample)
sos run /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/molecular_phenotypes/calling/apa_calling.ipynb bam2toolsv2 --cwd /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/scripts --sample /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/scripts/AC_1.txt --tissue AC --container  /mnt/mfs/statgen/ls3751/container/dapars2_final.sif -c /home/ls3751/project/ls3751/csgg.yml
INFO: Running bam2toolsv2: 
INFO: bam2toolsv2 is completed.
INFO: Workflow bam2toolsv2 (ID=wdde1442ae33498bb) is executed successfully with 1 completed step.

Step 2: Generating config files and calculating sample depth#

Notes on input file format#

For the input file, it has the following format. Additional notes are:

  • The first line is the information of file. If you do not have them, please add any content on first line

  • The file must end with “.wig”. It will not cause any problem if you directly change from “.bedgraph”

  • If your input wig file did not have the characters “chr” in the first column, please set no_chr_prefix = T

head -n 10 /mnt/mfs/statgen/ls3751/MWE_dapars2/sample1.wig
track type=bedGraph
1	0	10099	0
1	10099	10113	1
1	10113	10146	0
1	10146	10157	2
1	10157	10230	0
1	10230	10241	2
1	10241	10279	0
1	10279	10301	1
1	10301	10329	0
# Generate configuration file
[APAconfig]
parameter: bfile = path
parameter: annotation = path
parameter: job_size = 1
# Default parameters for Dapars2:
parameter: least_pass_coverage_percentage = 0.3
parameter: coverage_threshold = 10
output: [f'{cwd}/sample_mapping_files.txt',f'{cwd}/sample_configuration_file.txt']
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = ncore
python3: expand = "${ }", container = container
    import re
    import os
    target_all_sample = os.listdir("${bfile}")
    target_all_sample = list(filter(lambda v: re.match('.*wig$', v), target_all_sample))
    target_all_sample = ["${bfile}" + "/" + w for w in target_all_sample]
    

    def extract_total_reads(input_flagstat_file):
        num_line = 0
        total_reads = '-1'
        #print input_flagstat_file
        for line in open(input_flagstat_file,'r'):
            num_line += 1
            if num_line == 5:
                total_reads = line.strip().split(' ')[0]
                break
        return total_reads

    #print(target_all_sample)
    print("INFO: Total",len(target_all_sample),"samples found in provided dirctory!")
    # Total depth file:
    mapping_file = open("${_output[0]}", "w")
    for current_sample in target_all_sample:
        flag = current_sample.split(".")[0] + ".flagstat"
        current_sample_total_depth = extract_total_reads(flag)
        field_out = [current_sample, str(current_sample_total_depth)]
        mapping_file.writelines('\t'.join(field_out) + '\n')

        print("Coverage of sample ", current_sample, ": ", current_sample_total_depth)
    mapping_file.close()

    # Configuration file:

    config_file = open(${_output[1]:r},"w")
    config_file.writelines(f"Annotated_3UTR=${annotation}\n")
    config_file.writelines( "Aligned_Wig_files=%s\n" % ",".join(target_all_sample))
    config_file.writelines(f"Output_directory=${cwd}/apa \n")
    config_file.writelines(f"Output_result_file=Dapars_result\n")
    config_file.writelines(f"Least_pass_coverage_percentage=${least_pass_coverage_percentage}\n")
    config_file.writelines( "Coverage_threshold=${coverage_threshold}\n")
    config_file.writelines( "Num_Threads=${numThreads}\n")
    config_file.writelines(f"sequencing_depth_file=${_output[0]}") 
    config_file.close()    

Step 3: Run Dapars2 main to calculate PDUIs#

Default input of Dapars2_Multi_Sample.py did not consider the situation that first column did not contain “chr” (shown in Step 2). We add a new argument no_chr_prefix (default is FALSE)

# Call Dapars2 multi_chromosome
[APAmain]
parameter: chr_prefix = False
parameter: chrlist = list
input: for_each = 'chrlist'
output: [f'{cwd}/apa_{x}/Dapars_result_result_temp.{x}.txt' for x in chrlist], group_by = 1
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = ncore
bash: expand = True, container = container
    python2 /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/Dapars2_Multi_Sample.py {cwd}/sample_configuration_file.txt {_chrlist} {"F" if chr_prefix else "T"}

Analysis demo#

Step 0: 3UTR generation#

sos run /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/molecular_phenotypes/calling/apa_calling.ipynb UTR_reference \
    --cwd /mnt/mfs/statgen/ls3751/MWE_dapars2/Output \
    --hg_gtf /mnt/mfs/statgen/ls3751/MWE_dapars2/gencode.v39.annotation.gtf \
    --container /mnt/mfs/statgen/ls3751/container/dapars2_final.sif
INFO: Running UTR_reference: Generate the 3UTR region according to the gtf file
Done!
Generating regions ...
Total extracted 3' UTR: 185865
Finished
INFO: UTR_reference is completed.
INFO: UTR_reference output:   /mnt/mfs/statgen/ls3751/MWE_dapars2/Output/gencode.v39.annotation.bed /mnt/mfs/statgen/ls3751/MWE_dapars2/Output/gencode.v39.annotation.transcript_to_geneName.txt... (3 items)
INFO: Workflow UTR_reference (ID=w907045e2667aefaa) is executed successfully with 1 completed step.
tree /mnt/mfs/statgen/ls3751/MWE_dapars2/Output
/mnt/mfs/statgen/ls3751/MWE_dapars2/Output
├── gencode.v19.annotation_3UTR.bed
├── gene_annotation.bed
└── transcript_to_geneName.txt

0 directories, 3 files

Step 1: Bam to wig process and extracting reads depth file#

sos run /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/molecular_phenotypes/calling/apa_calling.ipynb bam2tools \
    --n 0 1 2 3 4 5 6 7 8 \
    --container /mnt/mfs/statgen/ls3751/container/dapars2_final.sif
sos run /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/molecular_phenotypes/calling/apa_calling.ipynb bam2toolsv2 \
    --cwd /mnt/mfs/statgen/ls3751/aqtl_analysis \
    --sample /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/scripts/AC_1.txt \
    --tissue AC \
    --container /mnt/mfs/statgen/ls3751/container/dapars2_final.sif

Step 2: Generating config files#

sos  run /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/molecular_phenotypes/calling/apa_calling.ipynb APAconfig \
    --cwd /mnt/mfs/statgen/ls3751/rosmap/dlpfcTissue/batch0 \
    --bfile /mnt/mfs/statgen/ls3751/rosmap/dlpfcTissue/batch0 \
    --annotation /mnt/mfs/statgen/ls3751/MWE_dapars2/Output/gencode.v39.annotation_3UTR.bed \
    --container /mnt/mfs/statgen/ls3751/container/dapars2_final.sif 
INFO: Running APAconfig: Calculcate total depth and configuration file
INFO: Total 4 samples found in provided dirctory!
Coverage of sample  /mnt/mfs/statgen/ls3751/MWE_dapars2/sample4.wig :  5943757290
Coverage of sample  /mnt/mfs/statgen/ls3751/MWE_dapars2/sample3.wig :  6060347021
Coverage of sample  /mnt/mfs/statgen/ls3751/MWE_dapars2/sample2.wig :  3617065465
Coverage of sample  /mnt/mfs/statgen/ls3751/MWE_dapars2/sample1.wig :  2573149742
INFO: APAconfig is completed.
INFO: APAconfig output:   /mnt/mfs/statgen/ls3751/MWE_dapars2/Output/sample_mapping_files.txt /mnt/mfs/statgen/ls3751/MWE_dapars2/Output/sample_configuration_file.txt
INFO: Workflow APAconfig (ID=wdc3865265bf562ea) is executed successfully with 1 completed step.
tree /mnt/mfs/statgen/ls3751/MWE_dapars2/Output
/mnt/mfs/statgen/ls3751/MWE_dapars2/Output
├── gencode.v19.annotation_3UTR.bed
├── gene_annotation.bed
├── sample_configuration_file.txt
├── sample_mapping_files.txt
└── transcript_to_geneName.txt

0 directories, 5 files

Step 3: Dapars2 Main#

Note: the example is a truncated version, which just have coverage in chr1,chr11 and chr12

sos run /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/molecular_phenotypes/calling/apa_calling.ipynb APAmain \
    --cwd /mnt/mfs/statgen/ls3751/rosmap/dlpfcTissue/batch0 \
    --chrlist chr21 chr14 chr1 \
    --container /mnt/mfs/statgen/ls3751/container/dapars2_final.sif
INFO: Running APAmain: Call Dapars2 multi_chromosome
[Sun Jan 30 04:15:34 2022] Start Analysis ...
All samples Joint Processing chr1 ...
[Sun Jan 30 04:15:34 2022] Loading Coverage ...
/mnt/mfs/statgen/ls3751/MWE_dapars2/sample4.wig
/mnt/mfs/statgen/ls3751/MWE_dapars2/sample3.wig
/mnt/mfs/statgen/ls3751/MWE_dapars2/sample2.wig
/mnt/mfs/statgen/ls3751/MWE_dapars2/sample1.wig
[5943757290 6060347021 3617065465 2573149742]
[Sun Jan 30 04:15:46 2022] Loading Coverage Finished ...
#assigned events:
1379
1379
1379
1379
1379
1378
1378
1378
[Sun Jan 30 04:15:51 2022] Finished!
INFO: APAmain is completed.
INFO: APAmain output:   /mnt/mfs/statgen/ls3751/MWE_dapars2/Output/Wigfiles_chr1/Dapars_result_result_temp.chr1.txt
INFO: Workflow APAmain (ID=w28477dad0713d81d) is executed successfully with 1 completed step.
tree /mnt/mfs/statgen/ls3751/MWE_dapars2/Output
/mnt/mfs/statgen/ls3751/MWE_dapars2/Output
├── gencode.v19.annotation_3UTR.bed
├── gene_annotation.bed
├── sample_configuration_file.txt
├── sample_mapping_files.txt
├── transcript_to_geneName.txt
└── Wigfiles_chr1
    ├── Dapars_result_result_temp.chr1.txt
    └── tmp
        ├── Each_processor_3UTR_Result_1.txt
        ├── Each_processor_3UTR_Result_2.txt
        ├── Each_processor_3UTR_Result_3.txt
        ├── Each_processor_3UTR_Result_4.txt
        ├── Each_processor_3UTR_Result_5.txt
        ├── Each_processor_3UTR_Result_6.txt
        ├── Each_processor_3UTR_Result_7.txt
        └── Each_processor_3UTR_Result_8.txt

2 directories, 14 files