Phenotype Data Formatting

Phenotype Data Formatting#

FIXME: this entire pipeline needs to be improved

Description#

We include a collection of workflows to format molecular phenotype data. These include workflows to separate phenotypes by chromosome, by user-provided regions, a workflow to subset bam files and a workflow to extract samples from phenotype files.

Input#

The input for this workflow is the collection of data for 1 molecular phenotype as described in the format of:

a complete residualized (covariates regressed out) molecular phenotype data
a region list

These input are outputs from previous pipelines such as covariate_preprocessing and gene_annotation.

Output#

A list of phenotype file (bed+index) for each chrom, annotated with genomic coordiates, suitable for TensorQTL analysis.
A lists of phenotype file (bed+index) for each gene, annotated with genomic coordiates, suitable for fine-mapping.

Minimal Working Example Steps#

The data and singularity image used are available on Synapse.

iii. Partition by chromosome#

This is necessary for cis TensorQTL analysis.

Timing < 1 min

!sos run phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd ../../../output_test/phenotype_by_chrom \
    --phenoFile ../../../output_test/phenotype_by_chrom/protocol_example.protein.bed.gz \
    --chrom `for i in {21..22}; do echo chr$i; done` \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
    -c ../../csg.yml  -q neurology

INFO: Running phenotype_by_chrom_1:
INFO: phenotype_by_chrom_1 (index=0) is completed.
INFO: phenotype_by_chrom_1 (index=1) is completed.
INFO: phenotype_by_chrom_1 output:   output/phenotype_by_chrom/protocol_example.protein.bed.chr22.bed.gz output/phenotype_by_chrom/protocol_example.protein.bed.chr21.bed.gz in 2 groups
INFO: Running phenotype_by_chrom_2:
INFO: phenotype_by_chrom_2 is completed.
INFO: phenotype_by_chrom_2 output:   output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.txt
INFO: Workflow phenotype_by_chrom (ID=w83c4137c28693564) is executed successfully with 2 completed steps and 3 completed substeps.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

!sos run phenotype_formatting.ipynb -h

usage: sos run phenotype_formatting.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  phenotype_by_chrom
  phenotype_by_region
  bam_subsetting
  gct_extract_samples

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --container ''
                        The filename namefor output data
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --phenoFile VAL (as path, required)
                        Path to the input molecular phenotype data.
  --name  f'{phenoFile:bn}'

                        name for the analysis output

Sections
  phenotype_by_chrom_1:
    Workflow Options:
      --chrom VAL VAL ... (as type, required)
                        list of chroms to extract
  phenotype_by_chrom_2:
  phenotype_by_region_1:
    Workflow Options:
      --region-list VAL (as path, required)
                        An index text file with 4 columns specifying the chr,
                        start, end and name of regions to analyze
  phenotype_by_region_2:
  bam_subsetting:
    Workflow Options:
      --region VAL VAL ... (as type, required)
                        Input to `samtools view` coordinates, for example,
                        --region chr21 chr22
  gct_extract_samples:  Extract samples from expression data generated by
                        RNASeQC
    Workflow Options:
      --keep-samples VAL (as path, required)

Setup and global parameters#

[global]
import os
# Work directory & output directory
parameter: cwd = path("output")
# The filename namefor output data
parameter: container = ''
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Path to the input molecular phenotype data.
parameter: phenoFile = path
# name for the analysis output
parameter: name= f'{phenoFile:bn}'

Process of molecular phenotype file#

This workflow produce a bed+tabix file for all the molecular pheno data that are included in the region list to feed into downstream analysis

[phenotype_by_chrom_1]
# list of chroms to extract 
parameter: chrom = list
chrom = list(set(chrom))
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "chrom"
output: f'{cwd}/{name}.{_chrom}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand = "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container, entrypoint=entrypoint
    zcat $[_input] | head -1 > $[_output:n]
    tabix $[_input] $[_chrom] >> $[_output:n] 
    bgzip -f $[_output:n]
    tabix -p bed $[_output] -f
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[phenotype_by_chrom_2]
# Path to the input molecular phenotype data.
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt',f'{cwd}/{name}.{step_name[:-2]}_files.region_list.txt'
import pandas as pd
chrom = [str(x).split(".")[-3].replace("chr","") for x in _input]
chrom_df = pd.DataFrame({"#id" : chrom ,"#dir" : _input})
chrom_df.to_csv(_output[0],index = 0,sep = "\t")
chrom_df["#chr"] = [f'chr{x}' for x in chrom]
phenoFile = pd.read_csv(phenoFile,sep = "\t", usecols = [0,1,2,3]).merge(chrom_df[["#chr","#dir"]],left_on = "#chr",right_on = "#chr").rename(columns={"#dir": "path"})
phenoFile.to_csv(_output[1], index = 0, sep = "\t")

[phenotype_annotate_by_tad]
parameter: TAD_list = path
parameter: phenotype_per_tad = 2 # This is the minimum number of epigenomics marker for a tadb to be considered having a functions.
input: phenoFile,TAD_list
output: f'{cwd}/{_input[0]:b}.{_input[1]:b}.{phenotype_per_tad}_pheno_per_region.region_list'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container, entrypoint=entrypoint
    library(tidyverse)

    tabix_region <- function(file, region){
    data.table::fread(cmd = paste0("tabix -h ", file, " ", region)) %>%
     as_tibble() %>%
     mutate(
        !!names(.)[1] := as.character(.[[1]]),
        !!names(.)[2] := as.numeric(.[[2]])
    ) 
    }
    TAD_list = read_delim(${_input[1]:ar})%>%mutate(region  = paste0(`#chr`,":",start,"-",end),
                                        path = ${_input[0]:ar},  
                                        keep = map_lgl( region,~tabix_region(${_input[0]:ar}, .x)%>%nrow >= ${phenotype_per_tad} )) 
    TAD_list%>%filter(keep)%>%select(`#chr`,start,end,ID = index, path)%>%write_delim(${_output:ar},"\t")
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[phenotype_by_chrom_gct_1]
# list of chroms to extract 
parameter: chrom = list
chrom = list(set(chrom))
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "chrom"
output: f'{cwd:a}/{name}.{_chrom}.gct'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand = "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container, entrypoint=entrypoint
    zcat $[_input] | head -1 > $[_output:n]
    tabix $[_input] $[_chrom] >> $[_output:n] 
    cat   $[_output:n] |  awk '{$1=$2=$3=""; print $0}' >> $[_output]
    rm $[_output:n]

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[phenotype_by_chrom_gct_2]
# Path to the input molecular phenotype data.
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt',f'{cwd}/{name}.{step_name[:-2]}_files.region_list.txt'
import pandas as pd
chrom = [str(x).split(".")[-2].replace("chr","") for x in _input]
chrom_df = pd.DataFrame({"#id" : chrom ,"#dir" : _input})
chrom_df.to_csv(_output[0],index = 0,sep = "\t")
chrom_df["#chr"] = [f'chr{x}' for x in chrom]
phenoFile = pd.read_csv(phenoFile,sep = "\t", usecols = [0,1,2,3]).merge(chrom_df[["#chr","#dir"]],left_on = "#chr",right_on = "#chr").rename({"#dir":"path"})
phenoFile.to_csv(_output[1], index = 0, sep = "\t" )

[phenotype_by_region_1]
# An index text file with 4 columns specifying the chr, start, end and name of regions to analyze
parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract_chrom(lst):
    return list(set([item[0] for item in lst]))
chrom = extract_chrom(regions)
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "regions"
output: f'{cwd}/{region_list:bn}_phenotype_by_region/{name}.{_regions[3]}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand = "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container, entrypoint=entrypoint
    tabix -h  $[_input] $[_regions[0]]:$[_regions[1]]-$[_regions[2]] > $[_output:n]
    bgzip -f $[_output:n]

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        for i in $[_output] ; do 
        echo "output_info: $i " 
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "` 
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "` 
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `
        echo "output_preview:"
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6 ; done

[phenotype_by_region_2]
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt'
import pandas as pd
region_df = pd.DataFrame({"#id" :  [str(x).split(".")[-3] for x in _input]  ,"dir" : _input})
region_df.to_csv(_output,index = 0,sep = "\t")

[bam_subsetting]
# Input to `samtools view` coordinates, for example, --region chr21 chr22
parameter: region = list
# Path to the input molecular phenotype data.
parameter: phenoFile = paths
input: phenoFile , group_by = 1
output: f'{cwd}/{_input:bn}.subsetted.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    samtools view -b ${_input} ${region} > ${_output}

# Extract samples from expression data generated by RNASeQC
[gct_extract_samples]
parameter: keep_samples = path
input: phenoFile
output: f'{_input[0]:nn}.sample_matched.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = "$[ ]", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container
    library("dplyr")
    library("readr")
    phenoFile = read_delim($[_input[0]:ar], "\t", col_names = T, comment = "#")
    sample_lookup = read_delim($[keep_samples:ar], "\t" ,col_names = T, comment = "#")
    ## Make phenoFile consistant with sampleLookup, remove samples by select()
    int = intersect(colnames(phenoFile),unlist(sample_lookup[,1]))
    phenoFile_tmp = phenoFile%>%select(c(colnames(phenoFile)[1],all_of(int)))
    ## Add 2 header lines, https://github.com/getzlab/rnaseqc/blob/286f99dfd4164d33014241dd4f3149da0cddf5bf/src/RNASeQC.cpp#L426
    cat(paste("#1.2\n#", nrow(phenoFile_tmp), ncol(phenoFile_tmp) - 2, "\n"), file=$[_output:nr], append=FALSE)
    phenoFile_tmp%>%write_delim($[_output:nr],delim = "\t",col_names = T, append = T)