Gene Coordinate Annotation#

Description#

We use a gene coordinate annotation pipeline based on pyqtl, as demonstrated here. This adds genomic coordinate annotations to gene-level molecular phenotype files generated in gct format and converts them to bed format for downstreams analysis.

Alternative implementation#

Previously we use biomaRt package in R instead of code from pyqtl. The core function calls are:

    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)

We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.

Input#

  1. Molecular phenotype data in gct format, with the first column being ENSEMBL ID and other columns being sample names.

  2. GTF for collapsed gene model

    • the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work)

  3. (Optional) Meta-data to match between sample names in expression data and genotype files

    • Tab delimited with header

    • Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data

    • must contains all the sample name in expression matrices even if they don’t existing in genotype data

Output#

Molecular phenotype data in bed format.

Minimal Working Example Steps#

The MWE is uploaded to Synapse

i. Cooridnate Annotation#

a. Gene Expression#

Timing: <1 min

!sos run gene_annotation.ipynb annotate_coord_gene \
    --cwd ../../../output_test/phenotype_by_chrom \
    --phenoFile ../../../mwe_data/xQTL_discovery/MWE.log2cpm.mol_phe.bed.gz \
    --annotation-gtf ../../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --container  oras://ghcr.io/cumc/rna_quantification_apptainer:latest --phenotype-id-type gene_name \
    -c ../../csg.yml  -q neurology
INFO: Running annotate_coord_gene: 
INFO: tc987661f8bb5f0b1 submitted to neurology with job id Your job 3690363 ("job_tc987661f8bb5f0b1") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: annotate_coord_gene output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/MWE.log2cpm.mol_phe.bed.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/MWE.log2cpm.mol_phe.bed.region_list.txt
INFO: Workflow annotate_coord_gene (ID=w90bd1b9b7fb38fca) is executed successfully with 1 completed step and 1 completed task.

b. Proteins#

Timing: <1 min

!sos run gene_annotation.ipynb annotate_coord_protein \
    --cwd ../../../output_test/phenotype_by_chrom \
    --phenoFile ../../../mwe_data/protocol_data/input/xqtl_association/protocol_example.protein.csv \
    --annotation-gtf ../../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --sample-participant-lookup ../../../mwe_data/protocol_data/output/protocol_example.protein.sample_overlap.txt \
    --phenotype-id-type gene_name \
    --container  oras://ghcr.io/cumc/rna_quantification_apptainer:latest --sep ","  \
    -c ../../csg.yml  -q neurology
INFO: Running annotate_coord_protein: 
INFO: tbecf36ea3e901ea1 re-execute completed
INFO: tbecf36ea3e901ea1 submitted to neurology with job id Your job 3961996 ("job_tbecf36ea3e901ea1") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: annotate_coord_protein output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.region_list.txt
INFO: Workflow annotate_coord_protein (ID=w75d21004c6718efd) is executed successfully with 1 completed step and 1 completed task.

Troubleshooting#

Step

Substep

Problem

Possible Reason

Solution

Command Interface#

!sos run gene_annotation.ipynb -h
usage: sos run gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  region_list_generation
  annotate_coord_gene
  annotate_coord_protein
  annotate_coord_biomart
  map_leafcutter_cluster_to_gene
  annotate_leafcutter_isoforms
  annotate_psichomics_isoforms

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --annotation-gtf VAL (as path, required)
                        gene gtf annotation table
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --phenotype-id-type 'gene_id'
                        Whether the input data is named by gene_id or gene_name.
                        By default it is gene_id, if not, please change it to
                        gene_name
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads
  --container ''
  --sep '\t'
                        delimiter of phenoFile
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""


Sections
  region_list_generation:
  annotate_coord_gene:
    Workflow Options:
      --sample-participant-lookup . (as path)
                        A file to map sample ID from expression to genotype,
                        must contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
  annotate_coord_protein:
    Workflow Options:
      --sample-participant-lookup . (as path)
                        A file to map sample ID from expression to genotype,
                        must contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
      --protein-name-index . (as path)
      --protein-ID-type SOMAseqID
  annotate_coord_biomart:
    Workflow Options:
      --ensembl-version VAL (as int, required)
  map_leafcutter_cluster_to_gene:
    Workflow Options:
      --intron-count VAL (as path, required)
                        Extract the code in case psichromatic needs to be
                        processed the same way PheoFile in this step is the
                        intron_count file
  annotate_leafcutter_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)
  annotate_psichomics_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)

Setup and global parameters#

[global]
# Work directory & output directory
parameter: cwd = path("output")
#  gene gtf annotation table
parameter: annotation_gtf = path
# Molecular phenotype matrix
parameter: phenoFile = path
# Whether the input data is named by gene_id or gene_name. By default it is gene_id, if not, please change it to gene_name
parameter: phenotype_id_type = 'gene_id'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""
## delimiter of phenoFile
parameter: sep = "\t"
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: strip_id = False

Region List generation#

To partitioning the data by genes require a region list file which:

  1. have 5 columns: chr,start,end,gene_id,gene_name

  2. have the same gene as or less gene than that of the bed file

Input:

  1. A gtf file used to generated the bed

  2. A phenotype bed file, must have a gene_id column indicating the name of genes.

[region_list_generation]
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bnn}.region_list'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import qtl.io
    # get the five column data
    bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "gene_id" )
    bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "gene_name" )
    bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = ["chr","start","end"])
    bed_template_df.columns = ["#chr","start","end","gene_id","gene_name"]
    pheno = pd.read_csv(${_input[0]:r}, sep = "\t")
    # Retaining only the genes in the data
    region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]
    region_list.to_csv("${_output}", sep = "\t",index = False)

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

Implementation using pyqtl#

Implementation based on GTEx pipeline.

Following step serves to annotate cood for gene expression file.

[annotate_coord_gene]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint

    import pandas as pd
    import qtl.io
    from pathlib import Path

    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
        if chr_subset is not None:
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df

    def load_and_preprocess_data(input_path, drop_columns):
        df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
        dc = [col for col in df.columns if col in drop_columns] # Take interscet between df.columns and drop_columns
        df = df.drop(dc,axis = 1) # drop the intersect
        if len(df.columns) < 2:
            raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
        return df

    # convert first column to second column:ProjID to SampleID
    def rename_samples_using_lookup(df, lookup_path):
        if Path(lookup_path).suffix == '.csv':
            sep = ','
        elif Path(lookup_path).suffix == '.tsv':
            sep = '\t'
        else:
            raise ValueError("Unsupported file format. Please use either .csv or .tsv files.")

        if Path(lookup_path).is_file():
            sample_participant_lookup = pd.read_csv(lookup_path, sep=sep, header=None, dtype={0: str, 1: str})
            rename_dict = dict(zip(sample_participant_lookup[0], sample_participant_lookup[1]))
            df.rename(columns=rename_dict, inplace=True)
        return df

    def load_bed_template(input_path, phenotype_id_type):
        if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene', phenotype_id = "gene_id").index.duplicated()) > 0:
            raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")

        bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
        bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
        bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
        bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
        bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)

        return bed_template_df

    df = load_and_preprocess_data(${_input[0]:ar}, ["#chr","chr", "start", "end","stop","annot.seqnames","annot.start","annot.end"]) # This function will drop any columns from the DF based on the drop_columns list, so it is good to make the list comprehensive to accomodate different source of inputs.    df.set_index(df.columns[0], inplace=True)
    df.set_index(df.columns[0], inplace=True)
    df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
    if ${strip_id}:  
        # Update the index by stripping the pattern "A | B" to "B"
        # and removing any leading or trailing spaces around "B"
        df.index = df.index.map(lambda x: x.split('|')[-1].strip() if '|' in x else x)
    bed_template_df = load_bed_template(${_input[1]:ar}, "${phenotype_id_type}")
    bed_df = prepare_bed(df, bed_template_df).drop("gene_name", axis=1)
    bed_df = bed_df.drop_duplicates("gene_id", keep=False)
    bed_df = bed_df.rename(columns={'gene_id': 'ID'})
    qtl.io.write_bed(bed_df, ${_output[0]:r})
    bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False) 

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

Following step serves to annotate cood for proteiomics data.

[annotate_coord_protein]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
parameter: protein_name_index = path()
parameter: protein_ID_type = "SOMAseqID"
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint

    import pandas as pd
    import qtl.io
    from pathlib import Path

    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
        if chr_subset is not None:
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df

    def load_and_preprocess_data(input_path, drop_columns):
        df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
        dc = [col for col in df.columns if col in drop_columns] # Take interscet between df.columns and drop_columns
        df = df.drop(dc,axis = 1) # drop the intersect
        if len(df.columns) < 2:
            raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
        return df

    def rename_samples_using_lookup(df, lookup_path):
        sample_participant_lookup = Path(lookup_path)
        if sample_participant_lookup.is_file():
            sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=1, dtype={0:str,1:str})
            df.rename(columns=sample_participant_lookup_s.to_dict()["genotype_id"], inplace=True)
        return df

    def load_bed_template(input_path, phenotype_id_type):
        if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene',phenotype_id = "gene_id").index.duplicated()) > 0:
            raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")

        bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
        bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
        bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
        bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
        bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)

        return bed_template_df

    df = load_and_preprocess_data(${_input[0]:ar}, ["#chr","chr", "start", "end","stop","annot.seqnames","annot.start","annot.end"]) # This function will drop any columns from the DF based on the drop_columns list, so it is good to make the list comprehensive to accomodate different source of inputs.
    protein_ID = df.columns.values[0]
    protein_name_index = Path("${protein_name_index:a}")
    if protein_name_index.is_file():
        df_info = pd.read_csv(protein_name_index).rename(columns={'${protein_ID_type}': protein_ID, 'EntrezGeneSymbol':'gene_name'})[['gene_name',protein_ID,'UniProt']]
        df = df_info.merge(df, on=protein_ID).drop(protein_ID,axis=1)
    else:
        df[[protein_ID, 'UniProt']] = df[protein_ID].astype(str).str.split('|', 1, expand=True)
    df.set_index(df.columns[0], inplace=True)
    df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
    bed_template_df = load_bed_template(${_input[1]:ar}, "${phenotype_id_type}")
    bed_df = prepare_bed(df, bed_template_df)
    bed_df["ID"] = bed_df["gene_id"] + "_" + bed_df["UniProt"]
    bed_df = bed_df.drop_duplicates("ID", keep=False)[["#chr","start","end","ID"] + df.drop(["UniProt"],axis=1).columns.values.tolist()]
    qtl.io.write_bed(bed_df, ${_output[0]:r})
    bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False) 

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
Following code annotate the atac_seq peak
[annotate_coord_ATAC]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
parameter: peak_annotation = path
input: phenoFile, peak_annotation
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint

    import pandas as pd
    import qtl.io
    from pathlib import Path

    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
        if chr_subset is not None:
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df

    def load_and_preprocess_data(input_path, drop_columns):
        df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
        dc = [col for col in df.columns if col in drop_columns] # Take interscet between df.columns and drop_columns
        df = df.drop(dc,axis = 1) # drop the intersect
        if len(df.columns) < 2:
            raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
        return df

    def rename_samples_using_lookup(df, lookup_path):
        sample_participant_lookup = Path(lookup_path)
        if sample_participant_lookup.is_file():
            sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=1, dtype={0:str,1:str})
            df.rename(columns=sample_participant_lookup_s.to_dict()["genotype_id"], inplace=True)
        return df

    def load_bed_template(input_path, phenotype_id_type):
        if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene',phenotype_id = "gene_id").index.duplicated()) > 0:
            raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")

        bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
        bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
        bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
        bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
        bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)

        return bed_template_df

    df = load_and_preprocess_data(${_input[0]:ar}, ["#chr","chr", "start", "end","stop","annot.seqnames","annot.start","annot.end"]) # This function will drop any columns from the DF based on the drop_columns list, so it is good to make the list comprehensive to accomodate different source of inputs.    df.set_index(df.columns[0], inplace=True)
    df.set_index(df.columns[0], inplace=True)
    df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
    bed_template_df = pd.read_csv("${_input[1]}",sep = "\t").set_index("ID")
    bed_template_df = bed_template_df.assign(ID = bed_template_df.index )
    bed_df = prepare_bed(df, bed_template_df)
    bed_df = bed_df.drop_duplicates("ID", keep=False)
    qtl.io.write_bed(bed_df, ${_output[0]:r})
    bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False) 


bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

Implementation using biomaRt#

This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the bed file. This workflow is obsolete.

[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
        f'{cwd:a}/{_input:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
R:  expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' ,container = container, entrypoint=entrypoint

    library(biomaRt)
    library(dplyr)
    library(readr)

    # Clear biomart cache
    biomartCacheClear()

    # Read gene expression data
    gene_exp <- read_delim("$[_input[0]]", delim = "\t")

    # If column "#chr" exists, remove the first 3 columns (chr, start, and end)
    if("#chr" %in% colnames(gene_exp)){
      gene_exp <- gene_exp[, 4:ncol(gene_exp)]
    }

    # Connect to Ensembl biomart
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")

    # Get gene annotations from Ensembl
    ensembl_df <- getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", "end_position"), mart = ensembl)

    # Filter and rename columns
    my_genes_ann <- ensembl_df %>%
      filter(ensembl_gene_id %in% gene_exp$gene_ID, chromosome_name %in% 1:23) %>%
      rename("#chr" = chromosome_name, 
            "start" = start_position, 
            "end" = end_position, 
            "gene_ID" = ensembl_gene_id) %>%
      filter(gene_ID != "NA")

    # Save the annotations
    my_genes_ann %>%
      select(`#chr`, start, end, gene_ID) %>%
      write_delim(path = "$[_output[1]]", "\t")

    # Merge the annotations with the gene expression data, after modifying 'end' to be 'start + 1'
    my_gene_bed <- my_genes_ann %>%
      mutate(end = start + 1) %>%
      inner_join(gene_exp, by = "gene_ID") %>%
      arrange(`#chr`, start)

    # Save the final merged data
    write_tsv(my_gene_bed, path = "$[_output[0]:n]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")


bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint=entrypoint
    bgzip -f $[_output[0]:n]
    tabix -p bed $[_output[0]] -f

Annotation of leafcutter isoform#

The following steps processed the output files of leafcutter so that they are TensorQTL ready. Shown below are three intemediate files

Exon list

chr

start

end

strand

gene_id

gene_name

chr1

29554

30039

+

ENSG00000243485

MIR1302-2HG

chr1

30564

30667

+

ENSG00000243485

MIR1302-2HG

chr1

30976

31097

+

ENSG00000243485

MIR1302-2HG

chr1

35721

36081

-

ENSG00000237613

FAM138A

chr1

35277

35481

-

ENSG00000237613

FAM138A

chr1

34554

35174

-

ENSG00000237613

FAM138A

chr1

65419

65433

+

ENSG00000186092

OR4F5

chr1

65520

65573

+

ENSG00000186092

OR4F5

chr1

69037

71585

+

ENSG00000186092

OR4F5

clusters_to_genes

clu

genes

1:clu_1_+

ENSG00000116288

1:clu_10_+

ENSG00000143774

1:clu_11_+

ENSG00000143774

1:clu_12_+

ENSG00000143774

1:clu_14_-

ENSG00000126709

1:clu_15_-

ENSG00000121753

1:clu_16_-

ENSG00000121753

1:clu_17_-

ENSG00000116560

1:clu_18_-

ENSG00000143549

phenotype_group

X1

X2

7:102476270:102478811:clu_309_-:ENSG00000005075

ENSG00000005075

7:102476270:102478808:clu_309_-:ENSG00000005075

ENSG00000005075

X:47572961:47574002:clu_349_-:ENSG00000008056

ENSG00000008056

X:47572999:47574002:clu_349_-:ENSG00000008056

ENSG00000008056

8:27236905:27239971:clu_322_-:ENSG00000015592

ENSG00000015592

8:27239279:27239971:clu_322_-:ENSG00000015592

ENSG00000015592

8:27241262:27241677:clu_323_-:ENSG00000015592

ENSG00000015592

8:27241262:27242397:clu_323_-:ENSG00000015592

ENSG00000015592

8:27241757:27242397:clu_323_-:ENSG00000015592

ENSG00000015592

1:35558223:35559107:clu_4_+:ENSG00000020129

ENSG00000020129

The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.

[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
# Defines the mapping strategy options: 'site' or 'region', with 'site' as the default. 
# The 'site' strategy maps introns to the start and end of each exon. 
# The 'region' strategy, to be recommended in leafcutter2, maps each intron based on it overlapping more than overlap_ratio of a gene's region.
parameter: map_stra = "site" 
# Define the overlap ratio as the proportion of the cluster length that intersects with a gene, used to determine mapping to the gene.
parameter: overlap_ratio = 0.8
input: intron_count, annotation_gtf
output: f'{cwd}/{_input[0]:b}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import qtl.annotation

    gtf = ${_input[1]:r}
    #there is no "gene_type" when using the uncollapsed gtf. Replace "gene_biotype" with "gene_type" in the gtf and temporarily save for use in the annotation
    has_gene_type = True
    with open(gtf, 'r') as file:
        for line in file:
            if line[0] == '#':
                continue

            row = line.strip().split('\t')
            chrom = row[0]
            # source = row[1]
            annot_type = row[2]

            if annot_type == "gene":
                if "gene_type" not in line:
                    has_gene_type = False
                break

    if has_gene_type == False:
        
        with open(gtf, 'r') as file:
            file_contents = file.read()
        updated_contents = file_contents.replace("gene_biotype", "gene_type")
        gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
        print(gtf)
        with open(gtf, 'w') as file:
            file.write(updated_contents)


    # Load data
    #annot = qtl.annotation.Annotation(${_input[1]:r})
    annot = qtl.annotation.Annotation(gtf)
    exon_df = pd.DataFrame([[g.chr, e.start_pos, e.end_pos, g.strand, g.id, g.name]
                        for g in annot.genes for e in g.transcripts[0].exons],
                       columns=['chr', 'start', 'end', 'strand', 'gene_id', 'gene_name'])
    exon_df.to_csv(${_output[0]:r}, sep='\t', index=False)
    
    
R:expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    suppressMessages(library(dplyr, quietly=TRUE))
    suppressMessages(library(stringr, quietly=TRUE))
    suppressMessages(library(foreach, quietly=TRUE))
    # leafcutter functions:
    
    #' Make a data.frame of meta data about the introns
    #' @param introns Names of the introns
    #' @return Data.frame with chr, start, end, cluster id
    #' @export
    get_intron_meta <- function(introns) {
      intron_meta <- do.call(rbind, strsplit(introns, ":"))  
      if (ncol(intron_meta) == 5) {
        colnames(intron_meta) <- c("chr", "start", "end", "clu", "category")
      } else {
        colnames(intron_meta) <- c("chr", "start", "end", "clu")
      }

      intron_meta <- as.data.frame(intron_meta, stringsAsFactors = FALSE)
      intron_meta$start <- as.numeric(intron_meta$start)
      intron_meta$end <- as.numeric(intron_meta$end)

      return(intron_meta)
    }
    #' Work out which gene each cluster belongs to. Note the chromosome names used in the two inputs must match.
    #' @param intron_meta Data frame describing the introns, usually from get_intron_meta
    #' @param exons_table Table of exons, see e.g. /data/gencode19_exons.txt.gz
    #' @return Data.frame with cluster ids and genes separated by commas
    #' @import dplyr
    #' @export
    map_clusters_to_genes_site <- function(intron_meta, exons_table) {
      suppressMessages(library(foreach, quietly=TRUE))
      gene_df <- foreach (chr=sort(unique(intron_meta$chr)), .combine=rbind) %dopar% {
    
        intron_chr <- intron_meta[ intron_meta$chr==chr, ]
        exons_chr <- exons_table[exons_table$chr==chr, ]
    
        exons_chr$temp <- exons_chr$start
        intron_chr$temp <- intron_chr$end
        three_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
    
        exons_chr$temp <- exons_chr$end
        intron_chr$temp <- intron_chr$start
        five_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
    
        all_matches <- rbind(three_prime_matches, five_prime_matches)[ , c("clu", "gene_name")]
    
        all_matches <- all_matches[!duplicated(all_matches),]
    
        if (nrow(all_matches)==0) return(NULL)
        all_matches$clu <- paste(chr,all_matches$clu,sep=':')
        all_matches
      }
    
      clu_df <- gene_df %>% group_by(clu) %>% summarize(genes=paste(gene_name, collapse = ","))
      class(clu_df) <- "data.frame"
      clu_df
    }
    
    map_clusters_to_genes_region <- function(intron_meta, exons_table, f) {
        #combine the exon position to get gene region
        merged_df <- exons_table %>%
            group_by(chr, gene_id) %>%
            summarize(start = min(start), end = max(end), .groups = 'drop') %>% as.data.frame()%>%
            .[,c("chr","start","end","gene_id")]
        # use bedtools to map the intron region to gene region 
         options(bedtools.path = "/home/rf2872/software/bedtools2/bin/") #Fixme: build bedtools in leafcutter sif
                map_res <- bedtoolsr::bt.intersect(a = intron_meta, b = merged_df,f = f, wa = TRUE, wb=TRUE)
        #organize output of mapped file
        exon_map <- map_res %>% 
            mutate(clu = paste(V1, V4, sep=":"), 
                   genes = .[[paste0("V", 4 + ncol(intron_meta))]]) %>%
                select(clu,genes)
          return(exon_map)
    }

    cat("LeafCutter: mapping clusters to genes\n")
    intron_counts <- read.table(${_input[0]:r}, header=TRUE, check.names=FALSE, row.names=4)
    intron_meta <- get_intron_meta(rownames(intron_counts))
    exon_table <- read.table(${_output[0]:r}, header=TRUE, stringsAsFactors=FALSE)
    if(!str_detect(intron_meta$chr[1],"chr")) {
        exon_table = exon_table %>% mutate(chr = str_remove_all(chr,"chr"))
    } else if (!any(str_detect(exon_table$chr[1],"chr"))) {
        exon_table = exon_table %>% mutate(chr = paste0("chr",chr))
    } else {exon_table = exon_table}
    stopifnot(is.element('gene_id', colnames(exon_table)))
    exon_table[, 'gene_name'] <- exon_table[, 'gene_id']
    if("${map_stra}" == "site"){
      cat("mapping clusters to genes by splicing site\n")
      m <- map_clusters_to_genes_site(intron_meta, exon_table)
      } else if("${map_stra}" == "region"){
      cat("mapping clusters to genes by overlapping gene region\n")
      m <- map_clusters_to_genes_region(intron_meta, exon_table, f = ${overlap_ratio})
    } else {
      stop("Map strategy should only be 'site' or 'region'.")
    }
    write.table(m, ${_output[1]:r}, sep = "\t", quote=FALSE, row.names=FALSE)
  
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import numpy as np
    import qtl.io
    import os
    from pathlib import Path

    # Load data
    if os.path.exists(${_input[1]:rn}.tmp${_input[1]:rx}):
        gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
    else:
        gtf = ${_input[1]:r}
    #tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r})
    tss_df = qtl.io.gtf_to_tss_bed(gtf)
    bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    bed_df.columns.values[0] = "#chr" # Temporary
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    cluster2gene_dict = pd.read_csv(${_input[3]:r}, sep='\t', index_col=0).to_dict()
    cluster2gene_dict = cluster2gene_dict['genes']
    print('    ** assigning introns to gene mapping(s)')
    n = 0
    gene_bed_df = []
    group_s = {}
    for _,r in bed_df.iterrows():
        s = r['ID'].split(':')
        cluster_id = s[0]+':'+s[3]
        if cluster_id in cluster2gene_dict:
            gene_ids = cluster2gene_dict[cluster_id].split(',')
            for g in gene_ids:
                gi = r['ID']+':'+g
                gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
                group_s[gi] = g
        else:
            n += 1
    if n > 0:
        print(f'    ** discarded {n} introns without a gene mapping')

    print('  * writing BED files for QTL mapping')
    gene_bed_df = pd.DataFrame(gene_bed_df, columns=bed_df.columns)
    # sort by TSS
    gene_bed_df = gene_bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
    #rename the samples if they named by file name (simply pick the first element with [.])
    gene_bed_df.columns = list(gene_bed_df.columns[:4]) + [name.split('.')[0] if 'junc' in name else name for name in gene_bed_df.columns[4:]]
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
        #gene_bed_df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
        # Create a dictionary mapping from sample_id to participant_id
        column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))
        # Get the column names to be replaced
        column_names = gene_bed_df.columns[4:]
        # Replace the column names using the mapping dictionary
        #new_column_names = [column_mapping.get(col, 'missing_data') for col in column_names]
        #it should overlap with genotype in downstream anyways
        new_column_names = [column_mapping.get(col, col) for col in column_names]
        gene_bed_df.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)

    gene_bed_df = gene_bed_df.drop_duplicates()
    qtl.io.write_bed(gene_bed_df, ${_output[0]:r})
    gene_bed_df[['start', 'end']] = gene_bed_df[['start', 'end']].astype(np.int32)
    gene_bed_df[gene_bed_df.columns[4:]] = gene_bed_df[gene_bed_df.columns[4:]].astype(np.float32)
    group_s_df =  pd.Series(group_s).sort_values().reset_index()
    group_s_df.columns = ['ID', 'gene']
    group_s_df.to_csv(${_output[1]:r}, sep='\t', index=False, header=True)

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout

        rm $[_input[1]:rn].tmp$[_input[1]:rx]

        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

Processing of psichomics output#

It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID

[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import numpy as np
    import qtl.io
    from pathlib import Path
    tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r}, feature='gene',phenotype_id = "gene_id" )
    bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    bed_df["gene_id"]  = [x[-1] for x in bed_df.ID.str.split("_")]
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    if "start" in  bed_df.columns:
        bed_df = bed_df.drop(["#Chr","start","end"],axis = 1)
    output = tss_df.merge(bed_df, left_on = ["gene_id"], right_on = ["gene_id"],how = "right").sort_values(["chr","start"])
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})

        column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))

        column_names = output.columns[4:]
        print(column_names)
        new_column_names = [column_mapping.get(col, col) for col in column_names]
        output.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)

        #output.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)

    # Old code grouping by each gene
    #bed_output = output.drop("gene_id" , axis = 1)
    #phenotype_group = output[["ID","gene_id"]]

    bed_output = output.drop("gene_id" , axis = 1)

    # R version 
    #output = output%>%mutate(gene_type_id = map_chr(str_split( ID ,"_"),~paste0(.x[length(.x)],"_",.x[1] )) )
    
    # corrected in python
    output = output.assign(gene_type_id = output['ID'].str.split('_').map(lambda x: x[-1] + '_' + x[0]))
    phenotype_group = output[["ID","gene_type_id"]]

    bed_output.to_csv(${_output[0]:nr},"\t",index = False)
    phenotype_group.to_csv(${_output[1]:r},"\t",index = False,header=False)

bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    bgzip -f ${_output[0]:n}
    tabix ${_output[0]}

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done