Gene Coordinate Annotation#
Description#
We use a gene coordinate annotation pipeline based on pyqtl
, as demonstrated here. This adds genomic coordinate annotations to gene-level molecular phenotype files generated in gct
format and converts them to bed
format for downstreams analysis.
Alternative implementation#
Previously we use biomaRt
package in R instead of code from pyqtl
. The core function calls are:
ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.
Input#
Molecular phenotype data in
gct
format, with the first column being ENSEMBL ID and other columns being sample names.GTF for collapsed gene model
the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work)
(Optional) Meta-data to match between sample names in expression data and genotype files
Tab delimited with header
Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
must contains all the sample name in expression matrices even if they don’t existing in genotype data
Output#
Molecular phenotype data in bed
format.
Minimal Working Example Steps#
The MWE is uploaded to Synapse
i. Cooridnate Annotation#
a. Gene Expression#
Timing: <1 min
!sos run gene_annotation.ipynb annotate_coord_gene \
--cwd ../../../output_test/phenotype_by_chrom \
--phenoFile ../../../mwe_data/xQTL_discovery/MWE.log2cpm.mol_phe.bed.gz \
--annotation-gtf ../../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest --phenotype-id-type gene_name \
-c ../../csg.yml -q neurology
INFO: Running annotate_coord_gene:
INFO: tc987661f8bb5f0b1 submitted to neurology with job id Your job 3690363 ("job_tc987661f8bb5f0b1") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: annotate_coord_gene output: /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/MWE.log2cpm.mol_phe.bed.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/MWE.log2cpm.mol_phe.bed.region_list.txt
INFO: Workflow annotate_coord_gene (ID=w90bd1b9b7fb38fca) is executed successfully with 1 completed step and 1 completed task.
b. Proteins#
Timing: <1 min
!sos run gene_annotation.ipynb annotate_coord_protein \
--cwd ../../../output_test/phenotype_by_chrom \
--phenoFile ../../../mwe_data/protocol_data/input/xqtl_association/protocol_example.protein.csv \
--annotation-gtf ../../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--sample-participant-lookup ../../../mwe_data/protocol_data/output/protocol_example.protein.sample_overlap.txt \
--phenotype-id-type gene_name \
--container oras://ghcr.io/cumc/rna_quantification_apptainer:latest --sep "," \
-c ../../csg.yml -q neurology
INFO: Running annotate_coord_protein:
INFO: tbecf36ea3e901ea1 re-execute completed
INFO: tbecf36ea3e901ea1 submitted to neurology with job id Your job 3961996 ("job_tbecf36ea3e901ea1") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: annotate_coord_protein output: /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.region_list.txt
INFO: Workflow annotate_coord_protein (ID=w75d21004c6718efd) is executed successfully with 1 completed step and 1 completed task.
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command Interface#
!sos run gene_annotation.ipynb -h
usage: sos run gene_annotation.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
region_list_generation
annotate_coord_gene
annotate_coord_protein
annotate_coord_biomart
map_leafcutter_cluster_to_gene
annotate_leafcutter_isoforms
annotate_psichomics_isoforms
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--annotation-gtf VAL (as path, required)
gene gtf annotation table
--phenoFile VAL (as path, required)
Molecular phenotype matrix
--phenotype-id-type 'gene_id'
Whether the input data is named by gene_id or gene_name.
By default it is gene_id, if not, please change it to
gene_name
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 1 (as int)
Number of threads
--container ''
--sep '\t'
delimiter of phenoFile
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
Sections
region_list_generation:
annotate_coord_gene:
Workflow Options:
--sample-participant-lookup . (as path)
A file to map sample ID from expression to genotype,
must contain two columns, sample_id and participant_id,
mapping IDs in the expression files to IDs in the
genotype (these can be the same).
annotate_coord_protein:
Workflow Options:
--sample-participant-lookup . (as path)
A file to map sample ID from expression to genotype,
must contain two columns, sample_id and participant_id,
mapping IDs in the expression files to IDs in the
genotype (these can be the same).
--protein-name-index . (as path)
--protein-ID-type SOMAseqID
annotate_coord_biomart:
Workflow Options:
--ensembl-version VAL (as int, required)
map_leafcutter_cluster_to_gene:
Workflow Options:
--intron-count VAL (as path, required)
Extract the code in case psichromatic needs to be
processed the same way PheoFile in this step is the
intron_count file
annotate_leafcutter_isoforms:
Workflow Options:
--sample-participant-lookup . (as path)
annotate_psichomics_isoforms:
Workflow Options:
--sample-participant-lookup . (as path)
Setup and global parameters#
[global]
# Work directory & output directory
parameter: cwd = path("output")
# gene gtf annotation table
parameter: annotation_gtf = path
# Molecular phenotype matrix
parameter: phenoFile = path
# Whether the input data is named by gene_id or gene_name. By default it is gene_id, if not, please change it to gene_name
parameter: phenotype_id_type = 'gene_id'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""
## delimiter of phenoFile
parameter: sep = "\t"
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
parameter: strip_id = False
Region List generation#
To partitioning the data by genes require a region list file which:
have 5 columns: chr,start,end,gene_id,gene_name
have the same gene as or less gene than that of the bed file
Input:
A gtf file used to generated the bed
A phenotype bed file, must have a gene_id column indicating the name of genes.
[region_list_generation]
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bnn}.region_list'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import qtl.io
# get the five column data
bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "gene_id" )
bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "gene_name" )
bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = ["chr","start","end"])
bed_template_df.columns = ["#chr","start","end","gene_id","gene_name"]
pheno = pd.read_csv(${_input[0]:r}, sep = "\t")
# Retaining only the genes in the data
region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]
region_list.to_csv("${_output}", sep = "\t",index = False)
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_column:" `cat $i | head -1 | wc -w ` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Implementation using pyqtl
#
Implementation based on GTEx pipeline.
Following step serves to annotate cood for gene expression file.
[annotate_coord_gene]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import qtl.io
from pathlib import Path
def prepare_bed(df, bed_template_df, chr_subset=None):
bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
if chr_subset is not None:
bed_df = bed_df[bed_df.chr.isin(chr_subset)]
return bed_df
def load_and_preprocess_data(input_path, drop_columns):
df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
dc = [col for col in df.columns if col in drop_columns] # Take interscet between df.columns and drop_columns
df = df.drop(dc,axis = 1) # drop the intersect
if len(df.columns) < 2:
raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
return df
# convert first column to second column:ProjID to SampleID
def rename_samples_using_lookup(df, lookup_path):
if Path(lookup_path).suffix == '.csv':
sep = ','
elif Path(lookup_path).suffix == '.tsv':
sep = '\t'
else:
raise ValueError("Unsupported file format. Please use either .csv or .tsv files.")
if Path(lookup_path).is_file():
sample_participant_lookup = pd.read_csv(lookup_path, sep=sep, header=None, dtype={0: str, 1: str})
rename_dict = dict(zip(sample_participant_lookup[0], sample_participant_lookup[1]))
df.rename(columns=rename_dict, inplace=True)
return df
def load_bed_template(input_path, phenotype_id_type):
if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene', phenotype_id = "gene_id").index.duplicated()) > 0:
raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")
bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)
return bed_template_df
df = load_and_preprocess_data(${_input[0]:ar}, ["#chr","chr", "start", "end","stop","annot.seqnames","annot.start","annot.end"]) # This function will drop any columns from the DF based on the drop_columns list, so it is good to make the list comprehensive to accomodate different source of inputs. df.set_index(df.columns[0], inplace=True)
df.set_index(df.columns[0], inplace=True)
df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
if ${strip_id}:
# Update the index by stripping the pattern "A | B" to "B"
# and removing any leading or trailing spaces around "B"
df.index = df.index.map(lambda x: x.split('|')[-1].strip() if '|' in x else x)
bed_template_df = load_bed_template(${_input[1]:ar}, "${phenotype_id_type}")
bed_df = prepare_bed(df, bed_template_df).drop("gene_name", axis=1)
bed_df = bed_df.drop_duplicates("gene_id", keep=False)
bed_df = bed_df.rename(columns={'gene_id': 'ID'})
qtl.io.write_bed(bed_df, ${_output[0]:r})
bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Following step serves to annotate cood for proteiomics data.
[annotate_coord_protein]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
parameter: protein_name_index = path()
parameter: protein_ID_type = "SOMAseqID"
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import qtl.io
from pathlib import Path
def prepare_bed(df, bed_template_df, chr_subset=None):
bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
if chr_subset is not None:
bed_df = bed_df[bed_df.chr.isin(chr_subset)]
return bed_df
def load_and_preprocess_data(input_path, drop_columns):
df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
dc = [col for col in df.columns if col in drop_columns] # Take interscet between df.columns and drop_columns
df = df.drop(dc,axis = 1) # drop the intersect
if len(df.columns) < 2:
raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
return df
def rename_samples_using_lookup(df, lookup_path):
sample_participant_lookup = Path(lookup_path)
if sample_participant_lookup.is_file():
sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=1, dtype={0:str,1:str})
df.rename(columns=sample_participant_lookup_s.to_dict()["genotype_id"], inplace=True)
return df
def load_bed_template(input_path, phenotype_id_type):
if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene',phenotype_id = "gene_id").index.duplicated()) > 0:
raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")
bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)
return bed_template_df
df = load_and_preprocess_data(${_input[0]:ar}, ["#chr","chr", "start", "end","stop","annot.seqnames","annot.start","annot.end"]) # This function will drop any columns from the DF based on the drop_columns list, so it is good to make the list comprehensive to accomodate different source of inputs.
protein_ID = df.columns.values[0]
protein_name_index = Path("${protein_name_index:a}")
if protein_name_index.is_file():
df_info = pd.read_csv(protein_name_index).rename(columns={'${protein_ID_type}': protein_ID, 'EntrezGeneSymbol':'gene_name'})[['gene_name',protein_ID,'UniProt']]
df = df_info.merge(df, on=protein_ID).drop(protein_ID,axis=1)
else:
df[[protein_ID, 'UniProt']] = df[protein_ID].astype(str).str.split('|', 1, expand=True)
df.set_index(df.columns[0], inplace=True)
df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
bed_template_df = load_bed_template(${_input[1]:ar}, "${phenotype_id_type}")
bed_df = prepare_bed(df, bed_template_df)
bed_df["ID"] = bed_df["gene_id"] + "_" + bed_df["UniProt"]
bed_df = bed_df.drop_duplicates("ID", keep=False)[["#chr","start","end","ID"] + df.drop(["UniProt"],axis=1).columns.values.tolist()]
qtl.io.write_bed(bed_df, ${_output[0]:r})
bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Following code annotate the atac_seq peak
[annotate_coord_ATAC]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
parameter: peak_annotation = path
input: phenoFile, peak_annotation
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import qtl.io
from pathlib import Path
def prepare_bed(df, bed_template_df, chr_subset=None):
bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
if chr_subset is not None:
bed_df = bed_df[bed_df.chr.isin(chr_subset)]
return bed_df
def load_and_preprocess_data(input_path, drop_columns):
df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
dc = [col for col in df.columns if col in drop_columns] # Take interscet between df.columns and drop_columns
df = df.drop(dc,axis = 1) # drop the intersect
if len(df.columns) < 2:
raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
return df
def rename_samples_using_lookup(df, lookup_path):
sample_participant_lookup = Path(lookup_path)
if sample_participant_lookup.is_file():
sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=1, dtype={0:str,1:str})
df.rename(columns=sample_participant_lookup_s.to_dict()["genotype_id"], inplace=True)
return df
def load_bed_template(input_path, phenotype_id_type):
if sum(qtl.io.gtf_to_tss_bed(input_path, feature='gene',phenotype_id = "gene_id").index.duplicated()) > 0:
raise valueerror(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")
bed_template_df_id = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_id")
bed_template_df_name = qtl.io.gtf_to_tss_bed(input_path, feature='transcript', phenotype_id="gene_name")
bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end"])
bed_template_df.columns = ["#chr", "start", "end", "gene_id", "gene_name"]
bed_template_df = bed_template_df.set_index(phenotype_id_type, drop=False)
return bed_template_df
df = load_and_preprocess_data(${_input[0]:ar}, ["#chr","chr", "start", "end","stop","annot.seqnames","annot.start","annot.end"]) # This function will drop any columns from the DF based on the drop_columns list, so it is good to make the list comprehensive to accomodate different source of inputs. df.set_index(df.columns[0], inplace=True)
df.set_index(df.columns[0], inplace=True)
df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
bed_template_df = pd.read_csv("${_input[1]}",sep = "\t").set_index("ID")
bed_template_df = bed_template_df.assign(ID = bed_template_df.index )
bed_df = prepare_bed(df, bed_template_df)
bed_df = bed_df.drop_duplicates("ID", keep=False)
qtl.io.write_bed(bed_df, ${_output[0]:r})
bed_df[["#chr","start","end","ID"]].assign(path = ${_output[0]:r}).to_csv(${_output[1]:r},"\t",index = False)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Implementation using biomaRt#
This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the bed
file. This workflow is obsolete.
[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
f'{cwd:a}/{_input:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' ,container = container, entrypoint=entrypoint
library(biomaRt)
library(dplyr)
library(readr)
# Clear biomart cache
biomartCacheClear()
# Read gene expression data
gene_exp <- read_delim("$[_input[0]]", delim = "\t")
# If column "#chr" exists, remove the first 3 columns (chr, start, and end)
if("#chr" %in% colnames(gene_exp)){
gene_exp <- gene_exp[, 4:ncol(gene_exp)]
}
# Connect to Ensembl biomart
ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
# Get gene annotations from Ensembl
ensembl_df <- getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", "end_position"), mart = ensembl)
# Filter and rename columns
my_genes_ann <- ensembl_df %>%
filter(ensembl_gene_id %in% gene_exp$gene_ID, chromosome_name %in% 1:23) %>%
rename("#chr" = chromosome_name,
"start" = start_position,
"end" = end_position,
"gene_ID" = ensembl_gene_id) %>%
filter(gene_ID != "NA")
# Save the annotations
my_genes_ann %>%
select(`#chr`, start, end, gene_ID) %>%
write_delim(path = "$[_output[1]]", "\t")
# Merge the annotations with the gene expression data, after modifying 'end' to be 'start + 1'
my_gene_bed <- my_genes_ann %>%
mutate(end = start + 1) %>%
inner_join(gene_exp, by = "gene_ID") %>%
arrange(`#chr`, start)
# Save the final merged data
write_tsv(my_gene_bed, path = "$[_output[0]:n]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint=entrypoint
bgzip -f $[_output[0]:n]
tabix -p bed $[_output[0]] -f
Annotation of leafcutter isoform#
The following steps processed the output files of leafcutter so that they are TensorQTL ready. Shown below are three intemediate files
Exon list
chr |
start |
end |
strand |
gene_id |
gene_name |
---|---|---|---|---|---|
chr1 |
29554 |
30039 |
+ |
ENSG00000243485 |
MIR1302-2HG |
chr1 |
30564 |
30667 |
+ |
ENSG00000243485 |
MIR1302-2HG |
chr1 |
30976 |
31097 |
+ |
ENSG00000243485 |
MIR1302-2HG |
chr1 |
35721 |
36081 |
- |
ENSG00000237613 |
FAM138A |
chr1 |
35277 |
35481 |
- |
ENSG00000237613 |
FAM138A |
chr1 |
34554 |
35174 |
- |
ENSG00000237613 |
FAM138A |
chr1 |
65419 |
65433 |
+ |
ENSG00000186092 |
OR4F5 |
chr1 |
65520 |
65573 |
+ |
ENSG00000186092 |
OR4F5 |
chr1 |
69037 |
71585 |
+ |
ENSG00000186092 |
OR4F5 |
clusters_to_genes
clu |
genes |
---|---|
1:clu_1_+ |
ENSG00000116288 |
1:clu_10_+ |
ENSG00000143774 |
1:clu_11_+ |
ENSG00000143774 |
1:clu_12_+ |
ENSG00000143774 |
1:clu_14_- |
ENSG00000126709 |
1:clu_15_- |
ENSG00000121753 |
1:clu_16_- |
ENSG00000121753 |
1:clu_17_- |
ENSG00000116560 |
1:clu_18_- |
ENSG00000143549 |
phenotype_group
X1 |
X2 |
---|---|
7:102476270:102478811:clu_309_-:ENSG00000005075 |
ENSG00000005075 |
7:102476270:102478808:clu_309_-:ENSG00000005075 |
ENSG00000005075 |
X:47572961:47574002:clu_349_-:ENSG00000008056 |
ENSG00000008056 |
X:47572999:47574002:clu_349_-:ENSG00000008056 |
ENSG00000008056 |
8:27236905:27239971:clu_322_-:ENSG00000015592 |
ENSG00000015592 |
8:27239279:27239971:clu_322_-:ENSG00000015592 |
ENSG00000015592 |
8:27241262:27241677:clu_323_-:ENSG00000015592 |
ENSG00000015592 |
8:27241262:27242397:clu_323_-:ENSG00000015592 |
ENSG00000015592 |
8:27241757:27242397:clu_323_-:ENSG00000015592 |
ENSG00000015592 |
1:35558223:35559107:clu_4_+:ENSG00000020129 |
ENSG00000020129 |
The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.
[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
# Defines the mapping strategy options: 'site' or 'region', with 'site' as the default.
# The 'site' strategy maps introns to the start and end of each exon.
# The 'region' strategy, to be recommended in leafcutter2, maps each intron based on it overlapping more than overlap_ratio of a gene's region.
parameter: map_stra = "site"
# Define the overlap ratio as the proportion of the cluster length that intersects with a gene, used to determine mapping to the gene.
parameter: overlap_ratio = 0.8
input: intron_count, annotation_gtf
output: f'{cwd}/{_input[0]:b}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import qtl.annotation
gtf = ${_input[1]:r}
#there is no "gene_type" when using the uncollapsed gtf. Replace "gene_biotype" with "gene_type" in the gtf and temporarily save for use in the annotation
has_gene_type = True
with open(gtf, 'r') as file:
for line in file:
if line[0] == '#':
continue
row = line.strip().split('\t')
chrom = row[0]
# source = row[1]
annot_type = row[2]
if annot_type == "gene":
if "gene_type" not in line:
has_gene_type = False
break
if has_gene_type == False:
with open(gtf, 'r') as file:
file_contents = file.read()
updated_contents = file_contents.replace("gene_biotype", "gene_type")
gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
print(gtf)
with open(gtf, 'w') as file:
file.write(updated_contents)
# Load data
#annot = qtl.annotation.Annotation(${_input[1]:r})
annot = qtl.annotation.Annotation(gtf)
exon_df = pd.DataFrame([[g.chr, e.start_pos, e.end_pos, g.strand, g.id, g.name]
for g in annot.genes for e in g.transcripts[0].exons],
columns=['chr', 'start', 'end', 'strand', 'gene_id', 'gene_name'])
exon_df.to_csv(${_output[0]:r}, sep='\t', index=False)
R:expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
suppressMessages(library(dplyr, quietly=TRUE))
suppressMessages(library(stringr, quietly=TRUE))
suppressMessages(library(foreach, quietly=TRUE))
# leafcutter functions:
#' Make a data.frame of meta data about the introns
#' @param introns Names of the introns
#' @return Data.frame with chr, start, end, cluster id
#' @export
get_intron_meta <- function(introns) {
intron_meta <- do.call(rbind, strsplit(introns, ":"))
if (ncol(intron_meta) == 5) {
colnames(intron_meta) <- c("chr", "start", "end", "clu", "category")
} else {
colnames(intron_meta) <- c("chr", "start", "end", "clu")
}
intron_meta <- as.data.frame(intron_meta, stringsAsFactors = FALSE)
intron_meta$start <- as.numeric(intron_meta$start)
intron_meta$end <- as.numeric(intron_meta$end)
return(intron_meta)
}
#' Work out which gene each cluster belongs to. Note the chromosome names used in the two inputs must match.
#' @param intron_meta Data frame describing the introns, usually from get_intron_meta
#' @param exons_table Table of exons, see e.g. /data/gencode19_exons.txt.gz
#' @return Data.frame with cluster ids and genes separated by commas
#' @import dplyr
#' @export
map_clusters_to_genes_site <- function(intron_meta, exons_table) {
suppressMessages(library(foreach, quietly=TRUE))
gene_df <- foreach (chr=sort(unique(intron_meta$chr)), .combine=rbind) %dopar% {
intron_chr <- intron_meta[ intron_meta$chr==chr, ]
exons_chr <- exons_table[exons_table$chr==chr, ]
exons_chr$temp <- exons_chr$start
intron_chr$temp <- intron_chr$end
three_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
exons_chr$temp <- exons_chr$end
intron_chr$temp <- intron_chr$start
five_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
all_matches <- rbind(three_prime_matches, five_prime_matches)[ , c("clu", "gene_name")]
all_matches <- all_matches[!duplicated(all_matches),]
if (nrow(all_matches)==0) return(NULL)
all_matches$clu <- paste(chr,all_matches$clu,sep=':')
all_matches
}
clu_df <- gene_df %>% group_by(clu) %>% summarize(genes=paste(gene_name, collapse = ","))
class(clu_df) <- "data.frame"
clu_df
}
map_clusters_to_genes_region <- function(intron_meta, exons_table, f) {
#combine the exon position to get gene region
merged_df <- exons_table %>%
group_by(chr, gene_id) %>%
summarize(start = min(start), end = max(end), .groups = 'drop') %>% as.data.frame()%>%
.[,c("chr","start","end","gene_id")]
# use bedtools to map the intron region to gene region
options(bedtools.path = "/home/rf2872/software/bedtools2/bin/") #Fixme: build bedtools in leafcutter sif
map_res <- bedtoolsr::bt.intersect(a = intron_meta, b = merged_df,f = f, wa = TRUE, wb=TRUE)
#organize output of mapped file
exon_map <- map_res %>%
mutate(clu = paste(V1, V4, sep=":"),
genes = .[[paste0("V", 4 + ncol(intron_meta))]]) %>%
select(clu,genes)
return(exon_map)
}
cat("LeafCutter: mapping clusters to genes\n")
intron_counts <- read.table(${_input[0]:r}, header=TRUE, check.names=FALSE, row.names=4)
intron_meta <- get_intron_meta(rownames(intron_counts))
exon_table <- read.table(${_output[0]:r}, header=TRUE, stringsAsFactors=FALSE)
if(!str_detect(intron_meta$chr[1],"chr")) {
exon_table = exon_table %>% mutate(chr = str_remove_all(chr,"chr"))
} else if (!any(str_detect(exon_table$chr[1],"chr"))) {
exon_table = exon_table %>% mutate(chr = paste0("chr",chr))
} else {exon_table = exon_table}
stopifnot(is.element('gene_id', colnames(exon_table)))
exon_table[, 'gene_name'] <- exon_table[, 'gene_id']
if("${map_stra}" == "site"){
cat("mapping clusters to genes by splicing site\n")
m <- map_clusters_to_genes_site(intron_meta, exon_table)
} else if("${map_stra}" == "region"){
cat("mapping clusters to genes by overlapping gene region\n")
m <- map_clusters_to_genes_region(intron_meta, exon_table, f = ${overlap_ratio})
} else {
stop("Map strategy should only be 'site' or 'region'.")
}
write.table(m, ${_output[1]:r}, sep = "\t", quote=FALSE, row.names=FALSE)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import numpy as np
import qtl.io
import os
from pathlib import Path
# Load data
if os.path.exists(${_input[1]:rn}.tmp${_input[1]:rx}):
gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
else:
gtf = ${_input[1]:r}
#tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r})
tss_df = qtl.io.gtf_to_tss_bed(gtf)
bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
bed_df.columns.values[0] = "#chr" # Temporary
sample_participant_lookup = Path("${sample_participant_lookup:a}")
cluster2gene_dict = pd.read_csv(${_input[3]:r}, sep='\t', index_col=0).to_dict()
cluster2gene_dict = cluster2gene_dict['genes']
print(' ** assigning introns to gene mapping(s)')
n = 0
gene_bed_df = []
group_s = {}
for _,r in bed_df.iterrows():
s = r['ID'].split(':')
cluster_id = s[0]+':'+s[3]
if cluster_id in cluster2gene_dict:
gene_ids = cluster2gene_dict[cluster_id].split(',')
for g in gene_ids:
gi = r['ID']+':'+g
gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
group_s[gi] = g
else:
n += 1
if n > 0:
print(f' ** discarded {n} introns without a gene mapping')
print(' * writing BED files for QTL mapping')
gene_bed_df = pd.DataFrame(gene_bed_df, columns=bed_df.columns)
# sort by TSS
gene_bed_df = gene_bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
#rename the samples if they named by file name (simply pick the first element with [.])
gene_bed_df.columns = list(gene_bed_df.columns[:4]) + [name.split('.')[0] if 'junc' in name else name for name in gene_bed_df.columns[4:]]
# change sample IDs to participant IDs
if sample_participant_lookup.is_file():
sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
#gene_bed_df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
# Create a dictionary mapping from sample_id to participant_id
column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))
# Get the column names to be replaced
column_names = gene_bed_df.columns[4:]
# Replace the column names using the mapping dictionary
#new_column_names = [column_mapping.get(col, 'missing_data') for col in column_names]
#it should overlap with genotype in downstream anyways
new_column_names = [column_mapping.get(col, col) for col in column_names]
gene_bed_df.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)
gene_bed_df = gene_bed_df.drop_duplicates()
qtl.io.write_bed(gene_bed_df, ${_output[0]:r})
gene_bed_df[['start', 'end']] = gene_bed_df[['start', 'end']].astype(np.int32)
gene_bed_df[gene_bed_df.columns[4:]] = gene_bed_df[gene_bed_df.columns[4:]].astype(np.float32)
group_s_df = pd.Series(group_s).sort_values().reset_index()
group_s_df.columns = ['ID', 'gene']
group_s_df.to_csv(${_output[1]:r}, sep='\t', index=False, header=True)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
rm $[_input[1]:rn].tmp$[_input[1]:rx]
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Processing of psichomics output#
It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID
[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import numpy as np
import qtl.io
from pathlib import Path
tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r}, feature='gene',phenotype_id = "gene_id" )
bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
bed_df["gene_id"] = [x[-1] for x in bed_df.ID.str.split("_")]
sample_participant_lookup = Path("${sample_participant_lookup:a}")
if "start" in bed_df.columns:
bed_df = bed_df.drop(["#Chr","start","end"],axis = 1)
output = tss_df.merge(bed_df, left_on = ["gene_id"], right_on = ["gene_id"],how = "right").sort_values(["chr","start"])
# change sample IDs to participant IDs
if sample_participant_lookup.is_file():
sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))
column_names = output.columns[4:]
print(column_names)
new_column_names = [column_mapping.get(col, col) for col in column_names]
output.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)
#output.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
# Old code grouping by each gene
#bed_output = output.drop("gene_id" , axis = 1)
#phenotype_group = output[["ID","gene_id"]]
bed_output = output.drop("gene_id" , axis = 1)
# R version
#output = output%>%mutate(gene_type_id = map_chr(str_split( ID ,"_"),~paste0(.x[length(.x)],"_",.x[1] )) )
# corrected in python
output = output.assign(gene_type_id = output['ID'].str.split('_').map(lambda x: x[-1] + '_' + x[0]))
phenotype_group = output[["ID","gene_type_id"]]
bed_output.to_csv(${_output[0]:nr},"\t",index = False)
phenotype_group.to_csv(${_output[1]:r},"\t",index = False,header=False)
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
bgzip -f ${_output[0]:n}
tabix ${_output[0]}
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done