Genotype VCF File Quality Control

Genotype VCF File Quality Control#

This implements some recommendations from UK Biobank on sequence data quality control.

Description#

A major challenge in biomedical research is the quality control (QC) of sequencing data. False positive variant calls can hinder the ability to detect disease associated variants or introduce spurious associations, therefore the need for a rigorous QC. Our pipeline focuses on QC after the variant calling stage and requires project Variant Calling Format (pVCF) as input files. We have defined default thresholds for genotype and variant-level hard filtering based on recommendations from the UK Biobank team and a thorough review of the literature [cf. Carson et al. BMC Bioinformatics (2014),cf. Lek et al. Nature (2016),cf. Szustakowski et al. Nature Genetics (2021)]. Bcftools is used in our QC steps. We first handle multi-allelic sites by splitting them into bi-allelic records. We include an optional workflow to keep only bi-allelic sites in the data. Variants are then annotated based on dbSNP data. Genotypes are kept if they have a Genotype Depth (DP) >= 10 and a Genotype Quality (GQ) >= 20. Variants are included if at least one sample has an allelic balance (AB) >= 0.15 for Single Nucleotide Variants (SNVs) and AB>=0.2 for indels, variant missigness is below 20% and Hardy-Weinberg Equilibrium p-value is > 1e-08. Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations. Output summary statistics, such as transistion/transversion ratios (TS/TV ratio) are calculated to determine the effectiveness of QC.

Default Parameters: VCF QC Filters#

Genotype depth filters: For WES data, UK Biobank recommends SNPs DP>10 and Indels DP>10 for indels. However we think for WGS we can be less stringent, or simply rely on GQ. Users can set it to 1 eg, --DP 1 --DP-indel 1
Genotype quality GQ>20.
At least one sample per site passed the allele balance threshold >= 0.15 for SNPs and >=0.20 for indels (heterozygous variants).
- Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations.
This module also allows for filtering by HWE and missingness although by default we don’t filter on them.

Filtering are done with bcftools. Here is a useful cheatsheet from github user @elowy01.

A note on TS/TV summary from VCF genotype data#

bcftools stats command provides useful summary statistics including TS/TV ratio, which is routinely used as a quality measure of variant calls. With dbSNP based annotation of novel and known variants, bcftools can compute TS/TV for novel and known variants at variant level, and at sample level. It should be noted that variant level TS/TV does not take sample genotype into consideration – it simply counts the TS and TV event for observed SNPs in the data. Other tools, such as snpsift, implements variant level TS/TV by counting TS and TV events in sample genotypes and compute the ratio after summing up TS and TV across all samples. See here some discussions on this issue. We provide these TS/TV calculations before and after QC but users should be aware of the difference when interpreting the results.

Default Parameters: Resource Usage#

Memory: Usually a whole genome VCF.gz file has the size of 200+GB, after testing, a minimum of 60GB of mem is requried.
Walltimes: For every hour qc_1 or qc_2 can process ~14G of data. The default is set to be 24h, corresponding to ~300GB of input. Please set the –walltime parameter according to the size of your input files.

Input#

The target vcf.gz file
- If its chromosome name does not have the chr prefix and you need it to match with reference fasta file, please run rename_chrs workflow to add chr.
- The vcf.gz file needs to be compressed by bgzip, instead of simple gzip
- It should have a index file accompanying it. The index file can be generated by tabix
- It must be a valid vcf.gz file that can pass bcftools sanity check: i.e. all tags are defined properly
- It must contains following fields:
  1. ##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Approximate read depth (reads with MQ=255 or with bad mates are filtered)”>
  2. ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype Quality”>
  3. ##FORMAT=<ID=AB,Number=1,Type=Float,Description=”Allele balance for each het genotype”>
dbSNP database in VCF format
A reference sequence fasta file

Output#

QC-ed genotype data in VCF format. You can use vcf_to_plink implemented in genotype_formatting.ipynb to further convert it to PLINK format.
A set of sumstats to help evaluate quality of genotype before and after QC
- Particularly useful is the TS/TV ratio

Minimal Working Example#

The MWE is generated via

bcftools query -l get-dosage.ALL.vcf.gz | head -40 > MWE_sample_list
bcftools view -S MWE_sample_list  get-dosage.ALL.vcf.gz > sample_filtered.vcf &
bgzip -c sample_filtered.vcf >  sample_filtered.vcf.gz
tabix -p vcf sample_filtered.vcf.gz
bcftools view --regions chr1 sample_filtered.vcf.gz > chr1_sample_filtered.vcf &
cat chr1_sample_filtered.vcf | head -20000 > MWE_genotype.vcf

and was stored here: https://drive.google.com/file/d/1sxxPdPIyKma0mAl8TKwhgyRHlOh0Oyrc/view?usp=sharing

FIXME: point this to the synapse folder.

(Optional) Rename Chromosomes#

Timing: 40 min

sos run VCF_QC.ipynb rename_chrs \
    --genoFile /restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.vcf.gz \
    --cwd /restricted/projectnb/xqtl/xqtl_protocol/reference_data \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
    -c /restricted/projectnb/xqtl/xqtl_protocol/scripts/csg.yml -q neurology 

INFO: Running [32mrename_chrs[0m: 
INFO: t0bb13a5cc2cae479 [32msubmitted[0m to neurology with job id Your job 8313589 ("job_t0bb13a5cc2cae479") has been submitted
INFO: Waiting for the completion of [32m1[0m task.
INFO: [32mrename_chrs[0m output:   [32m/restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.add_chr.vcf.gz[0m
INFO: Workflow rename_chrs (ID=w4f0deb3b5b4ae8ce) is executed successfully with 1 completed step and 1 completed task.

(Optional) dbSNP Annotation#

Timing: 4 min

sos run VCF_QC.ipynb dbsnp_annotate \
    --genoFile /restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.add_chr.vcf.gz \
    --cwd /restricted/projectnb/xqtl/xqtl_protocol/reference_data \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest \
    -c /restricted/projectnb/xqtl/xqtl_protocol/scripts/csg.yml -q neurology 

INFO: Running [32mdbsnp_annotate[0m: 
INFO: t6a0e22783b2666a3 [32msubmitted[0m to neurology with job id Your job 8314421 ("job_t6a0e22783b2666a3") has been submitted
INFO: Waiting for the completion of [32m1[0m task.
INFO: [32mdbsnp_annotate[0m output:   [32m/restricted/projectnb/xqtl/xqtl_protocol/reference_data/00-All.add_chr.variants.gz[0m
INFO: Workflow dbsnp_annotate (ID=wf0ebf0af2d3d81ad) is executed successfully with 1 completed step and 1 completed task.

i. Quality Control#

Timing: X min

sos run VCF_QC.ipynb qc    \
    --genoFile data/MWE/MWE_genotype.vcf     \
    --dbsnp-variants data/reference_data/00-All.add_chr.variants.gz  \
    --reference-genome data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta   \
    --cwd MWE/output/genotype_1 \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

To run in parallel for all genotype data listed in mwe_genotype_list:

sos run VCF_QC.ipynb qc    \
    --genoFile data/mwe/mwe_genotype_list    \
    --dbsnp-variants data/reference_data/00-All.add_chr.variants.gz  \
    --reference-genome data/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta   \
    --cwd MWE/output/genotype_4 \
    --container oras://ghcr.io/cumc/bioinfo_apptainer:latest

Producing the following results:

Total TS/TV for 19639 known variants before QC: 2.599
Total TS/TV for 19573 known variants after QC: 2.600
There is no novel variants included in the MWE.

The Total TS/TV is extracted from the last step of QC. For known variant before QC:

grep Ts/Tv MWE_genotype.leftnorm.known_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

2.599

For known variant after QC:

grep Ts/Tv MWE_genotype.leftnorm.filtered.*_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

2.600

For novel variant before/after QC, TS/TV is not avaible since no novel_variants presented in the MWE

grep Ts/Tv MWE_genotype.leftnorm.novel_variant.snipsift_tstv | rev | cut -d',' -f1 | rev
grep Ts/Tv MWE_genotype.leftnorm.filtered.novel_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

Command Interface#

sos run VCF_QC.ipynb -h

usage: sos run VCF_QC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  rename_chrs
  dbsnp_annotate
  qc

Global Workflow Options:
  --genoFile  paths

                        input can either be 1 vcf genoFile, or a list of vcf
                        genoFile.
  --remove-samples . (as path)
                        The path to the file that contains the list of samples
                        to remove (format FID, IID)
  --keep-samples . (as path)
                        The path to the file that contains the list of samples
                        to keep (format FID, IID)
  --cwd output (as path)
                        Workdir
  --numThreads 1 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 24h
                        Walltime
  --mem 60G
                        Usually a whole genome VCF.gz file has the size of
                        200+GB, after testing, a minimum of 60GB of mem is
                        requried.
  --container ''
                        Software container option
  --entrypoint ('micromamba run -n' + ' ' + container.split('/')[-1][:-4] + " --no-capture-output") if container.endswith('.sif') else ""

  --[no-]add-chr (default to False)
                        use this function to edit memory string for PLINK input

Sections
  rename_chrs:
  dbsnp_annotate:
  qc_1:                 Handel multi-allelic sites, left normalization of indels
                        and add variant ID
    Workflow Options:
      --dbsnp-variants VAL (as path, required)
                        Path to dbSNP variants generated previously
      --reference-genome VAL (as path, required)
                        Path to fasta file for HG reference genome, eg
                        GRCh38_full_analysis_set_plus_decoy_hla.fa
      --[no-]bi-allelic (default to False)
      --[no-]snp-only (default to False)
  qc_2:                 genotype QC
    Workflow Options:
      --geno-filter 0.2 (as float)
                        Maximum missingess per-variant, default to 0.2
      --DP-snp 10 (as int)
                        Sample level QC - read depth (DP) to filter out SNPs
                        below this value Default to 10, with WES data in mind
                        But for WGS, setting it to 2 may be fine considering the
                        WGS may have low DP but the GQ filter should be good
                        enough
      --GQ 20 (as int)
                        Sample level QC - genotype quality (GQ) of specific
                        sample. This measure tells you how confident we are that
                        the genotype we assigned to a particular sample is
                        correct
      --DP-indel 10 (as int)
                        Sample level QC - read depth (DP) to filter out indels
                        below this value
      --AB-snp 0.15 (as float)
                        Allele balance for snps
      --AB-indel 0.2 (as float)
                        Allele balance for indels
      --hwe-filter 0.0 (as float)
                        HWE filter, default to 0.0 which means no HWE filter is
                        applied
  qc_3:

Global parameters#

[global]
# input can either be 1 vcf genoFile, or a list of vcf genoFile.
parameter: genoFile = paths
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
# Workdir
parameter: cwd = path("output")
# Number of threads
parameter: numThreads = 1
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Walltime 
parameter: walltime = '24h'
# Usually a whole genome VCF.gz file has the size of 200+GB, after testing, a minimum of 60GB of mem is requried.
parameter: mem = '60G'
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = path(f"{cwd:a}")

import os
def get_genotype_file(geno_file_paths):
    #
    def valid_geno_file(x):
        suffixes = path(x).suffixes
        if suffixes[-1] == '.bed':
            return True
        elif suffixes[-1] == '.vcf':
            return True
        elif len(suffixes)>1 and ''.join(suffixes[-2:]) == ".vcf.gz" :
            return True
        return False
    #
    def complete_geno_path(x, geno_file):
        if not valid_geno_file(x):
            raise ValueError(f"Genotype file {x} should be VCF (end with .vcf.gz) or PLINK bed file (end with .bed)")
        if not os.path.isfile(x):
            # relative path
            if not os.path.isfile(f'{geno_file:ad}/' + x):
                raise ValueError(f"Cannot find genotype file {x}")
            else:
                x = f'{geno_file:ad}/' + x
        return x
    # 
    def format_chrom(chrom):
        if chrom.startswith('chr'):
            chrom = chrom[3:]
        return chrom
    # Inputs are either VCF or bed, or a vector of them 
    if len(geno_file_paths) > 1:
        if all([valid_geno_file(x) for x in geno_file_paths]):
            return paths(geno_file_paths)
        else: 
            raise ValueError(f"Invalid input {geno_file_paths}")
    # Input is one genotype file or text list of genotype files
    geno_file = geno_file_paths[0]
    if valid_geno_file(geno_file):
        return paths(geno_file)
    else: 
        units = [x.strip().split() for x in open(geno_file).readlines() if x.strip() and not x.strip().startswith('#')]
        if all([len(x) == 1 for x in units]):
            return paths([complete_geno_path(x[0], geno_file) for x in units])
        elif all([len(x) == 2 for x in units]):
            genos = dict([(format_chrom(x[0]), path(complete_geno_path(x[1], geno_file))) for x in units])
        else:
            raise ValueError(f"{geno_file} should contain one column of file names, or two columns of chrom number and corresponding file name")
        return genos
    
genoFile = get_genotype_file(genoFile)

Annotate known and novel variants#

You can download the known variant reference from this link.

For a detailed explanation of the procedure and its rationale, please refer to this post.

[rename_chrs: provides = '{genoFile:nn}.add_chr.vcf.gz']
# This file can be downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/00-All.vcf.gz.
input: genoFile
output: f'{_input:nn}.add_chr.vcf.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', entrypoint=entrypoint
    for i in {1..22} X Y MT; do echo "$i chr$i"; done > ${_output:nn}.chr_name_conv.txt
    bcftools annotate --rename-chrs ${_output:nn}.chr_name_conv.txt ${_input} -Oz -o ${_output}
    tabix -p vcf ${_output}
    rm -f ${_output:nn}.chr_name_conv.txt

[dbsnp_annotate]
input: genoFile
output: f"{cwd}/{_input:bnn}.variants.gz"
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    # Extract specific fields from the VCF file using bcftools
    bcftools query -f'%CHROM\t%POS\t%ID\t%REF\t%ALT\n' ${_input} | \
    awk 'BEGIN { 
            OFS="\t" 
         } 
         {
            # Calculate end position based on the length of REF or ALT
            if (length($4) > length($5)) {
                end_pos = $2 + (length($4) - 1)
            } else {
                end_pos = $2 + (length($5) - 1)
            }
            print $1, $2, end_pos, $3
         }' | \
    # Compress the output using bgzip
    bgzip -c > ${_output}

Genotype QC#

This step handles multi-allelic sites and annotate variants to known and novel. We add an RS ID to variants in dbSNP. Variants without rsID are considered novel variants. For every hour it can produce ~14Gb of data, please set the –walltime parameter according to the size of your input files.

# Handel multi-allelic sites, left normalization of indels and add variant ID
[qc_1 (variant preprocessing)]
# Path to dbSNP variants generated previously
parameter: dbsnp_variants = path
# Path to fasta file for HG reference genome, eg GRCh38_full_analysis_set_plus_decoy_hla.fa
parameter: reference_genome = path
parameter: bi_allelic = False
parameter: snp_only = False
input: genoFile, group_by = 1
output: f'{cwd}/{_input:bnn}.{"leftnorm" if not bi_allelic else "biallelic"}{".snp" if snp_only else ""}.vcf.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    # split multiallelic sites into biallelic records 
    ${'bcftools norm -m-any' if not bi_allelic else 'bcftools view -m2 -M2'} ${'-v snps' if snp_only else ""} ${_input} |\
    # Fix incorrect or missing REF alleles and warn about them
    bcftools norm -d exact -N --check-ref ws -f ${reference_genome} --threads ${numThreads} |\
    # Fill missing tags
    bcftools +fill-tags -- -t all,F_MISSING,'VD=sum(FMT/DP)' | \
    # Remove ID and replace with CHROM:POS:REF:ALT format
    bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' | \
    # Annotate with dbSNP rsID
    bcftools annotate -a ${dbsnp_variants}  -h <(echo '##INFO=<ID=RSID,Number=1,Type=String,Description="dbSNP rsID">') \
        -c CHROM,FROM,TO,INFO/RSID -Oz --threads ${numThreads} -o ${_output}

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    for i in ${_output} ; do 
        # Capture file metadata
        output_info="$i"
        output_size=$(ls -lh "$i" | awk '{print $5}')
        output_rows=$(zcat "$i" | wc -l)
        output_column=$(zcat "$i" | grep -v "##" | head -1 | wc -w)
        output_header_row=$(zcat "$i" | grep "##" | wc -l)
        output_preview=$(zcat "$i" | grep -v "##" | head | cut -f 1-11)

        # Write captured information to the stdout file
        printf "output_info: %s\noutput_size: %s\noutput_rows: %d\noutput_column: %d\noutput_header_row: %d\noutput_preview:\n%s\n" \
            "$output_info" "$output_size" "$output_rows" "$output_column" "$output_header_row" "$output_preview" >> ${_output:n}.stdout
    done

This step filter variants based on FILTER PASS, DP and QC, fraction of missing genotypes (all samples), and on HWE, for snps and indels. It will also remove monomorphic sites – using bcftools view -c1.

# genotype QC
[qc_2 (variant level QC)]
# Maximum missingess per-variant, default to 0.2
parameter: geno_filter = 0.2
# Sample level QC - read depth (DP) to filter out SNPs below this value
# Default to 10, with WES data in mind 
# But for WGS, setting it to 2 may be fine considering the WGS may have low DP but the GQ filter should be good enough
parameter: DP_snp = 10
# Sample level QC - genotype quality (GQ) of specific sample. This measure tells you how confident we are that the genotype we assigned to a particular sample is correct
parameter: GQ = 20
# Sample level QC - read depth (DP) to filter out indels below this value
parameter: DP_indel = 10
# Allele balance for snps
parameter: AB_snp = 0.15
# Allele balance for indels
parameter: AB_indel = 0.2
# HWE filter, default to 0.0 which means no HWE filter is applied
parameter: hwe_filter = 0.0
output: f"{_input:nn}.bcftools_qc.vcf.gz"
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    # Initial filtering based on depth and genotype quality for SNPs and INDELs. Will set to missing genotypes that do not meet both conditions (DP>=DP_spns & GQ>=GQ), similarly for indels
    bcftools filter -S . -i \
    '(TYPE="SNP" & (FMT/DP)>=${DP_snp} & (FMT/GQ)>=${GQ}) | 
     (TYPE="INDEL" & (FMT/DP)>=${DP_indel} & (FMT/GQ)>=${GQ})' ${_input} | \
     
    # Further filtering to retain only variants that are PASS and have at least one non-reference allele 
    bcftools view -c1 | \
    bcftools view -f PASS | \

    # Filter based on genotype (hom/het) and allelic balance for SNPs and INDELs
    bcftools filter -i \
    'GT="hom" | 
     TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= ${AB_snp} | 
     TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= ${AB_indel}' | \

    # Filter based on missingness and Hardy-Weinberg equilibrium 
    bcftools filter -i 'F_MISSING<${geno_filter} & HWE>${hwe_filter}' \
    -Oz --threads ${numThreads} -o ${_output}


bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    for i in ${_output} ; do 
        # Capture file metadata
        output_info="$i"
        output_size=$(ls -lh "$i" | awk '{print $5}')
        output_rows=$(zcat "$i" | wc -l)
        output_column=$(zcat "$i" | grep -v "##" | head -1 | wc -w)
        output_header_row=$(zcat "$i" | grep "##" | wc -l)
        output_preview=$(zcat "$i" | grep -v "##" | head | cut -f 1-11)

        # Write captured information to the stdout file
        printf "output_info: %s\noutput_size: %s\noutput_rows: %d\noutput_column: %d\noutput_header_row: %d\noutput_preview:\n%s\n" \
            "$output_info" "$output_size" "$output_rows" "$output_column" "$output_header_row" "$output_preview" >> ${_output:n}.stdout
    done

[qc_3 (genotype data summary statistics)]
input: output_from('qc_1'), output_from('qc_2'), group_by = 1
output: f"{cwd}/{_input:bnn}.novel_variant_sumstats", 
        f"{cwd}/{_input:bnn}.known_variant_sumstats", 
        f"{cwd}/{_input:bnn}.novel_variant.snipsift_tstv",
        f"{cwd}/{_input:bnn}.known_variant.snipsift_tstv"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint

    # Compute statistics for novel variants (RSID is missing)
    bcftools stats -i 'RSID="."' -v ${_input} > ${_output[0]}

    # Compute statistics for known variants (RSID is present)
    bcftools stats -i 'RSID!="."' -v ${_input} > ${_output[1]}

    # Compute TS/TV for novel variants
    bcftools filter -i 'RSID="."' ${_input} | \
        SnpSift tstv - > ${_output[2]}

    # Compute TS/TV for known variants
    bcftools filter -i 'RSID!="."' ${_input} | \
        SnpSift tstv - > ${_output[3]}

bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    for i in ${_output} ; do 
        # Capture file metadata
        output_info="$i"
        output_size=$(ls -lh "$i" | awk '{print $5}')
        output_rows=$(zcat "$i" | wc -l)
        output_column=$(zcat "$i" | grep -v "##" | head -1 | wc -w)
        output_header_row=$(zcat "$i" | grep "##" | wc -l)
        output_preview=$(zcat "$i" | grep -v "##" | head | cut -f 1-11)

        # Write captured information to the stdout file
        printf "output_info: %s\noutput_size: %s\noutput_rows: %d\noutput_column: %d\noutput_header_row: %d\noutput_preview:\n%s\n" \
            "$output_info" "$output_size" "$output_rows" "$output_column" "$output_header_row" "$output_preview" >> ${_output[0]:n}.stdout
    done