Output file formats
Although most output files include headers that describe the data, a brief explanation of the output files is provided below.
Read to isoform assignment
Tab-separated values, the columns are:
read_id
- read id;chr
- chromosome id;strand
- strand of the assigned isoform (not to be confused with read mapping strand);isoform_id
- isoform id to which the read was assigned;gene_id
- gene id to which the read was assigned;assignment_type
- assignment type, can be:unique
- reads was unambiguously assigned to a single known isoform;unique_minor_difference
- read was assigned uniquely but has alignment artifacts;inconsistent
- read was matched with inconsistencies, closest match(es) are reported;inconsistent_non_intronic
- read was matched with inconsistencies, which do not affect intron chain (e.g. olly TSS/TES);inconsistent_ambiguous
- read was matched with inconsistencies equally well to two or more isoforms;ambiguous
- read was assigned to multiple isoforms equally well;noninfomative
- reads is intronic or has an insignificant overlap with a known gene;intergenic
- read is intergenic.
assignment_events
- list of detected inconsistencies; for each assigned isoform a list of detected inconsistencies relative to the respective isoform is stored; values in each list are separated by+
symbol, lists are separated by comma, the number of lists equals to the number of assigned isoforms; possible events are (see graphical representation below):- consistent events:
none
/.
/undefined
- no special event detected;mono_exon_match
mono-exonic read matched to mono-exonic transcript;fsm
- full splice match;ism_5/3
- incomplete splice match, truncated on 5'/3' side;ism_internal
- incomplete splice match, truncated on both sides;mono_exonic
- mono-exonic read matching spliced isoform;tss_match
/tss_match_precise
- 5' read is located less than 50 /delta
bases from the TSS of the assigned isoformtes_match
/tes_match_precise
- 3' read is located less than 50 /delta
bases from the TES of the assigned isoform (can be reported without detecting polyA sites)
- alignment artifacts:
intron_shift
- intron that seems to be shifted due to misalignment (typical for Nanopores);exon_misalignment
- short exon that seems to be missed due to misalignment (typical for Nanopores);fake_terminal_exon_5/3
- short terminal exon at 5'/3' end that looks like an alignment artifact (typical for Nanopores);terminal_exon_misalignment_5/3
- missed reference short terminal exon;exon_elongation_5/3
- minor exon extension at 5'/3' end (not exceeding 30bp);fake_micro_intron_retention
- short annotated introns are often missed by the aligners and thus are not considered as intron retention;
- intron retentions:
intron_retention
- intron retention;unspliced_intron_retention
- intron retention by mono-exonic read;incomplete_intron_retention_5/3
- terminal exon at 5'/3' end partially covers adjacent intron;
- significant inconsistencies (each type end with
_known
if all resulting read introns are annotated and_novel
otherwise):major_exon_elongation_5/3
- significant exon extension at 5'/3' end (exceeding 30bp);extra_intron_5/3
- additional intron on the 5'/3' end of the isoform;extra_intron
- read contains additional intron in the middle of exon;alt_donor_site
- read contains alternative donor site;alt_acceptor_site
- read contains alternative annotated acceptor site;intron_migration
- read contains alternative annotated intron of approximately the same length as in the isoform;intron_alternation
- read contains alternative intron, which doesn't fall intro any of the categories above;mutually_exclusive_exons
- read contains different exon(s) of the same total length comparing to the isoform;exon_skipping
- read skips exon(s) comparing to the isoform;exon_merge
- read skips exon(s) comparing to the isoform, but a sequence of a similar length is attached to a neighboring exon;exon_gain
- read contains additional exon(s) comparing to the isoform;exon_detach
- read contains additional exon(s) comparing to the isoform, but a neighboring exon looses a sequnce of a similar length;terminal_exon_shift
- read has alternative terminal exon;alternative_structure
- reads has different intron chain that does not fall into any of categories above;
- alternative transcription start / end (reported when poly-A tails are present):
alternative_polya_site
- read has alternative polyadenylation site;internal_polya_site
- poly-A tail detected but seems to be originated from A-rich intronic region;correct_polya_site
- poly-A site matches reference transcript end;aligned_polya_tail
- poly-A tail aligns to the reference;alternative_tss
- alternative transcription start site.
- consistent events:
exons
- list of coordinates for normalized read exons (1-based, indels and polyA exons are excluded);additional
- field for supplementary information, which may include:gene_assignment
- Gene assignment classification; possible values are the same as for transcript classification.PolyA
- True if poly-A tail is detected;Canonical
- True if all read introns are canonical, Unspliced is used for mono-exon reads; (use--check_canonical
);Classification
- SQANTI-like assignment classification.
Note, that a single read may occur more than once if assigned ambiguously.
Expression table format
Tab-separated values, the columns are:
feature_id
- genomic feature ID;TPM
orcount
- expression value (float).
For grouped counts, each column contains expression values of a respective group (matrix representation).
Beside count matrix, transcript and gene grouped counts are also printed in a linear format, in which each line contains 3 tab-separated values:
feature_id
- genomic feature ID;group_id
- group name;count
- read count of the feature in this group.
Exon and intron count format
Tab-separated values, the columns are:
chr
- chromosome ID;start
- feature leftmost 1-based positions;end
- feature rightmost 1-based positions;strand
- feature strand;flags
- symbolic feature flags, can contain the following characters:X
- terminal feature;I
- internal feature;T
- feature appears as both terminal and internal in different isoforms;S
- feature has similar positions to some other feature;C
- feature is contained in another feature;U
- unique feature, appears only in a single known isoform;M
- feature appears in multiple different genes.
gene_ids
- list if gene ids feature belong to;group_id
- read group if provided (NA by default);include_counts
- number of reads that include this feature;exclude_counts
- number of reads that span, but do not include this feature;
Transcript models format
Constructed transcript models are stored in usual GTF format.
Contains exon
, transcript
and gene
features.
Known genes and transcripts are reposted with their reference IDs.
Novel genes IDs have format novel_gene_XXX_###
and novel transcript IDs are formatted as transcript###.XXX.TYPE
,
where ###
is the unique number (not necessarily consecutive), XXX
is the chromosome name and TYPE can be one of the following:
- nic - novel in catalog, new transcript that contains only annotated introns;
- nnic - novel not in catalog, new transcript that contains unannotated introns.
Each exon also has a unique ID stored in exon_id
attribute.
In addition, each transcript contains canonical
property if --check_canonical
is set.
If --sqanti_output
option is set, each novel transcript also has a similar_reference_id
field containing ID of
a most similar reference isoform and alternatives
attribute, which indicates the exact differences between
this novel transcript and the similar reference transcript.