What is Bellerophon?
Bellerophon is a program for detecting chimeric sequences in a
multiple sequence dataset by comparative analysis.
Bellerophon was specifically developed to detect 16S rRNA gene chimeras
in
PCR-clone libraries but can be applied to other gene datasets.
What is a chimeric sequence?
A chimeric sequence, or chimera for short, is a sequence comprised of
two or more phylogenetically distinct parent sequences. Chimeras are
usually PCR artifacts thought to occur when a prematurely terminated
amplicon
reanneals to a foreign DNA strand and is copied to completion in the
following PCR cycles. The point at which the chimeric sequence changes
from one parent to the next is called the breakpoint or conversion
point.
How does Bellerophon work?
Bellerophon detects chimeras based on a partial treeing approach (see
Ref. 1 below), that is, phylogenetic
trees are inferred from
independent regions (fragments) of a sequence alignment and the
topologies are compared. Incongruencies in branching patterns may be
indicative of a chimeric sequence. No trees are actually required to
be built during the procedure and the only calculations are distance
(similarity) calculations of sequences (see below for a discussion of
distance calculations). A full matrix of distances (dm) between all
pairs of sequences are calculated for fragments left and right of an
assumed conversion point. The total absolute deviation of the distance
matrices (distance matrix error dme) of n sequences is then
where dm[i][j] denotes the distance between two sequences i and j.
To find the sequence with the highest contribution to the dme value,
we calculate the ratio of the dme value from all sequences in the
dataset over the dme value (dme[i]) when the sequence i that is under
consideration is omitted and call this ratio the preference score of
the sequence.
These preference scores have to be calculated for all
sequences. Naively, the calculation would require two computationally
expensive distance matrix calculations. Computationally this can,
however, be implemented much more efficiently by taking advantage of
previously performed calculations. Because the calculation of the dme
involves sums in the form of
and the distances between identical sequences dm[i][i] are by
definition zero, it is straight forward to re-write equation (1) to
which only involves calculations of a single distance matrix
calculation and some intermediate storage of the column sums.
From equation 1 the meaning of the preference score becomes clear. One
compares two datasets; the full dataset and the dataset not including
the potential chimeric sequence to serve as a reference. The biggest
contribution to the distance matrix error is expected to be due to
chimeric sequences in the dataset, since fragments from these
sequences have distinctly different locations relative to all other
sequences in the dataset, and therefore distinctly different distance
matrices. To this end, chimeric sequences have a preference score
greater than one, while the preference score of normal sequences is
expected to be (close to) one. Given this as the basis to calculate
the preference score it is then only a matter of scanning sequences
for the break point with highest score, sort the list of sequences
according to the highest recorded score and report the potentially
chimeric sequences. Mutually incompatible chimeras are screened from
the output. That is, once a sequence (A) has been identified as
chimeric, subsequent putative chimeras with lower preference scores,
that identify sequence A as one of the parents, are removed from the
output list.
What are the Correction to use options?
These are the algorithms used to calculate corrected pairwise
distances between aligned sequences, which are then used for matrix
comparisons and identification of putative chimeras (
see above).
Correction refers to the concept that the algorithm corrects for
unseen changes in sequences over time and more accurately reflects
how sequences evolve than simply using the observable differences
between sequences (uncorrected). Within the framework of detecting
chimeric sequences commonly used correction functions may, however,
not all be appropriate. For example, 16S rDNA chimeras most
frequently occur between highly related sequences since the
likelihood of a prematurely terminated amplicon (parent 1)
reannealing to a foreign template (parent 2) is higher. In other
words the "distance of chimeric likelihood" between two sequences
increases more rapidly at high sequence identities than it does at
lower identities. This trend, however, is opposite to the trend
described by commonly used correction functions such as the
Jukes-Cantor correction, which led us to develop our own empirical
distance function, modestly called the Huber-Hugenholtz correction:
where d is the corrected distance between two sequences with a
fraction id of identically aligned bases. This correction strongly
differentiates between highly similar sequences and puts less emphasis
on the difference between remote homologs.
Hypervariable regions are routinely removed from sequence alignments
prior to phylogenetic analyses because they cannot be unambiguously
aligned and therefore introduce noise into the phylogenetic analysis.
We have implemented an automatic filter to remove variable regions
from the alignment prior to calculating pairwise distances. Sequence
positions (columns) that have nucleotides in less than 50% of all
sequences are removed, and from the resulting consensus sequences
only the ones with more residues than two times the window size are
used in the analysis. Given the fact that many truncated 16S rDNA
sequences in the dataset will heavily define the filter mask, we
advise to remove all sequences shorter than 2*window size + 100. Note
that the estimated breakpoint will likely not correspond directly to
the absolute nucleotide position of the chimeric sequence because
variable regions have been removed.
What is the window size for?
The window size is the number of nucleotides, either side of the
breakpoint, used to calculate distance matrices. This is necessary
since the alignment is scanned along its length for optimal break
points (see
How does Bellerophon
work) and the size of the fragments
either side of the breakpoint should be the same to normalise
calculations.
Ideally, the more characters (nucleotides) used, the more accurate
the distance calculations but the number of characters also dictates
how close to either end of the alignment breakpoints can be detected.
For instance if the highest window size of 400 is selected,
breakpoints can only be detected 400 characters or more from either
end of the alignment.
The automatic filter (see
What are
the correction to use options) detects if a sequence is too
short for a given window
size and removes it from the dataset prior to analysis.
What is the Align sequences option?
This option (set to yes) will automatically align 100 or fewer
unaligned sequences using ClustalW in preparation for running on
Bellerophon. Since multiple alignments are quite computationally
intensive we have limited datasets to 100 unaligned sequences.
How to read the results!!!
An example of a Bellerophon output (explanations inserted in red):
job title: micronodule clones
job title as
submitted
*** sequences are re-aligned using clustalw
sequences were clustalw aligned prior to being
processed by Bellerophon.
If this action was not activated the
output would read:
*** sequences are assumed to be aligned
*** Huber-Hugenholtz correction used
the correction selected
*** The chosen Window size is 300
the window size selected
all chimeras reported below are worth checking manually even if their
preference scores are close to 1.0
***** possible chimera 3
preference score: 1.23436
scores >1 suggest chimeras, results are ranked
by this score. Preference scores
are dataset dependent and can
only be compared within a dataset. Scores even
marginally greater
than 1 may indicate real chimeras and should be
checked.
E.coli (chimera) break point : 1163 (1131)
most likely point of conversion numbered
according to the E.coli sequence.
In brackets the break point in
the (filtered!) sequence, in this case
MNA5
% identities of parent sequences to chimera left and right of the break
point
frag. 1 frag. 2
parent 1: 100.0 90.7
parent 2: 89.3 98.7
sequence identities (%) between the chimeric
fragments and parent sequence before
and after the breakpoint
Sequence ID of chimera with parent sequences
chimera =>gi|11093933|_MNA5
parent 1 =>gi|11093931|_MNE12
parent 2 =>gi|11093935|_MNC9
names of the putative chimeric sequence and
parent sequences
Advantages of Bellerophon:
1) You can process all sequences from a
PCR-clone library in a
single analysis and don't have to inspect outputs for every sequence in
the dataset.
2) The approximate putative breakpoint is
calculated using a
sliding window (
see above) and will
help verification of the chimera
manually.
3) A chimeric sequence is not only tested
against two (putative)
parent sequences but rather is assessed by how well it fits into the
complete phylogenetic environment of a multiple sequence
alignment. Hence sequences do not become
invisible to the
program as
is the case with CHIMERA_CHECK (see
Ref 1
below).
4) The calculations Bellerophon uses to detect
chimeric sequences
are computationally relatively cheap and results are quickly
calculated for datasets with up 50 sequences (~1 min). Larger
datasets take longer - 100 sequences ~30 min, 300 sequences ~8 hours.
Tips for using Bellerophon:
1) Bellerophon works most efficiently if the
parent sequences or
non-chimeric sequences closely related to the parent sequences are
present in the dataset analyzed. Therefore, as many sequences as
possible from the one PCR-clone library should be included in the
analysis since the parent sequences of any chimera are most likely to
be in that dataset. Addition of non-chimeric
outgroup sequences
(e.g. from isolates) may help refine an analysis by
providing reference points (and a broader phylogenetic context) in the
analysis (see
ref. 1), but be aware of
increasing analysis time with bigger datasets.
2) Bellerophon is compromised by using
sequences of different
lengths as this can produce artificial skews in distance matrices of
fragments of the alignment. Datasets containing sequences of the same
length and covering the same portion of the gene should be used
(usually not an issue with sequences from a PCR-clone library).
The filter will automatically remove sequences too short for the
window size (see
window
size), i.e. <600 bp for a window size of 300.
3) If possible multiple window sizes should be
used as the number of identified chimeras can vary with the choice of
the window size.
4) Re-running the dataset without the first reported
chimeras may
identify additional putative chimeras by reducing noise in the
analysis. Ideally, the dataset should continue to be re-run removing
previously reported chimeras until no chimeras are identified.
5) Bellerophon should be used in concert with
other detection
methods such as
CHIMERA_CHECK
and putatively
identified chimeras should always be confirmed by manual inspection of
the sequences for signature shifts (see
ref
1 below)
Intended upgrades to Bellerophon:
1) Orientation checker - to confirm that all 16S rRNA
sequences are submitted in the right orientation and to reverse
complement any sequences that fail this check prior to running bellerophon
2) Automatic rerunning to extinction (0 detected
chimeras) option - to give the user the option of implementing tip 4
automatically, keeping in mind that this may significantly increase the
job processing time!
If you have any suggestions for extensions, we would
love to hear about them.
Please contact us for any suggestions.
Reference to cite:
T. Huber, G. Faulkner and P.
Hugenholtz. Bellerophon; a program to
detect chimeric sequences in multiple sequence alignments,
Bioinformatics (2004) 20, 2317-2319.
Additional information:
Manually verified putative 16S
rDNA chimeras as of March 2003
P. Hugenholtz and T. Huber.
Chimeric 16S rDNA sequences of
diverse origin are accumulating in the public databases. Int J Syst
Evol Microbiol (2003) 53 289-293.
Hot off the press: Microbiology
Today Editor Meriel Jones selected our Chimeric sequence detection
paper as one of the "papers in current issues of the Society's
journals which highlight new and exciting developments in
microbiological research".
Bellerophon library: A selection of published work that has used Bellerophon