What is Bellerophon?

Bellerophon is a program for detecting chimeric sequences in a multiple sequence dataset by comparative analysis. Bellerophon was specifically developed to detect 16S rRNA gene chimeras in PCR-clone libraries but can be applied to other gene datasets.


What is a chimeric sequence?

A chimeric sequence, or chimera for short, is a sequence comprised of two or more phylogenetically distinct parent sequences. Chimeras are usually PCR artifacts thought to occur when a prematurely terminated amplicon reanneals to a foreign DNA strand and is copied to completion in the following PCR cycles. The point at which the chimeric sequence changes from one parent to the next is called the breakpoint or conversion point.



How does Bellerophon work?

Bellerophon detects chimeras based on a partial treeing approach (see Ref. 1 below), that is, phylogenetic trees are inferred from independent regions (fragments) of a sequence alignment and the topologies are compared. Incongruencies in branching patterns may be indicative of a chimeric sequence. No trees are actually required to be built during the procedure and the only calculations are distance (similarity) calculations of sequences (see below for a discussion of distance calculations). A full matrix of distances (dm) between all pairs of sequences are calculated for fragments left and right of an assumed conversion point. The total absolute deviation of the distance matrices (distance matrix error dme) of n sequences is then
distance matrix error dme
where dm[i][j] denotes the distance between two sequences i and j.
To find the sequence with the highest contribution to the dme value, we calculate the ratio of the dme value from all sequences in the dataset over the dme value (dme[i]) when the sequence i that is under consideration is omitted and call this ratio the preference score of the sequence.
distance matrix error dme
These preference scores have to be calculated for all sequences. Naively, the calculation would require two computationally expensive distance matrix calculations. Computationally this can, however, be implemented much more efficiently by taking advantage of previously performed calculations. Because the calculation of the dme involves sums in the form of
distance matrix error dme
and the distances between identical sequences dm[i][i] are by definition zero, it is straight forward to re-write equation (1) to
distance matrix error dme
which only involves calculations of a single distance matrix calculation and some intermediate storage of the column sums.
From equation 1 the meaning of the preference score becomes clear. One compares two datasets; the full dataset and the dataset not including the potential chimeric sequence to serve as a reference. The biggest contribution to the distance matrix error is expected to be due to chimeric sequences in the dataset, since fragments from these sequences have distinctly different locations relative to all other sequences in the dataset, and therefore distinctly different distance matrices. To this end, chimeric sequences have a preference score greater than one, while the preference score of normal sequences is expected to be (close to) one. Given this as the basis to calculate the preference score it is then only a matter of scanning sequences for the break point with highest score, sort the list of sequences according to the highest recorded score and report the potentially chimeric sequences. Mutually incompatible chimeras are screened from the output. That is, once a sequence (A) has been identified as chimeric, subsequent putative chimeras with lower preference scores, that identify sequence A as one of the parents, are removed from the output list.



What are the Correction to use options?

These are the algorithms used to calculate corrected pairwise distances between aligned sequences, which are then used for matrix comparisons and identification of putative chimeras (see above). Correction refers to the concept that the algorithm corrects for unseen changes in sequences over time and more accurately reflects how sequences evolve than simply using the observable differences between sequences (uncorrected). Within the framework of detecting chimeric sequences commonly used correction functions may, however, not all be appropriate. For example, 16S rDNA chimeras most frequently occur between highly related sequences since the likelihood of a prematurely terminated amplicon (parent 1) reannealing to a foreign template (parent 2) is higher. In other words the "distance of chimeric likelihood" between two sequences increases more rapidly at high sequence identities than it does at lower identities. This trend, however, is opposite to the trend described by commonly used correction functions such as the Jukes-Cantor correction, which led us to develop our own empirical distance function, modestly called the Huber-Hugenholtz correction:
distance matrix error dme
where d is the corrected distance between two sequences with a fraction id of identically aligned bases. This correction strongly differentiates between highly similar sequences and puts less emphasis on the difference between remote homologs.
Hypervariable regions are routinely removed from sequence alignments prior to phylogenetic analyses because they cannot be unambiguously aligned and therefore introduce noise into the phylogenetic analysis. We have implemented an automatic filter to remove variable regions from the alignment prior to calculating pairwise distances. Sequence positions (columns) that have nucleotides in less than 50% of all sequences are removed, and from the resulting consensus sequences only the ones with more residues than two times the window size are used in the analysis. Given the fact that many truncated 16S rDNA sequences in the dataset will heavily define the filter mask, we advise to remove all sequences shorter than 2*window size + 100. Note that the estimated breakpoint will likely not correspond directly to the absolute nucleotide position of the chimeric sequence because variable regions have been removed.



What is the window size for?

The window size is the number of nucleotides, either side of the breakpoint, used to calculate distance matrices. This is necessary since the alignment is scanned along its length for optimal break points (see How does Bellerophon work) and the size of the fragments either side of the breakpoint should be the same to normalise calculations.
Ideally, the more characters (nucleotides) used, the more accurate the distance calculations but the number of characters also dictates how close to either end of the alignment breakpoints can be detected. For instance if the highest window size of 400 is selected, breakpoints can only be detected 400 characters or more from either end of the alignment.
The automatic filter (see What are the correction to use options) detects if a sequence is too short for a given window size and removes it from the dataset prior to analysis.



What is the Align sequences option?

This option (set to yes) will automatically align 100 or fewer unaligned sequences using ClustalW in preparation for running on Bellerophon. Since multiple alignments are quite computationally intensive we have limited datasets to 100 unaligned sequences.



How to read the results!!!

An example of a Bellerophon output (explanations inserted in red):

job title: micronodule clones
job title as submitted *** sequences are re-aligned using clustalw sequences were clustalw aligned prior to being
processed by Bellerophon.
If this action was not activated the output would read:
*** sequences are assumed to be aligned
*** Huber-Hugenholtz correction used the correction selected *** The chosen Window size is 300 the window size selected all chimeras reported below are worth checking manually even if their preference scores are close to 1.0 ***** possible chimera 3 preference score: 1.23436 scores >1 suggest chimeras, results are ranked
by this score. Preference scores
are dataset dependent and can only be compared within a dataset. Scores even
marginally greater than 1 may indicate real chimeras and should be
checked.
E.coli (chimera) break point : 1163 (1131) most likely point of conversion numbered
according to the E.coli sequence.
In brackets the break point in the (filtered!) sequence, in this case
MNA5
% identities of parent sequences to chimera left and right of the break point frag. 1 frag. 2 parent 1: 100.0 90.7 parent 2: 89.3 98.7 sequence identities (%) between the chimeric
fragments and parent sequence before
and after the breakpoint
Sequence ID of chimera with parent sequences chimera =>gi|11093933|_MNA5 parent 1 =>gi|11093931|_MNE12 parent 2 =>gi|11093935|_MNC9 names of the putative chimeric sequence and
parent sequences




Advantages of Bellerophon:

1) You can process all sequences from a PCR-clone library in a single analysis and don't have to inspect outputs for every sequence in the dataset.

2) The approximate putative breakpoint is calculated using a sliding window (see above) and will help verification of the chimera manually.

3) A chimeric sequence is not only tested against two (putative) parent sequences but rather is assessed by how well it fits into the complete phylogenetic environment of a multiple sequence alignment. Hence sequences do not become invisible to the program as is the case with CHIMERA_CHECK (see Ref 1 below).

4) The calculations Bellerophon uses to detect chimeric sequences are computationally relatively cheap and results are quickly calculated for datasets with up 50 sequences (~1 min). Larger datasets take longer - 100 sequences ~30 min, 300 sequences ~8 hours.



Tips for using Bellerophon:

1) Bellerophon works most efficiently if the parent sequences or non-chimeric sequences closely related to the parent sequences are present in the dataset analyzed. Therefore, as many sequences as possible from the one PCR-clone library should be included in the analysis since the parent sequences of any chimera are most likely to be in that dataset. Addition of non-chimeric outgroup sequences (e.g. from isolates) may help refine an analysis by providing reference points (and a broader phylogenetic context) in the analysis (see ref. 1), but be aware of increasing analysis time with bigger datasets.

2) Bellerophon is compromised by using sequences of different lengths as this can produce artificial skews in distance matrices of fragments of the alignment. Datasets containing sequences of the same length and covering the same portion of the gene should be used (usually not an issue with sequences from a PCR-clone library). The filter will automatically remove sequences too short for the window size (see window size), i.e. <600 bp for a window size of 300.

3) If possible multiple window sizes should be used as the number of identified chimeras can vary with the choice of the window size.

4) Re-running the dataset without the first reported chimeras may identify additional putative chimeras by reducing noise in the analysis. Ideally, the dataset should continue to be re-run removing previously reported chimeras until no chimeras are identified.

5) Bellerophon should be used in concert with other detection methods such as CHIMERA_CHECK and putatively identified chimeras should always be confirmed by manual inspection of the sequences for signature shifts (see ref 1 below)


Intended upgrades to Bellerophon:

1) Orientation checker - to confirm that all 16S rRNA sequences are submitted in the right orientation and to reverse complement any sequences that fail this check prior to running bellerophon
2) Automatic rerunning to extinction (0 detected chimeras) option - to give the user the option of implementing tip 4 automatically, keeping in mind that this may significantly increase the job processing time!
If you have any suggestions for extensions, we would love to hear about them. Please contact us for any suggestions.




Reference to cite:

T. Huber, G. Faulkner and P. Hugenholtz. Bellerophon; a program to detect chimeric sequences in multiple sequence alignments, Bioinformatics (2004) 20, 2317-2319.



Additional information:

Manually verified putative 16S rDNA chimeras as of March 2003

P. Hugenholtz and T. Huber. Chimeric 16S rDNA sequences of diverse origin are accumulating in the public databases. Int J Syst Evol Microbiol (2003) 53 289-293.

Hot off the press: Microbiology Today Editor Meriel Jones selected our Chimeric sequence detection paper as one of the "papers in current issues of the Society's journals which highlight new and exciting developments in microbiological research".

Bellerophon library: A selection of published work that has used Bellerophon