Date of Original Version
Abstract or Description
Finding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. Secondly, we introduce a different method, which is shown to succeed for various problem instances for which popular existing methods fail.
Most existing computational methods to finding motifs are based on the strong-signal model wherein only strong-signal sequences (i.e. those that are known to contain binding sites very similar to the motif) are considered as input and weak-signal sequences (i.e. those do not contain any sub-string similar to the motif) are disregarded.
Buhler and Tompa have studied the limitations of methods based on the strong-signal model. They characterized the problem instances for which the motif is unlikely to be found in terms of the number of input (strong-signal) sequences needed under the assumption that each input sequence contains exactly one binding site. They further gave a method to calculate the minimum number of input sequences required.
We re-characterize the limitations of the strong-signal model in terms of the minimum total number of binding sites, rather than the minimum number of strong-signal sequences, required to be in the input data set. We use a probability matrix to represent a motif instead of a string pattern to calculate the minimum total number of binding sites required. This new characterization is shown to be more general and realistic.
Next, we introduce a more general and realistic energy-based model, which considers all available sequences (including weak-signal sequences) with varying degrees of binding strength to the transcription factors (as measured experimentally by observed color intensity). Given varying degrees of binding strength, our model can consider sequences ranging from those that contain more than one binding site to those that are weak sequences. By treating sequences with different degrees of binding strength differently, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding algorithm) using an EM-like approach to find motifs under our model. This EBMF algorithm can find motifs for data sets that do not even have the required minimum number of binding sites as previously derived for the strong-signal model. Our algorithm compares favorably with common motif-finding programs AlignACE and MEME, which are based on the strong-signal model. In particular, for some simulated and real data sets, our algorithm finds the motif when both AlignACE and MEME fail to do so.