Date of Original Version
Abstract or Description
A molecule called transcription factor usually binds to a set of promoter sequences of coexpressed genes. As a result, these promoter sequences contain some short substrings, or binding sites, with similar patterns. The motif discovering problem is to find these similar patterns and motifs in a set of sequences. Most existing algorithms find the motifs based on strong-signal sequences only (i.e., those containing binding sites very similar to the motif). In this paper, we use a probability matrix to represent a motif to calculate the minimum total number of binding sites required to be in the input dataset in order to confirm that the discovered motifs are not artifacts. Next, we introduce a more general and realistic energy-based model, which considers all sequences with varying degrees of binding strength to the transcription factors (as measured experimentally). By treating sequences with varying degrees of binding strength, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding Algorithm) to find the motif, which can handle sequences ranging from those that contain more than one binding site to those that contain none. EBMF can find motifs for datasets that do not even have the required minimum number of binding sites as previously derived. EBMF compares favorably with common motif-finding programs AlignACE and MEME. In particular, for some simulated and real datasets, EBMF finds the motif when both AlignACE and MEME fail to do so.