Christina Boucher
**University of Florida**

E301 CSB

PO Box 116120

Gainesville, FL 32611

E301 CSB

PO Box 116120

Gainesville, FL 32611

Christina Boucher and James King.

Background: Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the Consensus String decision problem that asks, given a parameter d and a set of l-length strings S = {s1,...,sn}, whether there exists a consensus string that has Hamming distance at most d from any string in S. A set of strings S is pairwise bounded if the Hamming distance between any pair of strings in S is at most 2d. It is trivial to determine whether a set is pairwise bounded, and a set cannot have a consensus string unless it is pairwise bounded. We use Consensus String to determine whether or not a pairwise bounded set has a consensus. Unfortunately, Consensus String is NP-complete. The lack of an efficient method to solve the Consensus String problem has caused it to become a computational bottleneck in MCL-WMR, a motif recognition program capable of solving difficult motif recognition problem instances.

Results: We focus on the development of a method for solving Consensus String quickly with a small probability of error. We apply this heuristic to develop a new motif recognition program, sMCL-WMR, which has impressive accuracy and efficiency. We demonstrate the performance of sMCL-WMR in detecting weak motifs in large data sets and in real genomic data sets, and compare the performance to other leading motif recognition programs. In our preliminary discussion of our Consensus String algorithm we give insight into the issue of sampling pairwise bounded sets, and discuss its relevance to motif recognition.

Conclusions: Our novel heuristic gives birth to a state of the art program, sMCL-WMR, that is capable of detecting weak motifs in data sets with a large number of strings. sMCL-WMR is orders of magnitude faster than its predecessor MCL-WMR and is capable of solving previously unsolved synthetic motif recognition problems. Lastly, sMCL-WMR shows impressive accuracy in detecting transcription factor binding sites in the genomic data and used in the assessment of Tompa et al.

Christina Boucher and James King. Fast motif recognition via application of statistical thresholds. BMC Bioinformatics, 11(Suppl 1):S11, 2010.