Background: Given n strings s1,...,sn each of length l and a nonnegative integer d, the Closest String problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many -- but not necessarily all -- input strings is an important task that plays a role in many applications in bioinformatics.
Results: Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the Closest String with Outliers (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n-k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set.
We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W-hard with respect to n-k, l, and d.
Conclusions: Our refined model abstractly models finding common patterns in several but not all input strings. We initialize the study of the computability of this model and show that it is sensitive to different parameterizations. Lastly, we conclude by suggesting several open problems which warrant further investigation.
Christina Boucher and Bin Ma. Closest String with Outliers. BMC Bioinformatics, 12(Suppl 1):S55, 2011.