Current Research Projects

Total sequencing output of the research community was about 200 million base pairs in 1998. In January 2003, the United States Department of Energy Joint Genome Institute alone sequenced 1.5 billion bases in one month. In 2011, it was reported that the amount of sequence data produced by the Beijing Genome Institute (BGI) is equivalent to over 2,000 human genomes a day. It is a critical time in scientific history; better computational tools are desperately needed to handle the massive amount of biological data that is currently being generated. The New York Times describes this urgency of tool development in bioinformatics: “The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore’s law… The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data”.

This rapid increase in the ability to generate, sequence, and store vast amounts of data has prompted many new computational problems.  One of the goals of my research is to design algorithm solutions and software tools for the assembly and analysis genomic sequence data.  I frequently seek opportunities to work closely with biological and medical researchers at each stage of the tool development process: from the problem formulation, to the end usage of the tools. 

Tools for Genome Assembly and Analysis

Since the discovery of DNA as the basic unit of heredity, significant effort has been focused on automated determination of the sequence of nucleotides corresponding to a sample of DNA; a process referred to as genome sequencing. The technology this process relies on is the sequencing platform that accepts a collection of biological (DNA) samples and produces a set of reads from the samples. A read is a string from the alphabet {A, C, G, T} that represents the sequence of nucleotides in a sample. The general approach to genome sequencing is as follows: multiple copies of the DNA is extracted from the cell, each copy of the DNA is cut into smaller fragments, each fragment is sequenced by a sequencing platform to produce a read, and the reads are assembled into large segments of the genome. The final step is a computational problem referred to a fragment assembly. The output of fragment assembly is contiguous sequences called contigs

For more information please visit the following project pages:

 

Algorithms for Biological Sequence Analysis

Biological sequence data is commonly treated as strings over finite alphabets, and fundamental questions about searching and extracting information from this string data arise; that abstraction and computational problem is particularly prevalent with regard to DNA and RNA molecules. I study sequence problems in theoretical computer science that abstractly model biological processes or analysis. 

Please see visit the project page for more information.