CSB2010 Grammar String: a Novel ncRNA Secondary Structure Representation

Grammar String: a Novel ncRNA Secondary Structure Representation

Rujira Achawanantakun, Seyedeh Shohreh Takyar, Yanni Sun*

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA. yannisun@cse.msu.edu

Proc LSS Comput Syst Bioinform Conf. August, 2010. Vol. 9, p. 2-13. Full-Text PDF

*To whom correspondence should be addressed.

Multiple ncRNA alignment has important applications in homologous ncRNA consensus structure derivation, novel ncRNA identification, and known ncRNA classification. As many ncRNAs' functions are determined by both their sequences and secondary structures, accurate ncRNA alignment algorithms must maximize both sequence and structural similarity simultaneously, incurring high computational cost. Faster secondary structure modeling and alignment methods using trees, graphs, probability matrices have thus been developed. Despite promising results from existing ncRNA alignment tools, there is a need for more efficient and accurate ncRNA secondary structure modeling and alignment methods. In this work, we introduce grammar string, a novel ncRNA secondary structure representation that encodes an ncRNA's sequence and secondary structure in the parameter space of a context-free grammar (CFG). Being a string defined on a special alphabet constructed from a CFG, it converts ncRNA alignment into sequence alignment with O(n ² ) complexity. We align hundreds of ncRNA families from BraliBase 2.1 using grammar strings and compare their consensus structure with Murlet using the structures extracted from Rfam as reference. Our experiments have shown that grammar string based multiple sequence alignment competes favorably in consensus structure quality with Murlet. Source codes and experimental data are available at http://www.cse.msu.edu/~yannisun/grammar-string.

[ CSB2010 Conference Home Page ] .... [ CSB2010 Online Proceedings ] .... [ Life Sciences Society Home Page ]