One of the fundamental problems in computational structural biology is the prediction of RNA secondary structures from a single sequence. To solve this problem, mainly two different approaches have been used over the past decades: the free energy minimization (MFE) approach which is still considered the most popular and successful method and the competing stochastic context-free grammar (SCFG) approach. While the accuracy of the MFE based algorithms is limited by the quality of underlying thermodynamic models, the SCFG method abstracts from free energies and instead tries to learn about the structural behavior of the molecules by training the grammars on known real RNA structures, making it highly dependent on the availability of a rich high quality training set. However, due to the respective problems associated with both methods, new statistics based approaches towards RNA structure prediction have become increasingly appreciated. For instance, over the last years, several statistical sampling methods and clustering techniques have been invented that are based on the computation of partition functions (PFs) and base pair probabilities according to thermodynamic models. A corresponding SCFG based statistical sampling algorithm for RNA secondary structures has been studied just recently. Notably, this probabilistic method is capable of producing accurate (prediction) results, where its worst-case time and space requirements are equal to those of common RNA folding algorithms for single sequences.
The aim of this work is to present a comprehensive study on how enriching the underlying SCFG by additional information on the lengths of generated substructures (i.e. by incorporating length-dependencies into the SCFG based sampling algorithm, which is actually possible without significant losses in performance) affects the reliability of the induced RNA model and the accuracy of sampled secondary structures. As we will see, significant differences with respect to the overall quality of generated sample sets and the resulting predictive accuracy are typically implied. In principle, when considering the more specialized length-dependent SCFG model as basis for statistical sampling, a higher accuracy of predicted foldings can be reached at the price of a lower diversity of generated candidate structures (compared to the more general traditional SCFG variant or sampling based on PFs that rely on free energies).
This article focuses on the analytical analysis of the free energy in a realistic model for RNA secondary structures. In fact, the free energy in a stochastic model derived from a database of small and large subunit ribosomal RNA (SSU and LSU rRNA) data is studied. A common thermody-namic model for computing the free energy of a given RNA secondary structure, as well as stochastic context-free grammars and generating functions are used to derive the desired results. These results include asymptotics for the expected free energy and for the corresponding variance of a random RNA secondary structure. The quality of our model is judged by comparing the derived results to the used database of SSU and LSU rRNA data. At the end of this article, it is discussed how our results could be used to help on identifying good predictions of RNA secondary structure.
As any RNA sequence can be folded in many different ways, there are lots of different possible secondary structures for a given sequence. Most computational prediction methods based on free energy minimization compute a number of suboptimal foldings and we have to identify the native structures among all these possible secondary structures. For this reason, much effort has been made to develop approaches for identifying good predictions of RNA secondary structure. Using the abstract shapes approach as introduced by Giegerich et al., each class of similar secondary structures is represented by one shape and the native structures can be found among the top shape representatives. In this article, we derive some interesting results answering enumeration problems for abstract shapes and secondary structures of RNA. We start by computing symptotical representations for the number of shape representations of length n. Our main goal is to find out how much the search space can be reduced by using the concept of abstract shapes. To reach this goal, we analyze the number of secondary structures and shapes compatible with an RNA sequence of length n under the assumption that base pairing is allowed between arbitrary pairs of bases analytically and compare their exponential growths. Additionally, we analyze the number of secondary structures compatible with an RNA sequence of length n under the assumptions that base pairing is allowed only between certain pairs of bases and that the structures meet some appropriate conditions. The exponential growth factors of the resulting asymptotics are compared to the corresponding experimentally obtained value as given by Giegerich et al.