Statistical RNA Secondary Structure Sampling Based on a Length-Dependent SCFG Model
- One of the fundamental problems in computational structural biology is the prediction of RNA secondary structures from a single sequence. To solve this problem, mainly two different approaches have been used over the past decades: the free energy minimization (MFE) approach which is still considered the most popular and successful method and the competing stochastic context-free grammar (SCFG) approach. While the accuracy of the MFE based algorithms is limited by the quality of underlying thermodynamic models, the SCFG method abstracts from free energies and instead tries to learn about the structural behavior of the molecules by training the grammars on known real RNA structures, making it highly dependent on the availability of a rich high quality training set. However, due to the respective problems associated with both methods, new statistics based approaches towards RNA structure prediction have become increasingly appreciated. For instance, over the last years, several statistical sampling methods and clustering techniques have been invented that are based on the computation of partition functions (PFs) and base pair probabilities according to thermodynamic models. A corresponding SCFG based statistical sampling algorithm for RNA secondary structures has been studied just recently. Notably, this probabilistic method is capable of producing accurate (prediction) results, where its worst-case time and space requirements are equal to those of common RNA folding algorithms for single sequences.
The aim of this work is to present a comprehensive study on how enriching the underlying SCFG by additional information on the lengths of generated substructures (i.e. by incorporating length-dependencies into the SCFG based sampling algorithm, which is actually possible without significant losses in performance) affects the reliability of the induced RNA model and the accuracy of sampled secondary structures. As we will see, significant differences with respect to the overall quality of generated sample sets and the resulting predictive accuracy are typically implied. In principle, when considering the more specialized length-dependent SCFG model as basis for statistical sampling, a higher accuracy of predicted foldings can be reached at the price of a lower diversity of generated candidate structures (compared to the more general traditional SCFG variant or sampling based on PFs that rely on free energies).