- Sampling and Approximation in the Context of RNA Secondary Structure Prediction - Algorithms and Studies Based on Stochastic Context-Free Modeling (2012)
- Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Existing prediction methods basically use the dynamic programming principle and are either based on a general thermodynamic model or on a specific probabilistic model, traditionally realized by a stochastic context-free grammar. To date, the applied grammars were rather simple and small and despite the fact that statistical approaches have become increasingly appreciated over the past years, a corresponding sampling algorithm based on a stochastic RNA structure model has not yet been devised. In addition, basically all popular state-of-the-art tools for computational structure prediction have the same worst-case time and space requirements of O(n^3) and O(n^2) for sequence length n, limiting their applicability for practical purposes due to the often quite large sizes of native RNA molecules. Accordingly, the prime demand imposed by biologists on computational prediction procedures is to reach a reduced waiting time for results that are not significantly less accurate. We here deal with all of these issues, by describing algorithms and performing comprehensive studies that are based on sophisticated stochastic context-free grammars of similar complexity as those underlying thermodynamic prediction approaches, where all of our methods indeed make use of the concept of sampling. We also employ the approximation technique known from theoretical computer science in order to reach a heuristic worst-case speedup for RNA folding. Particularly, we start by describing a way for deriving a sequence-independent random sampler for an arbitrary class of RNAs by means of (weighted) unranking. The resulting algorithm may generate any secondary structure of a given fixed size n in only O(n·log(n)) time, where the results are observed to be accurate, validating its practical applicability. With respect to RNA folding, we present a novel probabilistic sampling algorithm that generates statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method actually samples the possible foldings from a distribution implied by a suitable (traditional or length-dependent) grammar. Notably, we also propose several (new) ways for obtaining predictions from generated samples. Both variants have the same worst-case time and space complexities of O(n^3) and O(n^2) for sequence length n. Nevertheless, evaluations of our sampling methods show that they are actually capable of producing accurate (prediction) results. In an attempt to resolve the long-standing problem of reducing the time complexity of RNA folding algorithms without sacrificing much of the accuracy of the results, we invented an innovative heuristic statistical sampling method that can be implemented to require only O(n^2) time for generating a fixed-size sample of candidate structures for a given sequence of length n. Since a reasonable prediction can still efficiently be obtained from the generated sample set, this approach finally reduces the worst-case time complexity by a liner factor compared to all existing precise methods. Notably, we also propose a novel (heuristic) sampling strategy as opposed to the common one typically applied for statistical sampling, which may produce more accurate results for particular settings. A validation of our heuristic sampling approach by comparison to several leading RNA secondary structure prediction tools indicates that it is capable of producing competitive predictions, but may require the consideration of large sample sizes.