Sampling and Approximation in the Context of RNA Secondary Structure Prediction - Algorithms and Studies Based on Stochastic Context-Free Modeling

Schulz, Anika

Sampling and Approximation in the Context of RNA Secondary Structure Prediction - Algorithms and Studies Based on Stochastic Context-Free Modeling

Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Existing prediction methods basically use the dynamic programming principle and are either based on a general thermodynamic model or on a specific probabilistic model, traditionally realized by a stochastic context-free grammar. To date, the applied grammars were rather simple and small and despite the fact that statistical approaches have become increasingly appreciated over the past years, a corresponding sampling algorithm based on a stochastic RNA structure model has not yet been devised. In addition, basically all popular state-of-the-art tools for computational structure prediction have the same worst-case time and space requirements of O(n^3) and O(n^2) for sequence length n, limiting their applicability for practical purposes due to the often quite large sizes of native RNA molecules. Accordingly, the prime demand imposed by biologists on computational prediction procedures is to reach a reduced waiting time for results that are not significantly less accurate. We here deal with all of these issues, by describing algorithms and performing comprehensive studies that are based on sophisticated stochastic context-free grammars of similar complexity as those underlying thermodynamic prediction approaches, where all of our methods indeed make use of the concept of sampling. We also employ the approximation technique known from theoretical computer science in order to reach a heuristic worst-case speedup for RNA folding. Particularly, we start by describing a way for deriving a sequence-independent random sampler for an arbitrary class of RNAs by means of (weighted) unranking. The resulting algorithm may generate any secondary structure of a given fixed size n in only O(n·log(n)) time, where the results are observed to be accurate, validating its practical applicability. With respect to RNA folding, we present a novel probabilistic sampling algorithm that generates statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method actually samples the possible foldings from a distribution implied by a suitable (traditional or length-dependent) grammar. Notably, we also propose several (new) ways for obtaining predictions from generated samples. Both variants have the same worst-case time and space complexities of O(n^3) and O(n^2) for sequence length n. Nevertheless, evaluations of our sampling methods show that they are actually capable of producing accurate (prediction) results. In an attempt to resolve the long-standing problem of reducing the time complexity of RNA folding algorithms without sacrificing much of the accuracy of the results, we invented an innovative heuristic statistical sampling method that can be implemented to require only O(n^2) time for generating a fixed-size sample of candidate structures for a given sequence of length n. Since a reasonable prediction can still efficiently be obtained from the generated sample set, this approach finally reduces the worst-case time complexity by a liner factor compared to all existing precise methods. Notably, we also propose a novel (heuristic) sampling strategy as opposed to the common one typically applied for statistical sampling, which may produce more accurate results for particular settings. A validation of our heuristic sampling approach by comparison to several leading RNA secondary structure prediction tools indicates that it is capable of producing competitive predictions, but may require the consideration of large sample sizes.
Die Entwicklung effizienter Algorithmen zur Vorhersage der Sekundärstruktur von RNAMolekülen ist eine der grundlegenden Problemstellungen aus dem Bereich der strukturellen Biologie und stellt eine anspruchsvolle Aufgabe dar. Aktuell existierende Vorhersage-Algorithmen nutzen die Methode der dynamischen Programmierung und basieren entweder auf einem allgemein gültigen thermodynamischen Modell oder auf einem speziellen probabilistischen Modell. Letzteres wird traditionellerweise über eine stochastische kontext-freie Grammatik realisiert. Bisher wurden zu diesem Zweck meist relativ einfache und übersichtliche Grammatiken betrachtet. Zudem wurde trotz der über die vergangenen Jahren stetig gewachsenen Akzeptanz und Wertschätzung für statistische Ansätze bislang noch kein Sampling-Algorithmus basierend auf einem stochastischen Modell der RNA-Struktur formuliert. Des Weiteren besitzen alle bekannten Vorhersage-Werkzeuge für eine beliebige Eingabe-Sequenz der Länge n die gleichen Zeitund Speicher-Komplexitäten von O(n^3) und O(n^2) im Worst-Case, was deren Anwendung in der Praxis aufgrund der oftmals großen Länge der zu betrachtenden RNA-Moleküle in einem gewissen Maße einschränkt. Entsprechend steht für Biologen eine geringere Wartezeit für die Berechnung der Ergebnisse im Vordergrund, wobei aber auch die Qualität nicht allzu sehr leiden sollte. All diese Aspekte werden hier adressiert, und zwar durch die Beschreibung von Algorithmen und die Durchführung umfangreicher Studien basierend auf ausgeklügelten stochastisch kontext-freien Grammatiken deren Komplexität in etwa der von thermodynamischen Ansätzen entspricht. Dabei wird in allen Fällen das Konzept des Sampling angewandt. Zusätzlich kommt die aus der Theoretischen Informatik bekannte Technik der Approximation zum Einsatz, um eine Beschleunigung der Worst-Case-Laufzeit von RNA-Strukturvorhersagen zu erreichen. Zu Beginn wird ein Sequenz-unabhängiger Random Sampler für beliebige RNA-Klassen basierend auf (gewichtetem) Unranking entwickelt. Der resultierende Algorithmus benötigt zum Erzeugen jeder beliebigen möglichen Sekundärstruktur einer gegebenen festen Größe n lediglich O(n·log(n)) Zeit. Die beobachteten Ergebnisse sind von hoher Qualität, was die praktische Anwendbarkeit bestätigt. Hinsichtlich der Vorhersage der RNA-Struktur wird zunächst ein neuer probabilistischer Sampling-Algorithmus präsentiert, der statistisch repräsentative und reproduzierbare Samples der Menge aller zulässigen Sekundärstrukturen konform zu einer gegebenen Eingabe-Sequenz erzeugt. Diese Methode zieht die möglichen Faltungen anhand einer durch eine geeignete Grammatik induzierten Verteilung. Insbesondere werden auch einige (neuartige) Verfahren zur Berechnung von Vorhersagen aus zuvor erzeugten Samples vorgestellt. Beide Varianten benötigen O(n^3) Zeit und O(n^2) Speicher für Sequenz-Länge n und können qualitativ hochwertige (Vorhersage-)Ergebnisse liefern. Dies wird belegt durch umfangreiche Evaluationen. Zur Reduzierung der Laufzeit von Algorithmen zur Vorhersage der RNA-Struktur ohne übermäßigen Qualitätsverlust wird eine innovative heuristische Methode für statistisches Sampling entwickelt, welche lediglich O(n^2) Zeit zum Sampeln einer festen Anzahl möglicher Strukturen für eine gegebene Sequenz der Länge n benötigt. Da eine sinnvolle Vorhersage effizient aus einer beliebigen Sample-Menge berechnet werden kann, wird so im Vergleich zu allen existierenden präzisen Methoden die Worst-Case-Laufzeit um einen linearen Faktor reduziert. Insbesondere wird auch eine (heuristische) Alternative zu der üblicherweise für statistisches Sampling verwendeten Sampling-Strategie eingeführt. Eine Validierung des entwickelten heuristischen Sampling-Ansatzes durch Vergleich mit mehreren gängigen Vorhersage-Werkzeugen deutet darauf hin, dass die Vorhersagen konkurrenzfähig sind, aber die Betrachtung größerer Sample-Mengen nötig sein kann.

Metadaten
Author:	Anika Schulz
URN:	urn:nbn:de:hbz:386-kluedo-33250
Advisor:	Markus Nebel, Rolf Backofen
Document Type:	Doctoral Thesis
Language of publication:	English
Date of Publication (online):	2012/10/12
Year of first Publication:	2012
Publishing Institution:	Technische Universität Kaiserslautern
Granting Institution:	Technische Universität Kaiserslautern
Acceptance Date of the Thesis:	2012/10/12
Date of the Publication (Server):	2012/10/16
Tag:	computational biology; secondary structure prediction
GND Keyword:	Bioinformatik; Molekulare Bioinformatik
Page Number:	410
Faculties / Organisational entities:	Kaiserslautern - Fachbereich Informatik
CCS-Classification (computer science):	J. Computer Applications / J.3 LIFE AND MEDICAL SCIENCES / Biology and genetics (REVISED)
DDC-Cassification:	0 Allgemeines, Informatik, Informationswissenschaft / 004 Informatik
MSC-Classification (mathematics):	68-XX COMPUTER SCIENCE (For papers involving machine computations and programs in a specific mathematical area, see Section {04 in that areag 68-00 General reference works (handbooks, dictionaries, bibliographies, etc.) / 68Uxx Computing methodologies and applications / 68U99 None of the above, but in this section
	68-XX COMPUTER SCIENCE (For papers involving machine computations and programs in a specific mathematical area, see Section {04 in that areag 68-00 General reference works (handbooks, dictionaries, bibliographies, etc.) / 68Wxx Algorithms (For numerical algorithms, see 65-XX; for combinatorics and graph theory, see 05C85, 68Rxx) / 68W20 Randomized algorithms
	68-XX COMPUTER SCIENCE (For papers involving machine computations and programs in a specific mathematical area, see Section {04 in that areag 68-00 General reference works (handbooks, dictionaries, bibliographies, etc.) / 68Wxx Algorithms (For numerical algorithms, see 65-XX; for combinatorics and graph theory, see 05C85, 68Rxx) / 68W25 Approximation algorithms
Licence (German):	Standard gemäß KLUEDO-Leitlinien vom 10.09.2012

Sampling and Approximation in the Context of RNA Secondary Structure Prediction - Algorithms and Studies Based on Stochastic Context-Free Modeling

Download full text files

Export metadata

Additional Services