dGnabasik's Assignment 14

Thesis Abstract for David Gnabasik
“Computational Proteomics”

The experimental and statistical analysis of proteomic data is beset by significant challenges. The process of defining any proteome, the expressed protein complement of a biological compartment including all isoforms, requires careful rethinking and management of hypothesis generation, specific study design, the modeling of noise, the extreme dimensionality of data, sample selection and handling, the costs of experimental verification and validation, the nature of statistical inference, and computational integration.

The crucial question to address is how to optimally detect group differences given experimental heterogeneity and large measurement errors? What is needed is a decision-support system that generates the most statistically likely scientific hypotheses and subsequent research directions for proteomic experiments, particularly for discovery studies. In other words, how does one design a proteomics study that verifies and validates a “scientific hunch”?

Proteomics is only as good as the quality of its biological samples and experimental study design, the key to reducing systematic bias. The starting hypotheses and study design constraints must accommodate small sample sizes of large variability and noise in hierarchically structured datasets, such as mass-charge ratios within averaged spectra. The standard feature extraction / selection approach and subsequence sensitivity analysis of sample group differences tends to produce over fitted models with more parameters than data points.

An intelligent, semi-automated assistant is proposed that performs information-theoretic analysis of the experimental process as a set of time-dependent, noisy communication channels. This assistant:
• uses a Bayesian statistical strategy to define a biomarker discovery model with unknown parameters;
• provides hypothesis building using inductive logic programming (ILP);
• leverages and integrates lab-specific, historical, experimental results;
• models data with multilevel techniques to reduce error probabilities;
• maintains a persistent Bayesian inference chain that conditionally models the data collection process itself and adjusts hypothetical probability distributions given new evidence by a sequential filtering approach;
• manages stratified sampling strategies of subgroups and hierarchical models;
• visually performs minimal adequate modeling through Formal Concept Analysis and Bayesian Information Criteria (BIC);
• manages both statistical significance and confounding factors in terms of precision, accuracy, sensitivity, specificity, noise model, and controls.

Last modified 12 December 2007 at 11:23 am by dgnabasik