Printable Version of this PageHome PageRecent ChangesSearchSign In
Review of Karin Verspoor’s talk at the Rocky Mountain Bioinformatics Conference, Sunday, Dec.02, 2007
“Integrating Semantic Data Sources for Disease-implicated Gene Discovery”

Abstract: Recent analysis of breast and colorectal cancer tumors through joint genomic and informatic filtering of 816,986 putative nucleotide changes across 13,023 genes yielded a surprising 198 allelic variants directly implicated in the cancers (Sjoblom et al, 2006). This greatly expanded previous estimates on the scope and complexity of the cancer genome, even though the candidate list was generated primarily by sequence-based analyses. We suggest that the integration of semantically-rich metadata from biomedical ontologies and the biological literature can be used to discover other candidate disease-implicated genes, to organize and better understand the commonalities and differences among identified disease-implicated genes, and to rank genes for further investigation. We have developed a methodology based on the mathematical technique of Formal Concept Analysis (FCA) that enables integration of semantic and empirical data, and exposes the structure of that data. FCA takes data objects and their properties and builds a concept lattice based on shared properties, and vice versa, naturally integrating across empirical data and a priori ontological constructs. We report on some preliminary analysis of the Sjoblom dataset using FCA and find that simply organizing the data into a concept lattice provides some important insight into properties implicit in that data that suggest a role in disease. The resulting FCA concept lattices represent ontological subsumption hierarchies derived from data.

Review: Verspoor et al uses Formal Concept Analysis (lattice theory) and biomedical ontological metadata to discover disease-implicated genes from allelic variants directly implicated in breast cancer, and to rank specific candidate genes for further investigation. The Gene Ontology (GO) semantic dictionary is used to describe the ontological metadata. These metadata can then be graphically represented as partially ordered sets. FCA takes data objects and their properties and builds a concept lattice based on shared properties; i.e., FCA produces a semantic hierarchy derived from relational data that clusters nodes in ontological space. Their computational method integrates a priori ontological (semantic) constructs and empirical genomic sequencing data in order to expose the structure of that data. The resulting concept lattices represent ontological hierarchies that suggest significant roles in disease.

In effect, this practical methodology serves as a decision-support system for justifying subsequent (and costly!) genomic assays and experiments. The authors successfully applied their method to one cancer dataset, which was visually displayed using the SpindelViz software in 3 dimensions. I am curious to know what is the proper dimension for representing the entire GO network.

The presentation proposed the idea of lattice attribute metrics as simple Euclidean distances, but I think there are much more interesting ways to measure node distances within the lattice network. One idea would be to relate lattice nodes as producers and consumers within the dynamic network, or as information (entropy) repositories.

FCA lattices do not appear to be related to Bayesian networks, although they are visually similar. FCA components are discrete elements that are related ontologically, whether through text analysis or some other form of statistical data mining.

Last modified 4 December 2007 at 9:44 am by dgnabasik