In cancer research, gene expression data from patients is often analyzed along with associated pathological, outcome, or other clinical data.The integration of this data creates a powerful and precious source of information.
While many expression datasets are publicly available, the corresponding clinical outcome is not always easily associable with the provided microarray experiments, or in general is difficult to find.
Looking at specific cancer datasets on GEO in detail, it becomes clear that the organization of the supplementary data is not standardized.
In fact the main information supplied from the repository is the gene expression, while all the other data is attached in separate files or in the meta data information provided with the authors details.
In this latter case, because of the diversity in the clinical information annotation, a purely systematic approach is difficult to apply. Nevertheless we can construct a database in which we associate each custom clinical information term with the corresponding standard term, in order to look up the right information for each specific dataset.
Using this simple concept we searched the GEO database for cancer dataset and we found over 20 experiments in which this approach is applicable.
The discovering of such availability led us to the development of an open source and web oriented platform, able to gather cancer datasets and register the corresponding necessary information in a standardized database.
The Ocelot project (On-line Cancer Expression Tool) is the application of these ideas, and provides clinical information for various cancer datasets making possible on the fly analysis. This includes Kaplan-Meier estimation and ROC curves, integrating follow up and response data with the corresponding gene expression.