|
Text mining MEDLINE abstracts
Description
The program parses MEDLINE abstracts for "informative" words that co-occur with each other.
Informative words would not include "a", "the", "an", etc. Informative words that are found
to be adjacent in the text should be recognized as conserved phrases and therefor also define
a single term. For example, "p53" and "cancer" will likely show up in the same abstract a statistically enriched
number of times but probably not next to each other. "Protein kinase" would be an example of a conserved term composed
of adjacent informative words. When Given the co-occurrence information, particular key words or phrases (terms)
can be queried for their significantly accossiated terms.
Input and output
The program may be given a file containing MEDLINE accession numbers or multiple MEDLINE abstract files as input on the command line.
Your script(s) could read or write files like the following:
- MEDLINE accessions
- MEDLINE abstracts
- term occurence table
- statistical results for a specific term
and parameters like the following used to find assocciated, co-occurring terms
--term <word or term>
There is a gzipped text file downloaded from PubMed here
that can be used as input. You can also create/download you own data from PubMed.
Output should be summarized in a table containing term pairs and their frequencies and over- under-representation statistics (see below).
Examples of program(s) execution:
build_table.pl medline_acc.txt occurrence_table.tab
analyze_table.pl occurrence_table.tab --term "prostate cancer"
Statistical analysis
The program will perform some statistical analysis on the input in order to both discover informative terms and find
the frequently co-occurring terms. An informative term is a word or pharse that does not occur too frequently per
abstract. You should set a frequency threshold, based on average occurence per abstract (or possibly
average occurence per term), to determine which terms are informative and will be analyzed for co-occurences.
Given the frequency of 2 terms and the number of terms in an abstract, assuming independence, you can calculate an expectation
for the number of times these 2 terms will co-occur in an abstract. If you take the log of the ratio of observed
co-occurrence to expected co-occurrence, you have a log-likelihood (LLH) score for the term pair. A LLH > 1 means
the term pair is over represented.
You will probably want to write more than one script depending on the task as illustrated above.
|