Sequential Result Refinement for Searching the Biomedical Literature

Author: Len Y. Tanaka, MD

Primary Advisor: Jorge R Herskovic, MD, MS (co-author)

Committee Members: M. S Iyengar, PhD (co-author); Elmer V Bernstam, MD, MSE, MS (co-author)

Masters thesis, The University of Texas School of Health Information Sciences at Houston.


Information overload is a problem for users of MEDLINE, the premier database of biomedical literature that indexes over 16 million articles. A variety of techniques have been developed to retrieve high quality or important articles. Many effective techniques rely on citation information. Unfortunately, citation information is proprietary, expensive, and suffers from “citation lag.” MEDLINE users have a variety of information needs. Although some users require high recall, many users are looking for a “few good articles” on a topic. For these users, precision is much more important than recall. We present and evaluate a method for identifying highly cited articles by using information available at, or close to, the time of publication. The method uses an indexing term score based on Medical Subject Headings (MeSH) terms, journal impact factor and number of authors. This method can filter large MEDLINE result sets (>1,000 articles) returned by actual user queries to produce small, highly cited result sets. We compare the prevalence of highly cited articles in the result sets before and after filtering to show that the method significantly enriches result sets with highly cited articles.