Ranking the whole MEDLINE database according to a large training set using text indexing
- Publikationstyp:
- Zeitschriftenaufsatz
- Metadaten:
-
- Autoren
- BP Suomela
- MA Andrade
- Autoren-URL
- https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=fis-test-1&SrcAuth=WosAPI&KeyUT=WOS:000228242600001&DestLinkType=FullRecord&DestApp=WOS_CPL
- DOI
- 10.1186/1471-2105-6-75
- Externe Identifier
- Clarivate Analytics Document Solution ID: 914QM
- PubMed Identifier: 15790421
- ISSN
- 1471-2105
- Zeitschrift
- BMC BIOINFORMATICS
- Artikelnummer
- ARTN 75
- Datum der Veröffentlichung
- 2005
- Status
- Published
- Titel
- Ranking the whole MEDLINE database according to a large training set using text indexing
- Sub types
- Article
- Ausgabe der Zeitschrift
- 6
Datenquelle: Web of Science (Lite)
- Andere Metadatenquellen:
-
- Autoren
- Brian P Suomela
- Miguel A Andrade
- DOI
- 10.1186/1471-2105-6-75
- eISSN
- 1471-2105
- Ausgabe der Veröffentlichung
- 1
- Zeitschrift
- BMC Bioinformatics
- Sprache
- en
- Artikelnummer
- 75
- Online publication date
- 2005
- Status
- Published online
- Herausgeber
- Springer Science and Business Media LLC
- Herausgeber URL
- http://dx.doi.org/10.1186/1471-2105-6-75
- Datum der Datenerfassung
- 2024
- Titel
- Ranking the whole MEDLINE database according to a large training set using text indexing
- Ausgabe der Zeitschrift
- 6
Datenquelle: Crossref
- Abstract
- <h4>Background</h4>The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.<h4>Results</h4>We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.<h4>Conclusion</h4>This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.
- Addresses
- Ontario Genomics Innovation Centre, Ottawa Health Research Institute, 501 Smyth Rd, Ottawa, Ontario K1H 8L6, Canada. bsuomela@ohri.ca
- Autoren
- Brian P Suomela
- Miguel A Andrade
- DOI
- 10.1186/1471-2105-6-75
- eISSN
- 1471-2105
- Externe Identifier
- PubMed Identifier: 15790421
- PubMed Central ID: PMC1274266
- Open access
- true
- ISSN
- 1471-2105
- Zeitschrift
- BMC bioinformatics
- Schlüsselwörter
- Stem Cells
- Humans
- False Positive Reactions
- Bayes Theorem
- Language
- Computational Biology
- Algorithms
- Natural Language Processing
- Database Management Systems
- User-Computer Interface
- Databases, Bibliographic
- Abstracting and Indexing
- Vocabulary, Controlled
- Subject Headings
- Information Storage and Retrieval
- MEDLINE
- Databases, Protein
- Information Systems
- Databases as Topic
- Sprache
- eng
- Medium
- Electronic
- Online publication date
- 2005
- Open access status
- Open Access
- Paginierung
- 75
- Datum der Veröffentlichung
- 2005
- Status
- Published
- Publisher licence
- CC BY
- Datum der Datenerfassung
- 2005
- Titel
- Ranking the whole MEDLINE database according to a large training set using text indexing.
- Sub types
- Research Support, Non-U.S. Gov't
- research-article
- Journal Article
- Ausgabe der Zeitschrift
- 6
Files
https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/1471-2105-6-75 https://europepmc.org/articles/PMC1274266?pdf=render
Datenquelle: Europe PubMed Central
- Abstract
- BACKGROUND: The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine. RESULTS: We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%. CONCLUSION: This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address http://www.ogic.ca/projects/ks2004/.
- Date of acceptance
- 2005
- Autoren
- Brian P Suomela
- Miguel A Andrade
- Autoren-URL
- https://www.ncbi.nlm.nih.gov/pubmed/15790421
- DOI
- 10.1186/1471-2105-6-75
- eISSN
- 1471-2105
- Externe Identifier
- PubMed Central ID: PMC1274266
- Zeitschrift
- BMC Bioinformatics
- Schlüsselwörter
- Abstracting and Indexing
- Algorithms
- Bayes Theorem
- Computational Biology
- Database Management Systems
- Databases as Topic
- Databases, Bibliographic
- Databases, Protein
- False Positive Reactions
- Humans
- Information Storage and Retrieval
- Information Systems
- Language
- MEDLINE
- Natural Language Processing
- Stem Cells
- Subject Headings
- User-Computer Interface
- Vocabulary, Controlled
- Sprache
- eng
- Country
- England
- Paginierung
- 75
- PII
- 1471-2105-6-75
- Datum der Veröffentlichung
- 2005
- Status
- Published online
- Datum, an dem der Datensatz öffentlich gemacht wurde
- 2006
- Titel
- Ranking the whole MEDLINE database according to a large training set using text indexing.
- Sub types
- Journal Article
- Research Support, Non-U.S. Gov't
- Ausgabe der Zeitschrift
- 6
Datenquelle: PubMed
- Beziehungen:
- Eigentum von