NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Wang, Kanix; Stevens, Robert; Alachram, Halima; Li, Yu; Soldatova, Larisa N; King, Ross; Ananiadou, Sophia; Schoene, Annika M.; Li, Maolin; Christopoulou, Fenia; Ambite, José Luis; Matthew, Joel; Garg, Sahil; Hermjakob, Ulf; Marcu, Daniel; Sheng, Emily; Beißbarth, Tim; Wingender, Edgar; Galstyan, Aram; Gao, Xin; Chambers, Brendan; Pan, Weidi; Khomtchouk, Bohdan B.; Evans, James A. and Rzhetsky, Andrey. 2021. NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding. npj Systems Biology and Applications, 7(1), 38. ISSN 2056-7189 [Article]

[img]
Preview
Text
s41540-021-00200-x.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract or Description

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades, the most dramatic advances in MR have followed in the wake of critical corpus development. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Item Type:

Article

Identification Number (DOI):

https://doi.org/10.1038/s41540-021-00200-x

Additional Information:

This work was funded by the DARPA Big Mechanism program under ARO contract W911NF1410333, by National Institutes of Health grants R01HL122712, 1P50MH094267, K12HL143959 (BBK), and U01HL108634-01, and by a gift from Liz and Kent Dauten. Additional support came from King Abdullah University of Science and Technology (KAUST), awards number FCS/1/4102-02-01, FCC/1/1976-26-01, REI/1/0018-01-01, and REI/1/4473-01-01.

The datasets generated during and/or analyzed during the current study are available in the Github repository at https://github.com/arzhetsky/Chicago_corpus. NERO in OWL format is available at: https://bioportal.bioontology.org/ontologies/NERO.

We also deployed a package called NERO-nlp for researchers interested in diving deeper into our annotated corpus; the installation guides and scripts are available online at https://pypi.org/project/NERO-nlp and https://github.com/Bohdan-Khomtchouk/NERO-nlp, respectively.

Keywords:

Diseases, Software

Departments, Centres and Research Units:

Computing

Dates:

DateEvent
17 September 2021Accepted
20 October 2021Published

Item ID:

30934

Date Deposited:

20 Dec 2021 16:30

Last Modified:

20 Dec 2021 16:30

Peer Reviewed:

Yes, this version has been peer-reviewed.

URI:

https://research.gold.ac.uk/id/eprint/30934

View statistics for this item...

Edit Record Edit Record (login required)