66
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.

          Methods

          In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.

          Results

          This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.

          Conclusion

          The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.

          Related collections

          Most cited references54

          • Record: found
          • Abstract: not found
          • Article: not found

          The Measurement of Observer Agreement for Categorical Data

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            A few useful things to know about machine learning

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              NCBI disease corpus: a resource for disease name recognition and concept normalization.

              Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.
                Bookmark

                Author and article information

                Contributors
                kunkaweb@gmail.com
                anacpeters@gmail.com
                adalnizapucca@gmail.com
                carolgebeluca21@gmail.com
                yohan.bonescki@pucpr.edu.br
                miemukai@hotmail.com
                ribeiro.carvalho@pucpr.br
                sadidhasan@gmail.com
                c.moro@pucpr.br
                Journal
                J Biomed Semantics
                J Biomed Semantics
                Journal of Biomedical Semantics
                BioMed Central (London )
                2041-1480
                8 May 2022
                8 May 2022
                2022
                : 13
                : 13
                Affiliations
                [1 ]GRID grid.412522.2, ISNI 0000 0000 8601 0541, Health Technology Program, , Pontifical Catholic University of Paraná, ; Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
                [2 ]GRID grid.417285.d, AI Lab, Philips Research North America, ; Cambridge, MA USA
                Article
                269
                10.1186/s13326-022-00269-1
                9080187
                35527259
                4626728a-3985-460f-9aea-35a2c0325273
                © The Author(s) 2022

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 18 March 2020
                : 12 April 2022
                Funding
                Funded by: Philips Research North America
                Award ID: NLP for Portuguese clinical texts
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100002322, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior;
                Award ID: Finance Code 001
                Award Recipient :
                Categories
                Research
                Custom metadata
                © The Author(s) 2022

                Bioinformatics & Computational biology
                natural language processing,semantic annotation,clinical narratives,corpora,gold standard

                Comments

                Comment on this article