Overview of distemist at bioasq: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources


There is a pressing need for advanced semantic annotation technologies of medical content, in particular medical publications, clinical trials and clinical records. Search engines and information retrieval systems require semantic annotation and indexing systems to support more advanced user search queries. Considering the relevance of disease concepts for clinical coding, automated processing of clinical trials and even patents, it is critical to provide access to high quality manually annotated documents labelled by clinicians for the development and benchmarking of disease mention recognition and grounding tools. This is particularly important for medical content beyond English, where even fewer annotated corpora have been released. To address these issues, we have organized the DisTEMIST (DISease TExt MIning Shared Task) track at BioASQ 2022. It represents the first community effort to evaluate and promote the development of resources for automatic detection and normalization of disease mentions from clinical case documents in Spanish. For this track we have released the DisTEMIST corpus, a collection of 1000 clinical case documents carefully selected by clinicians and annotated manually by a team of healthcare professionals following annotation guidelines and quality control analysis for consistency. Disease mentions were exhaustively mapped by these experts to their corresponding SNOMED CT concept identifiers. Moreover, we have created additional multilingual Silver Standard versions of the corpus for 7 languages (English, Portuguese, French, Italian, Romanian, Catalan and Galician), as well as mention normalization cross-mappings to 4 additional highly used terminologies. We received 38 systems or runs from 9 teams, obtaining very competitive results. Most participants implemented sophisticated AI approaches, mainly deep learning algorithms based on pretrained transformer-like language models (BERT, BETO, RoBERTa, etc.), with a classifier layer for named entity recognition and embedding distance metrics for entity linking. Finally, some initial explorations of applicability and adaptation of disease taggers trained on the DisTEMIST corpus to different clinical records (discharge summaries, radiology reports and emergency records) were performed. DisTEMIST corpus: https://doi. org/10.5281/zenodo. 6408476

Working Notes of Conference and Labs of the Evaluation (CLEF) Forum
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Luis Gasco
Luis Gasco
NLP Research Engineer