Annotated corpus for traditional formula-disease relationships in biomedical articles

0
Annotated corpus for traditional formula-disease relationships in biomedical articles

Lexical resources

In order to construct a corpus that captures the relationship between TF and diseases described in biomedical publications, it is crucial to have access to comprehensive vocabularies for both TF and diseases. However, the descriptions of TF in scientific literature exhibit substantial variation across different authors, and there is currently a lack of a unified TF vocabulary that can accommodate such diversity. To address this challenge, we utilized a Traditional Korean Medicine (TKM) ontology21 that extracted a wide range of TKM terms, such as medicinal herbs, formulas, symptoms, and meridians. This ontology also established the interrelationships among these terms by drawing on information from authoritative TKM textbooks and classical medical texts. The TF vocabulary comprises 446 representative names and 922 synonyms. In order to gather a wide range of English transliterations for the TF, several databases including OASIS ( CNKI ( Kampo DB ( Chinese Medicine Formulae Image Database ( and TCM Wiki ( were utilized. These databases served as valuable resources for collecting diverse English expressions associated with the TF. Three TCM doctors conducted independent searches within databases and cross-verified the results during weekly meetings. If they were unable to reach a consensus on the outcomes during these meetings, a TKM doctor made the final decision to create the TF vocabulary. By adopting this approach, we successfully constructed a comprehensive TF vocabulary that encompasses the heterogeneous English expressions of TF encountered in scholarly literature. As an example, the traditional formula 五積散 has been assigned different English transliterations, including ojeok-san, wuji-san, goseki-san, and goshaku-san. By consolidating these transliterations, we performed a comprehensive search on PubMed to identify relevant articles associated with TF.

Next, for the disease vocabulary, we adopted the Comparative Toxicogenomics Database’s (CTD)22 MEDIC23 resource. MEDIC is constructed by combining the Online Mendelian Inheritance in Man (OMIM), which provides comprehensive information on human genetic diseases, and terms in the disease categories offered by Medical Subject Headings (MeSH), which offer an efficient curation method for disease-related PubMed articles. As of March 2024, the MEDIC resource encompasses over 13,000 concepts and 78,000 synonyms, with regular updates to its vocabulary on a monthly basis. The extensive MEDIC vocabulary is widely utilized in biomedical algorithm research. For example, DNorm24, a tool specifically designed for disease normalization in clinical notes, utilizes MEDIC and has demonstrated exceptional performance during the 2013 ShARe/CLEF shared task. Also, MEDIC is effectively employed for disease Named Entity Recognition (NER) within PubTator Central25 which provides advanced automatic annotation tools for biomedical concepts, including genes and diseases, within both PubMed abstracts and PubMed Central full-text articles. Furthermore, MEDIC has been widely adopted in various studies that involve the construction of corpora based on PubMed abstracts22,26. Additionally, every disease term in MEDIC is assigned a corresponding MeSH Unique ID. This makes MEDIC an ideal disease vocabulary for creating a TFDR corpus, as it enables seamless integration with articles indexed in PubMed, ensuring precise and comprehensive disease term mapping. Therefore, in this study, CTD’s MEDIC was chosen as the disease vocabulary. However, MEDIC has certain limitations. Its polyhierarchical structure, where a disease may appear in multiple branches with varying descendants, can complicate disease categorization. Additionally, while MEDIC’s extensive coverage is beneficial, it may lack representation for rare or emerging diseases, and variations in terminology across different medical fields or regions could present further challenges.

Annotation workflow

The workflow for constructing the TFDR corpus is depicted in Fig. 1. Initially, search queries were generated by combining the English expressions of TF listed in the traditional formula vocabulary, as explained in the preceding section. Subsequently, a total of 5,763 abstracts were downloaded from the PubMed database. For downloaded abstracts, TF mentions were automatically pre-annotated using a dictionary-based approach, while disease mentions were pre-annotated using the TaggerOne algorithm27. TaggerOne employs semi-Markov Models and also relies on the comprehensive MEDIC vocabulary, provided through the PubTator API. Out of the total 5,763 abstracts, 4,095 abstracts were identified to contain both TF mentions and disease mentions. Subsequently, a subset of 740 abstracts was randomly chosen for following annotation tasks. Although TaggerOne generally performs well, some limitations remain. Its reliance on a predefined lexicon can lead to missed terms, especially for rare or newly emerging terminology, and boundary inconsistencies within complex entity structures may affect normalization accuracy. As a result, some terms may have been overlooked during the initial identification. It should be noted that no manual validation was performed on the abstracts without identified mentions, leaving the possibility of undetected or misidentified entities. However, given TaggerOne’s robustness, the chance of significant omissions is considered low.

Fig. 1
figure 1

The workflow for TFDR corpus construction.

To construct the TFDR Corpus, annotators independently performed annotation tasks on TF mention, diseases mention, and key-sentences within PubMed abstracts. They also extracted relationships between TF and diseases mentions from the key-sentence, because key-sentence provides direct evidences to support the claimed relationships16. A total of six annotators participated in the construction of the TFDR corpus, all of whom were proficient in English and experts in traditional medicine, holding medical degrees in traditional medicine. Firstly, the annotators involved in the construction of the corpus received training on the objectives of corpus construction, annotation guidelines, and annotation tools. To ensure a comprehensive understanding of the annotation guidelines, a two-week training program was conducted. This training program involved utilizing a subset of 40 pre-annotated abstracts, carefully selected from the initial pool of 740 abstracts, to ensure effective training. These 40 abstracts were specifically chosen to cover a wide range of TF and disease terms, key-sentence selection, and different types of relations, allowing the application of the annotation guidelines during the training process. Through this training period, the annotators familiarized themselves with the annotation guidelines, ensuring a consistent and accurate annotation process. The team of annotators involved in the construction of the TFDR corpus comprised graduate students and clinical practitioners possessing specialized expertise in traditional medicine and diseases, ensuring the development of a corpus of high-quality. The annotators were divided into two groups, with three annotators independently performing annotations for each abstract. The annotation and curation process were conducted in four phases as outlined below:

Phase 1. Initial annotation

In this phase, each group of three annotators engaged in independent annotation tasks on approximately 40 pre-annotated abstracts per week. The annotation process for each group encompassed around 350 abstracts and spanned approximately 10 weeks. The assignment of annotators to their respective groups remained unchanged throughout the corpus construction process.

Phase 2. Review meeting

Following the completion of annotation tasks for the assigned abstracts each week, the annotators within each group convened regular meetings to discuss and address any discrepancies observed in the annotation results. These meetings aimed to reach a consensus and derive solutions for resolving any discrepancies or ambiguities identified during the annotation process.

Phase 3. Annotation revision

In this phase, the annotators conducted an independent round of annotation based on the solutions derived from the review meetings. However, the acceptance of these proposed solutions was determined by each annotator independently, taking into account their own judgment and discretion.

Phase 4. Final curation

The final annotation outcome for each abstract was derived by integrating the independent annotation results from the three annotators. In cases where discrepancies occurred between the two rounds of annotation, the majority vote principle was applied to establish the final annotation decision.

Annotation guidelines

The guidelines were developed to facilitate annotation workflows and ensure high inter-rater agreement for the creation of a high-quality TFDR corpus. The guidelines used in this study were drafted by analyzing those used in previous researches15,16,28 and adapting them to align with the objectives of this study. The guidelines were then updated through a series of tests, where they were applied to the annotation of article abstracts. After a two-week training program, a meeting was held with the annotators to make final revisions, addressing any gaps or inconsistencies in the guidelines. During the annotation and curation process, the guidelines were not modified further in order to maintain consistency throughout the corpus. The TFDR corpus workflow included four annotation steps: TF annotation, disease annotation, key-sentence annotation, and relation annotation within the key-sentence. To support an efficient annotation process, we employed WebAnno29, a web-based annotation tool designed for various linguistic annotations. The guidelines that annotators should refer to and follow in the annotation workflow are thoroughly documented in the accompanying guidelines, which can be accessed in the supplementary materials. In this section, we provide a summary of the main points as follows:

Annotation of traditional formula mention

  • Annotate the maximum span of traditional formula mentions. (e.g., “yukmi-jihwang-tang” rather than “jihwang-tang”)

  • Annotate all synonymous mentions. (e.g., Abbreviation definitions such as “Xiaochaihutang (XCHT)” are separated into two annotated mentions.)

  • Do not annotate product name or product number. (e.g., TJ-41 for the mention, Hocku-ekki-to (TJ-41))

  • Do not annotate substances or medicinal herbs comprising the traditional formula.

    Annotation of disease mention

  • Annotate the most specific disease mentions with maximum span. (e.g., “Insulin-dependent diabetes mellitus” rather than “diabetes mellitus”.)

  • Annotate all synonymous mentions, including abbreviations possible to assume though not specified in the abstract.

  • Annotate each disease and symptom mentions separately, when the disease or symptom induced by disease. (e.g., “diabetes” and “cardiomyopathy” are annotated for the mention, “diabetes-induced cardiomyopathy”)

  • Annotate the mention representing the condition of disease (e.g., “severe dyspnea” rather than “dyspnea” and “dry cough” rather than “cough”)

  • Do not annotate mentions, if the prefix of it is “anti-”.

  • Do not annotate mentions concerning traditional medical diseases except “Yang Deficiency” and “Yin Deficiency”. (e.g., liver-wind stirring syndrome, fluid-retention syndrome, and so on are not annotated)

    Annotation of key-sentence

  • Key-sentence is a condensed representation of the result or conclusion of the article. It should contain both traditional formula and disease mentions. It might contain multiple relations between them.

  • Annotate the title as the key-sentence, when no proper key-sentence is in the abstract.

  • Do not annotate the sentence as a key-sentence, if it merely refers to findings from previous studies.

    Annotation of relations in the key-sentence

  • The Treatment of Disease relation is the “treatment”, “alleviation”, or “prevention” effects of the traditional formula on the disease. (e.g., “Using this model, Shakuyaku-kanzo-to was shown to relieve paclitaxel-induced painful peripheral neuropathy.” [PMID: 18472288])

  • The Cause of Side-effect relation is the “occurrence” or “exacerbation” effects of the traditional formula on the side-effect (e.g., “A case of pneumonitis induced by Bofu-tsusho-san” [PMID: 12692947])

  • The Association relation is annotated when, despite the co-occurrence of traditional formula and disease mentions in the key-sentence, the “description” or “correlation” between them is either unclear or not explicitly stated. (e.g., “Immunologic examination of Juzentaiho-to (TJ-48) in postoperative gastric cancer” [PMID: 2730049])

  • The Negative relation is the “ineffectiveness” of the traditional formula against the disease. (e.g., “However, given the high risk of bias among the trials, we could not conclude that YCWLP was beneficial to patients with hyperlipidemia.” [PMID: 27400466])

Based on the aforementioned guidelines, we present an annotated example abstract, as depicted in Fig. 2. In the example abstract, “Sishen Pill” and “SSP” are annotated as TF mention. Additionally, disease mentions which are “colitis”, “inflammatory bowel disease”, “IBD”, and “pathological damage to the colon” were annotated correctly. The concluding statement, which provides a concise summary of the entire abstract, was designated as a key-sentence and annotated accordingly. Within this sentence, TF mention “SSP” has Treatment of Disease relationship with disease mention “colitis.”

Fig. 2
figure 2

Example of annotated abstract (PMID: 32754049) using WebAnno.

Disagreements

In the annotation results, we identified several instances of disagreement between annotators. According to our analysis, most of the disagreed cases occurred in the following situations:

  • Missed identification of TF abbreviations (e.g., “HLJDD” for “Huang-lian-Jie-du decoction”).

  • Differences in the scope of disease identification (e.g., “Cholestasis” vs. “Variable Cholestasis”).

  • Recurring discrepancies between annotators when identifying key sentences (e.g., “Our study showed YGW administration effectively alleviated BCAA metabolic disorder and improved gut dysbiosis” vs. “This study provides support for YGW administration with benefits for allergic asthma”).

  • Frequent disagreement in distinguishing between Treatment of Disease and Association when establishing the relationship between TF and diseases (e.g., in “Dachengqi decoction may promote the recovery of intestinal mucosal permeability and decrease the incidence of MODS and pancreatic infection in patients with severe acute pancreatitis,” annotators differed in identifying the relationship between Dachengqi decoction and pancreatic infection, as well as Dachengqi decoction and severe acute pancreatitis).

Annotation quality assessment

Due to the manual construction of the corpus by annotators utilizing their domain expertise and adhering to well-defined guidelines, the evaluation of corpus quality plays a vital role in assessing its comprehensiveness and utility. As mentioned previously, the annotation workflow of the TFDR corpus involved five steps. The inter-annotator agreement (IAA) was computed for each step to provide valuable insights into the reliability and consistency of TFDR corpus. As three annotators independently conducted the annotation tasks for constructing the TFDR corpus, the Fleiss’s kappa30 score was employed to compute the IAA scores. Fleiss’s kappa is a statistical measure utilized to evaluate the reliability of agreement among a predetermined number of evaluators when assigning categorical ratings or classifying items into multiple categories. In contrast to Cohen’s kappa, which is applicable to two raters, Fleiss’s kappa operates with an arbitrary number of raters. The computation of Fleiss’s kappa is performed using the following formula:

$$\kappa =\frac\left(\barP-\barP_e\right)\left(1-\barP_e\right),$$

where \(\left(1-\barP_e\right)\) factor represents the degree of agreement that can be attainable over chance, and \(\left(\barP-\barP_e\right)\) gives the degree of agreement that is actually achieved over chance. If the all annotators matched completely, then Fleiss’s kappa κ = 1. According to Viera et al.31, kappa values between 0.6 and 0.8 represent “substantial” agreement, while values above 0.8 can be interpreted as indicating “almost perfect” agreement. In this study, both strict and relaxed constraints were applied when calculating the IAA. The strict constraint was used when the annotations from the three annotators had the same entity type and identical offsets. On the other hand, the relaxed constraint was applied when the annotations from the three annotators shared the same entity type but had partially overlapping ranges.

Corpus evaluation

As an illustrative example of TFDR corpus’ validity, we conducted experiments within the realm of biomedical NLP, focusing on three key tasks: Named Entity Recognition (NER), Key-Sentence Recognition (KSR) and Relation Extraction (RE). NER involves the identification of words within unstructured text that correspond to predefined entities. KSR task evaluates whether a given sentence serves as a condensed representation of the result or conclusion of the article. Meanwhile RE pertains to the detection and classification of mentions representing semantic relationships within unstructured documents. Recent advancements in NLP, particularly with the rise of contextual language models based on the Transformer’s encoder architecture, have demonstrated exceptional performance, outperforming conventional machine learning methods. This is largely due to the Transformer encoder’s enhanced ability to capture contextual information bidirectionally, allowing it to better understand relationships between words and sentences, thus excelling in text and language comprehension. Notably, these advancements have translated into exceptional performance gains in NER, KSR and RE tasks as well. Firstly, BERT32 is composed solely of the encoder part of the Transformer architecture and employs Masked Language Model (MLM) and Next Sentence Prediction (NSP) during pre-training. These techniques enable BERT to learn bidirectionally, allowing it to capture context more effectively. BERT was pre-trained on a large corpus of general English text, including Wikipedia and BooksCorpus, and has demonstrated outstanding performance across various NLP tasks. Furthermore, fine-tuned BERT-based models have been developed for specific domains, such as SciBERT33, trained on computer science and biomedical domain papers. Another noteworthy model is BioBERT34, which initializes with BERT’s weights and further undergoes pre-training using additional corpora, specifically PubMed abstracts and PMC full-text articles.

While SciBERT and BioBERT are domain-specific models fine-tuned on BERT, ELECTRA35 and DeBERTa36 are language models that enhance BERT’s architecture and training methods to improve performance. ELECTRA has a structure similar to BERT but introduces an innovative training method. It consists of two networks: a generator and a discriminator. Instead of predicting tokens during training, it learns by distinguishing between original tokens and replaced tokens, leading to more efficient learning. As a result, ELECTRA can achieve similar or better performance than BERT with fewer resources. DeBERTa enhances the BERT and RoBERTa37 models through two innovative techniques. The first is the disentangled attention mechanism, where each word is represented by two separate vectors—one for its content and the other for its position. The second technique involves an improved mask decoder, which replaces the standard softmax output layer for predicting masked tokens during model pretraining. These advancements result in outstanding performance across a range of benchmarks.

Performance evaluation for the NER, KSR and RE downstream tasks on TFDR corpus utilized three metrics: micro-F1, macro-F1, and weighted-F1. The F1 score is defined as the harmonic mean of precision and recall. If all labels are of similar importance, the macro-F1 score is employed. For cases where importance is weighted towards labels with a greater number of samples, the weighted-F1 score is consulted. Additionally, when evaluating the overall model performance regardless of labels, the micro-F1 score is used.

link

Leave a Reply

Your email address will not be published. Required fields are marked *