Objective To explore the construction of a real-world data annotation platform, and compare the real-world data extraction performance of retrieval augmented generation (RAG) combined with large language models and pre-training fine-tuning methods for pre-trained language models.
Methods Taking the pathological records of bladder cancer in the real world electronic medical record data as an example, a real-world data annotation platform was built. Based on the platform annotation data, the effects of automatic extraction of cancer typing and staging of bladder cancer using RAG combined with GPT-3.5, and the pre- training fine tuning method based on BERT and RoBERTa models were compared.
Results The extraction effects of the pre-training and fine-tuning model based on the fine-tuning of the full-training set were better than that of RAG combined with large model method and pre-training and fine-tuning model with the few-shot fine-tuning, and the effects of RoBERTa model were generally better than that of BERT model, but the extraction effects of these methods needs to be improved totally. The F1 scores for extracting bladder cancer typing, T staging, and N staging in the test set, using the RoBERTa model fine-tuned with the entire training set, were 71.06%, 50.18%, and 73.65% respectively.
Conclusion Pre-trained language models have the application potential in processing clinical unstructured data, but there is still room for improvement in the information extraction effect of existing methods. Future work requires further optimization of models or training strategies to accelerate data empowerment.
Please download the PDF version to read the full text:
download
1. 李绪辉, 阎思宇, 陈沐坤, 等. 面向真实世界的知识挖掘与知识图谱补全研究(一): 真实世界数据与知识图谱概述[J]. 医学新知, 2023, 33(2): 130-135. [Li XH, Yan SY, Chen MK, et al. Research on real-world knowledge mining and knowledge graph completion (I): overview of real-world data and knowledge map[J]. Yixue Xinzhi Zazhi, 2023, 33(2): 130-135.] DOI: 10.12173/j.issn.1004-5511.202301018.
2. 阎思宇, 李绪辉, 陈沐坤, 等. 面向真实世界的知识挖掘与知识图谱补全研究(二): 非结构化电子病历信息抽取方法及进展[J]. 医学新知, 2023, 33(5): 358-365. [Yan SY, Li XH, Chen MK, et al. Research on realworld knowledge mining and knowledge graph completion (II): methods and progress of information extraction from unstructured electronicmedical records[J]. Yixue Xinzhi Zazhi, 2023, 33(5): 358-365.] DOI: 10.12173/j.issn. 1004-5511.202301016.
3. 马文昊, 石涵予, 黄桥, 等. 面向真实世界的知识挖掘与知识图谱补全研究(三): 基于正则表达式对膀胱癌真实世界数据的结构化信息抽取[J]. 医学新知, 2024, 34(3): 312-321. [Ma WH, Shi HY, Huang Q, et al. Research on real-world knowledge mining and knowledge graph completion (III): structured information extraction from real world data of bladder cancer based on regular expression[J]. Yixue Xinzhi Zazhi, 2024, 34(3): 312-321.] DOI: 10.12173/j.issn.1004-5511.202308006.
4. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv: 1706.03762, 2017. http:arxiv.org/abs/1706.03762.
5. Devlin J, Chang MW, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv: 1810.04805, 2019. http:arxiv.org/abs/1810.04805.
6. Liu Y, Ott M, Goyal N, et al. RoBERTa: a robustly optimized bert pretraining approach[J]. arXiv: 1907.11692, 2019. http:arxiv.org/abs/1907.11692.
7. Li L, Zhou J, Gao Z, et al. A scoping review of using large language models (LLMs) to investigate electronic health records (EHRs)[J]. arXiv: 2405.03066, 2024. http://arxiv.org/abs/2405.03066.
8. Li D, Kadav A, Gao A, et al. Automated clinical data extraction with knowledge conditioned LLMs[J]. arXiv: 2406.18027, 2024. http://arxiv.org/abs/2406.18027.
9. Pelletier AR, Ramirez J, Adam I, et al. Explainable biomedical hypothesis generation via retrieval augmented generation enabled large language models[J]. arXiv: 2407.12888, 2024. http://arxiv.org/abs/2407.12888.
10. 董文波, 孙仕亮, 殷敏智. 医学知识推理研究现状与发展[J]. 计算机科学与探索, 2022, 16(6): 1193-1213. [Dong WB, Sun SL, Yin MZ. Research and development of medical knowledge graph reasoning[J]. Journal of Frontiers of Computer Science & Technology, 2022, 16(6): 1193-1213.] DOI: 10.3778/j.issn.1673-9418. 2111031.
11. Cui Y, Che W, Liu T, et al. Pre-training with whole word masking for Chinese BERT[J]. arXiv: 1906.08101, 2019.abs/1906.08101. http://arxiv.org/abs/1906.08101.
12. 马鹤桐, 王序文, 沈柳, 等. 医学知识标注体系设计与系统构建[J]. 中国卫生标准管理, 2023, 14(21): 1-4. [Ma HT, Wang XW, Shen L, et al. Medical knowledge labeling system design and annotation system construction[J]. China Health Standard Management, 2023, 14(21): 1-4.] DOI: 10.3969/j.issn.1674-9316.2023.21.001.
13. 张坤丽, 赵旭, 关同峰, 等. 面向医疗文本的实体及关系标注平台的构建及应用[J]. 中文信息学报, 2020, 34(6): 36-44. [Zhang KL, Zhao X, Guan TF, et al. A platform for entity and entity relationship labeling in medical texts[J]. Journal of Chinese Information Processing, 2020, 34(6): 36-44.] DOI: 10.3969/j.issn.1003-0077. 2020.06.006.
14. 张辉, 连万民, 刘翔, 等. 麻醉与围术期医学科数据标注平台的设计与实现[J]. 中国数字医学, 2021, 16(1): 96-100. [Zhang H, Lian WM, Liu X, et al. Design and implementation of the data annotation platform of department of anesthesia and perioperative medicine[J]. China Digital Medicine, 2021, 16(1): 96-100.] DOI: 10.3969/j.issn.1673-7571.2021.01.021.
15. Alkhalaf M, Yu P, Yin M, et al. Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records[J]. J Biomed Inform, 2024, 156: 104662. DOI: 10.1016/j.jbi.2024.104662.