Pre-training Model for Chinese Medical Text Processing—PCL-MedBERT
Date: 2020-08-25 Source:PCL
The BERT model, which stands for Bidirectional Encoder Representations from Transformers, is a pre-training language representation released by Google. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. BERT is conceptually simple and empirically powerful. It can achieve the state-of-the-art results on eleven natural language processing (NLP) tasks, which has brought a new wave of driving force to NLP technology. However, the training corpus of Google BERT comes from Wikipedia Chinese dataset (25M sentences), and there is still room for improvement in the support of text processing tasks in specific fields, especially in Chinese medical text processing.
In order to further promote the research and application of Chinese medical text processing, the Intelligent Medical Research Group at the Peng Cheng Laboratory has collected 1.2G professional medical treatment and 1.5G high-quality medical question and answer datasets from multiple sources to address the deficiencies of Google BERT Text. These datasets are used to establish a BERT pre-training model for medical text. Additionally, random initialization and fine-tuning methods are adopted to optimize the BERT model to finally obtain the “Peng Cheng Medical BERT” (PCL-MedBERT) pre-training model capable of supporting different downstream tasks in the medical field. At present, the model has surpassed Google BERT in the two downstream tasks of question matching and medical named entity recognition.
The following two tables compare the results of PCL-MedBERT and Google BERT for different medical tasks.
Table 1 Question Matching Task
Table 2 Medical Named Entity Recognition
The research team of PCL-MedBERT includes Prof. Ting Liu, Prof. Bing Qin, Prof. Ming Liu, and Prof. Ruifeng Xu. The teams are led by Associate Prof. Buzhou Tang and Prof. Qingcai Chen, respectively, and are responsible for verifying the question matching and medical named entity recognition, and providing a wide range of professional medical data. The large-scale training of all models was completed on the “Peng Cheng Cloud Brain” platform.
The pre-trained language model is available to worldwide scientific community for free on the PCL’s AI code management platform ihub.org.cn. For the download address and configuration file, please visit https://code.ihub.org.cn/projects/1775.
We welcome all comments and suggestions from professionals who are engaged in Chinese medical text processing.