\institutes

¹School of Biomedical Informatics
The University of Texas Health Science Center at Houston, Houston, TX, USA
²Department of Radiation Oncology
The University of Texas MD Anderson Cancer Center, Houston, TX, USA
³Department of Biostatistics
The University of Texas MD Anderson Cancer Center Houston, TX, USA

Text Classification of Cancer Clinical Trial Eligibility Criteria

Yumeng Yang MS¹ Soumya Jayaraj BA¹ Ethan Ludmir MD² ³ Kirk Roberts PhD¹

Abstract

Our study aims to develop automated classification models for eligibility criteria for cancer trials from ClinicalTrials.gov to facilitate the patient-trial matching process. We started from 764 annotated trials, using keywords matching to extract sentences that specifically convey certain criteria requirement. Alone with training data with five publicly available domain specific-models, we also pretrained our own model by using all avilable documents from ClinicalTrials.gov as of February 15t, then tested it on our classification task, the best f1 score for all criteria, prior malignancy, HIV, HBV, HCV, psycharistic illness, combination of alcohol, drug and substance abuse(SubDrugAlc), and autoimmune is over 0.95. Our result shows that it is feasible to build a classification model that automatically identify high level and frequent asked eligibility criteria for cancer clinical trials. Furthermore, our pretrained model outperforms other models in some criteria. Future work focused on studying eligibility criteria for clinical trials can greatly benefit from our model.

1 Introduction

Cancer has high morbidity and mortality rate that threatens millions of people’s lives. According to the American Cancer Society report, there will be a total of 1.9 million new cancer cases alone with more than 600,000 deaths in the US in 2022^acs2022. Clinical trials have always been recognized as significant for cancer treatment and anticancer drugs development^{cox2003patients}, but trial recruitment is still problematic^{cox2003patients, kadam2016challenges, ross1999barriers}. Barriers to recruitment include patients’ fear about the potential risks perceived by novel treatment^{jones2007identifying}, more detailed trial information not disclosure^{jenkins2010attitudes, mills2006barriers}. In addition to above concerns, identifying potential eligible patients is also a barrier. It is also challenging for trail seekers to find proper trials, eligibility criteria was written in jargon language, some professional terminology cause difficult for lay audiences to understand. A simplify version of clinical trial eligibility criteria can improve the accessibility to wider audiences^{kang2015initial}.

Clinical trial protocols outline the overall objective, design, methods, criteria, and other important details for a certain trial. The eligibility criteria is a key component of the protocol, as they defined the requirements for participants who are eligible to participate in this trial. The criteria usually includes inclusion criteria and often additional exclusion criteria to further rule our certain population^{usnll2022clinicaltrials}. ClinicalTrials.gov is a public database that provides information on registered human trials. It is managed by the United States National Library of Medicine and National Institutes of Health^{nlm2013clinicaltrials}, and as of 2007, human trials were legally required to register trial information ^{clinicaltrials_gov}. As of February 2023, about half a million studies were registered, such overarching and comprehensive database brings enormous potential for research purpose, including identify future trends, find the age, gender, and certain type of population disparities^{grant2020racial, corrigan2022inclusion}, developing tools to help facilitate trial recruitment.

Identifying eligible patients for these trials can be a time-consuming and challenging task, partly due to the non-standardized format of eligibility criteria on clinical trial protocol. As a result, patients and clinicians may struggle to identify relevant trials, leading to potential delays in the recruitment process. ^{bhattacharya2013analysis}. This underscores the urgent need for automated tools, such as text classification, to streamline and enhance the recruitment process. By developing an automatic classifier to identify key eligibility criteria from free-text records, we can potentially address the unmet needs of patients and clinicians by facilitating faster and more accurate identification of relevant clinical trials. Natural Language Processing(NLP) is a branch of artificial intelligence that aims to enabling machines understand human language. In the clinical fields, NLP has a wide range of applications, including name entity recognition^{zhao2004named, ji2019hybrid, bhatia2019comprehend}, text mining^{eom2004pubminer, huang2008genclip, bucur2014supporting}, text classification^{uzuner2008identifying, nii2017nursing, yao2019clinical}, and etc. NLP models can help extract and structure information from text data such as eligibility criteria description and from patients’ medical notes for secondary use. Our project aims to develop classifiers that can automatically identify key exclusion criteria from eligibility descriptions listed on ClinicalTrials.gov. Such classification tools has the great potential to enormously streamline the recruitment process for both clinicians and patients. For our study, we chose seven criteria and used five state-of-the-art domain-specific language models trained on our data. Additionally, we pretrained our own model based on the ClinicalBERT framework, by using over half a million eligibility criteria sections deprived from ClinicalTrials.gov. In the results section, we present a cross-comparison between our pretrained model and other models.

2 Related work

Numerous prior projects have focused on text mining through facilitating the identification and organization of eligibility criteria for clinical trials with induction and organization to finally streamlinging the recruitment process for both patients and clinicians. Criteria2Query is a NLP tool introduced in 2019 aiming to convert the free-text eligibility criteria for a clinical trial to a structured query for identifies eligible patients from clinical data^{yuan2019criteria2query}. DQueST is a dynamic questionnaire which was introduced in 2019 to help people find eligible clinical trials by asking trial criteria related questions^{liu2019dquest}. There also exists many models to automatically structure eligibility criteria for clinical trials^{luo2011extracting} and EHR data^{jonnalagadda2017text}. RuleEd is a web-based tool to revise and refine free text eligibility criteria^{olasov2006ruleed}. EXTRACTS is a searching tool that allow users to customize criteria and weight each criteria for potential trials^{miotto2013etacts}.

Another emphasis aspect is about information extraction to match terms from eligibility criteria and patients record. Some classification tasks were done previously to build tools based on regular expression and machine learning models to identify certain criteria for cancer trials^{zhang2017automated}, and for specific department needs^{ni2015automated}. Other tool was developed to identify by using key terms matching and pattern^{petkov2013automated}. Many other tools were developed based upon EHR data to further identify and match eligible patients from their medical records for specific disease type, such as cancer, Alzheimer’s disease^{kirshner2021automated, tissot2020natural, cai2021improving} with clinical trial criteria.

Alone with texting mining and information extraction, some work aims to create a knowledge base of common eligibility criteria related annotated corpus and build knowledge base. EliIE^{kang2017eliie}, Chia^kury2020chia, and the Leaf clinical Trials Corpus^{dobbins2022leaf} were published in 2017, 2020, and 2022 respectively, EliIE contains 230 trials from varying phases with focus on Alzheimer’s disease specifically, Chia contains 1000 phase IV trials cover all diseases, and LCT contains 1006 trials cross all phases and diseases. A lexicon base for breast cancer clinical trial eligibility criteria was created in 2021 to identify concepts related to eligibility^{jung2021building}. This study shows that a specified lexicon can improve the accuracy of subjects in clinical trial eligibility criteria analysis. Another knowledge base used hierarchical taxonomy to classify criteria into multiple categories, including, disease, intervention, condition, and etc^{liu2021knowledge}.

3 Method

We collected 764 phase III cancer clinical trials from ClinicalTrials.gov from 2000 to 2017. Each trial were initially annotated by clinicians using a two-person blinded annotation paradigm, the In this study, we mainly focus on automatically calssify seven key criteria for cancer trials. Seven criteria were chosen based on their frequency of occurrence as well as clinical significance for cancer trials; these criteria were selected in collaboration with both informatics and clinical specialists. Criteria include prior malignancy, HIV, HCV, HBV, psychiatric illness, autoimmune and the combination of alcohol, drug, and substance abuse.

We collected the text data of eligibility criteria description from ClinicalTrials.gov, and further split into each sentence level for two main reasons. Firstly, most BERT-Based models have maximum intake tokens, which is 512. The second reason is get rid of too many noisy data. The original length of criteria description from CTG ranges from to(need to check), The whole eligibility section contains multiple conditions from varying aspects, too many unrelated information confuse the model and make it challenging to identify the specific requirements for the criteria we are interested in. Therefore, we process to each sentence in the eligibility criteria separately for text classification. Oftentimes only one or two sentences convey the necessary information for the specific criteria. As a pre-processing step, we implement a simple text cleaning process, including the removal of extra spaces and new lines. These pre-processing steps help to standardize the text data and remove any unnecessary or irrelevant information that can interfere with the text classification process.

The eligibility criteria description can be divided into three different categories, inclusion, exclusion and eligibility. Those sections used different expression or reverse logic but convey the same criteria. For example,

•

Eligibility criteria in trial NCT00048997: ”No other malignancy within the past 3 years except nonmelanoma skin cancer”
•

Exclusion criteria in trial NCT00057876: ”Malignancy within the past 5 years except nonmelanoma skin cancer, carcinoma in situ of the cervix, or organ-confined prostate cancer (Gleason score no greater than 7)”
•

Inclusion criteria in trial NCT00095875: ”No other malignancy within the past 5 years except adequately treated carcinoma in situ of the cervix, basal cell or squamous cell skin cancer, or other cancer curatively treated by surgery alone”

All of those trials excluded patients with prior malignancy within a specific time frame, but using different terms under different subdivision. The extraction process of sentences based on keywords matching also involved marking each sentence with the category it was extracted from (inclusion, exclusion, or eligibility). This categorization was done to help the model better understand the underlying logic of the criteria. Table 4 shows the sample text input for the classification model.

3.1 Keywords summary

In order to accurately find the certain sentence that emphasise specific criteria, we created keywords list for each 7 criteria. By matching sentences contain keywords help extract the most relevant information for each criteria from lengthy eligibility description. Table2 shows the measurements of extractions abased on keywords. To capture as much as information for all required trials, we have to made a trade-off between recall and precision. Maybe seems counterintuitive that we also kept wrong captured sentences, correctly identifying unrelated sentences for a specific criteria can be just as important for the model to accurately classify sentence for that criteria.

The original annotation for each trial across all criteria based on information from multiple sources, including CTG, trials protocol, and publications. however, we only use data available from CTG in this study, we excluded trials that did not disclose criteria-related description on CTG in our analysis. After matching criteria-related sentence with corresponding keywords, we manually examined the remaining trials that unable to match any keywords. We want to identify the missingness was due to no available sentence from CTG, or our keywords list lacks other relevant words. This manual examination process helps us refine keywords lists and did not miss any relevant eligibility criteria.

Table 1: Keywords for Each Criteria

Criteria Keywords Prior Malignancy prior malignancy, concurrent malignancy, 5 years,five years,prior invasive malignancy,3 years,other malignancy,known additional malignancy,squamous cell carcinoma,in-situ, cancer HIV human immunodeficiency virus, acquired immunodeficiency syndrome,AIDS-defining malignancy, hiv,AIDS-related illness HBV hbv,hepatitis HCV hcv,hepatitis Psychiatric Illness psychosis, depression, psychiatric,psychological,psychologic,nervous,mental illness,mental disease SubDrugAlc ethanol, abuse,alcohol,alcoholism, illicit substance, drug,drugs,medical marijuana,inadequate liver, illicit substance, addictive, substance misuse, cannabinoids, chronic alcoholism Autoimmune uncontrolled systemic, autoimmune

Table 2 shows the summary of performance metrics for all criteria. We conducted error analysis for Psychiatric Illness, SubDrugAlc, and Autoimmune which have lower overall precision. In Psychiatric Illness trials, we noticed high frequency of keywords ”psychiatric” and ”nervous”, with 124 and 96 appearances respectively. We then focused on testing precision of these two keywords. The precision for psychiatric and nervous is 0.92, and 0.35 respectively. However, removing nervous cause the recall dropped from 0.99 to 0.90. In the combination of drug, alcohol, and substance abuse trials, keywords ”drug” and ”drugs” appear the most with 269 and 143 times, their precision is 0.22 and 0.23 respectively. The overall recall will drop from 1 to 0.92. Since there are only 54 trials mentioned Autoimmune criteria, and keywords were able to capture 50 of them. While the recall value is below 0.90, and precision is around 0.60. We believe that sacrificing precision or recall to improve the other is unnecessary given the relatively small number of missed trials. In summary, we believe that our keyword lists are suitable for capturing the required amount of needed information, with a balance between recall and precision. Therefore, we consider out approach to be effective in identifying the relevant data.

Table 2: Performance Metrics of Keywords for Each Criteria

	Prior Malignancy	HIV	HBV	HCV	Psychiatric Illness	SubDrugAlc	Autoimmune
Precision	0.87	0.90	0.98	0.96	0.68	0.27	0.62
Accuracy	0.82	0.88	0.74	0.95	0.67	0.27	0.57
Recall	0.98	0.97	0.98	1	0.99	1	0.89

3.2 Annotation process

After extracting each single sentence for all seven criteria. Each sentence for all criteria was annotated independently by two annotators(the first author and graduate school with medical background) by using a double-blinded paradigm, and any discrepancies were resolved through discussion and consensus. In case some disagreements occur, we will check with the data owner to finalize the label. The annotation rule for each criteria was using the same standard as the original trial level annotation. Descriptive summary of annotated corpus was shown in table 3.

Table 3: Annotation Agreement

	Prior Malignancy	HIV	HBV	HCV	Psychiatric Illness	SubDrugAlc	Autoimmune
Sample Size	529	200	130	282	281	523	54
Cohen’s Kappa	0.95	0.74	0.16	0.89	0.93	0.98	-0.10
Agreement	0.99	0.96	0.85	0.95	0.97	0.99	0.85

Table 4 shows some examples annotated sentence level text for certain criteria. Prior malignancy was the most confusing one, only prior malignancy will be count for this criteria, which means concurrent malignancy will not be counted, active secondary cancer was excluded. While prior systemic chemotherapy or radiation may infer prior malignancy, but it is hard to decide if these treatments were for current or prior malignancy, sentence focus on saying prior treatment will not be counted as prior malignancy.

Table 4: Examples of criteria-related sentence from study eligibility criteria

CTG ID	Text	Classification
NCT00005047	eligibility:At least 5 years since other prior systemic chemotherapy	0: PM not excluded
NCT00006011	inclusion:No prior radiotherapy for prior malignancy	0: PM not excluded
NCT00011986	inclusion:No other invasive malignancy within the past 5 years except nonmelanoma skin cancer	1: PM excluded
NCT00216060	exclusion:No prior history of malignancy in the past 5 years with the exception of basal cell and squamous cell carcinoma of the skin	1: PM excluded

3.3 Model Implementation

The high level of this study follows the pipeline: creating keywords lists for all criteria, extract sentences from keywords matching, sentence level annotation, model implementation, and evaluation. We used a five-fold cross validation sampling method to overcome the challenge of small sample size, and evaluate the model based on both each sentence level and the trial level.

The dataset was divided into a training set and a testing set with a 70/30 ratio for each criteria. To avoid the risk of overfitting, a 5-fold cross-validation sampling method was applied to the training data. This approach involved dividing the training data into five equally sized parts and then training and testing the model five times using a different subset of the data each time. This method helps to avoid isseus caused by a small sample size. To maintain consistency, the data was split at the trial level, meaning that all sentences belonging to the same trial were grouped together in either the training or testing set, even if multiple sentences were captured by keywords for a single trial.

We applied six BERT-based models on all criteria dataset, five of them are state-of-the-art BERT-Based model pretrained on domain specific corpus, including BioBERT, ClinicalBERT, BlueBERT, PubMedBERT, and SciBERT. In addition to that, we trained our own model using over half a million clinical trials eligibility criteria sections from ClinicalTrials.gov, build based upon the ClinicalBERT framework. BioBERT is the first domain-specific BERT based models that trained on biomedical domain corpus, including PubMed abstract and PMC full text articles^{lee2020biobert}(15). ClinicalBERT was trained upon BioBERT by adding more clinical notes and EHRs in the pre-training process^{huang2019clinicalbert}.BlueBERT was pre-trained upon the BERT model with PubMed abstracts and MIMIC-III^{peng2019transfer}. The model was further tested on five different tasks and evaluated with the Biomedical Language Understanding Evaluation(BLUE). Results showed that BlueBERT outperformed many other domain state-of-the-art models. PubMedBERT was pretrained from scratch by using PubMed abstract and title^gu2021domain. SciBERT was pretrained upon BERT with 1.14M full papers from Semantic Scholar^{beltagy2019scibert}, the pretrained corpus mainly focus on computer science and biomedicine domains.

3.4 Model Evaluation

We evaluate all classification models using precision, recall, and F1 score metrics. All these metrics were calculated at both the trial and sentence levels for all criteria to ensure a comprehensive evaluation as well as practical explainable result. For the trial level evaluation, if any extracted sentence for the same trial was annotated and predicted as 1, the trial level training label and predicted will be marked as 1. Final confusion metrics for each trial were calculated after remove duplicates.

4 Result

The table presents metrics for six different BERT-based models evaluated on seven criteria, using both sentence-level and trial-level assessment. Across all models, the overall F1 score for all criteria ranged from 0.83 to 1.00 for both evaluation levels. For the prior malignancy criterion, PubMedBERT and our own pretrain model achieved the highest F1 score of 0.91, while BioBERT outperformed other models with the highest F1 score of 0.93 for trial-level evaluation. BlueBERT achieved the highest F1 score of 0.98 on both evaluation levels for the HIV criterion. ClinicalBERT and SciBERT both achieved the highest F1 score for the psychiatric illness criterion, with 0.96 and 0.97 on sentence-level and trial-level evaluation, respectively. For the HBV criterion, all models achieved 0.97 and 1.00 for sentence-level and trial-level evaluation, respectively, except for ClinicalBERT, which had a slightly lower score of 0.96. On the sentence-level, BioBERT and our pretrain model reached 0.89 F1 score for the HCV criterion, while our pretrain model outperformed other models with 0.91 F1 score on the trial-level. There was no difference across all models for the autoimmune criterion. Lastly, on both evaluation levels, our pretrain model outperformed other models for the SubAlcDrug criterion, with ClinicalBERT and PubMedBERT also achieving an F1 score of 0.99 on the trial-level evaluation.

Our results indicate that, at the sentence-level evaluation, there is no differences among all models for HBV and autoimmune. Our pretrained model performed equally or better than other models for three out of five selected criteria, including prior malignancy, HCV, and SubAlcDrug. In comparison, all other models perform reached one highest F1 score respectively. At the trail-level evaluation, our pretrained model performed equally or better than other models in two out of five criteria compared to other models, including HCV and SubAlcDrug. ClincalBERT reached the highest F1 score in two criteria too, while other models reached the highest F1 score in one criteria among the five. BlueBERT did not reach the highest F1 score in any of the criteria.

The metrics for criteria autoimmune are all 1, there are couple reasons behind such perfect performance is likely due to the nature of the tasks and the evaluation criteria used. Specifically, it’s possible that the tasks are relatively simple, with clear labels and well-defined criteria for what constitutes a true positive, false positive, true negative, and false negative. Additionally, the evaluation criteria used (precision, recall, and F1 score) may not be as sensitive to performance differences as other metrics would be. Another potential reason could be the relatively small sample size for autoimmune. The total sample size for autoimmune is 54, with only seven for 0 class. The text pattern for autoimmune is really patterned and simple, in the exclusion criteria, it simply stated ”autoimmune” disease, and ”no autoimmune disease” in the inclusion section. These is fairly simple compared with description for prior malignancy.

Of all criteria, HCV yields the poorest performance from all models. We implemented an error analysis on SciBERT for HCV, since it performed the worst compare with other models. We found the model twisted HCV and HBV, even though they oftentimes show concurrently in the same sentence, but some sentences only mentioned HBV without saying HCV, and the model wrongly recognize it as HCV.

Criteria	PM						HIV
	Sentence Level			Trial Level			Sentence Level			Trial Level
Models	P	R	F1	P	R	F1	P	R	F1	P	R	F1
BioBERT	0.89	0.89	0.89	0.93	0.93	0.93	0.97	0.97	0.96	0.97	0.97	0.96
ClinicalBERT	0.90	0.90	0.90	0.89	0.89	0.89	0.97	0.97	0.96	0.97	0.97	0.96
PubMedBERT	0.91	0.91	0.91	0.92	0.91	0.91	0.97	0.97	0.96	0.97	0.97	0.96
BlueBERT	0.87	0.87	0.87	0.87	0.86	0.87	0.98	0.98	0.98	0.98	0.98	0.98
SciBERT	0.86	0.86	0.86	0.89	0.89	0.89	0.97	0.97	0.96	0.97	0.97	0.96
Pretrain Model	0.91	0.91	0.91	0.91	0.91	0.91	0.97	0.97	0.96	0.97	0.97	0.96

(a) PM and HIV

…

(b) Psychiatric Illness and HBV

…

Figure 1: Table caption

Criteria	PM						HIV
	Sentence Level			Trial Level			Sentence Level			Trial Level
Models	P	R	F1	P	R	F1	P	R	F1	P	R	F1
BioBERT	0.89	0.89	0.89	0.93	0.93	0.93	0.97	0.97	0.96	0.97	0.97	0.96
ClinicalBERT	0.90	0.90	0.90	0.89	0.89	0.89	0.97	0.97	0.96	0.97	0.97	0.96
PubMedBERT	0.91	0.91	0.91	0.92	0.91	0.91	0.97	0.97	0.96	0.97	0.97	0.96
BlueBERT	0.87	0.87	0.87	0.87	0.86	0.87	0.98	0.98	0.98	0.98	0.98	0.98
SciBERT	0.86	0.86	0.86	0.89	0.89	0.89	0.97	0.97	0.96	0.97	0.97	0.96
Pretrain Model	0.91	0.91	0.91	0.91	0.91	0.91	0.97	0.97	0.96	0.97	0.97	0.96

Criteria	Psychiatric Illness						HBV
	Sentence Level			Trial Level			Sentence Level			Trial Level
Models	P	R	F1	P	R	F1	P	R	F1	P	R	F1
BioBERT	0.95	0.94	0.94	0.96	0.96	0.96	0.86	0.93	0.89	0.90	0.95	0.92
ClinicalBERT	0.97	0.96	0.96	0.97	0.97	0.97	0.86	0.93	0.89	0.90	0.95	0.92
PubMedBERT	0.97	0.96	0.96	0.97	0.97	0.97	0.86	0.93	0.89	0.90	0.95	0.92
BlueBERT	0.95	0.94	0.94	0.96	0.96	0.96	0.86	0.93	0.89	0.90	0.95	0.92
SciBERT	0.97	0.96	0.96	0.97	0.97	0.97	0.86	0.93	0.89	0.90	0.95	0.92
Pretrain Model	0.96	0.95	0.95	0.96	0.96	0.96	0.86	0.93	0.89	0.90	0.95	0.92

Criteria	HCV						Autoimmune
	Sentence Level			Trial Level			Sentence Level			Trial Level
Models	P	R	F1	P	R	F1	P	R	F1	P	R	F1
BioBERT	0.89	0.89	0.89	0.90	0.90	0.89	1.00	1.00	1.00	1.00	1.00	1.00
ClinicalBERT	0.88	0.88	0.88	0.90	0.90	0.89	1.00	1.00	1.00	1.00	1.00	1.00
PubMedBERT	0.86	0.86	0.85	0.87	0.87	0.86	1.00	1.00	1.00	1.00	1.00	1.00
BlueBERT	0.84	0.84	0.84	0.86	0.85	0.85	1.00	1.00	1.00	1.00	1.00	1.00
SciBERT	0.84	0.83	0.83	0.85	0.84	0.83	1.00	1.00	1.00	1.00	1.00	1.00
Pretrain Model	0.89	0.89	0.89	0.92	0.91	0.91	1.00	1.00	1.00	1.00	1.00	1.00

Criteria	SubAlcDrug
	Sentence Level			Trial Level
Models	P	R	F1	P	R	F1
BioBERT	0.98	0.98	0.98	0.98	0.98	0.98
ClinicalBERT	0.98	0.98	0.98	0.99	0.99	0.99
PubMedBERT	0.98	0.98	0.98	0.99	0.99	0.99
BlueBERT	0.97	0.97	0.97	0.98	0.98	0.98
SciBERT	0.92	0.92	0.92	0.95	1.00	0.98
Pretrain Model	0.99	0.99	0.99	0.99	0.99	0.99

5 Discussion

Although sentence-level evaluation provides insight into our model’s overall performance, trial-level evaluation provides more practical and informative as it evaluates the model’s performance in a more realistic scenario. In our study, the original trial level-annotation was conducted by checking various sources, including the ClinicalTirals.gov, clinicians’ screening of the protocol, and available publications. Another interesting finding is many trials did not disclosure certain criteria on ClinicalTrials.gov, only mentioned in protocols. Further work include extend current text with more from protocols or publications.

Despite our model did not show a overwhelming advantage over other domain-specific models, our pretrained model still achieved the highest F1 score among all criteria. This indicates that our model can effectively identify and classify relevant information. However, it is important to note that the pattern for each criteria description is highly similar, which cause the results more predictable. Future studies could consider incorporating more diverse patterns to better evaluate the model’s performance in a wider range of scenarios.

6 Conclusion

In conclusion, we have successfully trained automatic classifiers using domain-specific BERT-based models to identify each of the selected criteria. We conducted evaluations at both the trial and sentence levels to assess the performance of all models. Currently, we have developed seven classifiers for each criteria, but our plan in the future is to create a more comprehensive and mature model that can identify all desired criteria simultaneously. In addition, we believe that a language model trained on clinical trial eligibility criteria will be beneficial for a wider range of diseases and trial phases.