COVID-19 Vaccine: Predicting Vaccine Types and Assessing Mortality Risk Through Ensemble Learning Agorithms (2024)

Journal List
F1000Res
v.12; 2023
PMC11128056

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Version 1. F1000Res. 2023; 12: 1200.

Published online 2023 Sep 25. doi:10.12688/f1000research.140395.1

PMCID: PMC11128056

PMID: 38799245

Hind Monadhel, Formal Analysis, Funding Acquisition, Methodology, Resources, Software, Visualization, Writing – Original Draft Preparation,^a,¹ Ayad R. Abbas, Supervision, Writing – Review & Editing,² and Athraa Jasim Mohammed, Supervision, Writing – Review & Editing³

Author information Article notes Copyright and License information PMC Disclaimer

Associated Data

Data Availability Statement

Abstract

Background: There is no doubt that vaccination is crucial for preventing the spread of diseases; however, not every vaccine is perfect or will work for everyone. The main objective of this work is to predict which vaccine will be most effective for a candidate without causing severe adverse reactions and to categorize a patient as potentially at high risk of death from the COVID-19 vaccine.

Methods: A comprehensive analysis was conducted using a dataset on COVID-19 vaccine adverse reactions, exploring binary and multiclass classification scenarios. Ensemble models, including Random Forest, Decision Tree, Light Gradient Boosting, and extreme gradient boosting algorithm, were utilized to achieve accurate predictions. Class balancing techniques like SMOTE, TOMEK_LINK, and SMOTETOMEK were incorporated to enhance model performance.

Results: The study revealed that pre-existing conditions such as diabetes, hypertension, heart disease, history of allergies, prior vaccinations, other medications, age, and gender were crucial factors associated with poor outcomes. Moreover, using medical history, the ensemble learning classifiers achieved accuracy scores ranging from 75% to 87% in predicting the vaccine type and mortality possibility. The Random Forest model emerged as the best prediction model, while the implementation of the SMOTE and SMOTETOMEK methods generally improved model performance.

Conclusion: The random forest model emerges as the top recommendation for machine learning tasks that require high accuracy and resilience. Moreover, the findings highlight the critical role of medical history in optimizing vaccine outcomes and minimizing adverse reactions.

Keywords: Classification algorithm, COVID-19 Vaccine, ensemble learning, machine learning, Sampling methods, Side effects.

Abbreviations

COVID-19: Coronavirus Disease 2019
DT: Decision Trees
LGBM: Light Gradient Boosting Machine
ML: Machine Learning
mRNA: messenger ribonucleic acid
RF: Random Forests
SARS-CoV: Severe Acute Respiratory Syndrome- associated coronavirus
SMOTE: Synthetic Minority Oversampling Technique
VAERS: Vaccine Adverse Event Reporting System
XGB: extreme Gradient Boosting Machine

Introduction

From seven to 13 years of research and development (R&D) and 1.8 million clinical trials to develop a vaccine in the past, we have transitioned to 10 to 18 months of R&D and tens of thousands of clinical trials to start vaccinating against COVID-19 in 2021.¹

Vaccines are biologics that provide active adaptive immunity against particular diseases. The vaccine usually contains drugs similar to the microorganisms that cause the disease. It is generally made from one of the killed or attenuated micro-organisms, its toxins, or its surface proteins. Giving us an injection, nasal spray, or oral vaccine stimulates our immune system to recognize and destroy foreign bodies.²

As a result of the novel coronavirus's rapid dissemination and disease burden, pharmaceutical companies and researchers were forced to create vaccinations quickly using either novel or preexisting technologies.³ There are several different types of vaccines, and the purpose of each type is to boost your immune system and prevent serious, life-threatening diseases from occurring.⁴ The COVID-19 vaccines that have been approved employ a variety of mechanisms of action, including mRNA, DNA vaccines, viral vectors, protein subunits, and virus-inactivated vaccination techniques.⁵ Three vaccines have been widely administered: Pfizer and Moderna (mRNA) vaccinations targeting the SARS-CoV-2 surface protein, and the Janssen (viral vector) vaccine, which employed pre-existing technology with an adenovirus vector to trigger an immune response and provide protection against further infection. As these vaccines were developed using various approaches, they differ in efficacy and storage conditions.⁶

However, no vaccine is entirely free from complications or adverse reactions. Any vaccination can have early adverse reactions, including local ones like pain, swelling, and redness, as well as systemic ones like headache, chills, nausea, fatigue, myalgia, and fever.⁷ Also, several existing health conditions or symptoms the candidate already has can lead to severe adverse reactions after taking the COVID-19 vaccine. The candidate's death could be the worst-case scenario. As a result, it's critical to know about the candidate's previous medical history.⁸

This work's main contribution can be summarized as follows:

1.
Identify the most important features of an individual's medical history that could contribute to adverse reactions to vaccination.
2.
Identify the most important features that contributed to the death of the candidate based on his or her medical history.
3.
Address the challenge of the imbalanced dataset by employing sampling methods to effectively handle the imbalance and improve the reliability of the analysis.
4.
Develop a machine learning (ML) model capable of predicting and classifying the most suitable vaccine types for each candidate, thus helping to prevent severe consequences and ensure optimal vaccination outcomes.

The rest of this paper is organized as follows: In the next section, we discuss a brief review of the literature on various related works. Section 3, provides a detailed explanation of our methodology and dataset. In Section 4, we discuss the study findings. while Section 5 discusses conclusions and potential future research.

Literature review

Due to the rapid advancement of technology, there are numerous opportunities and possibilities for ML in healthcare.⁹ Classification is the most well-known machine-learning technique in medical applications because it is similar to everyday problems. A classification algorithm builds a model based on training data and then applies it to test data to obtain a prediction.¹⁰

Interestingly, some studies have utilized machine learning applications to predict side effects, reactogenicity, and morbidity incidence following COVID-19 vaccinations. In research by Sujathaet al.,,¹¹ the authors develop a model to predict whether a candidate is suitable for COVID-19 vaccination. In this paper, four machine learning approaches named Logistic Regression, AdaBoost, Random Forest, and Decision Tree were employed in the task of prediction. The authors found that AdaBoost was the classifier with the best performance, with 0.98 accuracies.

In research by Hatmal. Met al.,¹² the authors used machine learning and ensemble methods to predict the severity of side effects, defined as no, mild, moderate, or severe side effects. The analysis revealed that random forest and XGBoost achieve the highest accuracy (0.80 and 0.79, respectively) and Cohen’s κ values (0.71 and 0.70, respectively). Statistical data analysis has revealed that the side effects were significantly different based on vaccine type. According to this study, the COVID-19 vaccine that the centers for disease control and prevention (CDC) has approved is safe, and vaccination provides people with a sense of safety. However, a severe case may need additional medical care or even hospitalization.

In research by Lianet al.,¹³ the goal was to collect and analyze tweets about the COVID-19 vaccination to find posts about personal experiences with COVID-19 vaccine adverse events. The authors found that the ensemble model-based RF achieves the best performance with an F1 score of 0.926, an accuracy of 0.908, and a recall of 0.946. The named entity recognition (NER) model achieved an F1 score of 0.770 for detecting adverse events using the conditional random fields (CRF) algorithm. Also, the results show that the three COVID-19 vaccines' (Pfizer, Moderna, and Johnson & Johnson) most common side effects are soreness to touch, fatigue, and headache.

Methods

The overview of the general methodology for developing a machine learning models is visualized inFigure 1. In this study, we focus on predicting which vaccine will be most effective for a candidate without causing severe adverse reactions (output) based on several factors (input) and handling the imbalanced data that falls under the Pre-processing step where the data preparation process takes place.

Open in a separate window

Figure 1.

Prediction methodology architecture.

Dataset

The raw data of individuals who received vaccinations and reported adverse reactions was obtained from the VAERS.¹⁴ This dataset contains vaccination information for individuals vaccinated against a variety of diseases including COVID-19, Polio, Tetanus, and Influenza. However, our current study omitted any non-SARS-CoV-2 (COVID-19) vaccination information. Therefore, the dataset being used consists of 49,810 individuals. This dataset has various attributes of individuals’ information such as age, gender, current illness, medical history, allergic history, type of vaccine, life-threatening illness, symptoms after vaccinations, etc. Some of these attributes have been found to be textual (e.g., medical history, symptoms text, etc.), while others have been found to be numerical (such as age, number of doses, etc.). The description of some different attributes in the VAERS data set is illustrated inTable 1.

Table 1.

Description of some attributes in the VAERS dataset.

Number	Features	Description	Range	Mean	Standard Deviation
1	AGE_YRS (AS)	Age in years	16-109	57.13	18.43
2	SEX (S)	Sex information: (0: Female, 1: Male)	0-1	56.12	229.37
3	OTHER_MEDS (OM)	Other medications currently being taken	0-1	0.46	0.49
4	CUR_ILL(CL)	Illnesses at the time of vaccination	0-1	0.26	0.43
5	PRIOR_VAX(PV)	Any prior vaccination information	0-1	0.02	0.14
6	VAX_NAME (VN)	Vaccination name	0-2	0.46	0.49
7	Medical History (MH)	Pre-existing chronic or long-standing health conditions	0-1	0.46	0.49
8	ALLERGIES (A)	Any allergy history	0-1	0.25	0.43
9	Died (D)	Died	0-1	0.14	0.35

Open in a separate window

Preprocessing

The quality of raw data used to perform any analysis heavily influences its outcome. Therefore, the preprocessing and exploratory analysis of data becomes the most important parts of any data-driven investigation. The preprocessing of a dataset involved examining the data for missing values, irrelevant values, replicas, etc. whereas EDA assists in understanding data by visualizing it. It has been noticed that the dataset contains many missing and irrelevant values.

Any COVID-19 vaccine types that were not specified were removed, and only two types of values in the sex field were considered: “M” as male and “F” as female. Unknown values were excluded. In the died field, ‘Y’ was considered yes, and the rest were considered ‘no’; in the ‘prior vaccine’ field, ‘yes’ was considered yes, and the rest were considered ‘no’. The analysis of allergic history included considering mentioned allergic effects as positive cases and considering ‘null’, ‘none’, ‘NA’, and other negatively mentioned text as negative cases. The History column in the dataset contained written records of coexisting conditions, requiring the extraction of all of the patient's medical history separately. To better understand the patient's medical history, information about pre-existing chronic and non-chronic diseases, such as chronic obstructive pulmonary disease, hypertension, diabetes, and kidney disease, was extracted. All missing values (i.e., empty, null) were excluded from this field, and spelling/grammar mistakes were fixed.

In the Feature extraction step, most of the important features in the acquired dataset are presented as textual data. However, in order to analyze them, they must be separated into separate entities. As a result, String matching was used to convert all text data into attributes. The correlation plot (Figure 2) did not demonstrate a significant relationship between various attributes and vaccine types. Yet, previous studies revealed a direct correlation between vaccine adverse reactions and medical and allergic histories. Therefore, the number of unique entries for the diseases in the patient's medical histories was counted. Diseases with more than 300 counts in patients' medical histories were considered attributes, while the rest were ignored due to the large dataset and the computational burden associated with each individual disease. This study, therefore, considered 21 diseases which are diabetes mellitus, thyroid, different pain, obesity, migraine, kidney disease, hypertension, hyperlipidemia, high cholesterol, heart disease, Gastroesophageal Reflux Disease (GERD), depression, dementia, positive history of COVID-19, Chronic Obstructive Pulmonary Disease (COPD), cancer, atrial fibrillation, asthma, arthritis, anxiety, and anemia from the patient’s medical history as attributes. Using the VAERS id, these files have been merged into one file after identifying and extracting features. The analyzed dataset has 28 different features and over 49,810 samples. The data was encoded using a one-hot encoding technique.

Open in a separate window

Figure 2.

Correlation plot between different features of the VAERSA dataset.

Data-Sampling Algorithms

In this study, only three methods of handling imbalanced data are used. In the first place, no changes are made to the data. Normally, it is divided into training and testing data at a ratio of 8 to 2. This first technique is referred to as “Normal” in this study. Next, experiments are conducted using well-known imbalanced data techniques called SMOTE, Tomek-links, and SMOTETOMEK, for balancing the dataset which combines SMOTE and Tomek links.¹⁵ As with the previous experiment, the dataset is divided into training and testing data at a ratio of 8 to 2. This experiment aims to handle imbalanced data and further improve the performance of machine learning classification models, especially in the multiclass classification scenario.

Description of Ensemble Methods

To predict which vaccine will be most effective for a candidate without causing severe adverse reactions (output) based on several factors (input), different machine-learning algorithms were used to build the proposed model.

Random Forest (RF)

A multipurpose data mining approach for classification. It is based on decision trees that operate as an ensemble, an approach of combining multiple classifiers to identify problems and enhance accuracy. A classification is predicted by each tree independently, and votes for the relevant class, and the majority of votes decide the model’s prediction. It can handle large dataset with high dimensionality, it also improves the accuracy of the model and eliminates the overfitting problem.¹⁶

Decision Tree (DT)

A DT is a supervised learning technique that can be used for classification and regression problems; however, it is most commonly used to resolve classification issues. In this tree-organized classifier, the internal nodes represent datasets, branches represent decision rules, and each leaf node represents the outcome. A DT has two nodes: the decision node and the leaf node. The leaf nodes are the result of such decisions and they do not have any extra branches, but decision nodes are frequently used to settle any decision and have several branches. Based on the features of the dataset, decisions or tests are made.¹⁷

Extreme Gradient Boosting (XGB)

XGBoost is an ensemble learning method combining multiple weak models' predictions to generate a stronger prediction. In the beginning, XGB fits the data to a weak classifier. Afterward, the data is fitted to another weak classifier to increase accuracy without affecting the current model. In the same way, the process continues until the best accuracy is achieved.¹⁸ Furthermore, XGBoost supports parallel processing, making it possible to train models on large datasets in a reasonable period of time.

Light Grading Boosting Machine (LGBM)

LGBM is an open-source gradient boosting algorithm based on a tree-based learning framework; it is an open-source GBDT algorithm designed by Microsoft Research Asia. This framework grew trees vertically (leaf-wise) rather than horizontally (level-wise) as other tree-based frameworks did. Therefore, it can reduce the losses more efficiently and handle huge dataset with less computational complexity due to its lighter version.¹⁹

Model performance evaluation

Macro average and Weighted Average are used to calculate the performance of the four classifiers used for learning.

•
Accuracy: This term tells us how many classifications were correct out of all classifications.
$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$
•
Precision: A model's precision tells us how reliable its predictions are.
$Precision = \frac{TP}{TP + FP}$
•
Recall: The model's ability to detect class.
$Recall = \frac{TP}{TP + FN}$
•
F-score: It will give us a harmonic mean of precision and recall.
See Also
Sunday Newspaper Coupon Insert Preview for 12/8 Walmart Ad Deals: 5/26-6/1 Ashley Cain's ex Safiyya Vorajee shares inspirational bikini clad snap KRON 4 News at 6am : KRON : May 29, 2024 6:00am-7:01am PDT : Free Borrow & Streaming : Internet Archive
$F - score = 2 . (\frac{precision \cdot recall}{precision + recall})$
•
ROC Curve & AUC

ROC Curves show the performance of the classification model across all classification thresholds. In a ROC curve, the TP rate and FP rate are plotted at each threshold of classification. “AUC” stands for “Area Under the ROC Curve”. It can be used as a classifier to distinguish between classes. In general, the higher the AUC value, the better the classifier is at identifying positive from negative classes.²⁰^,²¹

Results and Discussion

The majority of the individuals, 74% in total, were identified as female It was estimated that the average age of the individuals was about 53 years old and that the average age of those who died was about 72 years. Thus, there is a noticeable age difference between the two groups. Regarding reported chronic diseases, chronic hypertension held the highest prevalence (13%), followed by asthma (12%), and kidney and anemia (2%) (Figure 4). Among those who experienced adverse effects from vaccination, 10.7% lost their lives. From (Figure 3), one can clearly observe that in both genders, the majority of case fatalities occurred in individuals between the ages of 70 and 89. It should also be noted that the mortality rate of males is significantly higher than that of females from the age of 60 to 99 years. Hence, men are suffering from more significant adverse effects that lead to death.

Open in a separate window

Figure 3.

Case Fatality Number by Age Band and sex.

Open in a separate window

Figure 4.

Reported chronic diseases.

Extensive experiments have been conducted to predict three significant events in COVID-19 vaccination according to different scenarios. ML’s most relevant model to classify vaccines in each scenario includes RF, DT, XGB, and LGBM. We used 80% training data and 20% test data to evaluate the effectiveness of different ML-based approaches. As was previously mentioned, this dataset was unbalanced; therefore, We employed sampling strategies to address this problem. A number of well-known performance measures were used to assess the results of classification, including accuracy, precision, recall, F1 score, and ROC-AUC.

Our results are presented in two parts each with two scenarios:(a) multiclass classification with sampling,(b) binary classification with sampling, and(c) a comparison of the best model for each part.

Multiclass classification results: based upon both medical history and vaccine type

This section presents the results of the multiclass classification for covid-19 vaccine predicting problem, along with the analysis and the discussion. Firstly, we considered the patient’s medical history as independent features and the vaccine type (value 0 means Moderna, 1 means Pfizer, and 2 means Janssen) as dependent features that depend on the independent features. Then each of the three data-sampling procedures—SMOTE, TOMEK-LINKS, and SMOTETOMEK—was applied separately. (Figure 5) illustrates the effects of applying various data-balancing techniques.

Open in a separate window

Figure 5.

Class label counts before and after applying the various data-sampling techniques for the vaccine type dataset.

The performance parameters for each model on the test dataset are presented in (Table 2). As a result, the following observations have been noted:

•
The testing accuracy values range from approximately 75% to 81% across different models and methods. The Random Forest (RF) models with Normal, TOMEK-LINKS, and SMOTETOMEK methods achieved the highest testing accuracy of around 80.8%, while the XGBoost (XGB) and LightGBM (LGBM) models with Normal, SMOTE, and TOMEK-LINKS methods achieved slightly lower testing accuracy, ranging from 75.2% to 76.2%.
•
The training accuracy values are relatively close to the testing accuracy values, indicating that the models are not overfitting to the training data. The training accuracy values range from approximately 76.9% to 81.2%.
•
Macro Precision, Recall, and F1 Scores: These metrics provide insights into the models' performance for each class, and the macro averaging considers all classes equally. The RF and DT models consistently show similar precision, recall, and F1 scores across different methods, ranging from around 78.9% to 81.6%. The XGB and LGBM models tend to have slightly lower scores, ranging from approximately 70.5% to 74.6%. The RF models generally achieve the highest scores, while the XGB and LGBM models have the lowest scores.
•
The AUC (Area Under the Curve) values represent the performance of the models in terms of their ability to rank samples correctly across all classes. The AUC values range from approximately 78% to 85%. The RF models with SMOTE and SMOTETOMEK methods achieved the highest AUC values of around 85%, indicating better overall performance in distinguishing between different vaccine types.
•
Overall, the RF models consistently perform well across different methods, with relatively higher accuracy, precision, recall, F1 scores, and AUC values. The XGB and LGBM models have lower performance compared to RF and DT models. The SMOTE and SMOTETOMEK methods generally improve the performance of the models, as seen in higher AUC values compared to the Normal and TOMEK-LINKS methods. These models achieve relatively high testing accuracy, balanced precision, recall, and F1 scores, as well as high AUC values.

Table 2.

Performance measures of multiclass classification.

Method	Model	Testing Accuracy	Training Accuracy	Macro Precision	Macro Recall	Macro F1 scores	AUC
Normal	RF	0.80823	0.81208	0.81569	0.78993	0.79653	0.84
	DT	0.80823	0.81210	0.81569	0.78993	0.79653	0.84
	XGB	0.76174	0.76480	0.78625	0.72301	0.74137	0.80
	LGBM	0.75218	0.75926	0.77682	0.70460	0.72229	0.78
SMOTE	RF	0.80317	0.80393	0.80226	0.80057	0.79300	0.85
	DT	0.80374	0.80397	0.80180	0.80102	0.79356	0.85
	XGB	0.75740	0.75748	0.74593	0.75048	0.74586	0.81
	LGBM	0.74953	0.75284	0.73641	0.74317	0.73613	0.80
TOMEK-LINKS	RF	0.80823	0.81208	0.81569	0.78993	0.79653	0.84
	DT	0.80823	0.81210	0.81569	0.78993	0.79653	0.84
	XGB	0.76174	0.76480	0.78625	0.72301	0.74137	0.80
	LGBM	0.75218	0.75926	0.77682	0.70460	0.72229	0.78
SMOTETOMEK	RF	0.80358	0.80451	0.80189	0.80089	0.79341	0.85
	DT	0.80374	0.80451	0.80180	0.80102	0.79356	0.85
	XGB	0.75901	0.75789	0.74595	0.75350	0.74740	0.82
	LGBM	0.75708	0.76176	0.74552	0.75530	0.74652	0.82

Open in a separate window

ROC curves have been used to further analyze the predictive capability of these developed models, which are shown in (Figure 6). The RF and DT models prove their effectiveness. Taking AUC into account, all developed models perform satisfactorily.

Open in a separate window

Figure 6.

ROC curves for covid-19 type multiclass classification.

Binary classification results

In our model’s analysis, firstly, we considered the patient’s medical history as the independent features, and the vaccine type (value 0 means Moderna and value 1 means Pfizer) and the patient death (value 0 mean alive, and value 1 mean died) as dependent features. We trained and evaluated our models using test data by measuring accuracy, precision, recall, and AUC.

Scenario 1: Based upon both medical history and vaccine type

The performance parameters for each model on the test dataset are presented in (Table 3). As a result, the following observations have been noted:

1.
RF achieved high testing accuracy (0.87091) and training accuracy (0.87439), indicating good generalization and low overfitting. It demonstrated high precision (0.87974), recall (0.87091), and F1 score (0.87424), suggesting a balanced performance between identifying positive and negative instances. The AUC (0.93) indicates a high discriminatory power of the model. The precision value for both RF and DT was reported as 0.87. XGB and LGBM also show a comparable precision value of 0.86 and 0.0.84, respectively.
2.
DT achieved similar testing accuracy (0.86975) and training accuracy (0.87439) as RF. It showed slightly lower precision (0.8779), recall (0.86975), and F1 score (0.8728) compared to RF. The AUC (0.93) suggests a good ability to distinguish between positive and negative instances.
3.
XGB achieved a slightly lower testing accuracy (0.85905) and training accuracy (0.86122) compared to RF and DT. It demonstrated comparable precision (0.86031), recall (0.85905), and F1 score (0.8596) to the testing accuracy, indicating a balanced performance. The AUC (0.91) suggests a reasonably good ability to discriminate between positive and negative instances.
4.
LGBM showed the lowest testing accuracy (0.84953) and training accuracy (0.85038) among the models. It had slightly lower precision (0.84771), recall (0.84953), and F1 score (0.84857) compared to the other models. The AUC (0.89) suggests a good ability to distinguish between positive and negative instances, although it is lower than RF and DT.
5.
The RF and DT models with vaccine-type target consistently achieved the highest accuracy, Recall, Precision, F1 score, and AUC, especially RF outperforms all others. XGB and LGBM models had slightly lower performance metrics but still maintained reasonable accuracy and AUC.
6.
Thus, the experimental analysis recommends the RF model is the most suitable for detecting vaccine type compared to the other models.

Table 3.

Experimental performance of Scenario 1 the models with binary vaccine type dataset.

Method	Model	Testing Accuracy	Training Accuracy	Precision	Recall	F1 scores	AUC
Normal	RF	0.87091	0.87439	0.87974	0.87091	0.87424	0.93
	DT	0.86975	0.87439	0.8779	0.86975	0.8728	0.93
	XGB	0.85905	0.86122	0.86031	0.85905	0.8596	0.91
	LGBM	0.84953	0.85038	0.84771	0.84953	0.84857	0.89

Open in a separate window

ROC curves have been used to further analyze the predictive capability of these models, which are shown in (Figure 7). The RF and DT models prove their effectiveness. Taking AUC into account, all developed models perform satisfactorily.

Open in a separate window

Figure 7.

Scenario 1: ROC Curve for binary vaccine type dataset.

Scenario 2: based upon both medical history and death

The patient’s death dataset was also experimented with as the vaccine-type dataset. (Figure 8) demonstrates the effect of applying various data-sampling methods. The performance parameters for each model on the test dataset are presented in (Table 4). As a result, the following observations have been noted:

1.
The testing accuracy values range from approximately 79.9% to 85.7%, depending on the model and method used. The RF and XGB models consistently achieve higher testing accuracy compared to DT and LGBM models. Among the methods, TOMEK-LINKS and SMOTETOMEK methods tend to show slightly lower testing accuracy compared to Normal and SMOTE methods.
2.
The training accuracy values are relatively high, ranging from approximately 87% to 95.2%. However, there is a notable difference between the training accuracy and testing accuracy values, suggesting potential overfitting issues, especially for the RF models.
3.
Precision, Recall, and F1 scores: The precision, recall, and F1 scores provide insights into the models' performance for predicting the positive class (death possibility). The RF models consistently achieve higher precision, recall, and F1 scores compared to DT, XGB, and LGBM models. Among the methods, TOMEK-LINKS and SMOTETOMEK methods tend to show slightly lower precision, recall, and F1 scores compared to Normal and SMOTE methods.
4.
The AUC (Area Under the Curve) values represent the models' ability to rank samples correctly and discriminate between positive and negative classes. The AUC values range from approximately 66% to 86%. The RF and XGB models consistently achieve higher AUC values, indicating better overall performance in distinguishing between COVID-19 death possibilities.
5.
the models trained on the normal data generally performed better in terms of accuracy and AUC compared to the models trained on the modified datasets (SMOTE, TOMEK-LINKS, SMOTETOMEK). The Random Forest, XGBoost, and LGBM models consistently showed good performance across the metrics in all datasets, indicating their robustness and effectiveness in classification tasks. The Decision Tree model had relatively lower performance, especially in terms of AUC, in all methods.

Open in a separate window

Figure 8.

Class label counts before and after applying the various data-sampling techniques for the death dataset.

Table 4.

Performance measures of different methods with death dataset.

Method	Model	Testing Accuracy	Training Accuracy	Precision	Recall	F1 scores	AUC
Normal	RF	0.84261	0.95272	0.82632	0.84261	0.83277	0.83
	DT	0.82917	0.95272	0.81553	0.82917	0.82151	0.66
	XGB	0.85700	0.93064	0.83523	0.85700	0.84072	0.86
	LGBM	0.85700	0.91312	0.83938	0.85700	0.84503	0.86
SMOTE	RF	0.81861	0.92647	0.84585	0.81861	0.82961	0.83
	DT	0.81285	0.92647	0.83576	0.81285	0.82256	0.70
	XGB	0.79942	0.88992	0.84183	0.79942	0.81556	0.81
	LGBM	0.806147	0.87468	0.85443	0.80614	0.82339	0.82
TOMEK-LINKS	RF	0.84261	0.95240	0.82632	0.84261	0.83277	0.83
	DT	0.83301	0.95240	0.82065	0.83301	0.82608	0.67
	XGB	0.84932	0.92920	0.83149	0.84932	0.83791	0.85
	LGBM	0.85508	0.92002	0.83641	0.85508	0.84236	0.86
SMOTETOMEK	RF	0.80326	0.9198	0.84520	0.80326	0.81909	0.82
	DT	0.78406	0.91983	0.82101	0.78406	0.79924	0.65
	XGB	0.79846	0.88383	0.84462	0.79846	0.81563	0.81
	LGBM	0.79846	0.86958	0.84880	0.79846	0.81665	0.82

Open in a separate window

ROC curves have been used to further analyze the predictive capability of these models, which are shown in (Figure 9). The RF and DT models prove their effectiveness. Taking AUC into account, all developed models perform satisfactorily.

Open in a separate window

Figure 9.

Scenario 2: ROC curve for a Death.

The importance of all the features in the COVID-19 vaccine adverse reactions dataset is calculated using the feature_importance package from the Scikit-learn Python library. A visual representation of the calculated values for feature importance is displayed inFigure 10. The features are arranged based on their respective importance scores.

Open in a separate window

Figure 10.

Ranking of features based on the patients' medical history coefficient values.

Figure 10 shows that patients' age, gender, and use of other medicines were significant factors in the past medical history of all target variables. WHEN examining the target variable of “vaccine type,” the analysis revealed a comprehensive set of critical attributes within the patient's medical history that strongly influence the selection of the administered vaccine. These attributes include previous vaccine history, allergic history, diabetes, arthritis, hypertension, and asthma. Furthermore, when investigating the target variable of death status, certain factors emerged as highly significant. These factors include heart disease, allergic history, dementia, hypertension, diabetes, kidney disease, and Chronic obstructive pulmonary disease (COPD). These attributes have shown a noteworthy impact on the desired outcome, indicating their importance in predicting the death status of patients.

The patient's age and gender provide essential demographic information that may impact the choice of vaccine, as certain vaccines have age or gender-specific recommendations. Additionally, considering the patient's current medication usage is crucial to ensure compatibility and potential interactions with the chosen vaccine. Previous vaccine history helps determine if the patient requires a booster or a specific type of vaccine.

The presence of underlying conditions such as diabetes, arthritis, allergic history, hypertension, and asthma is highly influential in the decision-making process. These conditions may affect the patient's immune response or make them more susceptible to certain vaccine side effects. By considering these attributes, healthcare professionals can tailor the vaccine type to maximize efficacy and minimize risks for each patient.

Strengths and limitations

As far as the authors are aware, this is the first study that attempts to predict the type of covid-19 vaccine appropriate for a candidate, along with the death probability risk. Additionally, we suggest approaches to address the issue of imbalanced data concerning adverse reactions to COVID-19 vaccines.

This study has some limitations. Because these data were collected online, we cannot rule out information-gathering bias in the study. Moreover, this data set contained a significant amount of missing data, which may lead to a misrepresentation of patient populations.

Conclusion and future works

In this work, four ML models were evaluated: DT, RF, XGBoost, and LGBM. Three sampling techniques were executed for each model to handle imbalanced data. Below are some of the key findings of the study, which shed light on crucial insights and implications:

1.
The tree-based model RF presented the best overall results with multiclass classification.
2.
The SMOTE and SMOTETOMEK methods generally improve the performance of the models, as seen in higher AUC values compared to the Normal and TOMEK-LINKS methods.
3.
For binary classification in scenario 1, the experimental analysis recommends the RF model as the most suitable for detecting vaccine type compared to the other models.
4.
In scenario 2, the RF, XGBoost, and LGBM models consistently showed good performance across the metrics in all methods, indicating their robustness and effectiveness in classification tasks.
5.
The Decision Tree model had relatively lower performance, especially in terms of AUC, in all methods.
6.
The results revealed that patient age, gender, allergic history, prior vaccine, other medicines, diabetes, hypertension, and heart disease are significant pre-existing factors that strongly influence the selection of the administered vaccine.

According to the study's results, the RF model is recommended for machine learning tasks that demand high accuracy and robustness. While both the XGBoost and LGBM models are also viable options, the RF model could be preferable when dealing with imbalanced data.

This work can also be applied to any other datasets related to vaccinations. we limited the number of medical history features due to the large dataset size and the computational burden associated with processing each disease. However, further advancements can be made by automating the system to analyze predictions based on more medical history features. As new data is entered into the dataset, the automation can generate new predictions based on the prevailing factors at that particular moment. Furthermore, deep learning models can also be used to extract more hidden patterns to improve COVID-19 vaccine acceptability by better understanding its dynamics.

Notes

[version 1; peer review: 1 approved

Funding Statement

The author(s) declared that no grants were involved in supporting this work.

Data availability

The dataset used to support the findings of this study is available at the following:https://vaers.hhs.gov/data/datasets.html.

The dataset is comprised of three CSV files, namely VAERSDATA, VAERSVAX, and VAERSSYMPTOMS. Within these datasets, VAERSDATA provides comprehensive information regarding individuals, VAERSVAX offers details related to vaccines, encompassing vaccination type, manufacturer, dosage count, and vaccination location, and VAERSSYMPTOMS catalog symptoms reported as various illnesses following vaccinations.

[VAERS Data]:https://vaers.hhs.gov/eSubDownload/index.jsp?fn=2021VAERSDATA.csv.

[VAERS Vaccine]:https://vaers.hhs.gov/eSubDownload/index.jsp?fn=2021VAERSVAX.csv.

[VAERS Symptoms]:https://vaers.hhs.gov/eSubDownload/index.jsp?fn=2021VAERSSYMPTOMS.csv.

References

1. Velásquez G:Vaccines, Medicines and COVID-19: How Can WHO Be Given a Stronger Voice?Springer Nature;2022;117. [Google Scholar]

2. Dai X, Xiong Y, Li N, et al.:Vaccine types.Vaccines-the History and Future.IntechOpen;2019; (pp.1–18). [Google Scholar]

3. Eroglu B, Nuwarda RF, Ramzan I, et al.:A Narrative Review of COVID-19 Vaccines.Vaccines.2021;10(1):62. 10.3390/vaccines10010062 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

4. Monadhel H, Abbas A, Mohammed A:COVID-19 vaccinations and their side effects: a scoping systematic review [version 1; peer review: awaiting peer review].F1000Res.2023;12:604. 10.12688/f1000research.134171.1 [CrossRef] [Google Scholar]

5. Vitiello A, Ferrara F:Brief review of the mRNA vaccines COVID-19.Inflammopharmacology.2021;29(3):645–649. 10.1007/s10787-021-00863-6 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

6. Patel R, Kaki M, Potluri VS, et al.:A comprehensive review of SARS-CoV-2 vaccines: Pfizer, Moderna & Johnson & Johnson.Hum. Vaccin. Immunother.2022;18(1):2002083. 10.1080/21645515.2021.2002083 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

7. Al Khames Aga QA, Alkhaffaf WH, Hatem TH, et al.:Safety of COVID-19 vaccines.J. Med. Virol.2021;93(12):6588–6594. 10.1002/jmv.27304 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

8. Sujatha R, Venkata Siva Krishna B, Chatterjee JM, et al.:Prediction of suitable candidates for COVID-19 vaccination.Intell. Autom. Soft Comput.2022;32(1):525–541. 10.3233/JIFS-202714 [CrossRef] [Google Scholar]

9. Javaid M, Haleem A, Singh RP, et al.:Significance of machine learning in healthcare: Features, pillars and applications.Int. J. Intell. Networks.2022;3:58–73. 10.1016/j.ijin.2022.05.002 [CrossRef] [Google Scholar]

10. Zoumana KEITA:“Classification in Machine Learning: An Introduction”,datacamp.Sep 2022.Reference Source

11. Hatmal MMM, Al-Hatamleh MA, Olaimat AN, et al.:Side effects and perceptions following COVID-19 vaccination in Jordan: a randomized, cross-sectional study implementing machine learning for predicting severity of side effects.Vaccines.2021;9(6):556. 10.3390/vaccines9060556 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

12. Lian AT, Du J, Tang L:Using a machine learning approach to monitor COVID-19 vaccine adverse events (VAE) from twitter data.Vaccines.2022;10(1):103. 10.3390/vaccines10010103 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

13. VAERS Data Sets:Reference Source

14. Henry M:Imbalanced Classification in Python: SMOTE-Tomek Links Method.Reference Source

15. Random Forest Algorithm:Java T point.2018.Reference Source

16. Decision Tree Classification Algorithm GeeksforGeeks.08 May, 2023.Reference Source

17. Abbas AR, Farooq AO:Skin Detection Using Improved ID3 Algorithm.Iraqi J. Sci.2019;402–410. [Google Scholar]

18. XGBoost ML Model in Python JavaTpoint. Reference Source

19. Banerjee P:LightGBM Classifier in Python Kaggle.2021.Reference Source

20. Narkhede S:Understanding AUC - ROC Curve Medium.Jun 26, 2018.Reference Source

21. Abbas AR, Kareem AR:Age estimation using support vector machine.Iraqi J. Sci.2018;1746–1756. [Google Scholar]

Version 1. F1000Res. 2023; 12: 1200.

Reviewer response for version 1

2023; 12: 1200.

Published online 2023 Sep 25. doi:10.5256/f1000research.153740.r257039

Dhamodharavadhani S, Referee¹

Author information Copyright and License information PMC Disclaimer

Overall, this manuscript makes a significant contribution to the field of healthcare analytics by leveraging machine learning techniques to predict vaccine effectiveness and mortality risk associated with COVID-19 vaccination. The findings have important implications for improving vaccination strategies and patient care. However, I recommend some minor revisions for clarity and precision in language.

Certainly! Here are some potential review questions for the manuscript:

1. What are the main objectives of the study, and why are they important in the context of COVID-19 vaccination?

2.How does the study contribute to the existing literature on vaccine effectiveness and adverse reactions?

3.Can you explain the rationale behind selecting ensemble learning algorithms for this study?

4.How were class balancing techniques like SMOTE, TOMEK_LINK, and SMOTETOMEK incorporated into the analysis, and why were they deemed necessary?

5.What were the key factors identified as crucial for predicting vaccine effectiveness and mortality risk?

6.elaborate on the performance metrics used to evaluate the ensemble learning classifiers, and how do they reflect the predictive accuracy of the models?

7. How do the performance outcomes of different ensemble learning algorithms compare in predicting vaccine types and mortality risk?

8. What insights do the findings provide regarding the relationship between pre-existing conditions, medical history, and vaccine outcomes?

9. What practical implications do the study findings have for healthcare practitioners and policymakers?

10.Based on the results obtained, what recommendations would you provide for optimizing COVID-19 vaccination strategies and minimizing adverse reactions?

11. Are there any limitations of the current study that should be addressed in future research?

12. What additional research avenues do you suggest for further advancing our understanding of vaccine effectiveness and safety prediction?

13.In your opinion, how does this study contribute to advancing knowledge in the field of healthcare analytics and COVID-19 vaccination?

19. What are the strengths and weaknesses of the manuscript, and how could they be addressed to enhance its impact and credibility?

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Machine learning, Predictive Analytics, Data Science, optimization, Big data Analytics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References

1. :Nonlinear Neural Network Based Forecasting Model for Predicting COVID-19 Cases.Neural Process Lett.2023;55(1) :10.1007/s11063-021-10495-w171-191 10.1007/s11063-021-10495-w [PMC free article] [PubMed] [CrossRef] [Google Scholar]

2. :COVID-19 Mortality Rate Prediction for India Using Statistical Neural Network Models.Front Public Health.2020;8:10.3389/fpubh.2020.00441441 10.3389/fpubh.2020.00441 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

3. :COVID-19 mortality rate prediction for India using statistical neural networks and gaussian process regression model.Afr Health Sci.2021;21(1) :10.4314/ahs.v21i1.26194-206 10.4314/ahs.v21i1.26 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

4. :SEIR Model for COVID-19 Epidemic Using Delay Differential Equation.Journal of Physics: Conference Series.2021;1767(1) :10.1088/1742-6596/1767/1/012005 10.1088/1742-6596/1767/1/012005 [CrossRef] [Google Scholar]

5. :Novel COVID-19 Mortality Rate Prediction (MRP) Model for India Using Regression Model With Optimized Hyperparameter.Journal of Cases on Information Technology.2021;23(4) :10.4018/JCIT.20211001.oa11-12 10.4018/JCIT.20211001.oa1 [CrossRef] [Google Scholar]

6. :Coffee leaf valorisation into functional wheat flour rusk: their nutritional, physicochemical, and sensory properties.J Food Sci Technol.2024;61(6) :10.1007/s13197-024-05927-z1117-1125 10.1007/s13197-024-05927-z [PMC free article] [PubMed] [CrossRef] [Google Scholar]

7. :Vaccine rate forecast for COVID-19 in Africa using hybrid forecasting models.Afr Health Sci.2023;23(1) :10.4314/ahs.v23i1.1193-103 10.4314/ahs.v23i1.11 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

Version 1. F1000Res. 2023; 12: 1200.

Reviewer response for version 1

2023; 12: 1200.

Published online 2023 Sep 25. doi:10.5256/f1000research.153740.r245449

Jinran Wu, Referee¹

Author information Copyright and License information PMC Disclaimer

The authors proposed predicting vaccine types and assessing mortality risk through some ensemble learning approaches. The research topic is interesting, and some comments are given for further consideration.

1. Based on Google Scholar, authors missed citing many recent references that shouldn't be ignored. Here, I suggest using a table to list and compare their main points to highlight your contributions to the area.

2. For the experiment part, all parameter settings are missing, so that the results declared cannot be repeated. Please release your codes or detailed settings in the appendix.

3. For your results, I suggest authors use a cross-validation approach to evaluate the uncertainty of the predictions.

4. Also, considering the imbalance issue, the authors shall consider different penalties for different prediction errors. In other words, we cannot regard "dies" as the same as "non-dies". Authors shall distinguish different losses.

5. The discussion could have been improved. The authors shall further explore the underlying implications in this part. Otherwise, the work looks like a mathematical game. In particular, the authors shall connect results to some findings from some top medicine journals.

6. The authors shall use a professional writing service to make the content clear.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Machine learning, Forecasting, Applied Statistics

Version 1. F1000Res. 2023; 12: 1200.

Reviewer response for version 1

2023; 12: 1200.

Published online 2023 Sep 25. doi:10.5256/f1000research.153740.r257042

Aritra Ghosh, Referee¹

Author information Copyright and License information PMC Disclaimer

Full Report:

Introduction: The introduction provides a comprehensive overview of the transition in vaccine development timelines and the urgent need for effective COVID-19 vaccines. It highlights the significance of understanding vaccine mechanisms and adverse reactions, setting the stage for the proposed machine-learning framework. However, it could benefit from a succinct statement of the study's objectives to guide readers through the subsequent sections more effectively.

Literature Review: The literature review effectively contextualizes the study within existing research on machine learning applications in healthcare, specifically focusing on COVID-19 vaccine prediction. It provides insights into relevant studies while emphasizing the novelty and contributions of the current work. However, the review could be strengthened by discussing potential limitations or gaps in previous research, thereby justifying the need for the proposed study more explicitly.

Methods: The methods section is detailed and well-structured, outlining the data preprocessing, feature extraction, and modeling techniques employed. It effectively communicates the rationale behind each step and provides clarity on the experimental design. The inclusion of figures and tables enhances the understanding of complex methodologies. However, providing more information on the rationale behind the selection of specific sampling techniques and model evaluation metrics would strengthen the methodology further.

Results and Discussion: The results and discussion section presents comprehensive findings from the study, including performance metrics and an analysis of key features. The results are effectively communicated through tables, figures, and textual descriptions, facilitating interpretation. The discussion contextualizes the findings within the broader literature and highlights implications for vaccine selection and adverse reaction prediction. However, a more structured approach to discussing limitations and future research directions would enhance the clarity of the discussion.

Conclusion and Future Works: The conclusion summarizes the key findings and implications of the study while outlining potential avenues for future research. It effectively emphasizes the significance of the study's contributions and underscores the importance of continued research in this area. However, providing more specific recommendations for addressing identified limitations and mitigating potential biases would enhance the conclusion's comprehensiveness.

Overall Assessment: The abstract and full report provides a detailed and insightful analysis of the proposed machine-learning framework for COVID-19 vaccine prediction. The study demonstrates a rigorous approach to data analysis and model evaluation, yielding valuable insights into vaccine efficacy and adverse reactions. Addressing the following minor issues would further enhance the scientific soundness and readability of the report:

Clarity and Readability:
- Ensure consistent terminology throughout the report. For example, use either "COVID-19 vaccine" or "SARS-CoV-2 vaccine" consistently instead of switching between them.
- Consider breaking down lengthy paragraphs into shorter ones for easier readability and comprehension, especially in sections like "Results and Discussion" and "Conclusion and Future Works."
- Provide clear transitions between sections to guide the reader through the report more effectively. Each section should flow logically from one to the next.
Justification:
- Provide more justification for the choice of machine learning algorithms. Explain why Random Forest (RF), Decision Tree (DT), Extreme Gradient Boosting (XGB), and Light Grading Boosting Machine (LGBM) were selected over other algorithms. Justify why these algorithms are suitable for the task at hand.
- Clarify the reasoning behind choosing specific data sampling techniques (e.g., SMOTE, Tomek-links, SMOTETOMEK) to handle imbalanced data. Explain why these techniques were deemed appropriate and how they contribute to improving model performance.
Structure:
- Consider refining the structure of the report to make it more cohesive and organized. For instance, ensure that each section has a clear and specific focus, with subheadings to delineate different topics within the section.
- Provide a brief overview or summary at the beginning of each section to outline the main points that will be discussed. This will help readers understand the purpose and scope of each section more clearly.
- In the "Conclusion and Future Works" section, provide a concise summary of the key findings and implications of the study. Additionally, offers specific suggestions for future research directions based on the limitations or areas for improvement identified in the study.

By addressing these minor issues, the report will become more scientifically sound and easier to follow for readers, thereby enhancing its overall quality and impact.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Data Analysis, AI and ML, HCI, Computational Modeling and Big Data, and Web Development. Cuurently working on the application of Machine Learning for COVID-19 vaccine development.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Articles from F1000Research are provided here courtesy of F1000 Research Ltd

COVID-19 Vaccine: Predicting Vaccine Types and Assessing Mortality Risk Through Ensemble Learning Agorithms (2024)

FAQs

Which of these vaccine types is considered safest least chance of causing disease and or side effects )? ›

Benefits: Subunit vaccines only contain pieces of a pathogen, not the whole organism, so they cannot make you sick or cause infection.

Keep Reading ›

How does the COVID-19 vaccination reduce the risk of COVID-19 and its potentially severe complications? ›

Our study suggests that a vaccine could have a substantial impact on reducing incidence, hospitalizations, and deaths, especially among vulnerable individuals with comorbidities and risk factors associated with severe COVID-19.

What is vaccine effectiveness explanation? ›

Vaccine effectiveness is a measure of how well vaccination protects people against health outcomes such as infection, symptomatic illness, hospitalization, and death. Vaccine effectiveness is generally measured by comparing the frequency of health outcomes in vaccinated and unvaccinated people.

Learn More ›

What are the names of the COVID-19 vaccines? ›

CDC recommends the 2023–2024 updated COVID-19 vaccines—Pfizer-BioNTech, Moderna, or Novavax—to protect against serious illness from COVID-19.

Read On ›

What is the new type of COVID vaccine? ›

Novavax is a different type of COVID-19 vaccine than what has been previously approved or authorized in the U.S. The Pfizer-BioNTech and Moderna COVID-19 vaccines use a newer mechanism in which messenger RNA is used to lead cells to create a protein on the virus' surface that the immune system can recognize.

Read The Full Story ›

What vaccines are currently in development? ›

Pipeline vaccines

Chikungunya.
Enterotoxigenic Escherichia coli.
Group A Streptococcus (GAS)
Group B Streptococcus (GBS)
Herpes Simplex Virus.
HIV-1.
Improved Influenza Vaccines.
Malaria.

More items...

Learn More Now ›