Logistic regression for cardiovascular diseases prediction by integrating PCA and K-means ++

. This research introduces a novel method for forecasting cardiovascular diseases using an advanced combination of K-means++ clustering, Principal Component Analysis (PCA), and Logistic Regression techniques. Given the global impact of cardiovascular diseases as a primary cause of death, this research utilizes a comprehensive dataset to tackle the prediction challenges associated with CVDs. Initially employing K-means++ for enhanced data quality, followed by PCA for dimensionality reduction, the study applies Logistic Regression for outcome prediction, achieving remarkable accuracy, specificity, and sensitivity. This methodological rigor offers a promising avenue for early and accurate CVDs detection, significantly outperforming traditional predictive models. By refining data through these steps, the study ensures the predictive model is built on a solid foundation, enhancing the reliability and generalizability of the predictions. The integration of these advanced analytical techniques marks a step forward in the pursuit of effective cardiovascular disease management, highlighting the importance of data preprocessing in predictive modeling.


Introduction
Worldwide, cardiovascular diseases persist as the leading cause of mortality, accounting for over 30% of all deaths.In the year 2019 alone, these conditions resulted in the deaths of 17.9 million individuals.Furthermore, there is a trend towards younger people being affected by cardiovascular diseases, especially rheumatic heart disease [1].The complexity of cardiovascular diseases necessitates a multifaceted approach to understanding the interplay of various risk factors, including but not limited to age, blood pressure, cholesterol levels, and lifestyle choices such as smoking and physical activity.Traditional statistical methods have provided insights into the relationships between individual risk factors and heart disease.The advent of machine learning and data-driven methodologies offers a promising avenue to explore these associations further.This paper will utilize logistic regression, a binary classifier to predict the CVDs.However, dealing with datasets with many dimensions can be challenging due to the significant memory requirements and the risk of overfitting associated with analyzing numerous features [2].By applying feature weighting, it can reduce redundancy in the data and cut down on processing time.This approach helps enhance the efficiency of the algorithm [3].
Principal component analysis (PCA), the technique employed in this article, is categorized as feature extraction.Hence, PCA's fundamental aim is to streamline the analysis by converting a broad array of variables into a more compact set that preserves the majority of the original data's insights.This method is particularly useful in dealing with multicollinearity among variables, a common issue in epidemiological datasets where risk factors often exhibit intercorrelations.
In recent years, many researchers have combined PCA to build prediction models for heart disease.Gárate-Escamila et al. utilized chi-square (CHI) analysis and PCA in combination with machine learning techniques to determine if patients are afflicted with heart disease.The study gave an accuracy result of 98.7% by integrating CHI, PCA, and the RF classifier [4].Zhu et al. combined PCA and K-means for diabetes prediction, and then used the logistic regression to do the classification.The result attained a 97% accuracy [5].In a like manner, Rathore et al. developed a combined clustering and PCA framework for forecasting heart disease through logistic regression, attaining an accuracy of 98.82% [6].Jhaldiyal et al. utilized PCA with support vector machines (SVM) to build a prediction model for diabetes.The model provided a 93.66% accuracy [7].
Before applying PCA to reduce the dimensionality of the dataset, this article will utilize a clustering method, K-means ++, to do the data cleaning.Since original datasets often contain noise, missing values, errors, or inconsistent records.Those inaccurate or low-quality data can lead to unreliable analytical results [8].Through data cleaning, these issues can be identified and corrected, thereby enhancing the overall quality of the dataset to reduce misleading analyses and erroneous decisions [8].
Some researchers utilized clustering methods to do the data cleaning.Loureiro et al. employed hierarchical clustering techniques for identifying outliers, utilizing the dimensions of the resultant clusters as signals for outlier existence [9].To address scalability challenges related to data observation and cleaning, Hu et al. employed a special cut-clustering technique to categorize keys into various groups.This classification is grounded on specific attributes, including age, cell line, and disease type.By organizing keys into these clusters, identifying duplicates and errors within each group becomes more straightforward, facilitating efficient data management and quality improvement [10].Moreover, Guo et al. proposed a data-cleaning method based on improved K-means clustering and error feedback to achieve data cleaning [11].
Nevertheless, K-means chooses the initial centroids randomly, which might make a significant difference in the result.Therefore, a more accurate version of K-means-K-means ++ will be employed in the article.K-means++ improves upon K-means by carefully choosing initial centroids using a weighted probability distribution, which is more likely to spread out the centroids and avoid poor clustering.Due to the smarter initialization, K-means++ has a higher chance of converging to a better final solution closer to the global optimum.Moreover, with a better starting point, K-means++ often requires fewer iterations to converge, saving computational resources, especially in cases where convergence is slow.The k-means++ algorithm demonstrated a minimum improvement in accuracy of 10% over the traditional k-means method, frequently achieving significantly superior performance [12].
Therefore, this article will initially employ K-means++ to enhance the data quality, followed by the use of PCA for dimensionality reduction, and ultimately applying LR (Logistic Regression) for outcome prediction.Additionally, by refining the dataset through these steps, this study aims to address potential biases and ensure that the predictive model is built on a solid foundation, thus enhancing the reliability and accuracy of the predictions.

Data source and description
The dataset "Cardiovascular Disease dataset", as shown in Table 1, by SVETLANA ULIANOVA, obtained from Kaggle, was used for this project.This dataset encompasses a variety of input features of 70000 observations, categorized into three distinct groups: objective features (O), based on factual information; examination features (E), derived from the results of medical examinations; and subjective features (S), information provided by the patients.Furthermore, the dataset incorporates a Target Variable: The presence or absence of cardiovascular disease., where σ and μ represent the standard deviation and mean of each variable, and X is the original dataset.This process is to mitigate the sensitivity of clustering and PCA algorithms to feature scale.

K-means ++.
K-means ++ enhances the initialization step of K-means clustering to improve cluster quality.Following the random selection of the initial cluster center, the ensuing centers are picked based on a probability directly related to the squared distance from the closest current center.This method continues until the establishment of K centers, after which the process proceeds with regular Kmeans clustering until it stabilizes.
For this model, the k-means++ clustering algorithm, set with K = 2, segmented Z, the standardized dataset, into two distinct clusters.Cluster 0 comprised observations indicative of a healthy classification within the dataset, whereas Cluster 1 included those associated with heart disease.Subsequently, data points that were incorrectly clustered were identified and excluded from both clusters.Following the exclusion of these inaccuracies, data points from both clusters were amalgamated and randomized to eliminate any potential sequence dependency present in the original dataset.In this way, the subsequent processes, like principal Component Analysis (PCA), would avoid the influence of the original data ordering.

PCA.
Post shuffling and merging, PCA was utilized for dimensionality reduction, selecting principal components that accounted for at least 90% of the variance.Then, the data was then partitioned into training and test sets after the PCA application, preventing information leakage and validating model efficacy.
PCA begins with the computation of the covariance matrix with the formula: Σ = then transformed into a lower-dimensional space using the top k eigenvectors, resulting in a transformed dataset.

Logistic regression.
The selected principal components served as features for training the logistic regression model and applied 0.5 as the Decision Threshold.The logistic regression model predicts the probability of a given input belonging to the class labeled "1" (as opposed to class "0") using the logistic function.For a set of features X (the principal components), the probability that Y=1 is expressed as: Where ( = 1|) is the probability that the outcome Y is 1 given the predictors X;  0 ,  1 , … ,   , are the coefficients of the model, including the intercept  0 , and the slope coefficients  1 , … ,   for each predictor  1 , … ,   ; and e is the base of the natural logarithm.
Finally, the performance of the logistic regression model was evaluated using the test set, with accuracy, recall, and specificity as the metrics for assessment (Table 2).; specificity: the proportion of negative identifications that were correct, calculated as

Feature contribution
The PCA graph, as shown in Figure 1, represents the variables from the dataset projected onto the first two principal components, which are the axes of the graph.The orientation and magnitude of the arrows show the extent of each variable's contribution to the two main components.The PCA biplot visualizes the variables' contributions to the primary axes of variance within the dataset.Notably, cholesterol and glucose levels align closely with the second principal component, suggesting these factors share a similar pattern of variance.Additionally, gender, height, smoking, and alcohol intake appear interrelated, exerting a substantial influence on the first principal component, while age had a moderate contribution to both components.
This alignment indicates a linkage between these lifestyle factors and the underlying principal components.The proximity of cholesterol to the axis of the second principal component signifies its strong association with this axis, paralleled by gender's alignment with the first principal component.These observations emphasize the relative importance of these variables in the dataset and their potential impact on cardiovascular health outcomes (Figure 2).

Performance improvement
Table 3 shows the changes in accuracy, specificity, and sensitivity for different datasets.Upon the deployment of K-means++, PCA, and logistic regression, the model demonstrated exceptional proficiency in discerning between the presence and absence of CVDs, registering an accuracy of 98.31%, a specificity of 98.00%, and a sensitivity of 98.49%.These metrics collectively indicate that the K-means++ clustering and PCA preprocessing significantly enhance the model's ability to predict cardiovascular diseases with high precision.This suggests that the advanced preprocessing steps contribute to the elimination of noise and reduction of irrelevant information, leading to a more refined and accurate prediction by the logistic regression model.The striking improvement from the raw to the K-means++ and PCA processed data underlines the importance of proper data preprocessing in predictive modeling.The Receiver Operating Characteristic (ROC) curve supports these assertions, offering a detailed depiction of the predictive model's efficacy as illustrated in Fig 2 .An AUC of 0.99736 signifies the model's superior ability to differentiate between individuals with and without cardiovascular conditions.This almost flawless AUC value signifies both a high true positive rate (sensitivity) and a minimal rate of false positives (1-specificity), highlighting the accuracy of the model in forecasting cardiovascular events.

Comparison with other studies
To further assess the model, this study compares the accuracy with other recent studies using the same dataset by SVETLANA ULIANOVA, as shown in Table 4. Rana et al. used weight and height to determine the BMI as a new feature.Then they checked and removed the outlier by using the Interquartile Range (IQR).Finally, they used logistic regression as the classifier and got an average accuracy of 72.18% [13].Comlan et al. utilized the CRISP-DM framework to develop the prediction model, starting with data selection and preparation.They applied several algorithms as classifiers, and the Decision Tree Classifier gave the highest accuracy of 85% [14].Shorewala applied combined methods such as bagging, boosting, and stacking to enhance the efficacy of classic algorithms.By layering K-Nearest Neighbors, the random forest classifier, and the support vector machine atop logistic regression, they achieved a 75.1% accuracy [15].Additionally, Theerthagiri and Vidya devised a Recursive Feature Elimination-Gradient Boosting (RFE-GB) strategy, beginning with the dataset's complete feature set and gradually removing the less significant features to isolate a set number of crucial ones [16].They determined that blood pressure, cholesterol, and physical activity are the key predictors.Then, they used GB as a classifier to get an accuracy of 89.78%.By comparison, the model of this study shows an outstanding performance, but all these studies applied the data preprocessing techniques to achieve a significant improvement in model performance.

Conclusion
This study embarked on an integrative approach combining K-means++, PCA, and logistic regression to predict the presence of CVDs, leveraging a robust dataset reflective of key objective, examination, and subjective features.The findings underscore the efficacy of K-means++ in refining the dataset quality, which, when coupled with PCA for dimensionality reduction, significantly enhances the performance of the logistic regression model.This analysis yielded remarkable accuracy, specificity, and sensitivity, demonstrating the model's capability to discern between patients with and without CVDs effectively.
The strategic use of PCA facilitated the identification and retention of the most informative features, thereby mitigating the risk of multicollinearity and overfitting -common challenges in highdimensional datasets, crucial for the subsequent logistic regression phase, ensuring the model was underpinned by data of the highest fidelity.
The convergence of these methods made a predictive performance that not only aligns with but also extends the current literature on CVDs risk prediction, signifying a step forward in the pursuit of early and accurate disease detection.The study's strength lies in its methodological rigor and the synergistic application of advanced analytical techniques that together enhance the model's reliability and generalizability.
Future work may explore the integration of additional data sources and the deployment of the model in clinical settings to further validate its practical utility.In striving for a model that performs with high accuracy across diverse populations, acknowledge the continuous evolution of predictive analytics and its role in transforming cardiovascular disease management.

1 𝑛− 1
⊤ , A is the shuffling and merging dataset from Z.Then, eigenvalues and eigenvectors are computed from Σ, and eigenvectors are sorted by descending eigenvalues to capture the principal components.The dataset is Proceedings of the 2nd International Conference on Mathematical Physics and Computational Simulation DOI: 10.54254/2753-8818/38/20240569 + ; and recall (or Sensitivity): the proportion of actual positives correctly identified, calculated as  + .Higher values in accuracy, recall, and specificity generally indicate better performance of a logistic regression model.

Figure 1 .
Figure 1.Features contribution to the first and second principal components.

Table 1 .
Features of the Cardiovascular Disease dataset.
The Confusion Matrix, as shown above, provides TN, FP, FN, and TP to calculate accuracy: the proportion of correctly made predictions, calculated as + +++

Table 3 .
Performances of each dataset.

Table 4 .
Accuracy of different prediction models.