Introduction

Cardiovascular disease is the leading cause of death in the United States. One person dies every 36 seconds in the United Statates from cardiovascular disease (https://www.cdc.gov/heartdisease/). In order to study what contributes to disease progression, analyzing risk factors in our daily life could provide important sights on the diagnosis and prognosis. The following analysis will shed light on cardiovascular risk factors and potentially guide us to a healthier way to lead our lives.

Data Source

The following analysis is performed on a dataset with various cardiovascular disease risk factors and patient outcomes from: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

heart_data <- read.csv("cardio_train.csv", sep=";")

Data Exploration

Initially, we want to understand the dimensions of the data and what information is encoded in the categories.

Data Dimension

The dimension of the data is:

dim(heart_data)
## [1] 70000    13

We can see that there is a total of 70,000 entries of patient outcome. According to the explanation on the dataset, the variables are encoded in the following way:

  1. Age | Objective Feature | age | int (days)
  2. Height | Objective Feature | height | int (cm) |
  3. Weight | Objective Feature | weight | float (kg) |
  4. Gender | Objective Feature | gender | categorical code |
  5. Systolic blood pressure | Examination Feature | ap_hi | int |
  6. Diastolic blood pressure | Examination Feature | ap_lo | int |
  7. Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
  8. Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
  9. Smoking | Subjective Feature | smoke | binary |
  10. Alcohol intake | Subjective Feature | alco | binary |
  11. Physical activity | Subjective Feature | active | binary |
  12. Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

Data Variables

The variables in our dataset are:

glimpse(heart_data)
## Rows: 70,000
## Columns: 13
## $ id          <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 23, 24, …
## $ age         <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 22584, 1…
## $ gender      <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, …
## $ height      <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, 169, 17…
## $ weight      <dbl> 62, 85, 64, 82, 56, 67, 93, 95, 71, 68, 80, 60, 60, 78, 9…
## $ ap_hi       <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, 120, 12…
## $ ap_lo       <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 80, 70, …
## $ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ gluc        <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ smoke       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ alco        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ active      <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, …
## $ cardio      <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …

Data Preprocessing

Summary

A summary of the data and removing the ID’s

summary(heart_data)
##        id             age            gender         height     
##  Min.   :    0   Min.   :10798   Min.   :1.00   Min.   : 55.0  
##  1st Qu.:25007   1st Qu.:17664   1st Qu.:1.00   1st Qu.:159.0  
##  Median :50002   Median :19703   Median :1.00   Median :165.0  
##  Mean   :49972   Mean   :19469   Mean   :1.35   Mean   :164.4  
##  3rd Qu.:74889   3rd Qu.:21327   3rd Qu.:2.00   3rd Qu.:170.0  
##  Max.   :99999   Max.   :23713   Max.   :2.00   Max.   :250.0  
##      weight           ap_hi             ap_lo           cholesterol   
##  Min.   : 10.00   Min.   : -150.0   Min.   :  -70.00   Min.   :1.000  
##  1st Qu.: 65.00   1st Qu.:  120.0   1st Qu.:   80.00   1st Qu.:1.000  
##  Median : 72.00   Median :  120.0   Median :   80.00   Median :1.000  
##  Mean   : 74.21   Mean   :  128.8   Mean   :   96.63   Mean   :1.367  
##  3rd Qu.: 82.00   3rd Qu.:  140.0   3rd Qu.:   90.00   3rd Qu.:2.000  
##  Max.   :200.00   Max.   :16020.0   Max.   :11000.00   Max.   :3.000  
##       gluc           smoke              alco             active      
##  Min.   :1.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median :1.000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :1.226   Mean   :0.08813   Mean   :0.05377   Mean   :0.8037  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##      cardio      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4997  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
heart_data$id <- NULL

Outliers

Find potential outliers

stacked_heart_data <- stack(heart_data)
ggplot(stacked_heart_data, aes(x = ind, y = values)) +
  geom_boxplot() + 
  facet_wrap(~ind , scales = "free")

There are outliers in height, weight, ap_hi, and ap_lo.

height_outlier_idx <- which(heart_data$height %in% boxplot(heart_data$height)$out)

heart_data <- heart_data[-height_outlier_idx, ]

weight_outlier_idx <- which(heart_data$weight %in% boxplot(heart_data$weight)$out)

heart_data <- heart_data[-weight_outlier_idx, ]

ap_hi_outlier_idx <- which(heart_data$ap_hi %in% boxplot(heart_data$ap_hi)$out)

heart_data <- heart_data[-ap_hi_outlier_idx, ]

ap_lo_outlier_idx <- which(heart_data$ap_lo %in% boxplot(heart_data$ap_lo)$out)

heart_data <- heart_data[-ap_lo_outlier_idx, ]

nrow(heart_data)
## [1] 62505

After filtering out outliers in those four categories, there are still 62,505 entries left.

Duplicates

Find duplicate entries

print(duplicated(heart_data$id))
## logical(0)

There are no duplicate values.

Data Analysis

Overall Correlation

cormat <- round(cor(heart_data), 3)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  ggtitle("Correlation Between Variables") +
  xlab("Variables") + 
  ylab("Variables") +
  theme(axis.text=element_text(size=7))

Logistic Regression

Based on variables with high correlation shown from the overall correlation heatmap, we can tell that glucose level, cholesterol, diastolic and systolic blood pressure, weight, and age correlate with cardiovascular disease. Hence, we construct a logistic regression model to predict the disease outcome.

logit_model <- glm(cardio ~ gluc + cholesterol + ap_lo + ap_hi + weight + age, data = heart_data)
summary(logit_model)
## 
## Call:
## glm(formula = cardio ~ gluc + cholesterol + ap_lo + ap_hi + weight + 
##     age, data = heart_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.33340  -0.36728  -0.07491   0.39317   1.12732  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.064e+00  2.297e-02 -89.856  < 2e-16 ***
## gluc        -1.786e-02  3.470e-03  -5.147 2.65e-07 ***
## cholesterol  9.427e-02  2.978e-03  31.656  < 2e-16 ***
## ap_lo        3.352e-03  3.240e-04  10.343  < 2e-16 ***
## ap_hi        1.163e-02  1.772e-04  65.638  < 2e-16 ***
## weight       2.074e-03  1.482e-04  13.996  < 2e-16 ***
## age          2.849e-05  7.356e-07  38.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1928831)
## 
##     Null deviance: 15624  on 62504  degrees of freedom
## Residual deviance: 12055  on 62498  degrees of freedom
## AIC: 74528
## 
## Number of Fisher Scoring iterations: 2

The odds ratio and the 95% confidence interval for each variable is

exp(cbind(odds_ratio = coef(logit_model), confint(logit_model)))
## Waiting for profiling to be done...
##             odds_ratio     2.5 %    97.5 %
## (Intercept)  0.1269571 0.1213684 0.1328031
## gluc         0.9822977 0.9756395 0.9890013
## cholesterol  1.0988605 1.0924651 1.1052932
## ap_lo        1.0033573 1.0027203 1.0039948
## ap_hi        1.0117008 1.0113494 1.0120523
## weight       1.0020763 1.0017852 1.0023674
## age          1.0000285 1.0000270 1.0000299

We could see that a high cholesterols level meant higher odds of getting cardiovascular disease.

PCA

We can also perform PCA to see if the variables have a linear relationship and are separable based on disease-status.

heart_data.pca <- prcomp(heart_data[,-12], center = TRUE, scale. = TRUE)
autoplot(heart_data.pca, data=heart_data, colour="cardio")

We can see that disease status could be separated based on the variables in the first principle component.

Gender Disparities

Let’s investigate if there are any gender disparities for cardiovascular disease!

gender_table <- table(gender=heart_data$gender, CAD=heart_data$cardio)
colnames(gender_table) <- c("Have", "Don't Have")
rownames(gender_table) <- c("Male", "Female")
mosaic(gender_table, shade=TRUE, legend=TRUE)

There are no statistically signifcant relationship between gender and cardiovascular disease.

ANOVA

Lastly, we will perform ANOVA on all of the variables to see if there is a statistically significant relationship between these factors and cardiovascular disease.

summary(aov(cardio ~ age, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## age             1    874   874.5    3706 <2e-16 ***
## Residuals   62503  14749     0.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ gender, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)
## gender          1      0  0.3569   1.428  0.232
## Residuals   62503  15624  0.2500
summary(aov(cardio ~ height, data = heart_data))
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## height          1     12   11.90   47.63 5.19e-12 ***
## Residuals   62503  15612    0.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ weight, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## weight          1    409   408.9    1680 <2e-16 ***
## Residuals   62503  15215     0.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ ap_hi, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## ap_hi           1   2919  2918.9   14360 <2e-16 ***
## Residuals   62503  12705     0.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ ap_lo, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## ap_lo           1   1767  1767.0    7970 <2e-16 ***
## Residuals   62503  13857     0.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ cholesterol, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## cholesterol     1    744   744.2    3126 <2e-16 ***
## Residuals   62503  14880     0.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ gluc, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## gluc            1    115  114.87     463 <2e-16 ***
## Residuals   62503  15509    0.25                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ smoke, data = heart_data))
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## smoke           1      8    7.67    30.7 3.03e-08 ***
## Residuals   62503  15616    0.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ alco, data = heart_data))
##                Df Sum Sq Mean Sq F value  Pr(>F)   
## alco            1      2  2.3676   9.473 0.00209 **
## Residuals   62503  15622  0.2499                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ active, data = heart_data))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## active          1     22   22.20   88.94 <2e-16 ***
## Residuals   62503  15602    0.25                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that there is a significant relationship for all variables except gender.

Conclusion

Through analyzing this dataset, we have investigated the risk factors for cardiovascular disease. By recognizing these risk factors, we can use the knowledge to guide us in living a healthier life.