Cardiovascular disease is the leading cause of death in the United States. One person dies every 36 seconds in the United Statates from cardiovascular disease (https://www.cdc.gov/heartdisease/). In order to study what contributes to disease progression, analyzing risk factors in our daily life could provide important sights on the diagnosis and prognosis. The following analysis will shed light on cardiovascular risk factors and potentially guide us to a healthier way to lead our lives.
The following analysis is performed on a dataset with various cardiovascular disease risk factors and patient outcomes from: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
heart_data <- read.csv("cardio_train.csv", sep=";")
Initially, we want to understand the dimensions of the data and what information is encoded in the categories.
The dimension of the data is:
dim(heart_data)
## [1] 70000 13
We can see that there is a total of 70,000 entries of patient outcome. According to the explanation on the dataset, the variables are encoded in the following way:
The variables in our dataset are:
glimpse(heart_data)
## Rows: 70,000
## Columns: 13
## $ id <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 23, 24, …
## $ age <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 22584, 1…
## $ gender <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, …
## $ height <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, 169, 17…
## $ weight <dbl> 62, 85, 64, 82, 56, 67, 93, 95, 71, 68, 80, 60, 60, 78, 9…
## $ ap_hi <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, 120, 12…
## $ ap_lo <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 80, 70, …
## $ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ gluc <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ smoke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ alco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ active <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, …
## $ cardio <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
A summary of the data and removing the ID’s
summary(heart_data)
## id age gender height
## Min. : 0 Min. :10798 Min. :1.00 Min. : 55.0
## 1st Qu.:25007 1st Qu.:17664 1st Qu.:1.00 1st Qu.:159.0
## Median :50002 Median :19703 Median :1.00 Median :165.0
## Mean :49972 Mean :19469 Mean :1.35 Mean :164.4
## 3rd Qu.:74889 3rd Qu.:21327 3rd Qu.:2.00 3rd Qu.:170.0
## Max. :99999 Max. :23713 Max. :2.00 Max. :250.0
## weight ap_hi ap_lo cholesterol
## Min. : 10.00 Min. : -150.0 Min. : -70.00 Min. :1.000
## 1st Qu.: 65.00 1st Qu.: 120.0 1st Qu.: 80.00 1st Qu.:1.000
## Median : 72.00 Median : 120.0 Median : 80.00 Median :1.000
## Mean : 74.21 Mean : 128.8 Mean : 96.63 Mean :1.367
## 3rd Qu.: 82.00 3rd Qu.: 140.0 3rd Qu.: 90.00 3rd Qu.:2.000
## Max. :200.00 Max. :16020.0 Max. :11000.00 Max. :3.000
## gluc smoke alco active
## Min. :1.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :1.000 Median :0.00000 Median :0.00000 Median :1.0000
## Mean :1.226 Mean :0.08813 Mean :0.05377 Mean :0.8037
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## cardio
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4997
## 3rd Qu.:1.0000
## Max. :1.0000
heart_data$id <- NULL
Find potential outliers
stacked_heart_data <- stack(heart_data)
ggplot(stacked_heart_data, aes(x = ind, y = values)) +
geom_boxplot() +
facet_wrap(~ind , scales = "free")
There are outliers in height, weight, ap_hi, and ap_lo.
height_outlier_idx <- which(heart_data$height %in% boxplot(heart_data$height)$out)
heart_data <- heart_data[-height_outlier_idx, ]
weight_outlier_idx <- which(heart_data$weight %in% boxplot(heart_data$weight)$out)
heart_data <- heart_data[-weight_outlier_idx, ]
ap_hi_outlier_idx <- which(heart_data$ap_hi %in% boxplot(heart_data$ap_hi)$out)
heart_data <- heart_data[-ap_hi_outlier_idx, ]
ap_lo_outlier_idx <- which(heart_data$ap_lo %in% boxplot(heart_data$ap_lo)$out)
heart_data <- heart_data[-ap_lo_outlier_idx, ]
nrow(heart_data)
## [1] 62505
After filtering out outliers in those four categories, there are still 62,505 entries left.
Find duplicate entries
print(duplicated(heart_data$id))
## logical(0)
There are no duplicate values.
cormat <- round(cor(heart_data), 3)
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
ggtitle("Correlation Between Variables") +
xlab("Variables") +
ylab("Variables") +
theme(axis.text=element_text(size=7))
Based on variables with high correlation shown from the overall correlation heatmap, we can tell that glucose level, cholesterol, diastolic and systolic blood pressure, weight, and age correlate with cardiovascular disease. Hence, we construct a logistic regression model to predict the disease outcome.
logit_model <- glm(cardio ~ gluc + cholesterol + ap_lo + ap_hi + weight + age, data = heart_data)
summary(logit_model)
##
## Call:
## glm(formula = cardio ~ gluc + cholesterol + ap_lo + ap_hi + weight +
## age, data = heart_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.33340 -0.36728 -0.07491 0.39317 1.12732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.064e+00 2.297e-02 -89.856 < 2e-16 ***
## gluc -1.786e-02 3.470e-03 -5.147 2.65e-07 ***
## cholesterol 9.427e-02 2.978e-03 31.656 < 2e-16 ***
## ap_lo 3.352e-03 3.240e-04 10.343 < 2e-16 ***
## ap_hi 1.163e-02 1.772e-04 65.638 < 2e-16 ***
## weight 2.074e-03 1.482e-04 13.996 < 2e-16 ***
## age 2.849e-05 7.356e-07 38.727 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1928831)
##
## Null deviance: 15624 on 62504 degrees of freedom
## Residual deviance: 12055 on 62498 degrees of freedom
## AIC: 74528
##
## Number of Fisher Scoring iterations: 2
The odds ratio and the 95% confidence interval for each variable is
exp(cbind(odds_ratio = coef(logit_model), confint(logit_model)))
## Waiting for profiling to be done...
## odds_ratio 2.5 % 97.5 %
## (Intercept) 0.1269571 0.1213684 0.1328031
## gluc 0.9822977 0.9756395 0.9890013
## cholesterol 1.0988605 1.0924651 1.1052932
## ap_lo 1.0033573 1.0027203 1.0039948
## ap_hi 1.0117008 1.0113494 1.0120523
## weight 1.0020763 1.0017852 1.0023674
## age 1.0000285 1.0000270 1.0000299
We could see that a high cholesterols level meant higher odds of getting cardiovascular disease.
We can also perform PCA to see if the variables have a linear relationship and are separable based on disease-status.
heart_data.pca <- prcomp(heart_data[,-12], center = TRUE, scale. = TRUE)
autoplot(heart_data.pca, data=heart_data, colour="cardio")
We can see that disease status could be separated based on the variables in the first principle component.
Let’s investigate if there are any gender disparities for cardiovascular disease!
gender_table <- table(gender=heart_data$gender, CAD=heart_data$cardio)
colnames(gender_table) <- c("Have", "Don't Have")
rownames(gender_table) <- c("Male", "Female")
mosaic(gender_table, shade=TRUE, legend=TRUE)
There are no statistically signifcant relationship between gender and cardiovascular disease.
Lastly, we will perform ANOVA on all of the variables to see if there is a statistically significant relationship between these factors and cardiovascular disease.
summary(aov(cardio ~ age, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 874 874.5 3706 <2e-16 ***
## Residuals 62503 14749 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ gender, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 0 0.3569 1.428 0.232
## Residuals 62503 15624 0.2500
summary(aov(cardio ~ height, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## height 1 12 11.90 47.63 5.19e-12 ***
## Residuals 62503 15612 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ weight, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 409 408.9 1680 <2e-16 ***
## Residuals 62503 15215 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ ap_hi, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## ap_hi 1 2919 2918.9 14360 <2e-16 ***
## Residuals 62503 12705 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ ap_lo, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## ap_lo 1 1767 1767.0 7970 <2e-16 ***
## Residuals 62503 13857 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ cholesterol, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## cholesterol 1 744 744.2 3126 <2e-16 ***
## Residuals 62503 14880 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ gluc, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## gluc 1 115 114.87 463 <2e-16 ***
## Residuals 62503 15509 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ smoke, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## smoke 1 8 7.67 30.7 3.03e-08 ***
## Residuals 62503 15616 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ alco, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## alco 1 2 2.3676 9.473 0.00209 **
## Residuals 62503 15622 0.2499
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(cardio ~ active, data = heart_data))
## Df Sum Sq Mean Sq F value Pr(>F)
## active 1 22 22.20 88.94 <2e-16 ***
## Residuals 62503 15602 0.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that there is a significant relationship for all variables except gender.
Through analyzing this dataset, we have investigated the risk factors for cardiovascular disease. By recognizing these risk factors, we can use the knowledge to guide us in living a healthier life.