This was an extra credit project for one of my earlier classes. I performed linear regression on the diabeties toy dataset. In different steps, I included interaction variables and removed the correlated variables and all of the insignificant variables. I also first standardized the variables so that the coefficients can tell us the amount of strength each variable has.
OLS Regression Results
==============================================================================
Dep. Variable: disease R-squared: 0.521
Model: OLS Adj. R-squared: 0.514
Method: Least Squares F-statistic: 78.85
Date: Wed, 07 May 2025 Prob (F-statistic): 1.96e-66
Time: 07:09:07 Log-Likelihood: -2384.5
No. Observations: 442 AIC: 4783.
Df Residuals: 435 BIC: 4812.
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 148.9157 2.730 54.551 0.000 143.550 154.281
sex -235.1787 59.773 -3.935 0.000 -352.658 -117.700
bmi 514.4725 64.598 7.964 0.000 387.509 641.436
bp 302.6922 62.751 4.824 0.000 179.359 426.025
hdl -296.3677 64.925 -4.565 0.000 -423.974 -168.762
ltg 485.0868 65.006 7.462 0.000 357.321 612.853
bmi*bp 3596.9551 1073.567 3.350 0.001 1486.931 5706.979
==============================================================================
Omnibus: 3.917 Durbin-Watson: 1.992
Prob(Omnibus): 0.141 Jarque-Bera (JB): 3.837
Skew: 0.188 Prob(JB): 0.147
Kurtosis: 2.743 Cond. No. 420.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
These results imply that good cholesterol (hdl) helps decrease diabetes but being a male, high bmi, bp, and ltg each increases diabetes and having both bad bmi and bp increases diabetes even more. From these results I suggested for us to lower calorie intake and get plenty of exercise to prevent diabetes.
Now that I am further along in the program, what I would do differently is to first split the data into training and testing sets to get a better performing model. I would also compare elastic net and adaptive LASSO models and try group LASSO if it makes sense to group some of the correlated variables together.
Additionally, if I want to take a more in depth look at diabetes I could find and include more variables like family history, drug/tobacco/alcohol use, caloric intake, amount of weekly exercise, and occupation.
No comments:
Post a Comment