Diabetes

This was an extra credit project for one of my earlier classes. I performed linear regression on the diabeties toy dataset. In different steps, I included interaction variables and removed the correlated variables and all of the insignificant variables. I also first standardized the variables so that the coefficients can tell us the amount of strength each variable has.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                disease   R-squared:                       0.521
Model:                            OLS   Adj. R-squared:                  0.514
Method:                 Least Squares   F-statistic:                     78.85
Date:                Wed, 07 May 2025   Prob (F-statistic):           1.96e-66
Time:                        07:09:07   Log-Likelihood:                -2384.5
No. Observations:                 442   AIC:                             4783.
Df Residuals:                     435   BIC:                             4812.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        148.9157      2.730     54.551      0.000     143.550     154.281
sex         -235.1787     59.773     -3.935      0.000    -352.658    -117.700
bmi          514.4725     64.598      7.964      0.000     387.509     641.436
bp           302.6922     62.751      4.824      0.000     179.359     426.025
hdl         -296.3677     64.925     -4.565      0.000    -423.974    -168.762
ltg          485.0868     65.006      7.462      0.000     357.321     612.853
bmi*bp      3596.9551   1073.567      3.350      0.001    1486.931    5706.979
==============================================================================
Omnibus:                        3.917   Durbin-Watson:                   1.992
Prob(Omnibus):                  0.141   Jarque-Bera (JB):                3.837
Skew:                           0.188   Prob(JB):                        0.147
Kurtosis:                       2.743   Cond. No.                         420.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

These results imply that good cholesterol (hdl) helps decrease diabetes but being a male, high bmi, bp, and ltg each increases diabetes and having both bad bmi and bp increases diabetes even more. From these results I suggested for us to lower calorie intake and get plenty of exercise to prevent diabetes.

Now that I am further along in the program, what I would do differently is to first split the data into training and testing sets to get a better performing model. I would also compare elastic net and adaptive LASSO models and try group LASSO if it makes sense to group some of the correlated variables together.

Additionally, if I want to take a more in depth look at diabetes I could find and include more variables like family history, drug/tobacco/alcohol use, caloric intake, amount of weekly exercise, and occupation.

No comments:

Post a Comment