This project is conducted by USC Public Health Data Science Department with the great help of City ofo Hope. In this project, I use the CTS data to build a model to predict whether the patient would die in the given time window.I will also analyze several variables to find out whether they are significantly related to the risk of death in the given time window
At first, I remove all predictors whose NA values are more than 1/4. Those predictors are not lack too much values and are probably biased. There is an exception: date_of_death_dt. All the individuals who didn’t die until the end of the study shall be recorded as NA in “date_of_death_dt”. So I keep it and change all individuals’ “date_of_death_dt” to 4000-11-11 if and only is their “decease” is 0.
Second, I remove Some variables that is too related to other variables, which causes colinearity and lead to reduntancy. Such as smoke_statcat and cig_day_avg, or diag_icd1 and diag_icd_dsc1.I remove one of the pair.
Third, I also remove some irrelevant variables, judged by philosophy.
Forth, I also create several new variables. I create “month” recording the month of “discharge_dt”. Then I also create “season” based on “month”. I also create “survive time to record how long do they survive after discharge. I also create”die30“,”die180, “die1800” to record whether they die in 30days 180days or 1800days after discharge.
I found that each icd code have more than 2500 categories. If I use the all the ide codes as predictors, they will lead to more than 10000 dummy variables in the LASSO logistic regression and also make the calculate too complex and time consuming. Considering that the processed dataset only have less than 40000 rows, such a big amount of dummy variables will lead to overfitting in the LASSO logistic regression model. So I decide to only keep the ccs codes, which is collapsed from the icd code.
Form the plots, we can find that limited kinds of ccs codes covers most of the individuals. It is the same with ccs code1-4. We can also find that different ccs code have significantly different death time window distribution
## [1] 239
## [1] 2597
## [1] 241
## [1] 2709
## [1] 239
## [1] 2725
## [1] 238
## [1] 2653
It seems that season doesn’t have a significant influence on death time window distribution
We can find that in different zip code areas, the death time window distributions are quite different
It seems that birth place have a significant influence on death time window
It seems that age have a significant influence on death time window. The older, the more risk of death
It seems that participant race have a significant influence on death time window distributio. The African Americans have the biggest risk of death.
(die30 indicates whether the individual die in 30 days after discharge)
I will use four models: random forest, LASSO logistic regression, boosting and SVM. Then I will select the best model. I will also use that model to predict die1800. For some variables with more than 50 categories, I only keep the 50 most frequent categories and mark other categories as “other”.
In this model, the most important predictors are patient_disposition_cde, CCS code 1-4 and zip code. The AUC is 0.7328, not very good. There are two reason: First, random forest is not inherently a good model. Second, the model delivered by randomForest cannot provide the possibility, only 0 or 1, which doesn’t have advantage in AUC.
## MeanDecreaseGini
## ses_quartile_ind 16.7414633
## blockgroup90_urban_cat 21.3877469
## hysterectomy_ind 5.6186164
## bilateral_mastectomy_ind 0.5694043
## bilateral_oophorectomy_ind 5.1737893
## age_at_baseline 58.0258426
## adopted 0.9883099
## twin 1.0059664
## birthplace 12.2884274
## participant_race 7.5773925
## menarche_age 43.9027853
## oralcntr_ever_q1 8.1147141
## preg_ever_q1 5.0839590
## height_q1 33.3848338
## weight_q1 44.8022885
## bmi_q1 46.4593427
## allex_life_hrs 47.6316548
## alchl_analyscat 11.2206874
## brca 4.9621503
## mammo_ever_q1 2.8490895
## hbpmed_totyrs 23.1278744
## nsaid_totyrs 31.7869027
## sleep_hrs 12.0764941
## diet_plant 48.8326855
## diet_highprotfat 49.1342059
## diet_highcarb 54.1840777
## dief_ethnic 51.4565758
## diet_saladwine 52.9618185
## smoke_statcat 9.7935775
## asthma_q3 3.9398022
## insulin_daily 2.1858167
## aceinhb_daily 4.0282047
## othhbp_daily 5.4063048
## tamox_daily 4.2321221
## steroid_daily 3.5832532
## brondil_daily 4.1292450
## cholmed_daily 4.8079272
## antidep_daily 4.0523751
## admission_typ 11.7792268
## length_of_stay_day_cnt 47.6371265
## dnr_flg 85.1105531
## major_diag_cat_cde 93.7307453
## patient_care_typ 4.7807270
## patient_disposition_cde 908.4115643
## total_charges_amt 46.8404234
## diag_poa1 2.2138213
## diag_poa2 6.8399736
## diag_poa3 6.3980379
## diag_poa4 5.4213020
## hospital_time 49.3259845
## season 23.8409083
## ccs1_ 199.1781398
## ccs2_ 191.4893347
## ccs3_ 205.1382918
## ccs4_ 201.3648095
## zipcode_ 158.0996676
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## [1] "AUC for LASSO in the test set:"
## Area under the curve: 0.7356
The optimal lambda is 0.0008202554, so it is quite similiar to the logistic regression without any penalty on more variables. And the most important predictors are the dummy variables from patient_disposition_cde. LASSO logistic regression have a big advantage: It can provide the Beta of every category of factor variables. However, it can not provide the p-value of the Beta. The AUC of this model is 0.9212, which is considerably good.
## names Beta
## 1 patient_disposition_cde.X11 6.4072442
## 2 patient_disposition_cde.X20 4.9337482
## 3 major_diag_cat_cde.X25 4.1710452
## 4 patient_disposition_cde.X51 3.1185925
## 5 patient_disposition_cde.X50 2.1982525
## 6 patient_disposition_cde.X13 1.8960046
## 7 ccs1_.Secondary.malignancies 1.4033413
## 8 patient_disposition_cde.X2 1.0710238
## 9 dnr_flg.Y 1.0356694
## 10 ccs4_.Secondary.malignancies 0.9863471
## 11 major_diag_cat_cde.X17 0.8886343
## 12 patient_disposition_cde.X63 0.8790254
## 13 ccs1_.Acute.and.unspecified.renal.failure 0.8633028
## 14 ccs4_.Acute.and.unspecified.renal.failure 0.8431186
## 15 ccs2_.Secondary.malignancies 0.8329342
## 16 patient_disposition_cde.X1 -0.8042216
## 17 ccs2_.Nutritional.deficiencies 0.7558373
## 18 ccs1_.Respiratory.failure..insufficiency..arrest..adult. 0.7458320
## 19 admission_typ.X1 -0.7306056
## 20 ccs3_.Secondary.malignancies 0.6507283
## [1] "optimal lambda:"
## [1] 0.0009002293
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## [1] "AUC for LASSO in the test set:"
## Area under the curve: 0.9212
The optimal tree number is 1524 and the most important predictors are patient_disposition_cde, CCS code1-4 and zip code, quite similiar to rabdom forest. But this model is much more accurate than random forest because its AUC is 0.9268.
## var rel.inf
## patient_disposition_cde patient_disposition_cde 61.103784024
## ccs1_ ccs1_ 9.594899812
## ccs3_ ccs3_ 6.799722234
## ccs4_ ccs4_ 5.929404956
## ccs2_ ccs2_ 5.591591298
## zipcode_ zipcode_ 5.079880705
## dnr_flg dnr_flg 2.670469284
## major_diag_cat_cde major_diag_cat_cde 0.966645063
## length_of_stay_day_cnt length_of_stay_day_cnt 0.717910883
## age_at_baseline age_at_baseline 0.714453113
## diet_highcarb diet_highcarb 0.150305244
## diet_saladwine diet_saladwine 0.098389547
## sleep_hrs sleep_hrs 0.097658224
## admission_typ admission_typ 0.078648741
## bmi_q1 bmi_q1 0.076504809
## total_charges_amt total_charges_amt 0.050723350
## nsaid_totyrs nsaid_totyrs 0.047936626
## diag_poa2 diag_poa2 0.037275287
## weight_q1 weight_q1 0.035253287
## allex_life_hrs allex_life_hrs 0.033388595
## diet_highprotfat diet_highprotfat 0.026692564
## birthplace birthplace 0.023478619
## menarche_age menarche_age 0.020719332
## dief_ethnic dief_ethnic 0.019460037
## tamox_daily tamox_daily 0.005372312
## diet_plant diet_plant 0.005272608
## alchl_analyscat alchl_analyscat 0.004956879
## ses_quartile_ind ses_quartile_ind 0.004685884
## smoke_statcat smoke_statcat 0.002912967
## hbpmed_totyrs hbpmed_totyrs 0.002760106
## brondil_daily brondil_daily 0.002388429
## brca brca 0.002316795
## cholmed_daily cholmed_daily 0.001604349
## participant_race participant_race 0.001487529
## patient_care_typ patient_care_typ 0.001046506
## blockgroup90_urban_cat blockgroup90_urban_cat 0.000000000
## hysterectomy_ind hysterectomy_ind 0.000000000
## bilateral_mastectomy_ind bilateral_mastectomy_ind 0.000000000
## bilateral_oophorectomy_ind bilateral_oophorectomy_ind 0.000000000
## adopted adopted 0.000000000
## twin twin 0.000000000
## oralcntr_ever_q1 oralcntr_ever_q1 0.000000000
## preg_ever_q1 preg_ever_q1 0.000000000
## height_q1 height_q1 0.000000000
## mammo_ever_q1 mammo_ever_q1 0.000000000
## asthma_q3 asthma_q3 0.000000000
## insulin_daily insulin_daily 0.000000000
## aceinhb_daily aceinhb_daily 0.000000000
## othhbp_daily othhbp_daily 0.000000000
## steroid_daily steroid_daily 0.000000000
## antidep_daily antidep_daily 0.000000000
## diag_poa1 diag_poa1 0.000000000
## diag_poa3 diag_poa3 0.000000000
## diag_poa4 diag_poa4 0.000000000
## hospital_time hospital_time 0.000000000
## season season 0.000000000
## [1] 1734
## Using 1734 trees...
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] "AUC for LASSO in the test set:"
## Area under the curve: 0.9264
The optimal C value is 4. The AUC is only 0.7238, not very good.
## [Tune] Started tuning learner classif.ksvm for parameter set:
## Type len Def Constr Req Tunable Trafo
## C discrete - - 3,4,5,6 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: C=3
## [Tune-y] 1: mmce.test.mean=0.0385006; time: 8.1 min
## [Tune-x] 2: C=4
## [Tune-y] 2: mmce.test.mean=0.0387034; time: 8.0 min
## [Tune-x] 3: C=5
## [Tune-y] 3: mmce.test.mean=0.0387846; time: 8.1 min
## [Tune-x] 4: C=6
## [Tune-y] 4: mmce.test.mean=0.0388657; time: 8.5 min
## [Tune] Result: C=3 : mmce.test.mean=0.0385006
## [1] 3
## Setting default kernel parameters
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] "AUC for SVM in the test set:"
## Area under the curve: 0.7238
Boosting has the best AUC. And AUC is a very good measurement evaluating the models because it consider not only sensitivity but also specificity
(die1800 indicates whether the individual die in 1800 days after discharge)
I use boosting because it is the best method when predicting die30 and the model here use the same potential predictors. What’s more, according to my experience, boosting is usually the best model among the four above. Maybe because of the magic of iterations. The optimal tree number is 7504, and the most important predictors are CCS code 1-4, zip code and age_at_baseline The AUC of this model is 0.8747, which is considerably good.
## var rel.inf
## ccs1_ ccs1_ 19.185497876
## ccs3_ ccs3_ 15.751861861
## ccs2_ ccs2_ 15.163238825
## age_at_baseline age_at_baseline 12.236165820
## ccs4_ ccs4_ 11.467086200
## zipcode_ zipcode_ 8.841909603
## patient_disposition_cde patient_disposition_cde 7.192631006
## dnr_flg dnr_flg 3.864884021
## major_diag_cat_cde major_diag_cat_cde 3.320398701
## hbpmed_totyrs hbpmed_totyrs 0.357107085
## diet_highcarb diet_highcarb 0.300122182
## steroid_daily steroid_daily 0.230229567
## length_of_stay_day_cnt length_of_stay_day_cnt 0.204767133
## smoke_statcat smoke_statcat 0.197486421
## diet_highprotfat diet_highprotfat 0.161533873
## insulin_daily insulin_daily 0.157786946
## weight_q1 weight_q1 0.150815326
## participant_race participant_race 0.137079890
## allex_life_hrs allex_life_hrs 0.131342005
## brca brca 0.111961191
## bmi_q1 bmi_q1 0.107129812
## diet_saladwine diet_saladwine 0.101440162
## admission_typ admission_typ 0.093558972
## blockgroup90_urban_cat blockgroup90_urban_cat 0.069813331
## nsaid_totyrs nsaid_totyrs 0.065332789
## sleep_hrs sleep_hrs 0.063960929
## diet_plant diet_plant 0.057238518
## othhbp_daily othhbp_daily 0.047620069
## menarche_age menarche_age 0.043762707
## tamox_daily tamox_daily 0.040640254
## patient_care_typ patient_care_typ 0.033492885
## height_q1 height_q1 0.029141833
## birthplace birthplace 0.027383921
## total_charges_amt total_charges_amt 0.020289031
## dief_ethnic dief_ethnic 0.019722639
## diag_poa1 diag_poa1 0.005126787
## diag_poa3 diag_poa3 0.003244455
## ses_quartile_ind ses_quartile_ind 0.003030034
## bilateral_mastectomy_ind bilateral_mastectomy_ind 0.002869872
## antidep_daily antidep_daily 0.001295469
## hysterectomy_ind hysterectomy_ind 0.000000000
## bilateral_oophorectomy_ind bilateral_oophorectomy_ind 0.000000000
## adopted adopted 0.000000000
## twin twin 0.000000000
## oralcntr_ever_q1 oralcntr_ever_q1 0.000000000
## preg_ever_q1 preg_ever_q1 0.000000000
## alchl_analyscat alchl_analyscat 0.000000000
## mammo_ever_q1 mammo_ever_q1 0.000000000
## asthma_q3 asthma_q3 0.000000000
## aceinhb_daily aceinhb_daily 0.000000000
## brondil_daily brondil_daily 0.000000000
## cholmed_daily cholmed_daily 0.000000000
## diag_poa2 diag_poa2 0.000000000
## diag_poa4 diag_poa4 0.000000000
## hospital_time hospital_time 0.000000000
## season season 0.000000000
## [1] 7183
## Using 7183 trees...
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] "AUC for boosting in the test set:"
## Area under the curve: 0.875
I build classic logistic regression models for predicting die30 and die1800, so that we can find the p-values of the Beta of each variable adjusted for other variables. The classic logistic regression shall be considerably accurate because we find that the lambda in LASSO is extremely small.
It is obvious that the longer the time window is, the more people die. And risk of death just monotonously increases when the time window become longer. Generally, the farther away from the discharge, the less risk of death on that specific day
We find that CCS code 1 is significantly associated with die30 The CCS code1 descriptions that have significant association are as follows
## names1 P_value
## 1 ccs1_Biliary tract disease 0.007032301
## 2 ccs1_Other nutritional; endocrine; and metabolic disorders 0.029816690
## 3 ccs1_Pancreatic disorders (not diabetes) 0.018741722
## 4 ccs1_Secondary malignancies 0.025720911
We find that CCS code 1 is significantly associated with die1800 The CCS code1 descriptions that have significant association are as follows
## names2
## 1 ccs1_Biliary tract disease
## 2 ccs1_Congestive heart failure; nonhypertensive
## 3 ccs1_Coronary atherosclerosis and other heart disease
## 4 ccs1_Intestinal obstruction without hernia
## 5 ccs1_Nonspecific chest pain
## 6 ccs1_Osteoarthritis
## 7 ccs1_Other aftercare
## 8 ccs1_Pancreatic disorders (not diabetes)
## 9 ccs1_Prolapse of female genital organs
## 10 ccs1_Rehabilitation care; fitting of prostheses; and adjustment of devices
## 11 ccs1_Secondary malignancies
## P_value
## 1 1.037160e-08
## 2 2.116884e-06
## 3 1.234356e-02
## 4 1.159785e-02
## 5 2.444400e-02
## 6 1.711116e-04
## 7 1.041535e-02
## 8 1.943107e-05
## 9 2.067756e-02
## 10 2.212265e-02
## 11 5.060120e-22
We find that CCS code 2 is significantly associated with die30 The CCS code2 descriptions that have significant association are as follows
## names1 P_value
## 1 ccs2_Hypertension with complications and secondary hypertension 0.0351849126
## 2 ccs2_Nutritional deficiencies 0.0001661674
## 3 ccs2_Secondary malignancies 0.0010898445
We find that CCS code 2 is significantly associated with die1800 The CCS code2 descriptions that have significant association are as follows
## names2
## 1 ccs2_Acute and unspecified renal failure
## 2 ccs2_Acute posthemorrhagic anemia
## 3 ccs2_Asthma
## 4 ccs2_Benign neoplasm of uterus
## 5 ccs2_Biliary tract disease
## 6 ccs2_Cardiac dysrhythmias
## 7 ccs2_Chronic obstructive pulmonary disease and bronchiectasis
## 8 ccs2_Chronic ulcer of skin
## 9 ccs2_Coagulation and hemorrhagic disorders
## 10 ccs2_Complications of surgical procedures or medical care
## 11 ccs2_Congestive heart failure; nonhypertensive
## 12 ccs2_Coronary atherosclerosis and other heart disease
## 13 ccs2_Deficiency and other anemia
## 14 ccs2_Diabetes mellitus with complications
## 15 ccs2_Diabetes mellitus without complication
## 16 ccs2_Disorders of lipid metabolism
## 17 ccs2_Esophageal disorders
## 18 ccs2_Essential hypertension
## 19 ccs2_Fluid and electrolyte disorders
## 20 ccs2_Heart valve disorders
## 21 ccs2_Hypertension with complications and secondary hypertension
## 22 ccs2_Intestinal obstruction without hernia
## 23 ccs2_Late effects of cerebrovascular disease
## 24 ccs2_Mood disorders
## 25 ccs2_Non-Hodgkin`s lymphoma
## 26 ccs2_other
## 27 ccs2_Other aftercare
## 28 ccs2_Other connective tissue disease
## 29 ccs2_Other female genital disorders
## 30 ccs2_Other fractures
## 31 ccs2_Other gastrointestinal disorders
## 32 ccs2_Other nervous system disorders
## 33 ccs2_Other nutritional; endocrine; and metabolic disorders
## 34 ccs2_Paralysis
## 35 ccs2_Peri-; endo-; and myocarditis; cardiomyopathy (except that caused by tuberculosis or sexually transm
## 36 ccs2_Pleurisy; pneumothorax; pulmonary collapse
## 37 ccs2_Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 38 ccs2_Residual codes; unclassified
## 39 ccs2_Secondary malignancies
## 40 ccs2_Septicemia (except in labor)
## 41 ccs2_Skin and subcutaneous tissue infections
## 42 ccs2_Spondylosis; intervertebral disc disorders; other back problems
## 43 ccs2_Thyroid disorders
## 44 ccs2_Undefined
## 45 ccs2_Urinary tract infections
## P_value
## 1 1.752441e-03
## 2 1.513566e-08
## 3 1.578661e-04
## 4 1.575927e-03
## 5 4.359832e-03
## 6 8.222056e-10
## 7 4.669983e-03
## 8 2.223726e-02
## 9 5.569710e-04
## 10 1.554740e-06
## 11 7.244718e-03
## 12 4.984558e-12
## 13 2.778181e-05
## 14 4.195102e-05
## 15 9.847087e-07
## 16 2.622160e-10
## 17 8.180138e-06
## 18 6.587279e-16
## 19 1.250525e-10
## 20 1.277704e-10
## 21 4.637835e-03
## 22 3.725153e-04
## 23 1.613240e-04
## 24 8.429583e-09
## 25 6.382446e-04
## 26 2.899346e-08
## 27 4.716034e-08
## 28 3.321828e-09
## 29 7.944658e-04
## 30 6.639961e-07
## 31 4.155105e-06
## 32 4.029990e-03
## 33 4.655866e-07
## 34 3.322677e-02
## 35 3.252801e-06
## 36 1.297928e-03
## 37 7.226365e-04
## 38 2.273012e-02
## 39 8.204378e-07
## 40 1.179119e-03
## 41 6.364232e-08
## 42 1.242465e-06
## 43 1.785795e-11
## 44 4.309968e-06
## 45 3.251598e-07
We find that CCS code 3 is significantly associated with die30 The CCS code3 descriptions that have significant association are as follows
## names1 P_value
## 1 ccs3_Acute and unspecified renal failure 0.002350780
## 2 ccs3_Chronic ulcer of skin 0.029750532
## 3 ccs3_Coagulation and hemorrhagic disorders 0.032344260
## 4 ccs3_Nutritional deficiencies 0.020673844
## 5 ccs3_Other gastrointestinal disorders 0.007578809
## 6 ccs3_Other liver diseases 0.032103345
## 7 ccs3_Pleurisy; pneumothorax; pulmonary collapse 0.005358708
## 8 ccs3_Respiratory failure; insufficiency; arrest (adult) 0.015590451
## 9 ccs3_Secondary malignancies 0.002902010
## 10 ccs3_Undefined 0.034930974
We find that CCS code 3 is significantly associated with die1800 The CCS code3 descriptions that have significant association are as follows
## names2
## 1 ccs3_Acute posthemorrhagic anemia
## 2 ccs3_Asthma
## 3 ccs3_Bacterial infection; unspecified site
## 4 ccs3_Cancer of breast
## 5 ccs3_Cardiac dysrhythmias
## 6 ccs3_Chronic obstructive pulmonary disease and bronchiectasis
## 7 ccs3_Complications of surgical procedures or medical care
## 8 ccs3_Coronary atherosclerosis and other heart disease
## 9 ccs3_Deficiency and other anemia
## 10 ccs3_Diabetes mellitus without complication
## 11 ccs3_Disorders of lipid metabolism
## 12 ccs3_Esophageal disorders
## 13 ccs3_Essential hypertension
## 14 ccs3_Fluid and electrolyte disorders
## 15 ccs3_Heart valve disorders
## 16 ccs3_Hypertension with complications and secondary hypertension
## 17 ccs3_Intestinal obstruction without hernia
## 18 ccs3_Mood disorders
## 19 ccs3_Osteoarthritis
## 20 ccs3_Osteoporosis
## 21 ccs3_other
## 22 ccs3_Other aftercare
## 23 ccs3_Other circulatory disease
## 24 ccs3_Other connective tissue disease
## 25 ccs3_Other gastrointestinal disorders
## 26 ccs3_Other lower respiratory disease
## 27 ccs3_Other nervous system disorders
## 28 ccs3_Other nutritional; endocrine; and metabolic disorders
## 29 ccs3_Peri-; endo-; and myocarditis; cardiomyopathy (except that caused by tuberculosis or sexually transm
## 30 ccs3_Pleurisy; pneumothorax; pulmonary collapse
## 31 ccs3_Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 32 ccs3_Residual codes; unclassified
## 33 ccs3_Secondary malignancies
## 34 ccs3_Spondylosis; intervertebral disc disorders; other back problems
## 35 ccs3_Thyroid disorders
## 36 ccs3_Urinary tract infections
## P_value
## 1 3.290765e-03
## 2 1.063156e-03
## 3 1.502840e-02
## 4 6.902292e-04
## 5 2.625321e-05
## 6 2.709620e-02
## 7 3.297022e-04
## 8 3.511623e-06
## 9 1.845108e-03
## 10 3.280501e-09
## 11 1.315149e-10
## 12 1.774492e-05
## 13 1.536615e-11
## 14 1.206397e-05
## 15 1.238057e-04
## 16 3.576012e-02
## 17 6.862369e-03
## 18 1.096083e-02
## 19 8.664758e-06
## 20 2.002532e-07
## 21 1.086248e-05
## 22 1.157507e-04
## 23 4.460936e-04
## 24 2.047532e-08
## 25 1.707583e-03
## 26 2.193979e-02
## 27 1.219960e-02
## 28 1.684444e-04
## 29 1.250034e-02
## 30 1.529402e-02
## 31 6.826005e-03
## 32 4.544860e-04
## 33 4.318108e-13
## 34 5.488283e-06
## 35 1.954196e-08
## 36 3.645168e-04
We find that CCS code 4 is significantly associated with die30 The CCS code4 descriptions that have significant association are as follows
## names1 P_value
## 1 ccs4_Acute and unspecified renal failure 0.0308402438
## 2 ccs4_Conduction disorders 0.0459226927
## 3 ccs4_Coronary atherosclerosis and other heart disease 0.0226127697
## 4 ccs4_Diabetes mellitus without complication 0.0175385141
## 5 ccs4_Disorders of lipid metabolism 0.0035886258
## 6 ccs4_Esophageal disorders 0.0256542379
## 7 ccs4_Essential hypertension 0.0005367635
## 8 ccs4_Osteoarthritis 0.0316035664
## 9 ccs4_Osteoporosis 0.0134596195
## 10 ccs4_Other connective tissue disease 0.0028429636
## 11 ccs4_Secondary malignancies 0.0297261224
## 12 ccs4_Thyroid disorders 0.0083869195
We find that CCS code 4 is significantly associated with die1800 The CCS code4 descriptions that have significant association are as follows
## names2
## 1 ccs4_Acute and unspecified renal failure
## 2 ccs4_Acute posthemorrhagic anemia
## 3 ccs4_Allergic reactions
## 4 ccs4_Anxiety disorders
## 5 ccs4_Asthma
## 6 ccs4_Bacterial infection; unspecified site
## 7 ccs4_Cancer of breast
## 8 ccs4_Cardiac dysrhythmias
## 9 ccs4_Chronic kidney disease
## 10 ccs4_Chronic obstructive pulmonary disease and bronchiectasis
## 11 ccs4_Coagulation and hemorrhagic disorders
## 12 ccs4_Complications of surgical procedures or medical care
## 13 ccs4_Conduction disorders
## 14 ccs4_Congestive heart failure; nonhypertensive
## 15 ccs4_Coronary atherosclerosis and other heart disease
## 16 ccs4_Deficiency and other anemia
## 17 ccs4_Diabetes mellitus with complications
## 18 ccs4_Diabetes mellitus without complication
## 19 ccs4_Disorders of lipid metabolism
## 20 ccs4_Esophageal disorders
## 21 ccs4_Essential hypertension
## 22 ccs4_Fluid and electrolyte disorders
## 23 ccs4_Heart valve disorders
## 24 ccs4_Hypertension with complications and secondary hypertension
## 25 ccs4_Mood disorders
## 26 ccs4_Osteoarthritis
## 27 ccs4_Osteoporosis
## 28 ccs4_other
## 29 ccs4_Other aftercare
## 30 ccs4_Other bone disease and musculoskeletal deformities
## 31 ccs4_Other circulatory disease
## 32 ccs4_Other connective tissue disease
## 33 ccs4_Other female genital disorders
## 34 ccs4_Other gastrointestinal disorders
## 35 ccs4_Other lower respiratory disease
## 36 ccs4_Other nervous system disorders
## 37 ccs4_Other nutritional; endocrine; and metabolic disorders
## 38 ccs4_Pleurisy; pneumothorax; pulmonary collapse
## 39 ccs4_Residual codes; unclassified
## 40 ccs4_Respiratory failure; insufficiency; arrest (adult)
## 41 ccs4_Screening and history of mental health and substance abuse codes
## 42 ccs4_Secondary malignancies
## 43 ccs4_Spondylosis; intervertebral disc disorders; other back problems
## 44 ccs4_Thyroid disorders
## 45 ccs4_Undefined
## 46 ccs4_Urinary tract infections
## P_value
## 1 4.989725e-02
## 2 1.978235e-03
## 3 2.553783e-08
## 4 4.972784e-05
## 5 6.962489e-05
## 6 2.706857e-03
## 7 5.984177e-07
## 8 1.013857e-05
## 9 3.528087e-03
## 10 1.224477e-02
## 11 3.638384e-02
## 12 2.536749e-03
## 13 1.836035e-06
## 14 1.708664e-02
## 15 8.670652e-08
## 16 4.718959e-05
## 17 1.395134e-04
## 18 7.949749e-10
## 19 1.006207e-13
## 20 5.304703e-10
## 21 3.996115e-15
## 22 2.923289e-05
## 23 8.812460e-06
## 24 1.369354e-03
## 25 7.075018e-06
## 26 7.754330e-09
## 27 1.044065e-09
## 28 1.793640e-07
## 29 3.066605e-05
## 30 9.613791e-04
## 31 9.236255e-09
## 32 1.558220e-11
## 33 8.343201e-04
## 34 2.136675e-04
## 35 1.526646e-02
## 36 8.453787e-04
## 37 2.220105e-05
## 38 6.154932e-04
## 39 9.394551e-05
## 40 2.799613e-02
## 41 7.730256e-04
## 42 1.118399e-13
## 43 6.761417e-05
## 44 4.046449e-11
## 45 7.405927e-03
## 46 2.531452e-05
The seasons that have significant association are as follows We can find that the season winter is significantly related to the die30, which indicates whether the individual dies in the time window of 30 days after discharge
## names1 P_value
## 1 seasonwinter 0.04721893
The seasons that have significant association are as follows We can find that season is not significantly related to the die1800, which indicates whether the individual dies in the time window of 1800 days after discharge
## [1] names2 P_value
## <0 rows> (or 0-length row.names)
The ZIP codes that have significant association are as follows We can find that zip code is significantly related to the die30, which indicates whether the individual dies in the time window of 30 days after discharge
## names1 P_value
## 1 zipcode_90048 0.048536423
## 2 zipcode_91105 0.022100119
## 3 zipcode_91367 0.037728114
## 4 zipcode_92037 0.012019488
## 5 zipcode_92123 0.012081555
## 6 zipcode_92653 0.018628371
## 7 zipcode_94115 0.014193484
## 8 zipcode_94596 0.001057508
## 9 zipcode_94705 0.026920659
## 10 zipcode_95119 0.037935675
The ZIP codes that have significant association are as follows We can find that zip code is not significantly related to the die1800, which indicates whether the individual dies in the time window of 1800 days after discharge
## names2 P_value
## 1 zipcode_90505 0.01375748
## 2 zipcode_91360 0.01262534
## 3 zipcode_92103 0.03231716
## 4 zipcode_92373 0.04868250
## 5 zipcode_92835 0.03079069
## 6 zipcode_93720 0.00101477
In this project, I build four different models to predict die 30. At last, I found that boosting is the best and its AUC is 0.9246. I also use it to predict die 1800 and the AUC is 0.8745.
It is obvious that the longer the time window is, the more people die. And risk of death just monotonously increases when the time window become longer. Generally, the farther away from the discharge, the less risk of death on that specific day
Then I do the classic logistic regression( no penalty for more predictors) and I find that there are some CCS code 1, CCS code 2, CCS code 3, CCS code 4 and zip code that are significantly associated with die30 and die1800 adjusted for other variables. Season is only significantly associated to die 30, adjusted for other variables