Abstract
            Background: This study aimed to identify some risk factors associated with time to diabetes type  II events using artificial intelligence (AI) survival models (SM) in a population cohort from East  Azerbaijan, Iran.  
  Methods: Data from Azar-Cohort spanning from 2014 to 2020 was analyzed using the random  forest (RF) variable selection method along with Cox regression to identify the most relevant risk  factors associated with diabetes. We then developed prediction models using RF survival analysis.  Lasso-variable selection and RF variable selection were used to select the most important variables.  The concordance index (C-index) was used to evaluate the concordance of the prediction models.  
  Results: Our LASSO-Cox regression identified six factors to be significantly associated with  diabetes: age, mean corpuscular hemoglobin concentration (MCHC), waist circumference (WC),  body mass index (BMI), use of sleep medication, and hypertension stage 1 and stage 2. The  model included all variables with a C-index of 76.3%. In contrast, the RF analysis identified 21  important variables predicting a higher probability of having diabetes. Of those, WC, MCHC,  triglyceride, and age were the most important predictors of diabetes. The RF model converged  after 500 trees with an out-of-bag (OOB) of 0.28 and a C-index of 79.5%.  
  Conclusion: RF machine learning algorithms and LASSO-Cox regression analyses consistently  identified WC, hypertension, and MCHC as the main risk factors for developing diabetes. The  RF approach demonstrated slightly better accuracy in predicting the likelihood of diabetes at  different time points.