Predicting graduation time using EDA and ML : a case study at the University of Oulu
Mahjouyanmoghaddam, Fatemeh (2025-06-09)
Mahjouyanmoghaddam, Fatemeh
F. Mahjouyanmoghaddam
09.06.2025
© 2025 Fatemeh Mahjouyanmoghaddam. Ellei toisin mainita, uudelleenkäyttö on sallittu Creative Commons Attribution 4.0 International (CC-BY 4.0) -lisenssillä (https://creativecommons.org/licenses/by/4.0/). Uudelleenkäyttö on sallittua edellyttäen, että lähde mainitaan asianmukaisesti ja mahdolliset muutokset merkitään. Sellaisten osien käyttö tai jäljentäminen, jotka eivät ole tekijän tai tekijöiden omaisuutta, saattaa edellyttää lupaa suoraan asianomaisilta oikeudenhaltijoilta.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202506094258
https://urn.fi/URN:NBN:fi:oulu-202506094258
Tiivistelmä
Predicting the time to graduation is a critical challenge for higher education institutions, as delays in degree completion can have significant financial and academic implications. This study analyzes academic data collected at the University of Oulu from students registered from 2004 to 2023 and develops a machine learning model to forecast the remaining time to graduate for still-studying students enrolled in a combined Bachelor’s and Master’s degree program. Using a dataset of student academic records, we apply data preprocessing techniques, feature engineering, and filtering criteria to extract key variables influencing graduation timelines. The selected features include study interruptions (Total Gap), credit accumulation (Total Credits, Average Credits per Term), and study pace (Rolling Average of Credits, Credit Growth Rate, Study Pace).
The predictive model is built using the Extreme Gradient Boosting (XGBoost) algorithm, optimized through hyperparameter tuning. The model achieves a high predictive accuracy, with an R-squared (R²) score of 0.9596 on the test set. As a benchmark, a Random Forest model was also developed, which performed slightly lower but confirmed the robustness of the selected features across different tree-based methods. Feature importance analysis highlights that Total Gap and Average Credits per Term are the most influential predictors of graduation time, reinforcing the role of credit accumulation and study continuity in student progression.
Case study comparisons provide deeper insights into the behavioural patterns of still-studying students relative to graduated students. The findings reveal that students with slower study pace and lower credit accumulation rates tend to have prolonged graduation timelines, while those with consistent credit accumulation and fewer interruptions are more likely to complete their studies on time.
The results offer valuable insights for academic policymakers and advisors to identify students at risk of delayed graduation and implement data-driven interventions. Future research could explore additional demographic, behavioural factors, and external commitments, to further enhance prediction accuracy and policy recommendations.
The predictive model is built using the Extreme Gradient Boosting (XGBoost) algorithm, optimized through hyperparameter tuning. The model achieves a high predictive accuracy, with an R-squared (R²) score of 0.9596 on the test set. As a benchmark, a Random Forest model was also developed, which performed slightly lower but confirmed the robustness of the selected features across different tree-based methods. Feature importance analysis highlights that Total Gap and Average Credits per Term are the most influential predictors of graduation time, reinforcing the role of credit accumulation and study continuity in student progression.
Case study comparisons provide deeper insights into the behavioural patterns of still-studying students relative to graduated students. The findings reveal that students with slower study pace and lower credit accumulation rates tend to have prolonged graduation timelines, while those with consistent credit accumulation and fewer interruptions are more likely to complete their studies on time.
The results offer valuable insights for academic policymakers and advisors to identify students at risk of delayed graduation and implement data-driven interventions. Future research could explore additional demographic, behavioural factors, and external commitments, to further enhance prediction accuracy and policy recommendations.
Kokoelmat
- Avoin saatavuus [38618]