Effectiveness of resampling methods in coping with imbalanced crash data: crash type analysis and predictive modeling

Morris, Clint; Yang, Jidong J.

doi:10.1016/j.aap.2021.106240

SAFETYLIT WEEKLY UPDATE

We compile citations and summaries of about 400 new articles every week.

RSS Feed

HELP: Tutorials | FAQ

CONTACT US: Contact info

Search Results

Journal Article

Effectiveness of resampling methods in coping with imbalanced crash data: crash type analysis and predictive modeling
Citation	Morris C, Yang JJ. Accid. Anal. Prev. 2021; 159: 106240.
Copyright	(Copyright © 2021, Elsevier Publishing)
DOI	10.1016/j.aap.2021.106240
PMID	unavailable
Abstract	Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the density-based approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space. Language: en
Keywords	Machine learning; Traffic crash; Data imbalance; Gradient boosting; Nested logit; Over-sampling; Resampling; Tree ensemble