Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods

In the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bei Zhou, Qianxi Zhou, Zongzhi Li
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Crash injury severity cost-sensitive learning data imbalance generative adversarial network
Online Access:	https://ieeexplore.ieee.org/document/10819443/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841533417060040704
author	Bei Zhou Qianxi Zhou Zongzhi Li
author_facet	Bei Zhou Qianxi Zhou Zongzhi Li
author_sort	Bei Zhou
collection	DOAJ
description	In the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific focus on Conditional Tabular GAN (CTGAN), for synthesizing minority crash data to address this imbalance. Utilizing traffic crash data from Chicago spanning 2020 to 2022, the research evaluates the capabilities of CTGAN against three traditional data resampling methods, as well as an additional cost-sensitive learning approach. These methods are evaluated across various injury severity classification scenarios (2-class, 3-class, and 4-class) using five commonly applied injury severity classification models. The study’s dual evaluation approach encompasses both the quality of synthetic data and the enhancement of classification model performance. The pivotal findings reveal that: 1) CTGAN markedly outperforms other data resampling techniques in generating superior quality synthetic data, particularly for the least represented injury severity category; 2) While CTGAN demonstrates substantial improvements over traditional data resampling methods in classification model performance, this advantage diminishes as the number of injury categories increases; 3) Surprisingly, CTGAN’s superior data quality does not result in better classification performance compared to cost-sensitive learning, especially in more complex classification scenarios. Cost-sensitive learning combined with LightGBM achieves the best classification performance across all scenarios. Given the significantly lower computational resources required by cost-sensitive learning, this approach is recommended for handling imbalanced injury severity data.
format	Article
id	doaj-art-7b4683edd90c40bbb4e6d279a10e4d18
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-7b4683edd90c40bbb4e6d279a10e4d182025-01-16T00:02:05ZengIEEEIEEE Access2169-35362025-01-01132929294410.1109/ACCESS.2024.352462010819443Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional MethodsBei Zhou0https://orcid.org/0000-0001-9639-2560Qianxi Zhou1https://orcid.org/0009-0000-9062-8413Zongzhi Li2https://orcid.org/0000-0002-6500-7460School of Transportation Engineering, Chang’an University, Xi’an, Shaanxi, ChinaSchool of Transportation Engineering, Chang’an University, Xi’an, Shaanxi, ChinaDepartment of Civil, Architectural, and Environmental Engineering, Illinois Institute of Technology, Chicago, IL, USAIn the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific focus on Conditional Tabular GAN (CTGAN), for synthesizing minority crash data to address this imbalance. Utilizing traffic crash data from Chicago spanning 2020 to 2022, the research evaluates the capabilities of CTGAN against three traditional data resampling methods, as well as an additional cost-sensitive learning approach. These methods are evaluated across various injury severity classification scenarios (2-class, 3-class, and 4-class) using five commonly applied injury severity classification models. The study’s dual evaluation approach encompasses both the quality of synthetic data and the enhancement of classification model performance. The pivotal findings reveal that: 1) CTGAN markedly outperforms other data resampling techniques in generating superior quality synthetic data, particularly for the least represented injury severity category; 2) While CTGAN demonstrates substantial improvements over traditional data resampling methods in classification model performance, this advantage diminishes as the number of injury categories increases; 3) Surprisingly, CTGAN’s superior data quality does not result in better classification performance compared to cost-sensitive learning, especially in more complex classification scenarios. Cost-sensitive learning combined with LightGBM achieves the best classification performance across all scenarios. Given the significantly lower computational resources required by cost-sensitive learning, this approach is recommended for handling imbalanced injury severity data.https://ieeexplore.ieee.org/document/10819443/Crash injury severitycost-sensitive learningdata imbalancegenerative adversarial network
spellingShingle	Bei Zhou Qianxi Zhou Zongzhi Li Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods IEEE Access Crash injury severity cost-sensitive learning data imbalance generative adversarial network
title	Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
title_full	Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
title_fullStr	Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
title_full_unstemmed	Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
title_short	Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
title_sort	addressing data imbalance in crash data evaluating generative adversarial network x2019 s efficacy against conventional methods
topic	Crash injury severity cost-sensitive learning data imbalance generative adversarial network
url	https://ieeexplore.ieee.org/document/10819443/
work_keys_str_mv	AT beizhou addressingdataimbalanceincrashdataevaluatinggenerativeadversarialnetworkx2019sefficacyagainstconventionalmethods AT qianxizhou addressingdataimbalanceincrashdataevaluatinggenerativeadversarialnetworkx2019sefficacyagainstconventionalmethods AT zongzhili addressingdataimbalanceincrashdataevaluatinggenerativeadversarialnetworkx2019sefficacyagainstconventionalmethods

Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network&#x2019;s Efficacy Against Conventional Methods

Similar Items

Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods