Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
In the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10819443/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841533417060040704 |
---|---|
author | Bei Zhou Qianxi Zhou Zongzhi Li |
author_facet | Bei Zhou Qianxi Zhou Zongzhi Li |
author_sort | Bei Zhou |
collection | DOAJ |
description | In the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific focus on Conditional Tabular GAN (CTGAN), for synthesizing minority crash data to address this imbalance. Utilizing traffic crash data from Chicago spanning 2020 to 2022, the research evaluates the capabilities of CTGAN against three traditional data resampling methods, as well as an additional cost-sensitive learning approach. These methods are evaluated across various injury severity classification scenarios (2-class, 3-class, and 4-class) using five commonly applied injury severity classification models. The study’s dual evaluation approach encompasses both the quality of synthetic data and the enhancement of classification model performance. The pivotal findings reveal that: 1) CTGAN markedly outperforms other data resampling techniques in generating superior quality synthetic data, particularly for the least represented injury severity category; 2) While CTGAN demonstrates substantial improvements over traditional data resampling methods in classification model performance, this advantage diminishes as the number of injury categories increases; 3) Surprisingly, CTGAN’s superior data quality does not result in better classification performance compared to cost-sensitive learning, especially in more complex classification scenarios. Cost-sensitive learning combined with LightGBM achieves the best classification performance across all scenarios. Given the significantly lower computational resources required by cost-sensitive learning, this approach is recommended for handling imbalanced injury severity data. |
format | Article |
id | doaj-art-7b4683edd90c40bbb4e6d279a10e4d18 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-7b4683edd90c40bbb4e6d279a10e4d182025-01-16T00:02:05ZengIEEEIEEE Access2169-35362025-01-01132929294410.1109/ACCESS.2024.352462010819443Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional MethodsBei Zhou0https://orcid.org/0000-0001-9639-2560Qianxi Zhou1https://orcid.org/0009-0000-9062-8413Zongzhi Li2https://orcid.org/0000-0002-6500-7460School of Transportation Engineering, Chang’an University, Xi’an, Shaanxi, ChinaSchool of Transportation Engineering, Chang’an University, Xi’an, Shaanxi, ChinaDepartment of Civil, Architectural, and Environmental Engineering, Illinois Institute of Technology, Chicago, IL, USAIn the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific focus on Conditional Tabular GAN (CTGAN), for synthesizing minority crash data to address this imbalance. Utilizing traffic crash data from Chicago spanning 2020 to 2022, the research evaluates the capabilities of CTGAN against three traditional data resampling methods, as well as an additional cost-sensitive learning approach. These methods are evaluated across various injury severity classification scenarios (2-class, 3-class, and 4-class) using five commonly applied injury severity classification models. The study’s dual evaluation approach encompasses both the quality of synthetic data and the enhancement of classification model performance. The pivotal findings reveal that: 1) CTGAN markedly outperforms other data resampling techniques in generating superior quality synthetic data, particularly for the least represented injury severity category; 2) While CTGAN demonstrates substantial improvements over traditional data resampling methods in classification model performance, this advantage diminishes as the number of injury categories increases; 3) Surprisingly, CTGAN’s superior data quality does not result in better classification performance compared to cost-sensitive learning, especially in more complex classification scenarios. Cost-sensitive learning combined with LightGBM achieves the best classification performance across all scenarios. Given the significantly lower computational resources required by cost-sensitive learning, this approach is recommended for handling imbalanced injury severity data.https://ieeexplore.ieee.org/document/10819443/Crash injury severitycost-sensitive learningdata imbalancegenerative adversarial network |
spellingShingle | Bei Zhou Qianxi Zhou Zongzhi Li Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods IEEE Access Crash injury severity cost-sensitive learning data imbalance generative adversarial network |
title | Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods |
title_full | Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods |
title_fullStr | Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods |
title_full_unstemmed | Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods |
title_short | Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods |
title_sort | addressing data imbalance in crash data evaluating generative adversarial network x2019 s efficacy against conventional methods |
topic | Crash injury severity cost-sensitive learning data imbalance generative adversarial network |
url | https://ieeexplore.ieee.org/document/10819443/ |
work_keys_str_mv | AT beizhou addressingdataimbalanceincrashdataevaluatinggenerativeadversarialnetworkx2019sefficacyagainstconventionalmethods AT qianxizhou addressingdataimbalanceincrashdataevaluatinggenerativeadversarialnetworkx2019sefficacyagainstconventionalmethods AT zongzhili addressingdataimbalanceincrashdataevaluatinggenerativeadversarialnetworkx2019sefficacyagainstconventionalmethods |