An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction

In the field of finance, the table is a common form of data organization. Extracting data from these tables in large quantities is a fundamental and important task for researchers. However, this can be a challenging task, as many tables exist in unstructured forms, such as scanned images in PDFs, ra...

Full description

Saved in:
Bibliographic Details
Main Authors: Weiyu Peng, Xuhui Li
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/1/370
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841549327787360256
author Weiyu Peng
Xuhui Li
author_facet Weiyu Peng
Xuhui Li
author_sort Weiyu Peng
collection DOAJ
description In the field of finance, the table is a common form of data organization. Extracting data from these tables in large quantities is a fundamental and important task for researchers. However, this can be a challenging task, as many tables exist in unstructured forms, such as scanned images in PDFs, rather than forms which can be easily processed, such as Excel spreadsheets. In recent years, a large number of table extraction methods utilizing heuristic algorithms or deep learning models have been proposed to free people from manual processing tasks, which are time-consuming and troublesome. Although existing methods achieve high levels of accuracy in processing some kinds of tables, they often fail to achieve optimal results when extracting complex financial tables with multi-line text and missing demarcation lines. In this article, we propose an enhancement method for image-based complex table extraction. This method consists of two modules: a split module and a filter module. The split module uses an OCR (optical character recognition) model to locate text regions, and a heuristic algorithm to obtain candidate demarcation lines. The filter module is based on a text semantic matching model and another heuristic algorithm. The experimental results show that the use of the proposed method can significantly improve the performance of different table extraction methods, with increases in F1 scores of between 5.10 and 14.36 points being recorded.
format Article
id doaj-art-19bf5530b20242d68478647ed06bbf97
institution Kabale University
issn 2076-3417
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-19bf5530b20242d68478647ed06bbf972025-01-10T13:15:19ZengMDPI AGApplied Sciences2076-34172025-01-0115137010.3390/app15010370An Ante Hoc Enhancement Method for Image-Based Complex Financial Table ExtractionWeiyu Peng0Xuhui Li1School of Information Management, Wuhan University, Wuhan 430072, ChinaSchool of Information Management, Wuhan University, Wuhan 430072, ChinaIn the field of finance, the table is a common form of data organization. Extracting data from these tables in large quantities is a fundamental and important task for researchers. However, this can be a challenging task, as many tables exist in unstructured forms, such as scanned images in PDFs, rather than forms which can be easily processed, such as Excel spreadsheets. In recent years, a large number of table extraction methods utilizing heuristic algorithms or deep learning models have been proposed to free people from manual processing tasks, which are time-consuming and troublesome. Although existing methods achieve high levels of accuracy in processing some kinds of tables, they often fail to achieve optimal results when extracting complex financial tables with multi-line text and missing demarcation lines. In this article, we propose an enhancement method for image-based complex table extraction. This method consists of two modules: a split module and a filter module. The split module uses an OCR (optical character recognition) model to locate text regions, and a heuristic algorithm to obtain candidate demarcation lines. The filter module is based on a text semantic matching model and another heuristic algorithm. The experimental results show that the use of the proposed method can significantly improve the performance of different table extraction methods, with increases in F1 scores of between 5.10 and 14.36 points being recorded.https://www.mdpi.com/2076-3417/15/1/370table extractioninformation extractiontext matchingartificial intelligence
spellingShingle Weiyu Peng
Xuhui Li
An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction
Applied Sciences
table extraction
information extraction
text matching
artificial intelligence
title An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction
title_full An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction
title_fullStr An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction
title_full_unstemmed An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction
title_short An Ante Hoc Enhancement Method for Image-Based Complex Financial Table Extraction
title_sort ante hoc enhancement method for image based complex financial table extraction
topic table extraction
information extraction
text matching
artificial intelligence
url https://www.mdpi.com/2076-3417/15/1/370
work_keys_str_mv AT weiyupeng anantehocenhancementmethodforimagebasedcomplexfinancialtableextraction
AT xuhuili anantehocenhancementmethodforimagebasedcomplexfinancialtableextraction
AT weiyupeng antehocenhancementmethodforimagebasedcomplexfinancialtableextraction
AT xuhuili antehocenhancementmethodforimagebasedcomplexfinancialtableextraction