Assessing Similarity Between Datasets Using Vector Representations

The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset...

Full description

Saved in:
Bibliographic Details
Main Authors: А. А. Usatoff, A. M. Nedzved, Guo Jiran
Format: Article
Language:Russian
Published: Educational institution «Belarusian State University of Informatics and Radioelectronics» 2025-07-01
Series:Doklady Belorusskogo gosudarstvennogo universiteta informatiki i radioèlektroniki
Subjects:
Online Access:https://doklady.bsuir.by/jour/article/view/4164
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The article considers an approach to determining the similarity of datasets for training algorithms using datasets with human faces as an example. This approach allows finding similar datasets from different sources, expanding the detection of features and classes and significantly affecting dataset balance. For each dataset object, a vector representation (embedding) was obtained, then the embeddings in both datasets were compared. The experiments were conducted using datasets with images of human faces as an example. To obtain embeddings, a pretrained ResNet network was used. During the research, one dataset was divided into two parts, which were similar datasets, then each of the parts was compared with a different dataset. The new similarity metric is proposed, which has several advantages and allows to find the most similar datasets.
ISSN:1729-7648