A geolocated dataset of German news articles

Abstract The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal tren...

Full description

Saved in:
Bibliographic Details
Main Authors: Lukas Kriesch, Sebastian Losacker
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05422-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849402710379462656
author Lukas Kriesch
Sebastian Losacker
author_facet Lukas Kriesch
Sebastian Losacker
author_sort Lukas Kriesch
collection DOAJ
description Abstract The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends. In this paper, we provide insights into how news articles can be geolocated and how the texts can then be further analyzed. We collect data from the CommonCrawl News dataset and clean the text data. We then use a named-entity recognition model for geocoding. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news articles and make the German location data, as well as the embeddings, available for download. We compile a dataset containing text embeddings for about 50 million German news articles, of which about 70% include geographic locations. The process can be replicated for news data from other countries.
format Article
id doaj-art-f933a4afb16b451aa58e40d41bd1f70f
institution Kabale University
issn 2052-4463
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-f933a4afb16b451aa58e40d41bd1f70f2025-08-20T03:37:28ZengNature PortfolioScientific Data2052-44632025-07-0112111310.1038/s41597-025-05422-wA geolocated dataset of German news articlesLukas Kriesch0Sebastian Losacker1Department of Geography, Justus Liebig University GiessenDepartment of Geography, Justus Liebig University GiessenAbstract The emergence of large language models and the exponential growth of digitized text data have revolutionized research methodologies across a broad range of social sciences. News data is crucial for the social sciences as it provides real-time insights into public discourse and societal trends. In this paper, we provide insights into how news articles can be geolocated and how the texts can then be further analyzed. We collect data from the CommonCrawl News dataset and clean the text data. We then use a named-entity recognition model for geocoding. Finally, we transform the news articles into text embeddings using SBERT, enabling semantic searches within the news data corpus. In the paper, we apply this process to all German news articles and make the German location data, as well as the embeddings, available for download. We compile a dataset containing text embeddings for about 50 million German news articles, of which about 70% include geographic locations. The process can be replicated for news data from other countries.https://doi.org/10.1038/s41597-025-05422-w
spellingShingle Lukas Kriesch
Sebastian Losacker
A geolocated dataset of German news articles
Scientific Data
title A geolocated dataset of German news articles
title_full A geolocated dataset of German news articles
title_fullStr A geolocated dataset of German news articles
title_full_unstemmed A geolocated dataset of German news articles
title_short A geolocated dataset of German news articles
title_sort geolocated dataset of german news articles
url https://doi.org/10.1038/s41597-025-05422-w
work_keys_str_mv AT lukaskriesch ageolocateddatasetofgermannewsarticles
AT sebastianlosacker ageolocateddatasetofgermannewsarticles
AT lukaskriesch geolocateddatasetofgermannewsarticles
AT sebastianlosacker geolocateddatasetofgermannewsarticles