To main content

Methodology for Developing a Fact-Checked News Dataset in Norwegian Bokmål for Fake News Detection (The Fakespeak-NOR Corpus)

Abstract

This work presents the methodology for constructing a novel dataset of fact-checked news articles in Norwegian bokmål, a language with relatively limited publicly available resources for natural language processing. To the best of our knowledge, this is the first dataset of its kind that combines the text of the news article and its veracity label. The source of the data is Faktisk.no, the only Norwegian fact-checking organization. Each of their fact-checks is published with detailed assessment of a claim, including a link to the original article in which the claim first appeared along with a verdict (5 categories from completely true, partially true, not sure, partially false and completely false) and a justification based on factual evidence. The dataset creation process involves several filtering steps. Firstly, all the links to the articles with the original claim were validated. Articles that had been deleted, often due to the claim being flagged as false, were excluded. Non textual content, such as video and audio, were identified using keywords in the URL of the link and removed. Articles that were behind hard paywalls were also removed. From the initial pool of 423 articles, approximately 200 valid instances were retained. Each article was manually reviewed to ensure that the claim being assessed was still present in the current version of the source article. A key challenge in compiling such datasets is that false claims are frequently deleted or edited after being fact-checked, resulting in many articles being unusable. The final dataset includes, for each instance, the claim under evaluation, the corresponding article text, its title, and its veracity label. This collection is intended to support future research on the language of fake news as well as mis- and disinformation detection in low-resource languages.

Category

Conference abstract

Language

English

Author(s)

Affiliation

  • SINTEF Digital / Sustainable Communication Technologies
  • University of Oslo

Year

2025

Published in

Impulses and Approaches to Computer-Mediated Communication: Proceedings of the 12th International Conference on Computer-Mediated Communication and Social Media Corpora for the Humanities

Page(s)

122 - 122

View this publication at Norwegian Research Information Repository