Tabular Data Anomaly Patterns

Sammendrag

One essential and challenging task in data science is data cleaning - the process of identifying and eliminating data anomalies. Different data types, data domains, data acquisition methods, and final purposes of data cleaning have resulted in different approaches in defining data anomalies in the literature. This paper proposes and describes a set of basic data anomalies in the form of anomaly patterns commonly encountered in tabular data, independently of the data domain, data acquisition technique, or the purpose of data cleaning. This set of anomalies can serve as a valuable basis for developing and enhancing software products that provide general-purpose data cleaning facilities and can provide a basis for comparing different tools aimed to support tabular data cleaning capabilities. Furthermore, this paper introduces a set of corresponding data operations suitable for addressing the identified anomaly patterns and introduces Grafterizer - a software framework that implements those data operations

Les publikasjonen

Kategori

Vitenskapelig Kapittel/Artikkel/Konferanseartikkel

Oppdragsgiver

EC/H2020 / 732590
EC/H2020 / 732003
EC/H2020 / 644497

Språk

Engelsk

Forfatter(e)

Institusjon(er)

SINTEF Digital / Sustainable Communication Technologies

År

2017

Forlag

IEEE (Institute of Electrical and Electronics Engineers)

Bok

2017 International Conference on Big Data Innovations and Applications (Innovate-Data), Prague, Czech Republic, Czech Republic, 21-23 Aug. 2017

ISBN

978-1-5386-0960-6

Side(r)

25 - 34

DOI

https://doi.org/10.1109/innovate-data.2017.10

Les fulltekst

https://hdl.handle.net/11250/2491583

Vis denne publikasjonen hos Cristin

Kontakt oss

Tjenester

Rapporter og publikasjoner

Forskningssenter og samarbeid

Karriere

Bærekraft

Institutter

Andre enheter

Ledelse og organisering

Om oss

Følg oss