To main content

ML model for data type identification in structured data

The objective of this work is to propose a model using machine learning (ML) techniques to identify the different data types of columns in a tabular data source. This means establishing a priori categories: PERSON, LOCATION, OTHER, NUMBER, DATE, STREET_ADDRESS, TELEPHONE_NUMBER, EMAIL, DESCRIPTION_TEXT, URL.

Contact person

Keywords: Data linking, Data enrichment, Tabular data, Machine Learning

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types This is important for obtaining a preliminary schema annotation of a tabular data source, which assists in Entity Linking tasks [Hulsebos el al 2019, Zhang et al 2019].

Work to be done:

  • Define the priori categories.
  • Build a training dataset for the machine learning model.
  • Define the Machine Learning model.
  • Validate the proposed model over the proposed dataset

References:

[Hulsebos el al 2019] Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., ... & Hidalgo, C. (2019, July). Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1500-1508).

[Zhang et al 2019] Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., & Tan, W. C. (2019). Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311.