ML model for data type identification in structured data
Keywords: Data linking, Data enrichment, Tabular data, Machine Learning
Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types This is important for obtaining a preliminary schema annotation of a tabular data source, which assists in Entity Linking tasks [Hulsebos el al 2019, Zhang et al 2019].
Work to be done:
- Define the priori categories.
- Build a training dataset for the machine learning model.
- Define the Machine Learning model.
- Validate the proposed model over the proposed dataset
[Hulsebos el al 2019] Hulsebos, M., Hu, K., Bakker, M., Zgraggen, E., Satyanarayan, A., Kraska, T., ... & Hidalgo, C. (2019, July). Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1500-1508).
[Zhang et al 2019] Zhang, D., Suhara, Y., Li, J., Hulsebos, M., Demiralp, Ç., & Tan, W. C. (2019). Sato: Contextual semantic type detection in tables. arXiv preprint arXiv:1911.06311.