Enhancing entity linking over tabular data: An extension of current solutions and implementation of generative tabular data lifecycle
Keywords: Entity linking, Data enrichment, Tabular data, Machine Learning
Entity linking is a crucial task in Natural Language Processing (NLP) and Information Retrieval (IR) [Shen et al 2021]. It involves associating specific strings of text (also known as mentions or entities) with the corresponding entries in a knowledge graph or database. This process aids systems in discerning the exact identity that a named entity refers to within a given context, especially when possibilities of ambiguity are present. For instance, a text mention of "Paris" may refer to either "Paris, France" or "Paris, Texas," and entity linking helps clarify such ambiguity [Yin et al 2019].
In the context of structured data, such as tables, entity linking expands to link specific cells (entities) with corresponding entries in a knowledge graph. This entails identifying entities within the tabular data and linking them to specific entries in an external structured database. This essential process enriches structured data with external information and context, making it more meaningful and comprehensive. For instance, in a table of movies, the entity "The Matrix" would be linked to its corresponding entry in a knowledge graph, thereby offering additional information like the film's director, release date, and more.
With the existing solution available at https://github.com/roby-avo/alligator, the aim is to extend its functionality to encompass a broader scope. The current solution employs a machine learning algorithm that could potentially be improved. It was trained on predefined datasets, and the proposal is to enhance it by generating tabular data directly from the knowledge graph, including introducing controlled noise to increase the algorithm's robustness.
Work to be done:
- Generative Source Creation: Develop a generative source of tabular data that can simulate real-world complexities and nuances.
- Model Training: Train a Machine Learning model on the newly created generative source of tabular data. This would be an improvement over training on predefined datasets, allowing the model to handle more diverse data scenarios.
- Model Validation: Validate the enhanced Machine Learning model. This validation could be conducted using datasets from the SemTab challenge, ensuring that the model performs well on known benchmarks.
[Shen et al 2021] Shen, W., Li, Y., Liu, Y., Han, J., Wang, J., & Yuan, X. (2021). Entity linking meets deep learning: Techniques and solutions. IEEE Transactions on Knowledge and Data Engineering.
[Yin et al 2019] Yin, X., Huang, Y., Zhou, B., Li, A., Lan, L., & Jia, Y. (2019). Deep entity linking via eliminating semantic ambiguity with BERT. IEEE Access, 7, 169434-169445.