To main content

Data pipelines for scalable entity linking over large volumes of tabular data


Contact person

Keywords: Data pipelines, Scalability, Data linking, Tabular data

Entity linking is a crucial task in Natural Language Processing (NLP) and Information Retrieval (IR) [Shen et al 2021]. It involves associating specific strings of text (also known as mentions or entities) with the corresponding entries in a knowledge graph or database. This process aids systems in discerning the exact identity that a named entity refers to within a given context, especially when possibilities of ambiguity are present. For instance, a text mention of "Paris" may refer to either "Paris, France" or "Paris, Texas," and entity linking helps clarify such ambiguity [Yin et al 2019].

In the context of structured data such as tables, entity linking involves associating specific cells (entities) with the appropriate entries in a knowledge graph. It requires the identification and linking of entities within tabular data to specific entries in an external structured database. This process is pivotal in making structured data more meaningful and comprehensible by enriching it with external information and context. For instance, in a table of movies, a cell containing "The Matrix" would be linked to the corresponding entry in a knowledge graph, providing additional details about the film, like its director, release date, etc.

The current solution found at https://github.com/roby-avo/alligator offers a scalable method via a somewhat intricate bash script. The goal of this project is to develop a scalable solution utilising standard technologies such as Argo Workflow, Airflow, and others. This will enhance the efficiency and robustness of the process while adhering to widely-accepted tech standards.

Work to be done:

  • Investigate existing solutions to identify their strengths and weaknesses. This can inform the development of your own solution.
  • Define the architecture of the solutions in terms which technologies will be used.
  • Validate the proposed pipeline proving also time computation information.

References:

[Shen et al 2021] Shen, W., Li, Y., Liu, Y., Han, J., Wang, J., & Yuan, X. (2021). Entity linking meets deep learning: Techniques and solutions. IEEE Transactions on Knowledge and Data Engineering.

[Yin et al 2019] Yin, X., Huang, Y., Zhou, B., Li, A., Lan, L., & Jia, Y. (2019). Deep entity linking via eliminating semantic ambiguity with BERT. IEEE Access, 7, 169434-169445.