To main content

Entity linking optimization and analysis for public bodies data: A study on Spend Network data

This thesis seeks to improve and conduct an in-depth analysis of Entity Linking (EL) problems in public bodies' data sourced from Spend Network.

Contact person

Keywords: Data linking, Data enrichment, Procurement data

Entity linking is a crucial task in Natural Language Processing (NLP) and Information Retrieval (IR) [Shen et al 2021]. It involves associating specific strings of text (also known as mentions or entities) with the corresponding entries in a knowledge graph or database. This process aids systems in discerning the exact identity that a named entity refers to within a given context, especially when possibilities of ambiguity are present. For instance, a text mention of "Paris" may refer to either "Paris, France" or "Paris, Texas," and entity linking helps clarify such ambiguity [Yin et al 2019].

In the context of structured data such as tables, entity linking involves associating specific cells (entities) with the appropriate entries in a knowledge graph. It requires the identification and linking of entities within tabular data to specific entries in an external structured database. This process is pivotal in making structured data more meaningful and comprehensible by enriching it with external information and context. For instance, in a table of movies, a cell containing "The Matrix" would be linked to the corresponding entry in a knowledge graph, providing additional details about the film, like its director, release date, etc.

This thesis seeks to improve and conduct an in-depth analysis of Entity Linking (EL) problems in public bodies' data sourced from Spend Network (https://spendnetwork.com/). Established in 2007, Spend Network is ambitiously compiling every global public tender and contract. The goal of this project is to refine the existing EL approach, as found at https://github.com/roby-avo/alligator, specifically for this context, and to evaluate the resulting performance.

Work to be done:

  • Establish a dataset of public bodies, including its ground truth.
  • Refine the existing solution method if necessary.
  • Validate the refined approach using the proposed dataset.

References:

[Shen et al 2021] Shen, W., Li, Y., Liu, Y., Han, J., Wang, J., & Yuan, X. (2021). Entity linking meets deep learning: Techniques and solutions. IEEE Transactions on Knowledge and Data Engineering.

[Yin et al 2019] Yin, X., Huang, Y., Zhou, B., Li, A., Lan, L., & Jia, Y. (2019). Deep entity linking via eliminating semantic ambiguity with BERT. IEEE Access, 7, 169434-169445.