To main content

Intelligent data preparation and support for data pipelines

Data science MSc topics related to intelligent data preparation and support for data pipelines.

Contact person

(Photo: Pexels)

Description

Problem description

Data science projects rely on data management processes, which include processes such as acquiring, validating, storing, protecting, and processing required data and then making it available for data science processes and applications. Many of these processes are time intensive and require orchestration of complex data processing steps. Therefore, we are open to explore potential MSc thesis topics aiming at devising supportive mechanisms for data scientist and domain experts for data preparation (also known as data prep, wrangling, transformation) and data pipelines (composite pipelines for processing data with non-trivial properties and characteristics -- commonly referred to as the Vs of Big Data, e.g. volume, velocity, variety, etc).

Qualifications - Requirements

Programming knowledge and skills for data management.

Goal

Exploring and finding intelligent mechanisms to support data scientists in the data preparation phase and exploring and finding mechanisms that support both data scientist and domain experts in the complete lifecycle of managing data pipelines.

The above topics will be implemented in exciting domains and various contexts such as Digital Twins, Smart Cities, Industry 4.0, etc. There are ongoing EU and national projects within SINTEF for which the topics will be connected to.

References

Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad Van Moorsel, and Rajiv Ranjan. 2019. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions. ACM Comput. Surv. 52, 5, Article 95 (October 2019), 41 pages. DOI: https://doi.org/10.1145/3332301

Saliha Sajid, Bjørn Marius von Zernichow, Ahmet Soylu, Dumitru Roman: Predictive Data Transformation Suggestions in Grafterizer Using Machine Learning. MTSR 2019: 137-149

Tasks/Structure and Learning outcome

Use of AI for intelligently suggesting data transformations, automated data quality assessment and recommendations for improving data quality, application of semantics and formal reasoning in the data preparation phase, support for data extension, enrichment and interlinking, use of (knowledge) graph representation and analytics techniques in data preparation.

AI-driven techniques for data pipeline discovery, languages for data pipelines modelling, and techniques for simulation, deployment, and adaptation of data pipelines. Of particular interest could be support for data pipelines on the Computing Continuum (how heterogenous infrastructures such as cloud, fog, edge could be used to support the complete data pipeline lifecycle).

Knowledge: Upon successful completion of the thesis, the student should have specialized knowledge on the data management aspects for data science.

Skills: Upon successful completion of the thesis, the student can clearly define and limit a problem area; connect his/her own project to relevant literature; plan and carry out limited research or development projects; identify types and scopes of results which are required to ensure the claims and conclusions are scientifically valid; reflect on the decisions made and their consequences for the project.

Supervisor

Main supervisor: Dumitru Roman  -  Assisting supervisor: Nikolay Nikolov  -  Internal Supervisor UiO: Arne Jørgen Berre
Other supervisors: Ahmet Soylu - Mihhail Matskin

Contact info