Controlled Self-Recovery of the Aggregator in Federated Learning Using RAFT Protocol

Abstract

Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, achieving fault tolerance and self-recovery in such scenarios is challenging due to the centralised model aggregation, which poses a single point of failure. This paper focuses on the self-recovery of the aggregator, specifically the controlled re-assignment of the aggregator role to the most suitable node. Our proposed solution leverages the RAFT consensus algorithm to facilitate consistent state replication and leader election within the FL system. This is complemented by controlled aggregator re-assignment, which considers various contextual properties to select the optimal node, enhancing the system's robustness, especially in dynamic and unreliable cyber-physical environments. We implement a proof of concept using the Flower FL framework and conduct experiments to evaluate aggregator recovery time and the traffic overhead associated with state replication. While the traffic overhead scales with the number of FL nodes, our results demonstrate a resilient, self-recovering system capable of handling node failures while maintaining model consistency.

Read the publication

Language

English

Author(s)

Affiliation

SINTEF Digital / Sustainable Communication Technologies

Date

18.12.2025

Year

2025

Published in

ACM Transactions on Autonomous and Adaptive Systems

ISSN

1556-4665

DOI

https://doi.org/10.1145/3785470

View this publication at Norwegian Research Information Repository

Contact us

Our services

Career

Sustainability

Management and board

Institutes

Other units

About us

Follow us