Abstract
Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, achieving fault tolerance and self-recovery in such scenarios is challenging due to the centralised model aggregation, which poses a single point of failure. This paper focuses on the self-recovery of the aggregator, specifically the controlled re-assignment of the aggregator role to the most suitable node. Our proposed solution leverages the RAFT consensus algorithm to facilitate consistent state replication and leader election within the FL system. This is complemented by controlled aggregator re-assignment, which considers various contextual properties to select the optimal node, enhancing the system's robustness, especially in dynamic and unreliable cyber-physical environments. We implement a proof of concept using the Flower FL framework and conduct experiments to evaluate aggregator recovery time and the traffic overhead associated with state replication. While the traffic overhead scales with the number of FL nodes, our results demonstrate a resilient, self-recovering system capable of handling node failures while maintaining model consistency.