Federated learning for data streams
Keywords: Federated Learning, Data streaming, Big data pipelines
The choice between a centralized architecture and a federated one is challenging and often depends on the priorities of the specific application. A centralized architecture seems to be the best solution for guaranteeing accountability and control over data flows, while a federated one allows for the application of economies of scale techniques for high scalability and independence.
With the continuous increase in volume of data, federated models are emerging as the best fit for most applications, with their distributed accountability for data products. With a federated model, there is no need to wait around for a centralized team to complete various processes before making a minor change since accountability for every data product rests clearly independent.
The choice of data streaming architecture to use may be challenging. Each tool can cover various aspects of architecting generic streaming systems, addressing aspects such as: scalability (the system able to scale up or down as needed to handle large amounts of data, as well as handle unpredictable spikes in streaming); reliability (being able to cope with data loss, duplicates, or incorrect data processing); real-time processing (no significant latency between data being generated and data being processed, data is processed without any errors); data integration (integration of data potentially streamed from a variety of sources and/or in different formats); security (e.g., controlling access to data streams and preventing data loss); fault tolerance (the streaming system can continue to operate even if some of its components fail), etc.
Different approaches exist in the literature for the management of data streaming pipelines, for example Lambda architecture (e.g., Apache Spark Streaming and Apache Storm), Kappa architecture (e.g., Apache Kafka and Apache Samza), event-driven architecture (e.g., Google Cloud Pub/Sub and AWS Kinesis) and Cloud-native architecture (e.g., Google Cloud Dataflow and AWS SQS).
Work to be done:
- State-of-the-art comparison of available federated streaming tools for data pipelines. Research is very active in the field, and with so many frameworks available it is difficult to understand the pros and cons of each. Key factors like exposed APIs, use cases, and integrations can be considered for a fair comparison of them.