Fully Asynchronous Federated Learning with Faster Convergence for LLM Reasoning
Abstract
Federated Learning (FL) has emerged as a transformative paradigm for distributed machine learning, enabling collaborative model training across decentralized devices while preserving data privacy. Concurrently, the advent of Large Language Models (LLMs)—such as GPT, Claude, and Qwen—has redefined natural language understanding and generation. Despite their potential, integrating LLMs into FL frameworks remains challenging; conventional synchronous FL mechanisms frequently suffer from significant communication overhead and idle "straggler" delays, exacerbated by the vast parameter space of LLMs and the inherent hardware heterogeneity of edge devices. To mitigate these inefficiencies, we propose a novel fully asynchronous FL framework specifically optimized for LLM fine-tuning. Our core contribution is a systematic exploration of matrix decomposition and approximation techniques to identify the most effective linear algebraic methods for distributed optimization in asynchronous settings. We evaluate three distinct approaches—Principal Component Analysis (PCA), QR Decomposition with Column Pivoting (QRCP), and CUR Decomposition—through extensive experiments on GPT-2 fine-tuning using the WikiText dataset. Empirical results demonstrate that PCA-based approximation achieves the fastest convergence and competitive accuracy, significantly reducing wall-clock training time while maintaining a performance profile comparable to synchronous baselines.