LoRA-One
One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently
ICML 2025 Oral
Abstract
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation.
The Core Insight: Subspace Alignment
This is the central theoretical motivation of LoRA-One. We prove that vanilla LoRA will align with the top singular subspaces of the first-step gradient from full fine-tuning. Under mild conditions, in the linear setting, LoRA via Gradient Descent (GD) yields

Methodology
The subspace alignment motivates us to initialize LoRA using the best rank-r approximation of the first-step gradient from full fine-tuning. The "Spectral Initialization" is the core of the algorithm.

Key Theoretical Underpinnings
1. Subspace Alignment Dynamics
The theory first establishes that standard LoRA adapters, when trained with gradient descent, gradually align with the target singular subspaces of the one-step full fine-tuning gradient. This alignment is crucial for effective adaptation.
2. Immediate Alignment via Spectral Initialization
Building on the first point, LoRA-One's "Spectral Initialization" is proven to achieve this optimal subspace alignment instantaneously at t=0. This gives LoRA-One a significant advantage by starting the fine-tuning process in the most promising directions.
3. Provable Linear Convergence
LoRA-One comes with theoretical guarantees of linear convergence. This means the method is proven to approach the optimal solution at a linear rate, ensuring efficient and reliable training. This applies to both linear models and, under certain conditions, nonlinear models like ReLU MLP.
4. Generalization Bounds
Beyond convergence, the theory provides generalization bounds for LoRA-One. These bounds help explain why LoRA-One not only trains well but also performs robustly on unseen data, which is critical for real-world applications.
5. Mitigating Ill-Conditioning
The paper theoretically analyzes how ill-conditioning in the downstream tasks can affect LoRA-based methods. It shows that incorporating preconditioners can provably mitigate these effects, leading to more stable and faster convergence for LoRA-One.
6. Unifying Perspectives
The theoretical framework developed for LoRA-One also helps to unify and clarify the connections between LoRA-One and other gradient-alignment-based PEFT methods. This provides a clearer understanding of the landscape of efficient fine-tuning techniques.
Experimental Results Dashboard
Empirically, LoRA-One consistently outperforms LoRA and other variants across a wide range of tasks and models. Select a task category below to explore the performance data from the paper.
Comparison with LoRA Variants on GLUE Benchmarks (T5 Base) - This chart summarizes performance of LoRA, LoRA-One, and other PEFT baselines across five natural language understanding tasks: MNLI, SST-2, CoLA, QNLI, and MRPC.
Scalability and Efficiency
The experiments show that LoRA-One's advantages hold up during longer training, and it achieves this superior performance with virtually no extra time or memory cost during the fine-tuning phase.
Scaling with More Data and Epochs
Accuracy on GSM8K when fine-tuning on the full MetaMathQA dataset. LoRA-One consistently leads over multiple epochs.
Time & Memory Efficiency
Dataset | LoRA | LoRA-One |
---|---|---|
MetaMathQA100K | 6h 20m (21.6GB) | 6h 23m (21.7GB) |
Code-Feedback100K | 6h 24m (22.6GB) | 6h 26m (22.9GB) |
Alpaca | 3h 22m (23.4GB) | 3h 25m (23.4GB) |
The time cost of spectral initialization is negligible due to sub-sampling, making LoRA-One's runtime and memory footprint nearly identical to standard LoRA.