LoRA-One

One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently

Yuanhe Zhang¹, Fanghui Liu¹, Yudong Chen²

¹University of Warwick, ²University of Wisconsin-Madison

ICML 2025 Oral

Explore Results Read Paper View Code

Abstract

This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation.

The Core Insight: Subspace Alignment

This is the central theoretical motivation of LoRA-One. We prove that vanilla LoRA will align with the top singular subspaces of the first-step gradient from full fine-tuning. Under mild conditions, in the linear setting, LoRA via Gradient Descent (GD) yields

$\|\mathbf{V}_{r^*,\perp}^{\top}(\nabla_{\mathbf{W}} L(\mathbf{W}^{\natural})) \mathbf{V}_{r^*}(\mathbf{B}_{t})\|_{op}=0\,, \quad \forall\,t\geq 1\,,$ $\|\mathbf{U}_{r^*,\perp}^{\top}(\nabla_{\mathbf{W}} L(\mathbf{W}^{\natural})) \mathbf{U}_{r^*}(\mathbf{A}_{t^*})\|_{op}\lesssim \theta\,, \quad \text{for }\,t^* = \Theta(\log d)\,.$

Methodology

The subspace alignment motivates us to initialize LoRA using the best rank-r approximation of the first-step gradient from full fine-tuning. The "Spectral Initialization" is the core of the algorithm.

Key Theoretical Underpinnings

1. Subspace Alignment Dynamics

The theory first establishes that standard LoRA adapters, when trained with gradient descent, gradually align with the target singular subspaces of the one-step full fine-tuning gradient. This alignment is crucial for effective adaptation.

2. Immediate Alignment via Spectral Initialization

Building on the first point, LoRA-One's "Spectral Initialization" is proven to achieve this optimal subspace alignment instantaneously at t=0. This gives LoRA-One a significant advantage by starting the fine-tuning process in the most promising directions.

3. Provable Linear Convergence

LoRA-One comes with theoretical guarantees of linear convergence. This means the method is proven to approach the optimal solution at a linear rate, ensuring efficient and reliable training. This applies to both linear models and, under certain conditions, nonlinear models like ReLU MLP.

4. Generalization Bounds

Beyond convergence, the theory provides generalization bounds for LoRA-One. These bounds help explain why LoRA-One not only trains well but also performs robustly on unseen data, which is critical for real-world applications.

5. Mitigating Ill-Conditioning

The paper theoretically analyzes how ill-conditioning in the downstream tasks can affect LoRA-based methods. It shows that incorporating preconditioners can provably mitigate these effects, leading to more stable and faster convergence for LoRA-One.

6. Unifying Perspectives

The theoretical framework developed for LoRA-One also helps to unify and clarify the connections between LoRA-One and other gradient-alignment-based PEFT methods. This provides a clearer understanding of the landscape of efficient fine-tuning techniques.

Experimental Results Dashboard

Empirically, LoRA-One consistently outperforms LoRA and other variants across a wide range of tasks and models. Select a task category below to explore the performance data from the paper.

Select Task Category:

Comparison with LoRA Variants on GLUE Benchmarks (T5 Base) - This chart summarizes performance of LoRA, LoRA-One, and other PEFT baselines across five natural language understanding tasks: MNLI, SST-2, CoLA, QNLI, and MRPC.

Scalability and Efficiency

The experiments show that LoRA-One's advantages hold up during longer training, and it achieves this superior performance with virtually no extra time or memory cost during the fine-tuning phase.

Scaling with More Data and Epochs

Accuracy on GSM8K when fine-tuning on the full MetaMathQA dataset. LoRA-One consistently leads over multiple epochs.

Time & Memory Efficiency

Dataset	LoRA	LoRA-One
MetaMathQA100K	6h 20m (21.6GB)	6h 23m (21.7GB)
Code-Feedback100K	6h 24m (22.6GB)	6h 26m (22.9GB)
Alpaca	3h 22m (23.4GB)	3h 25m (23.4GB)

The time cost of spectral initialization is negligible due to sub-sampling, making LoRA-One's runtime and memory footprint nearly identical to standard LoRA.

Conclusion & Citation

LoRA-One provides a more effective, efficient, and stable method for fine-tuning LLMs, backed by rigorous theory. By achieving immediate subspace alignment, it sets a new standard for parameter-efficient adaptation.

@inproceedings{zhang2025loraone,
  title={{LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, 
          Provably and Efficiently}},
  author={Zhang, Yuanhe and Liu, Fanghui and Chen, Yudong},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year={2025}
}