publications | Yingcong Li

2025

AISTATS

Provable Benefits of Task-Specific Prompts for In-context Learning

Xiangyu Chang, Yingcong Li, Muti Kara, and 2 more authors

International Conference on Artificial Intelligence and Statistics, 2025

Abs arXiv

In this work, we consider a novel setting where the global task distribution can be partitioned into a union of conditional task distributions. We then examine the use of task-specific prompts and prediction heads for learning the prior information associated with the conditional task distribution using a one-layer attention model. Our results on loss landscape show that task-specific prompts facilitate a covariance-mean decoupling where prompt-tuning explains the conditional mean of the distribution whereas the variance is learned/explained through in-context learning. Incorporating task-specific head further aids this process by entirely decoupling estimation of mean and variance components. This covariance-mean perspective similarly explains how jointly training prompt and attention weights can provably help over fine-tuning after pretraining.

2024

NeurIPS

Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

Yingcong Li, Ankit Singh Rawat, and Samet Oymak

Advances in Neural Information Processing Systems, 2024

Abs arXiv

In this work, we develop a stronger characterization of the optimization landscape of ICL through contributions on architectures, low-rank parameterization, and correlated designs: (1) We study the landscape of 1-layer linear attention/H3 models and prove that both implement 1-step preconditioned gradient descent (PGD). Additionally, thanks to its native convolution filters, H3 also has the advantage of implementing sample weighting and outperforming linear attention. (2) We provide new risk bounds for RAG and task-feature alignment which reveal how ICL sample complexity benefits from distributional alignment. (3) We derive the optimal risk for low-rank parameterized attention weights in terms of covariance spectrum. Through this, we also shed light on how LoRA can adapt to a new distribution by capturing the shift between task covariances.
ICML

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

M Emrullah Ildiz, Yixiao Huang, Yingcong Li, and 2 more authors

International Conference on Machine Learning, 2024

Abs arXiv

In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text.
ICML workshop

Can Mamba In-Context Learn Task Mixtures?

Yingcong Li, Xupeng Wei, Haonan Zhao, and 1 more author

In ICML 2024 Workshop on In-Context Learning, 2024

Abs PDF

In this work, we explore the Mamba performance in mixed ICL tasks, in a degree from low to high, and from labeled to unlabeled. We show that Mamba is capable of learning ICL mixtures, reaching the performance of single ICL task and Transformer baselines. Moreover, Mamba converges faster and shows more stable performances than Transformers, allowing Mamba to handle longer context lengths and more complicated prompt structures. Different learning dynamics in different ICL tasks are also observed.
AISTATS

Mechanics of next token prediction with self-attention

Yingcong Li^*, Yixiao Huang^*, Muhammed E Ildiz, and 2 more authors

In International Conference on Artificial Intelligence and Statistics, 2024

Abs arXiv

In this work, we ask: What does a single self-attention layer learn from next-token prediction? We show that training self-attention with GD learns an automaton which generates the next token in two distinct steps: (1) Hard retrieval: Given input sequence, self-attention precisely selects the high-priority input tokens associated with the last input token. (2) Soft composition: It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively.
NeurIPS

Dissecting chain-of-thought: Compositionality through in-context filtering and learning

Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, and 2 more authors

Advances in Neural Information Processing Systems, 2024

Abs arXiv

Our study finds that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of ICL and facilitates the learning of complex functions that non-CoT methods struggle with.

2023

in submission

Transformers as support vector machines

Davoud Ataee Tarzanagh^*, Yingcong Li^*, Christos Thrampoulidis, and 1 more author

arXiv preprint arXiv:2308.16898, 2023

Abs arXiv

In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. Our findings are applicable to arbitrary datasets and their validity is verified via experiments.
NeurIPS spotlight

Max-margin token selection in attention mechanism

Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, and 1 more author

Advances in Neural Information Processing Systems, 2023

Abs arXiv

In this work, we explore the seminal softmax-attention model and prove that running gradient descent converges in direction to a max-margin solution that separates locally-optimal tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism.
ICML

Transformers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and 1 more author

In International Conference on Machine Learning, 2023

Abs arXiv

In this work, we formalize ICL as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. or (2) a trajectory arising from a dynamical system. We characterize when transformer/attention architecture provably obeys the stability condition and also provide empirical verification. For generalization on unseen tasks, we identify an inductive bias phenomenon in which the transfer learning risk is governed by the task complexity and the number of MTL tasks in a highly predictable manner.
ICASSP

On the fairness of multitask representation learning

Yingcong Li, and Samet Oymak

In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

PDF
AAAI

Provable pathways: Learning multiple tasks over multiple paths

Yingcong Li, and Samet Oymak

In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

Abs arXiv

This work explores the pathways proposal from the lens of statistical learning: We first develop novel generalization bounds for ERM problems learning multiple tasks over multiple paths. In conjunction, we formalize the benefits of resulting multipath representation when adapting to new downstream tasks. Our bounds are expressed in terms of Gaussian complexity, lead to tangible guarantees for the class of linear representations, and provide novel insights into the quality and benefits of a multipath representation.
AAAI

Stochastic contextual bandits with long horizon rewards

Yuzhen Qin, Yingcong Li, Fabio Pasqualetti, and 2 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, 2023

arXiv

2022

arXiv preprint

Provable and efficient continual representation learning

Yingcong Li, Mingchen Li, M Salman Asif, and 1 more author

arXiv preprint arXiv:2203.02026, 2022

Abs arXiv

In this work, we study the problem of continual learning (CL) and focus on zero-forgetting methods. We first provide experiments demonstrating CL can significantly boost sample efficiency when learning new tasks and establish theoretical guarantees by providing sample complexity and generalization error bounds for new tasks. Our analysis and experiments also highlight the importance of the task order. Specifically, we show that CL benefits if the initial tasks have large sample size and high "representation diversity". Finally, we propose an inference-efficient variation of PackNet called Efficient Sparse PackNet (ESPN) which employs joint channel & weight pruning. ESPN embeds tasks in channel-sparse subnets requiring up to 80% less FLOPs to compute while approximately retaining accuracy and is very competitive with a variety of baselines.

2021

AAAI

Provable benefits of overparameterization in model compression: From double descent to pruning neural networks

Xiangyu Chang^*, Yingcong Li^*, Samet Oymak, and 1 more author

In Proceedings of the AAAI Conference on Artificial Intelligence, 2021

Abs arXiv

Recent empirical evidence indicates that the practice of overparameterization not only benefits training large models, but also assists building lightweight models. This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional asymptotics of model pruning in the overparameterized regime. We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning rather than simply training with the known informative features. This leads to a new double descent in the training of sparse models: growing the original model, while preserving the target sparsity, improves the test accuracy as one moves beyond the overparameterization threshold. Our analysis further reveals the benefit of retraining by relating it to feature correlations.