Uncovering mesa-optimization algorithms in Transformers

Uncovering mesa-optimization algorithms in Transformers

Arxiv Papers

55 лет назад

422 Просмотров

The paper proposes that the superior performance of Transformers in deep learning is due to an architectural bias towards mesa-optimization, a learned process within the forward pass. They reverse-engineer Transformers and show that the learned optimization algorithm can be used for few-shot tasks. They also propose a new self-attention layer that improves performance.

00:00 Section: 1 Introduction
03:35 Section: Linear self-attention can implement one step of gradient descent.
07:42 Section: Multi-layer mesa-optimizers.
13:38 Section: 5.1 Prediction of linear dynamics by in-context learning
17:31 Section: Multiple self-attention layers.
22:22 Section: 5.2 Simple autoregressive models become few-shot learners
25:32 Section: A toy model for in-context learning.

https://arxiv.org/abs//2309.05858

YouTube: https://www.youtube.com/@ArxivPapers

PODCASTS:
Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016
Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers
Ссылки и html тэги не поддерживаются


Комментарии: