The post A Dynamical Model of Neural Scaling Laws appeared first on Kempner Institute.

]]>$$\mathcal L = C_t \ t^{-\alpha_t} + C_N N^{-\alpha_N} + \mathcal L_{\infty}.$$

The constants $C_t, C_N$ and exponents $\alpha_t$ and $\alpha_N$ and asymptote $\mathcal L_{\infty}$ are depend on the learning task and neural network architecture. Below, we show two simple examples of such phenomena in a vision task (CIFAR-5M with convolutional networks) and in a language modeling task (Wikitext-103 language modeling with a transformer) where we change the parameter count of a model by increasing the width $N$.

We see that performance improves regularly with both model size and training steps. While our previous blog post discussed infinite parameter limits of neural networks đââ, understanding scaling laws requires characterizing the finite model-size effects that limit performance after enough training. In this post, we summarize our recent paper ^{[7]}, which will appear at ICML 2024, that analyzes a simple theoretical model that can reproduce these types of scaling behaviors. A very similar model was also studied in the recent work of Paquette et al.^{[8]}

In addition to the simple power-law neural scaling laws in training time and model size that are observed in many scaling law studies, we would also like to explain some other observed effects in network training. Some key empirical observations include

**Asymmetric exponents**: the model size and training time exponents $\alpha_N , \alpha_t$ are often different^{1-6}.**Data reuse effects**: Reusing data leads to a gradual buildup of the gap between train loss and test loss compared to the online training regime where data is never repeated and train and test losses are identical^{[9]}^{[10]}.**Change in rate of convergence to large width limit**: Very early in training networks converge at a rate $1/\text{width}$ to their infinite width limit^{[11]}^{[12]}Late in training, the network has a loss that scales as $\text{width}^{-c}$ where $c$ is task and architecture dependnent^{11,12,}^{[13]}.**Compute suboptimality of ensembling**: On large datasets or in online training, ensembling over many randomly initialized models often fails to match the performance of training a single larger model.^{12}

Below we describe a very simple model of network training dynamics that can reproduce these effects.

We seek the simplest possible model that captures all of these observed phenomena. In the previous blog post, we introduced the kernel limit of neural networks which arises from randomly initializing large width networks in a certain parameterization. This model has serious deficiencies as a model of neural network dynamics since the internal representations in the network are static throughout learning, however it is much more analytically tractable since it is essentially a linear model. Despite the deficiencies of this kernel (linear model) regime of neural network training, we will show that all of the above neural scaling law effects are already observable in the learning dynamics of a linear model. We therefore aim to characterize the test and train loss dynamics of this kind of model.

We consider a network with đ trainable parameters, đ data points, and đĄ timesteps of training. Our goal is to characterize the expected or typical test error as a function of these quantities over random draws of datasets and initial network features.

Neural networks in certain limits can operate as linear models. In this regime, the output prediction $f$ of the neural network is a linear combination of its $N$ features $\left\{\tilde{\psi}_k(x) \right\}_{k=1}^N$, which arise from a rank-$N$ kernel associated with an $N$ parameter model. The target function, $y$, on the other hand, is a linear combination of a complete set of features $\left\{ \psi_k(x) \right\}_{k=1}^\infty$, corresponding to a complete set of square integrable functions. These expansions take the form $$f(x) = \sum_{k} \tilde{\psi}_k(x) w_k \ , \ y(x) = \sum_k \psi_k(x) w^\star_k.$$ We will use the basis of features $\psi_k(x)$ as the infinite width kernel eigenfunctions \.^{8-9} The finite model’s $N$ features $\{ \tilde{\psi}_k \}_{k=1}^N$ can be expanded in the basis of the original features with coefficients $A_{k\ell}$, $$\tilde{\psi}_k(x) = \sum_{\ell = 1}^\infty A_{k\ell} \ \psi_\ell(x) .$$

We will model the matrix $A_{k\ell}$ as random, which reflects the fact that the empirical kernel in a finite parameter model depends on the random initialization of the network weights. The statics of this model were analyzed in prior works ^{[14]}^{[15]}, but in this work we focus on the dynamics of training.

To train the model parameters $w_k$ with gradient based training, we randomly sample a training set with $P$ data points ${ x_\mu }_{\mu=1}^P$ drawn from the population distribution and train the model with gradient descent/gradient flow on the training loss $\hat{\mathcal{L}} = \frac{1}{P} \sum_{\mu=1}^P [f(x_\mu) – y(x_\mu)]^2$. For gradient flow, we have

$$\frac{d}{dt} \mathbf w(t) = – \eta \nabla \hat{\mathcal L}(\mathbf w(t)) . $$

For simplicity in this post we focus on gradient flow, but discrete time algorithms such as gradient descent or momentum and one pass SGD can also be handled in our framework, see our paper.^{7}

Our goal is to track the test error $\mathcal L =\mathbb{E}_{x} [f(x) – y(x)]^2$ over training time. Since $f(x,t)$ depends on the random dataset and random projection, we have to develop a method to average over these sources of disorder.

We develop a theory to track the test and train loss dynamics in this random feature model for đ, đ large. To analytically calculate these losses, we utilize ideas from statistical physics, specifically dynamical mean field theory (DMFT). This method summarizes all relevant summary statistics of the network in terms of correlation and response functions.

Below, we plot an example of our theoretical predictions of test loss (dashed black lines) against experimental training (solid) for feature maps of varying dimension $N$ with large dataset size $P=1000$. Standard deviations over random realizations of the dataset and projection matrix $A$ are plotted as bands of shaded color. We see that the theory (dashed black lines) accurately captures the deviation of finite models from the $N,P \to \infty$ limiting dynamics (blue). Further, increasing training time $t$ and increasing model size $N$ leads to consistent reductions in test loss.

However, if the dataset size is small, the returns to increasing model size eventually diminish as the test loss is bottlenecked by the amount of available data. Below we plot varying model sizes $N$ as we train on a dataset of size $P=128$.

From the last section, we saw that the performance of the model can be bottlenecked by one of the three computational/statistical resources: training time $t$, model size $N$, and total available data $P$. By this we mean that even if the other two resources were effectively infinite, the loss can still be nonzero because of the finite value of the third quantity. In this section, we show that the dependence of the loss on these resources can obey power laws when the features themselves have power-law structure. It has been observed that the spectra of neural network kernels on real datasets often follow power-laws^{13} ^{[16]}

$$\lambda_k = \mathbb{E}_{x} \psi_k(x)^2 \sim k^{-b} \ , \ [\mathbb{E}_x y(x) \psi_k(x) ]^2 \sim k^{-a} .$$

For this kind of feature structure, our theory gives the following approximate scaling laws when bottlenecked by one of the three resources (time, model size, and dataset size)

\begin{align}

\mathcal L(t,P,N) \approx

\begin{cases}

t^{-(a-1)/b} \ , \ N,P \to \infty

\\

N^{-\min\{a-1,2b\}} \ , \ t,P \to \infty

\\

P^{-\min\{a-1,2b\}} \ , \ t, N \to \infty

\end{cases}

\end{align}

For most cases of interest, the expoents satisfy $a-1 < 2b$^{13,16}, leading to $\sim N^{-(a-1)}, P^{-(a-1)}$ model and data bottleneck scaling laws. In these cases, our result predicts that in general the training time exponent is smaller than model size or data exponents, depending on the rate of decay of the eigenvalues, set by $b$.

An intuitive way to interpret this result in the case of interest ($\min{a-1,2b} = a-1$) is that $t$ steps of gradient descent on $N$ features and $P$ data can capture at most

$$k_{\star} \approx \min{ t^{1/b}, N, P } . $$

spectral components of the target function. The loss is determined by the remaining variance that is not captured in these top $k_\star$ components $\mathcal L \approx \sum_{k > k_\star} \mathbb{E}_{x}[y(x) \psi_k(x)]^2$. Thus these bottleneck scaling laws can be viewed low-rank effects in the empirical kernel that limit the performance of the model.

In this section we consider a regime of training where there is sufficient data, such as the online training regime of large language models. By approximating the test loss as a linear combination of the model size and time bottleneck scalings, we can derive the compute optimal scaling of training time and model size with respect to total compute $C=N t$. This compute budget $C$ is the total number of floating point operations required to train the model. For the optimal choice of training time and model size, we find the loss depends on compute as

$$\mathcal L_\star(C) \sim C^{-\min{a-1,2b}(a-1) /( b \min{a-1,2b} + a-1)}$$

which in most cases of interest will simply be $\mathcal{L}_\star(C) \sim C^{- (a-1)/(b+1)}$. We show an example of this for $(a,b) = (2,1)$ below. Our theoretical scaling law is compared to the experimental loss curves from training models of varying size $N$ for multiple timesteps.

This model shows how the data structure and architecture influence the compute costs of training a highly performant model. Specifically, the decay rate of target coefficients and eigenvalues controls the compute optimal scaling law of the model. For models with fast eigenvalue decay rates, it is preferable to scale up training time much faster than scaling up model size as the optimal scaling rule is $t \sim C^{\frac{b}{1+b}}$ and $N \sim C^{\frac{1}{1+b}}$. As $b \to 1$ the optimal scaling is symmetric.

Many works have observed that the early training-time dynamics of networks with width $N$ deviates from the infinite $N$ limit with a scaling rate of $1/N$^{11-13}, but that after a long amount of training on sufficient quantities of data the convergence rate exhibits a task-dependent scaling law $N^{-\alpha_N}$^{12-13}. Our model also exhibits a transition in the convergence rates as training takes place. Below we show the early time loss of our model at $N$ compared to our model in the $N \to \infty$ limit, seeing a $1/N$ convergence rate.

However, after significant training time, the model will eventually depend on the model size $N$ with a scaling exponent that is task-dependent (the bottleneck scaling) as we show below.

We see that this scaling law can significantly differ from the $1/N$ rate and indeed becomes task dependent.

Many works have also observed that the early time training with a finite dataset is well approximated by training with infinite data^{9-10}, however over time a gap develops between training and test losses. This is also a naturally occuring feature in our model and the DMFT equations exactly describe how the test and train losses diverge over time. Below we plot dynamics for $N=512$ with varying dataset size $P$.

We note that the test and train losses are close initially but accumulate finite $P$ corrections that drive the separation of test and train. These corrections are larger for small $P$ and vanish as $P \to \infty$.

Finite sized models with random initial weights can be thought of noisy approximations of infinitely sized neural networks. This extra noise can lead to worse performance and can be eliminated by training multiple models with independent initialization in parallel and averaging their outputs, a procedure known as ensembling. However recent experiments have demonstrated that the benefits to ensembling, while non-negligible, are not as significant as the benefit of increasing model size^{11-12}.

In our toy model, we can analyze the effect of ensembling on the test loss and ask whether ensembling is compute optimal. Training an ensemble of $E$ networks and averaging their outputs would incur a compute cost of $C = E N t$. Below we plot loss as a function of compute for $E=1$ and $E=4$ ensembles for varying width $N$.

At each value of compute $C$, it is prefereable to choose the larger model with $E=1$ than to use a smaller model with $E=4$. We argue the reason for this is that doubling $N$ has a similar effect on the variance as doubling $E$. However, doubling $N$ also reduces the *bias*.

To give a flavor of how this theory works, we show how DMFT recovers the bias along the $k$th feature for all $k$. This error is given by: $$ H_k(t) = \frac{\mathbb{E}_{x} [(y(x) – f(x,t)) \psi_k(x)] }{\mathbb{E}_{x} [y(x) \psi_k(x)]}$$

Our theory explictly calculates the Fourier transform $\mathcal H_k(\omega)$ in closed form. This is given in terms of the eigenvalues $\lambda_k$, the dataset size $P$ and the model size $N$. An example of the closed form solution for the $H_k(t)$ is plotted below with $N = 128$ and varying values for $P$.

The error along the $k$-th eigendirection deviates from the infinite data and infinite model limit (gray lines) and eventually saturates as $t \to \infty$, giving a final loss which depends on $N$ and $P$. Even if $P \to \infty$, the $H_k$ curves saturate in this plot due to the finite value of $N = 128$. We show the losses for $k=1$ (solid) and $k=10$ (dashed). We find that the bias, which is set by $H_k$ decreases as $N,P$ and $t$ increase.

We can also use our methods to analyze stochastic gradient descent (SGD) in discrete time without data reuse. In this setting, the finite model size and finite training time can still limit performance, but the finite batch size $B$ only introduces additional *variance* in the dynamics as we illustrate below. On the left, we vary the model size $N$ with batchsize set to $B=32$ and see that we still obtain model size bottlenecks which are qualitatively similar to before. We also see additional small fluctuations in the loss from batch to batch. On the right, we show $N=256$ with varying batch size, showing that the expected loss and scale of fluctuations are higher for smaller batches.

In this setting, a test-train gap is not possible since every fresh batch of data gives an unbiased estimate of the population loss. As a consequence, online learning does not experience a data bottleneck in the bias, but only additional variance from the fluctuations in the SGD updates. These updates disappear in the continuous time (infinitesimal learning rate) limit which recovers the infinite data $P \to \infty$ limit of the previously discussed gradient flow equations.

Our model is based on a kernel approximation of neural network training which fails to capture the benefits to performance due to feature learning. Below we plot neural networks trained in the kernel regime (solid) and the predicted compute scaling exponent (blue), obtained from fitting the exponents $a$ and $b$ to the measured initial kernel spectra. We also plot the loss curves for networks in the feature learning regime (dotted lines).

While the networks operating in the kernel regime (solid) are well described by our theoretical prediction for the compute scaling law, the networks in the rich, feature learning regime have a much better dependence on compute $C$. This illustrates that quantitatively capturing the compute optimal scaling exponents observed in practice will require a theory of how feature learning accelerates convergence during training.

We proposed a simple linear model to analyze dynamical neural scaling laws. This model captures many of the observed phenomena related to network training and test loss dynamics. Looking forward, theories which incorporate feature learning into the network training dynamics will improve our understanding of scaling laws. The fact that infinite sized models perform the best suggests that starting with theories of feature learning at infinite width are a good place to start.

- Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Gemini Team Google. Gemini: A Family of Highly Capable Multimodal Models, arXiv preprint arXiv:2203.15556, 2024.
- Tamay Besiroglu, Ege Erdil, Matthew Barnett and Josh You. Chinchilla Scaling: A replication attempt. arXiv preprint arXiv:2404.10102, 2024.
- Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Blake Bordelon, Alex Atanasov, Cengiz Pehlevan. Dynamical Model of Neural Scaling Laws. To appear International Conference of Machine Learning 2024.
- Elliot Paquette,Â Courtney Paquette,Â Lechao Xiao,Â Jeffrey Pennington. 4+3 Phases of Compute-Optimal Neural Sclaing Laws. arXiv preprint arXiv:2405.15074, 2024.
- Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. International Conference on Learning Representations, 2021
- Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. Advances in Neural Information Processing systems, 2023
- Blake Bordelon and Cengiz Pehlevan. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. Advances in Neural Information Processing Systems, 2023.
- Nikhil Vyas, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, and Cengiz Pehlevan. Feature-learning networks are consistent across widths at realistic scales, Advances in Neural Information Processing Systems 2023.
- Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
- Alexander Maloney, Daniel Roberts, James Sully. A solvable model of neural scaling laws. arXiv preprint arXiv:2210.16859. 2022.
- Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, Cengiz Pehlevan. Onset of Variance-limited Behavior for networks in the lazy and rich regimes. arXiv preprint arXiv:2212.12147. 2022.
- Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024â1034. PMLR, 2020.

The post A Dynamical Model of Neural Scaling Laws appeared first on Kempner Institute.

]]>The post Infinite Limits of Neural Networks appeared first on Kempner Institute.

]]>The performance of deep learning models improves with model size and dataset size in remarkably regular and predictable ways ^{[1]}^{[2]}. However, less is known about what kind of limiting behavior these models approach as their model size and dataset size approaches infinity. In this blog post, we aim to give the reader a relatively accessible introduction to various infinite parameter limits of neural networks. Beyond this, we aim to answer a pressing question of whether these theoretical limits actually translate to anything practically meaningful. Are large-scale language and vision models anywhere near these infinite limits? If not, in what ways do they differ?

In studying very wide and deep networks, a crucial role will be played by the *parameterization* of the network. A parameterization is a rule for going from a given size network to a wider (or deeper) one. More technically, it is defined by how width and depth enter when one defines the initialization, forward pass, and gradient update step of the network. Different parameterizations can lead to very different behavior as networks are scaled up.

**TL;DR:** Conventional parameterizations lead to neural networks performing worse with increasing depth or width. However, there are special parameterizations where performance monotonically improves as width and depth increase. These special parameterizations also keep the optimal hyperparmeters relatively stable as networks are scaled up. The key property of these parameterizations is that the network continues to learn features from the data as width an depth increase. In conventional parameterizations, very wide and deep networks are limited in their ability to learn features.

Below, we train CNNs of varying channel count for a few epochs on the CIFAR-10 image classification task. We will call the channel count the *width* and denote it by $N$. We start with a width 64 network (solid), and increase the width according to the commonly used *NTK parameterization* (dashed). This parameterization will be precisely defined in the next section. We keep all other hyperparameters fixed. As the model size increases, the test loss gets worse.

Do wider models inherently train slower and generalize worse? Such a finding would be quite inconsistent with the recent results of larger language and vision models achieving top performance.

Actually, this effect is entirely an artifact of the parameterization. When using the same architecture in another parameterization known as the *maximal update parameterization* or $\mu$P for short (solid lines below), we find that increasing the width $N$ leads to improved performance. We will precisely define this parameterization later on.

Here that we have set up things so that the width 64 network has identical dynamics in both paramterizations, and the other networks are scaled up from that one using either NTK or $\mu$P parameterizations.

What is different between these two parameterizations? Why does one give similar training dynamics across model sizes while the other does not? To understand the difference, we need to investigate their limiting behaviors as width becomes larger. Mathematically, this means studying the limit $N \to \infty$.

Depending on the architecture of a given network, the *width* of a given layer is either: the feedforward dimension in a dense layer, the number of channels in a convolutional layer, the size of the residual stream, or the dimenson of the keys and queries for each attention head. Infinite width limits usually mean that we take this hidden dimension to infinity in each layer of the network.

For simplicity, we focus on a dense feedforward neural network, though we stress that these ideas can straightforwardly be extended to other architectures (convnets, resnets, transformers, state space models).

We initialize all weights to be drawn from the unit Gaussian $\mathcal N(0, 1)$, although any well-behaved^{1} distribution with mean 0 and variance 1 will lead to the same results. We take the depth to be $L$. For each layer $\ell \in {1, \dots L }$, the entries of the hidden layer “preactivation” vector $\mathbf h^\ell$ are given by

$$h^{\ell+1}_i(\mathbf x) = \frac{1}{\sqrt{n_{\ell}}} \sum_{j} W_{ij}^{\ell+1} \varphi(h^{\ell}_j(\mathbf x)), \quad h^1_i(\mathbf x) = \frac{1}{\sqrt{n_0}} \sum_j W_{ij}^1 x_j,$$

where $\varphi$ is a nonlinearity acting element-wise, and $n_{\ell}$ is the width of layer $\ell$ and $n_0$ is the dimension of the inputs. The output of the network is then given by the last layer:

$$f(\mathbf x; \mathbf W) = h^{L}(\mathbf x).$$

Going forward, we will take all widths $n_\ell = N$. The factors of $1/\sqrt{n_\ell}$ at each layer are there so that (by the central limit theorem) each $h_{i}^\ell$ is $\Theta_N(1)$ in size^{2}. This allows for the network output to be finite in the limit of $N \to \infty,$ and for the network to be trainable.

In this parameterization, the weights and preactivations move during gradient descent by:

$$\Delta W_{ij}^\ell \sim \Theta(N^{-1}) \ , \ \Delta h_i^\ell \sim \Theta(N^{-1/2}).$$

We note that in the $N \to \infty$ limit, because the preactivations do not change, the hidden representations of the network will be static during training. The network not will learn internal representations of the data.

Because the weights move only infinitesimally as $N\to \infty$, the model behaves like its *linearization* in weight space. This gives a linear model (also called a kernel method). This is the *neural tangent kernel* (NTK)^{3}.

One can successfully study the training and generalization of sufficiently wide neural networks by studying this linear model. Our group’s prior work ^{[3]}^{[4]} gives exact analytic equations for the learning curves in this infinite width limit, using methods in statistical physics and random matrix theory. See also our groupâs recent work ^{[5]} for an accessible overview of these ideas. One can also study the training dynamics of linear models, and recover many of the empirical phenomena observed in large scale language and vision models – see our next blog post!

Unfortunately, by their very definition, linear models do not learn features. Feature learning is widely belived to be responsible for the impressive performance of deep learning. Further, in networks that learn features, wider models empirically tend to perform better. The NTK parameterization does not capture these phenomena – luckily, there is an alternative parameterization that does.

The aim is to parameterize the network so that it can learn features even at infinite width. This has been studied under the name *mean field* paramerization for two layer networks ^{[6]}^{[7]} and generalized to arbitrary architectures by Yang et al ^{[8]} under the name maximal update parameterization, commonly abbreviated as $\mu$P. A characterization of the infinite width limit using dynamical mean field theory was done by our group in ^{[9]}.

A simple way to derive this parameterization is to slightly modify the above feedforward network by adding a single additional (non-trainable) parameter $\alpha$ to the very last stage of the forward pass, as done in ^{[10]}:

$$f_\alpha(\mathbf x; \, {\mathbf W^\ell}_{\ell}) = \alpha \, h^{L}(\mathbf x).$$

Although this modification appears relatively innocuous, it substantially impacts the amount by which the network deviates from being a linear model. When $\alpha$ is very large, relatively small changes in the weights can lead to large changes in the output $f_\alpha$. The model is then well-approximated by its weight space linearization. This is called the **lazy limit**, **kernel limit**, or **lazy training**. Therefore at large $\alpha$, the network does not learn features.

By contrast, small $\alpha$ networks need to significantly change their weights to induce the necessary change in the output ^{[10]}^{[11]}^{[12]}. This is called the **rich regime** and allows for the network to learn features.

In order to keep the gradients from exploding or vanishing, one needs to scale the learning rate^{4} as $\eta \sim \alpha^{-2}$. Then, in each step of gradient descent one can show that the changes of the preactivations go as:

$$\Delta h_i^\ell \sim \frac{\eta\, \alpha}{\sqrt{N}} \sim \frac{1}{\alpha \sqrt{N}}.$$

We see that choosing $\alpha = 1/\sqrt{N}$ causes $\Delta h_i^\ell$ to remain order $1$ even as $N \to \infty$. This allows for feature learning at infinite width while keeping the network trainable. This is exactly the rescaling required to achieve $\mu$P.

The $N \to \infty$ limit gives a dynamical system for preactivations $\mathbf h^\ell$ which can be characterized using a method from statistical physics, known as *dynamical mean field theory* (DMFT) ^{[13]}. Below we show a figure from that paper. We plot the correlation between preactivations $\mathbf x^\mu, \mathbf x^\nu$ across different layers $\ell$. $H^{\ell}_{\mu\nu} = \frac{1}{N} \mathbf h^\ell(x_\mu) \cdot \mathbf{h}^\ell(x_\nu)$ for pairs of samples $\mathbf x_\mu, \mathbf x_\nu$ across different layers $\ell \in \{1,…,5\}$ after training.

The architecture and dataset is just a linear neural network learning the first two classes of CIFAR. Despite this simple setting, already we see that the DMFT solution captures the network’s ability to learn features while the NTK limit cannot.

We have described two different infinite width limits of neural networks. From the point of view of a practitioner, what really matters is whether such limits are descriptive of realistic large models. We investigate this question empirically in our recent paper ^{[14]}. We find that $\mu$P networks approach their infinite width behavior at the scales used in practice, by contrast to NTK-parameterized networks.

We look at CNNs trained on vision tasks with millions of training examples (CIFAR-5M and ImageNet) as well as language modeling on Wikitext-103. We vary the width of models parametrized in $\mu$P so that they approach a well-defined feature learning limit.

Below, we show that the loss dynamics for networks of varying width $N$ in $\mu$P all begin to coincide at sufficiently large $N$.

We also note that, unlike in NTK parameterization, the wider models tend to outperform the narrower models, similar to observations in ^{[15]}.

Not only do the loss curves begin to coincide past a given width, but even the individual logit outputs for any fixed held-out test point begin to agree at large $N$. Plotted below is the output of the network on the correct logit for a vision and language modeling task in $\mu$P. We use a single (randomly selected) test point in each case.

Another prediction of the DMFT treatment is that for wide enough networks, the learned internal representations will converge to the same limiting values. We indeed observe this in practice. Below we plot the preactivations and final-layer feature kernels for a ResNet on CIFAR-5m, and observe that the network learns the exact same representations across different widths. We also plot the feedforward preactivation histograms, as well as attention weights for a transformer trained on language modeling of Wikitext-103. For more quantitative measures of this convergence, see our paper.

One can ask what role finite width plays in deviating a network from its infinite-width NTK or $\mu$P limit. There are two primary effects to the dynamics of finite width networks compared to the infinite width networks. First, finite networks exhibit dependence on their precise initialization of parameters which infinite networks do not. This can be thought of as additional *variance* in the model. This variance can be eliminated by averaging network outputs over different random initializations. This is known as *ensembling*. However, even after ensembling, finite models are still *biased* compared to infinite width models.

To measure how large the variance correction and bias correction of finite width models are, we ensemble several randomly initialized networks. Below, we plot the average loss of single models of varying width in solid lines and their ensemble averaged loss in dashed lines. Although ensembling is helpful, going wider still is better than ensembling.

These results indicate that bias corrections in the dynamics tend to dominate the gap between finite and infinite models.

For a deeper theoretical study of these finite-width effects, see our prior works ^{[16]}^{[17]}. See also our follow up blog post and recent paper ^{[18]} for a solvable model of this and its implications on the compute-optimal allocation of resources.

Because the training dynamics of $\mu$P become consistent across widths, one can perform hyperparameter search on smaller models to find the optimal hyperparameters (e.g. learning rate, momentum, etc). One then scales the width up and sees that those hyperparameters remain nearly optimal for the wider networks. See ^{[15]} for details. This offers large potential savings of compute time when training large scale models.

Optimal hyperparameters are not stable across widths in most parameterizations. For example, in standard (Pytorch default) parameterization, learning rates do not transfer over either width or depth – as we show below.

Recently, our paper ^{[19]} as well as concurrent work ^{[20]} has extended this to allow for hyperparameter transfer across both widths *and* depths. Below we show an example of the loss as a function of learning rate for various sized models, showing that the optimum is essentially constant across different sizes.

The key philosophy to obtain this hyperparameter transfer is to design a parameterization where the model converges to a well-defined limit and maintains the same rate of feature learning across all network sizes.

We have introduced two distinct parameterizations of neural networks that allow for infinite width limits to be taken. The NTK parameterization gives rise to a linear model at infinite width. This linear model does not learn features and therefore fails to capture crucial aspects of deep learning. The maximal update parameterization $\mu$P on the other hand gives a feature-learning limit that performs well and is representative of realistic wide networks. We observed consistent dynamics even at modest and realistically-accessible widths, indicating that networks were approaching that limit. We’ve commented on the applications of this for studying learned representations and transferring hyperparameters across widths and depths. We believe that infinite-width feature learning networks have much to offer both theorists and practitioners.

While the infinite width and depth models tend to perform best, it is important to know the scaling behavior of the loss as a function of training time and model size in order to optimally allocate compute resources. Check out our follow-up blog post that gives a solveable model of these scaling laws.

^{1} That is, any distribution with finite moments, e.g. a uniform distribution. [Return to text ]

^{2} Here we will use $O_N(1)$ to denote that the quantity remains constant as $N$ is varied. We will also use $\Theta(f(N))$ to denote that a given quantity as a function of $N$ can be bounded from *both above and below* by (appropriate constants) times the function $f(N)$ as $N \to \infty$. That is, it has the same asymptotic behavior as $f(N)$. [Return to text ]Return to text

^{3} In the infinite-width limit, the kernel turns out to be independent of the initialization $\mathbf W_0$. This is an important fact, as the leading correction of finite-width is to introduce a dependence on initialization to the kernel that can be viewed as a source of noise or “variance” that hurts performance. [Return to text ]

^{4} This learning rate scaling follows from the fact that the NTK goes as $\partial_\theta f \cdot \partial_\theta f \sim O(\alpha^2)$ and the learning rate should be scaled inversely with the max eigenvalue of the kernel. [Return to text ]

- Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Bordelon, B., Canatar, A. and Pehlevan, C., 2020, November. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning (pp. 1024-1034). PMLR.
- Canatar, A., Bordelon, B. and Pehlevan, C., 2021. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1), p.2914.
- Atanasov, A., Zaatone-Veth, J. and Pehlevan, C., 2024. Scaling and renormalization in high-dimensional regression. arXiv preprint. https://arxiv.org/abs/2405.00592.
- Mei, S., Montanari, A. and Nguyen, P.M., 2018. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33), pp.E7665-E7671.
- Rotskoff, G.M. and Vanden-Eijnden, E., 2018. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. stat, 1050, p.22.
- Yang, G. and Hu, E.J., 2021, July. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning (pp. 11727-11737). PMLR.
- Bordelon, B. and Pehlevan, C., 2022. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35, pp.32240-32256.
- Chizat, L., Oyallon, E. and Bach, F., 2019. On lazy training in differentiable programming. Advances in neural information processing systems, 32.
- Geiger, M., Spigler, S., Jacot, A. and Wyart, M., 2020. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11), p.113301.
- Woodworth, B., Gunasekar, S., Lee, J.D., Moroshko, E., Savarese, P., Golan, I., Soudry, D. and Srebro, N., 2020, July. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory (pp. 3635-3673). PMLR.
- Bordelon, B. and Pehlevan, C., 2022. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35, pp.32240-32256
- Vyas, N., Atanasov, A., Bordelon, B., Morwani, D., Sainathan, S. and Pehlevan, C., 2024. Feature-learning networks are consistent across widths at realistic scales. Advances in Neural Information Processing Systems, 36.
- Yang, G., Hu, E.J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W. and Gao, J., 2022. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466.
- Atanasov, A., Bordelon, B., Sainathan, S. and Pehlevan, C., 2022, September. The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes. In The Eleventh International Conference on Learning Representations.
- Bordelon, B. and Pehlevan, C., 2024. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. Advances in Neural Information Processing Systems, 36.
- Bordelon, B., Atanasov, A. and Pehlevan, C., 2024. A Dynamical Model of Neural Scaling Laws. arXiv preprint arXiv:2402.01092.
- Bordelon, B., Noci, L., Li, M.B., Hanin, B. and Pehlevan, C., 2023, October. Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit. In The Twelfth International Conference on Learning Representations.
- Yang, G., Yu, D., Zhu, C. and Hayou, S., 2023. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244.

The post Infinite Limits of Neural Networks appeared first on Kempner Institute.

]]>The post Distinguishing the Knowable from the Unknowable with Language Models appeared first on Kempner Institute.

]]>How can we endow LLMs with this metacognitive capability? A starting point is to use the probability values LLMs place on all the tokens they produce. In a pre-trained model, these reflect the modelâs uncertainty about whether any given token would be the correct one if it encountered the preceding text during training. Empirical studies have found that (at least before the model is fine-tuned) these probabilities are well-calibrated in multiple choice question-answering settings: among cases where the model places a 30% probability that the answer is A, the answer is indeed A around 30% of the time.

This suggests a band-aid solution for hallucinations: simply display to the user the probabilities of every token in the modelâs output, or highlight the low-probability ones. When the probability is low, this means the model has low confidence in the token. Suppose a model says âVĂ€nern is the largest lake in the country of Finlandâ, but you see that it only places a 5% probability on âFinlandâ. This is a red flag that the model is hallucinating.

So is the band-aid solution sufficient? To see why not, suppose the model says âSweden has beautiful lakesâ and the probability on âlakesâ is 5%. This does *not* mean the model âbelievesâ there is only a 5% chance Swedenâs lakes are beautiful. Indeed, the remaining 95% of the modelâs probability mass might be placed on completions like âlandscapesâ, âcitiesâ, âpeopleâ, etc. which are consistent with the lakes being beautiful. There are simply many valid words that could fit in this spot.

The key observation is that when a language model is uncertain about the next token, its uncertainty is a mix of *epistemic* uncertaintyâreflecting the modelâs ignoranceâand *aleatoric* uncertaintyâreflecting the inherent unpredictability of text. When our model was uncertain about âFinlandâ, this was primarily epistemic uncertaintyâthere was a **knowable** correct answer. When it was uncertain about âlakesâ, its uncertainty was primarily aleatoric. A viable solution for mitigating hallucinations cannot rely simply on the modelâs *total* uncertaintyâit is *epistemic* uncertainty which matters. In our new work, we thus study the problem of distinguishing epistemic from aleatoric uncertainty (i.e., â**knowing what is knowableâ**) in the outputs of language models. While prior works on estimating epistemic uncertainty have primarily limited their scope to small models and/or structured question-answering settings, we set out to identify epistemic uncertainty in larger models on free-form unconstrained text.

**We show that supervised ***linear*** probes trained on a language modelâs internal activations can achieve high accuracy at classifying epistemic versus aleatoric uncertainties, even when the probes are evaluated on unseen text domains (probes trained on Wikipedia text, for example, generalize well to code data). Going further, we provide a completely unsupervised method for the same tasks that outperforms naive baselines.**

One way to define the difference between epistemic and aleatoric uncertainty is that aleatoric uncertainty is inherent to the randomness in language, while epistemic uncertainty is âin the eyes of the beholder.â As the quantity of training data and computation time increases, models will learn more of what is knowable, and epistemic uncertainty will recede. In other words, as language models become bigger and trained for longer, they become better proxies for the âtrueâ distribution. Specifically, we posit a setting where we have access to a relatively small language model of interest (say, LLaMA 7B), as well as a more knowledgeable reference language model (say, LLaMA 65B, or, better, a model at the level of GPT-4). We will refer to the first model as the **small model** and the second model as the **large model**.

The goal is to be able to identify tokens where the *small model* has high epistemic uncertainty. The large model serves as a proxy for the ground truth, providing a way to obtain (imperfect, but still meaningful) âepistemic uncertainty labelsâ that can be used to evaluate different approaches. Our supervised approach will use the large model to train simple probes on the activations of the small model, but will not have access to it at inference time. Our unsupervised approach will not have access to the large model at all.

To generate uncertainty labels for a set of tokens, we use both models to generate conditional predictions at each token and flag cases where the two models significantly disagree on the entropy of the next token.

**Tagging Method:**

**Tagging Example:Â Â Â Â Â Â **

Weâve highlighted this passage from Wikipedia using LLaMAs 7B and 30B according to the difference between their predictive entropies at each token. **Black** denotes no meaningful difference (i.e., < 0.2), and **red** denotes the largest difference (i.e., > 2.0). In cases where two models disagree the most, weâve also inserted in parentheses the top three token predictions from the small model. Qualitatively, we can see that tokens highlighted in red above (large disagreements) tend to include factual or otherwise deterministic information (the name of Oppenheimerâs advisor, the content of his paper with Born, tokens in a quotation the large model seems to be familiar with).

In general, if the large model is capable enough, we expect tokens where the small model is uncertain while the large model has near-zero entropy to be instances where the small model uncertainty is purely epistemic. Other tokens with a large entropy difference would be mixtures of epistemic and aleatoric uncertainty. To simplify the setup, we focus on the former kind: Among all the tokens that the small model is uncertain about, we create a binary classification task: labeling tokens where the large model has low entropy with â0â and all other tokens â1â.

There are obvious limitations to this contrastive setup, which stem from the fact that large models are only proxies for the true language distribution. The quality of our label depends on the large model used for tagging. If the large model exhibits epistemic uncertainty in its own right, âepistemicâ tokens could slip through the cracks. This issue can be addressed by simply scaling up the large model further. And in practice, we find that our probes make accurate predictions on the classification task anyway.

**Supervised probes**

In the supervised setting, we train linear classifiers on a small modelâs ânext-tokenâ activations to predict uncertainty labels. Empirically, we find that activations from the middle layers work the best (sometimes by large margins). Our probes are mostly trained using a set of Wikipedia articles that are new enough that they arenât present in the modelsâ training data. Our models are selected from the LLaMA, Llama 2, and Pythia model families.

To make sure that our probes are not just learning trivial heuristics, we perform a few interventions on our training and evaluation sets.

First, we note that small model entropy and large model entropy tend to be very heavily correlated (see log-colored heatmaps above). The high correlation means that it is possible for a small-model probe to predict the large modelâs entropy at any given token with high accuracy simply by outputting the entropy of the small model at that token. Second, the entropy at a token is often heavily influenced by the previous token; tokens following a period, for example, tend to have higher entropy according to both models. To prevent the probe from learning the two trivial cases mentioned, we train individual probes on 1) a narrow band of the small modelâs entropy (sample band depicted in green) and 2) balance the two classes for each âpreviousâ token.

We run some simple baselines as sanity checks. BET (best entropy threshold) is the best classifier that simply thresholds the small modelâs entropy on a given evaluation set. BET-FT uses the same entropy after the small modelâs language modeling head is fine-tuned on distributions output by the large model (to control for the effects of the fact that the probes are being trained on data slightly out of the training distribution of the small model).

Overall, we find that the probes (denoted âMLPâ and âLinear classifierâ in the above plots) are extremely effective, and greatly outperform both baselines in all cases. We provide ROC curves for a handful of classifiers above. Probes trained on Wikipedia data are almost perfect classifiers of Wikipedia tokens (top plot), and they are also surprisingly effective when evaluated on radically different text distributions, such as code, foreign language text (Europarl), and Stack Exchange forum posts without any additional training (bottom plot).

The fact that the linear probe in particular performs so well means that notions of epistemic uncertainty are *linearly represented* in the activations of LLMs. At first we thought that this finding might be simply reflecting the fact that LLMs form rich representations of language that are useful in generalâperhaps this richness is sufficient to encapsulate aspects of language that are correlated with the epistemic/aleatoric distinction. But the impressive transfer performance of the probes when they are trained on one domain and evaluated on a completely different domain suggests that they might not be merely learning correlationsâthe LLM representations may ânatively encodeâ epistemic uncertainty in some mysterious way. Weâre not sure whatâs going on here!

The supervised probing results are promising, but in order to implement them in practice, when training the probe one needs access to a more powerful model than the model one actually is going to deploy. This does reflect some real-life situationsâe.g., deploying a small model for efficiency reasons, or because the most powerful models are proprietary. We now consider whether it is possible to perform uncertainty disentanglement on the most powerful models available. In order to study this, we will still consider the small model to be the one we will deploy, but now we will not allow access to the large model at all except for evaluation.

In particular, we propose an unsupervised method inspired by the phenomenon of *in-context learning*. Large language models are adept at in-context learning: while they do store information directly in their parameters, they can also work with new facts and even new tasks presented to them entirely in-context.

Copying information from the context makes more sense in some cases than others. Suppose the model needs to fill in the blank in âVĂ€nern is the largest lake in the country of __â, and suppose the full sentence âVĂ€nern is the largest lake in the country of Swedenâ occurred earlier in the context. Then, even if the model is initially uncertain about the location of VĂ€nern, we expect the model to predict âSwedenâ for the second occurrence as well as long as it knows there is only one correct answer. Alternatively, suppose the model needs to complete âSweden has beautiful __â, given that âSweden has beautiful landscapesâ appeared earlier in the context. In this case, the model may still place significant probability on completions besides âlandscapesâ, because they are not incompatible with the first occurrence. Our unsupervised method relies on the hypothesis that language models are capable of this sort of *selective* in-context learning, and lean on in-context information more heavily when their uncertainty is primarily epistemic.

**Unsupervised uncertainty classification (synthetic example)**

To show that language models can develop that capability, we first consider language modeling in a synthetic setting. We construct binary question/answer pairs that consist of the following:

- An âepistemic/aleatoricâ bit, indicating whether the question is aleatoric or not.
- A question ID, of some length
*k* - A single answer bit

Crucially, we generate the answer bit differently for âepistemicâ vs âaleatoricâ questions. For epistemic questions, we sample an answer bit uniformly at random in advance and then fix itâevery time that question appears, it has the same answer. Epistemic questions can be memorized. Aleatoric questions, on the other hand, have newly sampled random answers at every occurrence.

We train a nanoGPT model (~100M parameters) to answer questions of this form in a few shot setting and then evaluate it on held-out data. Our intuition is that the model should always start with a 50/50 prediction for questions of both typesâthis is the correct prediction for aleatoric questions, and the model has no way to know the answers to the unseen epistemic questions in the evaluation set. After we provide it with hints in-context, however, we expect it to update its answers for epistemic questions, but not aleatoric questions. That is indeed what we see:

**Unsupervised uncertainty classification (real example)**

In the synthetic setting, it is simple to determine if a token is âaleatoricâ or âepistemicâ; the model only needs to attend to the first bit of the question, and can change its behavior accordingly. Is this possible in natural language, where this informationâinsofar as tokens are even strictly âaleatoricâ or âepistemicâ at allâis much more diffuse? If so, do large language models learn to respect the distinction?

To some extent, the answer to both questions seems to be âyesâ! We test them with the following technique, which we term the âIn-Context Learning Test,â or ICLT for short. Hereâs a walkthrough of how it works:

**InâContext Learning Test Procedure using example âVĂ€nern is the largest lake in __â:**

Suppose âVĂ€nern is the largest lake in __â is the original prompt.

**Step 1**: We first use the small model to generate candidate next-token predictions from the original prompt. The small model outputs the follow three predictions with probabilities in descending order: Sweden (p=0.4), Norway (p=0.4), Europe (p=0.2)

**Step 2: **

- We add each of the three tokens (Sweden, Norway, Europe) to the original prompt to form three completed prompts.
- Completed prompt 1: âVĂ€nern is the largest lake in Sweden.â
- Completed prompt 2: âVĂ€nern is the largest lake in Norway.â
- Completed prompt 3: âVĂ€nern is the largest lake in Europe.âÂ

- For all three completed prompts, we then prepend it to the original prompt, to form a new repeated prompt with in-context information. We insert a separator token in between repetitions to mimic how independent documents are packed into a single context during the modelâs pre-training.
- Repeated prompt 1: â<BOS>VĂ€nern is the largest lake in Sweden.<BOS>VĂ€nern is the largest lake in __â
- Repeated prompt 2: â<BOS>VĂ€nern is the largest lake in Norway.<BOS>VĂ€nern is the largest lake in __â
- Repeated prompt 3: â<BOS>VĂ€nern is the largest lake in Europe.<BOS>VĂ€nern is the largest lake in __âÂ

- We use the small model to separately generate next token predictions using the three repeated prompts. The extra in-context information may alter the modelâs prediction. The small modelsâ outputs from the three repeated prompts might be:
- Repeated prompt 1 generation: Sweden (p=0.9), Norway (p=0.05), Europe (p=0.05)
- Repeated prompt 2 generation: Sweden (p=0.05), Norway (p=0.9), Europe (p=0.05)
- Repeated prompt 3 generation: Sweden (p=0.30), Norway (p=0.40), Europe (p=0.30)

- In other words, after seeing the information provided, the model becomes more confident in predicting certain tokens, and less in others.

**Step 3:** We compute entropy for all the repeated prompt generations, and use the minimum of these entropies as a scalar value for solving the classification task (i.e., predicting the large modelâs uncertainty). In particular, we learn a simple threshold classifier on this scalar.

Letâs visualize the results of the ICLT on two hand-picked prompts.

The first (âepistemic-likeâ) prompt is: â36-42 Coney Street is a historic terrace in the city centre of __â, for which there is a single correct answerâif the model has a lot of uncertainty, it is mostly epistemic.

For our second (âaleatoric-likeâ) prompt we will use âBulgurluk is a village GenĂ§ Distric, BingĂ¶l Province, Turkey. The village is __â, for which there is plenty of aleatoric uncertainty.

In the plot, we first show the original predicted probabilities in gray bars for the top 10 predictions. We then show the probability of the model repeating the same token when prompted with â<BOS>original prompt + token. <BOS>original promptâ . On the âepistemic-likeâ prompt, probabilities rise significantly, signifying that the model is âcopyingâ from its input. In the aleatoric case, the model is more reluctant to modify its predictions based on the new information provided in the context.

We include some hand-picked examples above. Gray bars show the original predicted probabilities for each token. Colored bars show the probability of the same tokens in new prompts that include those tokens as hints. On the âepistemic-likeâ prompt, probabilities rise significantly, signifying that the model is âcopyingâ from its input. In the aleatoric case, the model is more reluctant to modify its predictions.

Applying ICLT to LLaMA models as a classifier for the same uncertainty labels we used in the supervised case, we find that it outperforms naive entropy baselines on the Wikipedia test set (AUC ~ 0.70 compared to ~ 0.55). This is a proof-of-concept result that unsupervised uncertainty disentanglement might be possible for real LLMs.

However, it fails on Pythia models, which seem to be more prone to repeating information no matter what. In the future, we plan to test the method on more diverse models and determine more precisely when it fails.

In summary, weâve introduced 1) a new way to label different types of uncertainty in unconstrained text and 2) simple methods to predict those labels, including a completely unsupervised approach. More work needs to be done before they can be transformed into practical tools to detect uncertainty in the wildâspecifically, the recall of our classifiers on realistic, unbalanced data is still too low. If that point can be reached, however, they could become part of a solution for making language models more transparent about what they know and less susceptible to hallucinations. For more about our setup, methods, and findings, check out our paper!

The post Distinguishing the Knowable from the Unknowable with Language Models appeared first on Kempner Institute.

]]>The post Repeat After Me: Transformers are Better than State Space Models at Copying appeared first on Kempner Institute.

]]>Recently, State Space Models (SSMs) have emerged as a challenger to the Transformer architecture. These models can be interpreted as a type of recurrent neural networks (RNNs), which use a fixed-size memory that does not grow with the sequence length. This makes training and inference on long sequences much more efficient, opening up the possibility of feeding extremely long inputs, such as entire libraries, audio samples or DNA sequences, directly into the model.

Mamba, which is the leading SSM architecture, has demonstrated very impressive performance in language modeling. Remarkably, the paper that introduced the Mamba model demonstrated that it achieves *better* performance than competing Transformer models in many settings. Should we therefore abandon transformers in favor of a more efficient and better performing architecture?

In a new preprint, we (Samy Jelassi, David Brandfonbrener, Sham Kakade, Eran Malach) show that the improved efficiency of SSMs inevitably sacrifices some core capabilities that are crucial for modern language modeling. Specifically, we identify one particular capability that is sacrificed: the ability to retrieve and repeat parts of the input context. This task plays a key part in few-shot learning and retrieval which are two tasks that are ubiquitous in foundation models. Using theory and experiments, we show that Mamba models are inferior to Transformer-based language models on a variety of tasks that involve copying and retrieval. Importantly, we argue that this is not due to design flaws in the Mamba model, and in fact any model with fixed memory size will suffer from the same issues.

Before introducing our results, we begin with a quick review of the memory considerations of Transformers, and how they compare to SSMs. Recall that the Transformer architecture takes a sequence of tokens as input and maps each token to a vector representation with some hidden dimension d. The model then alternates between token-level operations (represented with an MLP) and token-mixing operations (the attention layers). Therefore, for an input of length n, the output of each block is of size d x n. In particular, if we generate text auto-regressively token-by-token, then the size of the memory for storing the activations grows linearly with the numbers of generated tokens.

State-space models operate differently. Instead of performing operations over all previously observed tokens, SSMs effectively âcompressâ their inputs into a fixed-size latent state. This latent state is passed from one iteration to the next, but importantly does not grow in size when generating longer sequences. Therefore, SSMs are much more efficient when processing long inputs.

In our theoretical results, we focus on a very simple task of *copying* the input text. I.e., we give the model an arbitrary sequence of tokens as input, and ask it to repeat the sequence verbatim. We prove two results. First, we show that a small Transformer can be used to copy extremely long sequences. Second, we show that any language model with fixed-size memory (i.e., any SSM) fails to copy long random strings.

Let us consider how a small Transformer can potentially copy very long input sequences. The idea is a generalization of the *induction head* mechanism, described by Olsson et al. (2022) as âa circuit whose function is to look back over the sequence for previous instances of the current token, find the token that came after it last time, and then predict that the same completion will occur againâ. More generally, we show that Transformers can look back for occurrences of patterns of n tokens (n-grams), and complete the pattern based on the token that appears after the same n-gram. By repeating this process, Transformers are able to correctly copy very long input sequences, each time matching a small pattern in order to find the next token.

SSMs, on the other hand, have a fixed-size memory, so intuitively cannot store (and copy) inputs that are too long to fit in their memory. If the model doesnât have enough capacity to store the input, then copying will likely contain errors.

These two observations demonstrate the theoretical gap between Transformers and SSMs: while the former can easily copy very long input sequences, the latter struggle to copy any sequence that does not fit in their memory.

We showed that in theory, small Transformers can copy long inputs, while SSMs cannot. We now turn to training actual models on the copy task, testing their ability to perfectly copy their input. That is, we randomly sample a sequence of up to 300 letters, and train Transformer and SSM-based causal language models to repeat the sequence. We observe that a small Transformer quickly learns to perfectly repeat the input string, while a Mamba model of a similar size fails. Even when we increase the size of the hidden state of the Mamba model so that it can store the input sequence, Mamba takes much longer to train, requiring 100x more data to learn copying compared to a Transformer based model.

We find that Transformer models are also much better than SSMs at generalizing to inputs that are longer than the ones seen during training on the copy task. We compare Transformers and SSMs trained to copy strings of length < 50, and show that while SSMs fail to copy long strings that were not present during training, Transformers can accurately copy strings of length 100 and more. When we equip the Transformer with an improved positional embedding (Hard-ALiBi) motivated by our theoretical results (see more details in the paper), we observe that it maintains accurate performance when copying strings up to length 1000!

We now turn to study Transformer and Mamba-based models that are pre-trained on natural language datasets. Specifically, we compare a suite of Mamba and Pythia (Transformer-based, Biderman et al. (2023)) models, both of which are trained on The Pile (Gao et al. (2020)). We compare models of varying parameter count on a variety of tasks that involve copying and retrieval from the input context. We find that overall, pre-trained Transformers can outperform Mamba models with 10x more parameters on tasks that require information retrieval.

**Phonebook retrieval:** We test the ability of the pre-trained models to perform retrieval from the input context by presenting each model with a âphonebookâ, asking it to retrieve the phone number of a particular individual. Namely, the model gets a list of random phonebook entries: âJohn Powel: 609-323-7777â, and is then asked for the phone number of a random person from the phonebook. Below we show that Transformer-based models perform this task much better than Mamba models of much larger size.

**Copying natural language:** In this experiment, we provide the models with chunks of text sampled from the C4 dataset, a large corpus of natural language data. We provide the models with two repetitions of the same text chunk, followed by the first word from the text, and expect it to generate an additional copy of the input text. We report string-level accuracy, measuring the probability of perfectly copying the input string. Indeed, we observe that Transformer based models reliably copy their input text, and that the accuracy of larger models remains high even for longer inputs. Mamba models, however, quickly degrade in performance when asked to copy long strings.

**Question answering with long context:** In our final experiment, we compare the 2.8B-parameter Mamba and Transformer models on the SQuAD question-answering dataset. This dataset provides text paragraphs of varying lengths together with a few questions regarding the text. We test both the Mamba and Transformer model on questions from this dataset, plotting the F1 score of their answers as a function of the paragraph length. We observe that while for short paragraphs, both the Pythia Transformer and Mamba achieve comparable performance, the performance of Mamba degrades with the paragraph length, while the transformer-based model maintains a similar accuracy even for longer texts.

Our paper demonstrates, through theory and experiments, that Transformers are better than state space models at copying from their input context. However, we emphasize that SSMs have many advantages over Transformers. The memory and computational complexity of SSMs does not increase with the input length, which is ideal for training and inference on long inputs. Additionally, state space models such as RNNs are better at tracking state variables across long sequences, which may be useful for generating long consistent text. Importantly, language processing in the human brain appears to be much more similar to how state space models process language. We therefore believe that future work should focus on building hybrid architectures that endow state space models with an attention-like mechanism, allowing them to retrieve relevant pieces of text from their input. Indeed, humans have an incredibly limited capacity for memorizing sequences, but can translate entire novels if we allow them to look back at the text.

The post Repeat After Me: Transformers are Better than State Space Models at Copying appeared first on Kempner Institute.

]]>The post Where Do Features Come From? appeared first on Kempner Institute.

]]>Suppose it is 9am. What will the time be 5 hours from now?

There are many valid ways to solve this problem. For instance:

**Counting up by one hour five times:**10am, 11am, 12pm, 1pm, 2pm. (At the fourth step, you needed to use the memorized fact that 1, not 13, follows 12 on a clock.)**Addition followed by subtraction:**9 + 5 = 14. 14 â 12 = 2. So: 2pm.**Memorization:**You conveniently remember off the top of your head that 5 hours after 9am is 2pm.**Clock visualization:**You envision an analog clock set to 9 oâclock in your mindâs eye, and mentally rotate the hour hand five hours forward. At its new angle, you observe the hand is pointing at 2 oâclock.

It would be straightforward to write code implementing any of the above algorithms for modular addition (in this case, adding two integers modulo 12). But this is going to be a story about deep learning, and the fundamental principle of deep learning is laziness: why intelligently design an algorithm and hand-code it yourself, when you could just feed a bunch of data into an off-the-shelf neural network, train it with an all-purpose optimizer, and watch it *learn* its *own* computational strategy?

Admittedly, the machinery of deep learning isnât particularly practically useful for clean, synthetic tasks like this; itâs meant for messy tasks like predicting the next word in Internet text or classifying cats and dogs. But our goal will be *understanding* deep learning, and scientific understanding is sometimes easiest to arrive at by studying toy cases.

In any case, training a neural network to perform modular addition turns out to be an interesting exercise. In 2022, Power et al. trained transformers on modular arithmetic tasks and observed that, surprisingly, âlong after severely overfitting, validation accuracy sometimes suddenly begins to increase from chance level toward perfect generalization.â Various works since then have tried to understand this befuddling generalization behavior, dubbed âgrokking.â But instead of focusing on the question of why grokking occurs, we will focus on a different question: Which modular addition algorithm does the trained network implement, *and why*?

Earlier this year, Nanda et al. empirically investigated the first half of this question. They found that, remarkably, small transformers consistently learn to implement a version of the **clock visualization algorithm**âconverting the inputs into cosines and sines of the corresponding angles, and then using trigonometry to add the angles!

The training dataset consists of inputs of the form \((a,b)\), paired with the corresponding target outputs \(c = a+b \bmod p\), where \(a,b,c \in \mathbb{Z}_p\) with \(\mathbb{Z}_p = \{ 0,1,…,p-1 \}\). The algorithm identified by Nanda et al. can be seen as a real-valued implementation of the following procedure:

1. Choose a fixed \(k\). Embed \(a \mapsto e^{2\pi i k a}\), \(b \mapsto e^{2 \pi i k b}\), representing rotations by \(ka\) and \(kb\).

2. Multiply these (i.e. compose the rotations) to obtain \(e^{2 \pi i k(a+b)}\).

3. Then, for each \(c\) in the output, multiply by \(e^{-2\pi i k c}\) and take the real part to obtain the logit for \(c\).

The algorithm fundamentally relies on the following identity: for any \(a, b \in \mathbb{Z}_p\) and \(k \in \mathbb{Z}_p \setminus \{0\}\),

$$(a+b) \textrm{ mod } p = \text{argmax}_{c\in \mathbb{Z}_p} \left\{\cos\left(\frac{2\pi k(a+b-c)}{p}\right)\right\}$$

Moreover, averaging the result over neurons with different frequencies \(k\) results in destructive interference when \(c \neq a + b\), accentuating the correct answer.

In a twist on the story, Zhong et al. found that not all neural network architectures use the same procedureâsome modified networks learn to implement a related but distinct âpizza algorithmâ. But, notably, the pizza algorithm also starts out by calculating sinusoidal functions of the input, which we will refer to as *Fourier features*. This, then, is our primary mystery:

**Why do neural networks have a bias towards using Fourier features?**

To be more mathematical about it, modular addition is a finite group operation, with the group in question being the cyclic group. And Fourier analysis on the cyclic group is a special case of representation theory for general groups. Chughtai et al. presented suggestive evidence that neural networks trained on another group, the symmetric group, learn to convert the inputs into features corresponding to the irreducible representations of the group, which are analogous to Fourier features!

To be precise, [Chughtai et al.](https://arxiv.org/abs/2302.03025) show that one layer ReLU MLPs and transformers learn the task by taking representation matrices $R(a)$, $R(b)$ of group elements $a, b$ and performing matrix multiplication with $R(c^{-1})$ such that the logit at output $c$ is proportional to the *character* $\chi_R(abc^{-1}) = \mathrm{tr}(R(a)R(b)R(c^{-1}))$ which is just the trace of the resulting matrix product. The output logits happen to be maximized precisely when $c = ab$ for all irreducible representations $R$.

So the mystery has deepened.

*Why do neural networks have a bias towards solving finite group tasks using irreducible representations?*

Before we try to solve the mysteries, letâs take a step back. Whatâs the point?

Trained deep neural networks are famously black boxes. Or are they? Over the past several years, various researchers have peered into real big AI models and triedâsometimes with a modicum of successâto explain *how* they compute the things they compute, at a comprehensible level of abstraction. What features and circuits do neural networks learn to employ when they are trained to solve a given task?

This pursuit has gone under various namesâBERTology in the NLP community; mechanistic interpretability in the machine learning community. Whenever such an investigation is successful, it raises a further question: *why*? What was it about the architecture and training process that biased the network towards a particular computational strategy?

If we can answer the âwhyâ question, we can gain more leverage on various other questions: Can we predict which mechanisms a network will learn? Can we understand why different mechanisms are favored at different stages of training? Can we intervene on the learning process to modify the mechanisms, to make them more robust, safe, fair, etc.?

The modular addition task has served as a relatively tractable case study for mechanistic interpretability. It is thus a natural choice of case study for that *why* question. So letâs begin our investigation.

First, we will simplify the setting down to the essentials. Just an MLP, no biasesâan embedding layer, activation function, and unembedding layer.

For simplicity, we train these networks using population gradient descent on the full distribution. As was demonstrated by Gromov, the Fourier feature emergence still happens!

Below, we visualize how the embedding weights and their Fourier power spectrum evolve throughout training on the mod-71 task (with L_{2} regularization), when ReLU activations are used:

We can see that the embedding weights for each neuron become periodic, with almost all of the Fourier spectrum concentrated on a single frequency! But the vectors arenât quite pure sinusoids: they arenât smooth (because, we suspect, of the ReLU activations).

Letâs replace the ReLU activations with quadratic activations *x*^{2}. This makes the phenomenon much cleaner (and easier to analyze!) ^{[1]}

What seems to be happening is that as training progresses, the network approaches a limit, and in this limit each neuronâs embedding vector is a pure sinusoid. This is also true for the unembedding vectors. In fact, for each neuron, the frequency of its embedding and unembedding vectors is the same.

Given that the phenomenon is exhibited in such a pure form in MLPs with quadratic activations, one may hope for an elegant mathematical explanation.

This is where weâre going to bring in a important insight from deep learning theory: the inductive bias of neural networks toward *maximum margin* solutions. A maximum margin solution is a setting of the network weights that minimizes the networkâs total weight norm, subject to classifying every data point correctly with a given confidence (or âmarginâ).

Consider a neural network \(f(\theta; x)\), where \(\theta\) and \(x\) represent its parameters and input respectively. For a given norm \(|| \cdot ||\), let \(\Theta = \{ \theta: || \theta || \leq 1 \}\). The maximum normalized margin of the network with respect to the given norm, when trained on a multi-class classification task with dataset \(D\) is defined as

$$\max_{\theta \in \Theta} \min_{(x,y) \sim D} f(\theta; x)[y] – \max_{y’ \in \mathcal{Y}\backslash y} f(\theta; x)[y’]$$

where \(\mathcal{Y}\) represents the set of classes.

In particular, a result of Wei et al. implies that standard training with sufficiently small regularization tends towards the maximum margin solution.^{[3]}

In our paper, we present a suite of theoretical techniques for deriving the *precise value* of the maximum margin. In the below plots, we show that, empirically, the margin of the network indeed approaches the derived value over the course of training!

So we can predict the value of the marginâŠ but does that actually imply anything about the learned circuit?

Yes. We are able to prove that for the task of addition mod* p*, if the network has width at least 4(*p* â 1) and achieves the maximum margin, then all of the weight vectors must be sinusoids, precisely of the following form:

$$u(a) = \lambda \cos(\theta_u^* + 2 \pi ka/p), \quad v(b) = \lambda \cos(\theta_v^* + 2 \pi kb/p), \quad w(c) = \lambda \cos(\theta_w^* + 2 \pi kc/p),$$

where $\lambda \in \mathbb{R}$ is some constant, $k \in \left\{1, \dots, \frac{p-1}{2}\right\}$ is the frequency of the neuron, and $\theta_u^*,\theta_v^*,\theta_w^*$ are phase offsets satisfying $\theta_u^* + \theta_v^* = \theta_w^*$

Moreover, we prove that *every* frequency is used by *some* neuron.

How did we calculate the max-margin value and characterize the maximum margin solutions? The central tool is the max-min inequality. Consider the definition of normalized maximum margin:

$$\max_{\theta \in \Theta} \min_{(x,y) \sim D} f(\theta; x)[y] – \max_{y’ \in \mathcal{Y}\backslash y} f(\theta; x)[y’]$$

Letting \(Q\) be the set of distributions defined over \((x,y) \in D\), we can rewrite the definition above as

$$\max_{\theta \in \Theta} \min_{q \in Q} \mathbb{E}_{(x,y) \sim q} \left[f(\theta; x)[y] – \max_{y’ \in \mathcal{Y}\backslash y} f(\theta; x)[y’]\right]$$

Let \(\gamma_\theta^q = \mathbb{E}_{(x,y) \sim q} \left[f(\theta; x)[y] – \max_{y’ \in \mathcal{Y}\backslash y} f(\theta; x)[y’]\right]\). Then, the max-min inequality implies that

$$ \max_{\theta \in \Theta} \min_{q \in Q} \gamma_\theta^q \leq \min_{q \in Q} \max_{\theta \in \Theta} \gamma_\theta^q.$$

Our technique aims at finding a certificate pair \((\theta^*, q^*)\) such that

$$ q^* \in \text{argmin}_{q \in Q} \gamma_{\theta^*}^q \text{ and } \theta^* \in \text{argmax}_{\theta \in \Theta} \gamma_\theta^{q^*}.$$

If such a pair exists, the “max-min property” holds: the above inequality becomes an equality, with the optimal value given by \(\gamma_{\theta^*}^{q^*}\).

In order to find a certificate pair, we reduce the problem from an optimization of the *full network* to an optimization over a single neuron considered in isolation. For details, refer to Section 3 of our paper.

Thus, we have a resolution to the mystery in our setting:

**Neural networks have a tendency to approach maximum margin solutions, and every maximum margin solution uses Fourier features.**

Futhermore, we were able to extend our max margin analysis from modular addition (the cyclic group) to other finite groups, explaining the empirical results of Chughtai et al.! What is the âanalogousâ result here? Basically, instead of all *frequencies* being used, all group *representations* are used. Furthermore, all neurons only use a *single* representation. For more details, see Section 6 of our paper.

We also derived results of a similar flavour for the **sparse parity setting** studied in works such as Daniely et al., Barak et al., and Edelman et al.

We have shown that at least for simple algebraic tasks and simple neural networks, we can actually explain *where features come from* as a consequence of a known inductive bias of deep learning.

What are the prospects for understanding where features come from in general? *If* we can explain why neural networks prefer certain circuits over others, this can have significant implications:

- To what extent are learned circuits
*universal*, and to what extent are they sensitive to the architecture and learning algorithm? - Can we modify any aspects of the learning process to favor circuits that are more interpretable, robust, fair, or have other desired properties?
- In some cases, like training transformers on standard arithmetic, the ârightâ algorithm (i.e., one that generalizes off the training distribution) isnât learned by default. If we can explain success stories of algorithm learning, can we also explain the failure cases?

We are hopeful that better understanding the inductive biases of neural networks will lead to a better understanding of feature learning.

- Note that the Clock visualization algorithm still can be expressed using quadratic activations. One could further ask if changing to this architecture makes studying the solutions uninteresting, in the sense that this is the only solution it can express. In the Appendix of our paper, we show that the network can still express a memorizing solution even with quadratic activations.
- L
_{2,3}norm : Consider a neural network of width $m$, and let the parameters associated with the $i^{th}$ neuron be represented by $u_i, v_i$ and $w_i$. Then $L_{2,3}$ norm of the network is defined as $\sum_{i=1}^m (||u_i||^2 + ||v_i||^2 + ||w_i||^2)^{3/2}$.

In a technical sense, this norm is the “natural” norm for quadratic activations, in the same sense that $L_2$ norm is natural for ReLUs. - Let $\gamma^*$ represent the maximum normalized margin of the network with respect to a $|| \cdot ||$. Under mild assumptions on $f$, the normalized margin of the global minimizer of the loss given by $\mathbb{E}_{(x,y) \sim D} \ell(f(\theta; x), y) + \lambda \|\theta\|^r$ approaches $\gamma^*$ as $\lambda \to 0$. Here $\ell$ represents the standard cross-entropy loss and $r > 0$.

The post Where Do Features Come From? appeared first on Kempner Institute.

]]>The post Watermarking in the Sand appeared first on Kempner Institute.

]]>Before we can ask the question of whether watermarking is possible, we need to define what watermarking is. Given a generative model for text, code, images, etc., a watermarking scheme consists of a pair of algorithms. The **watermarking embedding** algorithm modifies the model to âplantâ a statistical signal in the output. The **watermarking detection** algorithm gets an output and detects whether or not it contains a watermark. A watermarking detection algorithm can be either *public *(everyone can run it) or *private* (requiring a secret key to run it).

To understand the security of watermarks, we need to consider the point of view of the *attacker*. Think of a student trying to plagiarize an essay or an agent trying to generate misinformation without the output being detected. The attacker provides a prompt X and gets a watermarked output Y. The attackerâs goal is to leverage Y to find a different output Yâ that has the **same** **quality** as Y (in terms of how well it answers the prompt) but is **not watermarked**.

A **strong watermarking scheme** is one that can resist all attacks by a computationally bounded attacker, and in particular, one that only has *black-box access* to the watermarked model and no white-box access to any other model of comparable capabilities. We can contrast this with **weak** watermarking schemes that restrict not only the capabilities of the attacker but also the *set of transformations* that it is allowed to make to Y. For example, we might restrict the attacker to change at most a certain fraction of the words in a piece of text or bound the changes it can make to each pixel in an image. Several proposed watermarking schemes satisfy the requirements of weak watermarking for transformations such as these. While weak watermarking schemes can have some uses (e.g., for preventing accidental training on AI-generated data), most adversarial settings require the security of *strong* watermarking schemes, and this is what we focus on in our work.

Our main result is that under natural assumptions, **strong watermarking schemes are impossible to achieve**. That is, there is a *general attack* for any such scheme that can be implemented by an adversary with only black-box access to the watermarked model and white-box access to weaker models. This holds for schemes with both private and public detection algorithms. In the rest of this blog post, we describe our assumptions and why we believe they already hold in many settings and will only become more likely as model capabilities and modalities increase. We then describe our generic attack framework. Finally, we describe an implementation of our attack for several recently proposed watermarking schemes for language models. We demonstrate that the attack can reliably remove watermarks with only a minor degradation in quality.

In our paper, we describe a generic attack on any watermarking scheme. For the attack to work, we need the following two assumptions to hold:

**Verification is easier than generation:**Even if an attacker is not able to generate high-quality outputs on its own, they are able to*verify*the quality of outputs. There are several ways to instantiate this assumption. One is to use weaker open-source models for verifying quality. The other is to simply ask the watermarked language model itself whether the output has high quality.**The space of high-quality outputs is rich:**The second assumption is that, rather than consisting only of a single high-quality output, the set of potential outputs is*rich*(in a technical sense, described in the paper). While this assumption does not always hold (for example, if a prompt is a math question, there may well be only a single correct answer), it is*necessary*for watermarking in the first place. After all, in order for a watermarking scheme to have a low false-positive rate, it needs to be the case that the vast majority of possible high-quality outputs wouldnât be detected as watermarked.

Technically, the two assumptions boil down to assuming that the attacker has access to two âblack boxesâ or âoraclesâ: a **quality oracle** that evaluates the quality of a response Y to a prompt X, and a **perturbation oracle** that enables the attacker to perform a ârandom walkâ on the space of outputs.

Our main theoretical result is that under these assumptions, there is a *generic attack* that can break any watermarking scheme. We also show a âproof of conceptâ by implementing the attack to break several recent watermarking schemes for large language models. While our implementation is not the most efficient way to attack these models, its generality demonstrates the feasibility of the framework. Moreover, perhaps most concerning, our assumptions will only **become more likely to hold** as AI model capabilities increase.

The idea behind our proof is simple: we use the combination of the quality and perturbation oracle to implement a **quality-preserving random walk** on the space of all outputs. That is, using ârejection samplingâ we can attempt to perturb the output and reject the perturbation if it decreases quality. The ârichnessâ of the space of high-quality outputs corresponds to a *mixing* property of the random walk. Hence the attacker is guaranteed that the output will be (eventually) *random *over this rich space. As such, even if the attacker cannot run the (private) detection algorithm, they are guaranteed that the probability that the output is watermarked will eventually converge to the false positive rate.

To demonstrate the feasibility of our attack, we implement it for several recently proposed watermarking schemes applied to LlaMA 2: 1) Kirchenbauer, Geiping, Wen, Katz, Miers, and Goldstein (2023) (called âUMDâ henceforth), 2) Zhao, Ananth, Li, and Wang (2023) (henceforth âUnigramâ), 3) Kuditipudi, Thickstun, Hashimoto, and Liang (2023) (henceforth âEXPâ). We consider settings in which the watermark can be reliably inserted (with p-value < 0.05: less than 5% false positive) and show that our attack can reliably remove the watermark.

We measure quality degradation by asking GPT4 to compare the original and the final output. We score the output +1 if GPT4 strongly prefers it to the original, -1 if GPT4 strongly prefers the original, and zero otherwise. We measure the average value across multiple examples (and so +1 means that the judge strongly prefers the adversaryâs output to the original 100% of the time, and -1 means that the judge strongly prefers the original output to the adversaryâs 100% of the time). The expected value is listed in the table above, demonstrating a mild degradation in quality. We stress that we do *not* use GPT4 in the course of the attack.

As we perform the random walk, the z-score (# of standard deviations away from the expectation in non-watermarked text) steadily decreases. The quality is generally stable, though with some amount of noise in both directions. We expect that better implementations of the quality oracle, as well as early-stopping heuristics, can ensure more stability and less degradation. (Figure above is for the UMD scheme, averaged over 12 samples.)

We instantiate our **perturbation oracle** with T5-XL v1.1, which we use to mask sequences of text and propose alternative completions. We use a combination of a reward model (RoBERTa-v3 large fine-tuned on the OpenAssistant dataset) and calls to the GPT 3.5 API to implement our **quality oracle**. There are many other possible choices for implementations of the perturbation and quality oracles, and future attacks will likely only get better. Our experiments are meant as a general âproof of conceptâ rather than providing the most efficient way to attack any particular scheme.

Cryptographers often say that âattacks only get betterâ and this saying is likely to hold for watermarking as well. In fact, there is an asymmetry that favors the attacker since as models become stronger and more flexible, this does not make it easier to embed watermarks, but it does enable new ways to implement the perturbation and quality oracles. For example, new multi-modal APIs could be used to implement quality and perturbation oracles in more flexible ways and for more modalities such as images. In particular, because a quality oracle can be implemented by API calls that have yes/no or single number responses, while a watermarking scheme requires a high-entropy output, an adversary can use an API for such an oracle even if the underlying model is watermarked.

The bottom line is that watermarking schemes will likely not be able to resist attacks from a determined adversary. Moreover, we expect that this balance will only shift in the attackerâs favor as model capabilities increase. Future regulations should be based on realistic assessments of what watermarking schemes are and are not able to achieve.

**Note: **We believe that investigating the possibilities of watermarking schemes at this stage can help to provide a better understanding of the inherent tradeoffs, and give policymakers realistic expectations of what watermarking can and cannot provide. While our techniques can be used to remove watermarks from existing schemes, they are not the most efficient way to do so, with the benefit being generality rather than efficiency. Moreover, our implementation is for text generation models, while currently widely deployed watermarks are for image generation models. While it is possible to adapt our ideas to attack deployed image generative models, we do not provide a recipe for doing so in this paper. Thus, our work isnât likely to be used by malicious actors. Rather, we see exposing fundamental weaknesses in the watermarking paradigm as a contribution to the ongoing discussion on how to mitigate the misuse of generative models. We hope our findings will be taken into account by organizations building generative models and policymakers regulating them.

The post Watermarking in the Sand appeared first on Kempner Institute.

]]>The post A Next-Generation Architecture for Elastic and Conditional Computation appeared first on Kempner Institute.

]]>We illustrate the status quo of model deployment using the example of the Llama series of language models, ranging from 7B to 64B parameters. Consider a scenario where the deployment constraints allow for a 50B-parameter Llama model, but only a 33B variant is available. This forces users to settle for a less accurate model despite having a larger latency budget. While model compression can alleviate this issue, it often requires additional training for each extracted model. Moreover, inference time techniques like speculative decoding (Leviathan et al., 2023) and model cascades require co-locating models on the same device. Training models independently incurs significant overhead for colocation during inference and are not behaviorally consistent with each other which is detrimental to quality.

*To make the deployment of these large foundation models amenable to resource constraints, on-device to the cloud, we desire a universal model that can be used to extract smaller yet accurate models for free depending on the accuracy-vs-compute trade-off for each deployment constraint.*

We introduce MatFormer, a next generation model architecture that scales as reliably as vanilla Transformers while satisfying these desiderata. We do this by building upon **Matryoshka Representation Learning (MRL)** (Kusupati et al., 2022), a simple nested training technique that induces flexibility in the learned representations. More specifically, MRL optimizes a single learning objective across *g* subvectors of a dense vector representation from the model being optimized. These subvectors are created by considering the first k-dimensions of the model representation and are as accurate as baseline models specifically trained for those granularities. MRL sub-vectors that are not explicitly optimized for can also be extracted by picking the necessary number of dimensions, enabling flexible dense representations tailored to specific deployment constraints.

With MatFormer we bring the nested structure of MRL from the activations/representation space to the parameter/weight space. For simplicity, we focus on the Feed-Forward Network (FFN) component of the Transformer block, given that as we scale Transformers up, the FFN can contribute to upwards of 60% of the inference cost.

In the figure below, we illustrate training four different granularities within the Transformer block – XL, L, M, and S. We jointly optimize for these different granularities with exponentially spaced hidden dimension sizes ranging from h to h/8 dimensions. This allows us to reduce the cost of the FFN block by up to 87.5% at inference time. While jointly optimizing multiple granularities results in training overhead, when compared to training these 4 models S, M, L, and XL independently â as is the current norm â MatFormer can be up to 20% faster to train. This can be further improved by employing training optimizations.

While we explicitly optimize for only 4 granularities, we can further extract hundreds of subnetworks that scale linearly with size by using a method called **Mix-n-Match**. Mix’n’Match involves varying the capacities/granularities of MatFormer blocks across different layers during inference. For instance, the first four layers could be XL, while the next three could be S. This simple approach enables the extraction of a vast array of smaller models tailored to specific deployment constraints, all while following accuracy trends without explicit training.

In our experiments, we show that MatFormer can create scalable decoder-only language models (MatLMs) and vision encoders (MatViT) that can enable accurate and flexible deployment across language and vision domains for generative and discriminative tasks.

For a 2.6B decoder-only MatLM, we find that the optimized smaller models are as accurate as baselines â both on perplexity and downstream evaluations â with even more “free” models obtained using MixânâMatch that improve predictably with scale.

These smaller MatFormer submodels are significantly more consistent (5-10% across model sizes) with the largest model than independently trained models. That is, the smaller subnetworks make predictions that are closer to that of the largest (XL) model compared to baselines. This (1) enables consistent deployment across scales and (2) boosts inference optimization techniques like speculative decoding (6% speedup over baseline) while keeping the accuracy the same as the XL model.

Moreover, we train models at sizes ranging from 70M to 2.6B for up to 160B tokens and find that MatFormer scales as reliably as vanilla Transformers and that we can fit a single scaling law for all MatFormer submodels agnostic to the granularity.

We show that MatFormer works well for modalities other than text – we extend MatFormer to Vision Transformer-based encoders (MatViT), where we also see that we can use MixânâMatch to extract models that span the accuracy-vs-compute curve.

Finally, MatFormer enables truly elastic adaptive retrieval for the first time. MatFormer allows for elastic query-side encoders based on the resource constraints and the query hardness â potentially impacting large-scale web search. This is possible because MatFormer preserves the metric space, unlike independently trained models which would further need consistency-aware distillation to achieve the same result.

Using MatFormer, we introduce an algorithmic method to elastically deploy large models while being cheaper to train than a series of independent vanilla Transformer models. This method not only allows developers to offer end-users the most accurate models possible but also opens up the possibility of query-dependent conditional computation. In the future, we plan on creating elastic methods that can compress the depth of large models, and create more sophisticated routing algorithms for conditional computation in the MatFormer-powered large foundation models. Overall, MatFormer is an efficient, elastic next-generation architecture that enables web-scale intelligent systems.

We have open-sourced the code for MatViT training along with checkpoints:

Finally, we have utilized the Kempner Instituteâs research cluster for public reproduction of MatLM results on the OLMo models up to 1.3B parameters using the Pile corpus â code and models are available below.

We have open-sourced the code for MatViT training along with checkpoints:

Finally, we have utilized the Kempner Instituteâs research cluster for public reproduction of MatLM results on the OLMo models up to 1.3B parameters using the Pile corpus â code and models are available below.

The post A Next-Generation Architecture for Elastic and Conditional Computation appeared first on Kempner Institute.

]]>