Understanding Frequency Bias in SGD and Adam's Adaptive Solution

In natural language processing, token distributions are highly skewed: common words like "the" appear frequently, while rare tokens like "thalweg" appear only occasionally. This imbalance creates a problem for standard Stochastic Gradient Descent (SGD), which applies the same learning rate to all parameters. Parameters tied to frequent tokens get updated often and converge quickly, but rare-token parameters receive sparse updates and remain poorly trained. Adam, an adaptive optimizer, addresses this by normalizing the learning rate per parameter based on historical gradient variance, allowing rare tokens to learn faster. Below, we explore this frequency bias in detail and how Adam mitigates it.

What is frequency bias in Stochastic Gradient Descent (SGD)?

Frequency bias refers to the tendency of SGD to learn common features rapidly while underrepresenting rare ones. When training language models, token frequencies vary by orders of magnitude. Under standard SGD, every parameter uses the same learning rate. Parameters for frequent tokens receive gradient updates in almost every batch, so their values move quickly toward the optimum. In contrast, parameters for rare tokens may go hundreds or even thousands of steps without any gradient signal. As a result, those weights stay close to their random initialization, and the model fails to capture the importance of rare but meaningful tokens. This imbalance leads to suboptimal performance, especially in tasks where rare tokens carry critical information.

Understanding Frequency Bias in SGD and Adam's Adaptive Solution — Source: www.marktechpost.com

How does uneven token distribution create optimization challenges?

During training, each parameter is updated only when its corresponding token appears in the current batch. With a heavy-tailed token distribution, common tokens appear in nearly every batch, giving their parameters constant learning signals. Rare tokens, however, appear extremely infrequently—maybe once in a thousand batches. This creates a hidden challenge: the optimizer cannot distinguish between a parameter that has converged and one that hasn't received updates. Under SGD, both receive the same learning rate, so rare-token parameters remain underfitted. This is especially problematic for models that must generalize to low-frequency patterns, such as scientific terms or proper nouns. The optimization landscape becomes highly anisotropic, demanding adaptive step sizes to compensate.

What is Adam's variance normalization and how does it work?

Adam's variance normalization is its key innovation beyond momentum. It maintains per-parameter moving averages of the gradient (mean) and the squared gradient (uncentered variance). For each parameter, the actual update step is scaled by the square root of the historical variance. If a parameter's gradients are consistently small or sparse, its variance estimate will be low, leading to a larger effective learning rate. Conversely, if gradients are large and consistent, the variance estimate rises, dampening the update. This normalization automatically adjusts step sizes based on how much reliable gradient information has been observed. It makes Adam particularly effective for problems with sparse features or uneven gradient scales, as it can give rare tokens proportionally larger updates without manual tuning.

How does Adam compensate for rare tokens compared to SGD?

Adam compensates by decoupling the learning rate from the frequency of updates. While SGD applies a fixed learning rate to all parameters, Adam adjusts each parameter's step size based on its gradient history. For a rare token, the moving average of squared gradients stays small because gradients appear sparsely. This results in a larger normalized update each time the token does appear. In contrast, frequent tokens see their variance estimate grow, reducing the effective step size to prevent overshooting. This adaptive behavior allows rare tokens to learn much faster relative to common ones. In practice, Adam often achieves better performance on long-tail distributions because underrepresented features can catch up without disrupting the convergence of frequently updated weights. The optimizer effectively balances the learning rates across parameters with vastly different update frequencies.

Can you describe a controlled experiment that demonstrates this behavior?

Yes. A simple NumPy experiment can isolate the effect. We define a vocabulary of six tokens with appearance probabilities spanning four orders of magnitude—from nearly every batch to only 0.1% of batches. Each token is assigned the same ground-truth weight of 1.0, removing semantic complexity. Training samples are sparse binary vectors indicating token presence, and the target is the sum of active weights plus noise. A small linear model is trained using either SGD or Adam. By comparing the final parameter values, the number of non-zero gradient updates each parameter received, and Adam's effective learning rates, we can directly observe how adaptive optimization compensates for frequency imbalance. Under SGD, rare-token weights remain far from 1.0, while Adam brings them much closer, despite identical overall training conditions.

What is the setup of the synthetic experiment?

The experiment uses a six-token vocabulary with frequencies: token A (e.g., "the") appears in 50% of batches, token B in 20%, token C in 10%, token D in 5%, token E in 1%, and token F (e.g., "thalweg") in 0.1%. Each batch is a sparse binary vector of length 6, where a 1 indicates the token is present. The ground-truth target is the sum of weights of present tokens (each assigned weight 1.0) plus a small Gaussian noise. The linear model has six weights, all initialized to zero. We train for a fixed number of steps using SGD (learning rate 0.1) and Adam (default parameters). We record final weights, the count of non-zero gradient updates per token, and for Adam, the effective learning rate (actual update divided by gradient). This setup cleanly isolates the effect of token frequency on learning dynamics.

What results does the experiment show for SGD versus Adam?

The experiment reveals stark differences. With SGD, final weights for frequent tokens (A and B) converge close to the correct value of 1.0, but rare tokens (E and F) remain near zero—despite having the same ground-truth importance. The rare-token parameters received far fewer gradient updates, and SGD's constant learning rate could not compensate. In contrast, Adam's final weights for all tokens, including the rarest, were much closer to 1.0. Adam's effective learning rates were automatically scaled: the rarest token received an effective learning rate about 20 times larger than that for the most common token. This demonstrates how Adam's variance normalization boosts updates for underrepresented features, enabling the model to learn all tokens more uniformly despite extreme frequency disparities.

Tags: