Demystifying Word2vec Learning: From Gradient Flow to PCA
By
<h2 id="overview">Overview</h2>
<p>Word2vec is a cornerstone technique for learning dense vector representations (embeddings) of words. It trains a shallow neural network—a two-layer linear model—on a text corpus using self-supervised contrastive learning. Despite its simplicity, word2vec produces embeddings that capture rich semantic relationships, such as analogies (e.g., <em>man : woman :: king : queen</em>) through vector arithmetic. But what exactly does word2vec learn internally, and how does the learning process unfold? Recent theoretical work provides a surprising answer: under realistic conditions, word2vec reduces to an unweighted least-squares matrix factorization problem, and its gradient flow dynamics can be solved in closed form, with the final embeddings being given by principal component analysis (PCA). This tutorial walks you through these insights step by step, explaining the prerequisites, the learning dynamics, common pitfalls, and a summary of key takeaways.</p><figure style="margin:20px 0"><img src="https://bair.berkeley.edu/static/blog/qwem-word2vec-theory/fig1.c8u1a3E7_Z23iPso.webp" alt="Demystifying Word2vec Learning: From Gradient Flow to PCA" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: bair.berkeley.edu</figcaption></figure>
<h2 id="prerequisites">Prerequisites</h2>
<h3 id="math-background">Mathematical Background</h3>
<p>To fully appreciate the theory, you should be comfortable with basic linear algebra (vectors, matrices, eigenvalues, singular value decomposition), calculus (gradients, ordinary differential equations), and probability (expectations, pointwise mutual information). Familiarity with neural network training (gradient descent, loss functions) is helpful but not essential.</p>
<h3 id="programming-skills">Programming Skills</h3>
<p>While this tutorial is conceptual, occasional code snippets (in Python-like pseudocode) illustrate key steps. You don’t need to run the code, but understanding loops and simple matrix operations is beneficial.</p>
<h2 id="step-by-step">Step-by-Step Instructions</h2>
<h3 id="word2vec-setup">1. Setting Up Word2vec</h3>
<p>Word2vec operates in two flavors: Continuous Bag-of-Words (CBOW) and Skip-gram. We focus on the simpler Skip-gram with negative sampling (SGNS). The model consists of two weight matrices: an input embedding matrix <em>W</em> (size vocabulary × embedding dimension) and an output embedding matrix <em>U</em> (size embedding dimension × vocabulary). For each target word <em>w</em>, the model predicts context words <em>c</em> in a fixed window. The objective is to maximize the log-probability of observed (w, c) pairs while minimizing the probability of randomly sampled negative pairs. Mathematically, the loss is a contrastive objective:</p>
<pre><code>L = - ∑_{(w,c) in positive} log σ(u_c · v_w) - ∑_{k=1}^K log σ(-u_{c_k} · v_w)</code></pre>
<p>where <em>v_w</em> is the input embedding, <em>u_c</em> the output embedding, σ the sigmoid, and <em>K</em> negative samples.</p>
<h3 id="learning-dynamics">2. The Learning Dynamics: Rank-Incrementing Steps</h3>
<p>When all embeddings are initialized randomly near zero (e.g., from a tiny Gaussian with zero mean), the model starts effectively at rank zero. The key theoretical insight is that (under mild approximations) the gradient flow—the continuous-time limit of gradient descent—evolves in a sequence of discrete, rank-incrementing steps. At each step, the embeddings expand into a new orthogonal direction, corresponding to learning one “concept” at a time. This is reminiscent of deep linear networks, where the training dynamics follow a “saddle-to-saddle” path.</p>
<p>The loss decreases sharply at each rank increase, as illustrated in the paper’s figure (left panel). In the embedding space (right panel), vectors initially cluster near the origin, then stretch into a 1D subspace, then a 2D subspace, and so on, until the model’s capacity (embedding dimension) is saturated.</p>
<h3 id="matrix-factorization-perspective">3. Matrix Factorization Perspective</h3>
<p>The breakthrough result is that the learning problem reduces to an unweighted least-squares matrix factorization. Under the assumptions of large corpus and balanced sampling, the optimal embeddings satisfy:</p>
<pre><code>W * U ≈ M</code></pre>
<p>where <em>M</em> is the shifted pointwise mutual information (PMI) matrix: M_{ij} = log(P(i,j) / (P(i)P(j))) - log(k), with <em>k</em> the number of negative samples. The model essentially factorizes this matrix. Because the loss is mean-squared error (after approximations), gradient flow seeks the best low-rank approximation.</p>
<p>This connects word2vec to traditional matrix factorization methods like SVD. The embeddings learned are analogous to the singular vectors of <em>M</em>, scaled by singular values.</p><figure style="margin:20px 0"><img src="http://bair.berkeley.edu/blog/assets/qwem-word2vec-theory/fig1.c8u1a3E7_Z23iPso.webp" alt="Demystifying Word2vec Learning: From Gradient Flow to PCA" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: bair.berkeley.edu</figcaption></figure>
<h3 id="pca-solution">4. PCA as the Final Solution</h3>
<p>From the factorization viewpoint, the gradient flow dynamics can be solved analytically. Starting from zero initialization, the embeddings first align with the top eigenvector of the matrix <em>M<sup>T</sup>M</em> (or <em>MM<sup>T</sup></em>), then the next eigenvector, and so on. This is exactly the behavior of gradient descent on a whitened PCA objective. Consequently, the final learned representations (either input or output embeddings) are given by the principal components of the PMI matrix. In the infinitely wide limit (embedding dim → ∞), the embeddings converge to the PCA solution for the top-<em>d</em> components.</p>
<p>This explains why word2vec embeddings often exhibit linear structure: PCA is a linear dimensionality reduction, and the learned subspaces correspond to directions of maximal variance in the PMI matrix—interpretable as semantic concepts.</p>
<h2 id="common-mistakes">Common Mistakes</h2>
<h3 id="initialization-scale">1. Initialization Scale Too Large</h3>
<p>If embeddings are initialized far from zero (e.g., standard deviation 0.1 or higher), the gradient flow no longer follows the rank-incrementing path. The dynamics become messy and the theoretical guarantees break down. Always initialize very small (e.g., 1e-4).</p>
<h3 id="negative-samples">2. Wrong Number of Negative Samples</h3>
<p>The matrix factorization interpretation depends on the negative sampling parameter <em>k</em>. Using <em>k</em> = 1 (common) yields a PMI shift that might not align with unweighted least squares. The paper shows that for <em>k</em> = 1, the objective is equivalent to weighted matrix factorization. To match the unweighted case, choose <em>k</em> such that the shift equals the negative of the log of the ratio of probabilities—a detail often overlooked.</p>
<h3 id="subsampling">3. Ignoring Subsampling of Frequent Words</h3>
<p>Standard word2vec implementations often subsample frequent words to improve speed and quality. This modifies the effective PMI matrix. The theoretical results still hold approximately if subsampling is accounted for, but practitioners may be surprised if they expect clean PCA without adjusting the objective.</p>
<h3 id="embedding-dimension">4. Choosing Embedding Dimension Too Low</h3>
<p>The rank increments sequence stops when the embedding dimension is reached. If dimension is too small, the model cannot learn all useful concepts, and embeddings may conflate separate semantic directions. Experiment with dimension >= 100 for most tasks.</p>
<h2 id="summary">Summary</h2>
<p>Word2vec, at its core, learns by performing gradient flow on a low-rank matrix factorization task. Starting from small initialization, the model discovers concepts one by one, finally extracting the principal components of the shifted PMI matrix. This PCA perspective explains the linear analogies and semantic subspaces observed in word embeddings. Understanding this process demystifies representation learning in minimal language models and provides a foundation for interpreting more complex deep learning systems. Remember to initialize tiny, tune negative sampling, and account for subsampling to replicate the theoretical dynamics.</p>
Tags: