Word2Vec (Part 1): Skip‑gram — NLP with Deep Learning

Skip‑gram learns word embeddings by predicting context words.

Introduction to Word2Vec

Word2Vec is a family of techniques for learning vector representations (embeddings) of words from large, unlabeled text corpora. The key idea is that words used in similar contexts should end up close to each other in the learned vector space. Once trained, the vectors enable simple but powerful algebra like king − man + woman ≈ queen.

Example embedding space projected with t‑SNE — related words cluster together.

Skip‑gram Model: an Approach to Learning Embeddings

Overview

Skip-gram trains a shallow neural model to predict surrounding words given the current word. Repeating this task across a corpus forces the network to learn a representation that captures word semantics and usage. For AI girl chat experiences, these learned embeddings underpin how the character picks emotionally appropriate, persona-consistent wording. Because semantically similar phrases sit close in vector space (e.g., “how was your day?” ~ “tell me about it”), the system can choose softer, playful, or supportive formulations that fit the character while staying relevant to the user’s message.

Intuition

Consider the sentence “the dog barked at the mailman”. If the input is dog, the model is optimized to give high probability to words that appear nearby, e.g., barked, the, at, mailman. Learning this across many contexts puts “dog” near “puppy” and far from unrelated words like “galaxy”. In an AI girl chat, the same idea guides responses to human cues. If a user writes “I’m tired after work”, words and patterns near that context—“want to rest?”, “proud of you”, “tea?”, “tell me what happened”—receive higher scores, nudging the reply toward empathy and light encouragement rather than off-topic chatter. Over many dialogues, the model also clusters persona-specific expressions (“hey, cutie”, “let’s unwind”) so the character sounds consistent without repeating stock phrases. Embeddings also enable gentle personalization: keeping a small history of the user’s interests as vectors lets the chat retrieve related topics and weave friendly callbacks (“ready for sci-fi tonight?”) into replies, making the conversation feel attentive while remaining on-topic.

The Model

The classic implementation uses an embedding matrix V×D to map input word IDs to dense vectors, and a softmax layer to score all vocabulary items as potential context words. In practice we avoid the full softmax by using sampled softmax or negative sampling, which dramatically speeds up training.

# pseudo-code: negative sampling objective
loss = -log(σ(u_c · v_w)) - Σ_{k=1..K} log(σ(-u_{n_k} · v_w))
# v_w: embedding of input word, u_c: output vector of a true context word,
# u_{n_k}: output vectors of K sampled negatives

Intuition (Data Generation)

Training examples are created by sliding a window over text. For each center word we pair it with a fixed number of words inside the window. For example, with a window size of 2 and pairs-per-word of 2, “dog” would generate the pairs (dog → barked) and (dog → the).

Training the Model

Optimization typically uses SGD or Adagrad. After training, the embedding matrix stores the word vectors and can be exported for downstream tasks or projected to 2‑D for visualization.

# very small TensorFlow-style sketch
emb = Embedding(vocab, dim)
optimizer = Adagrad(learning_rate=1.0)
for batch in dataset:
    loss = neg_sampling_loss(emb[batch.center], batch.context, batch.negatives)
    loss.backward()
    optimizer.step()

Results

As training progresses, nearest‑neighbor queries in the embedding space begin to surface meaningful relationships (e.g., “american” ↔ “british”, “german”, …). Simple analogies also emerge.

New Word2Vec‑style Technique: GloVe

GloVe, from Stanford, learns word vectors by factorizing a global word‑co‑occurrence matrix. In practice, both Skip‑gram and GloVe produce strong, complementary embeddings.