LLM Post-training
Prerequisite:
Introduction
Definition: Post-training in LLM models refines pre-trained base models using specialized techniques and data to improve behaviors like instruction-following, reasoning, and alignment with user needs. It follows pre-training, which builds general language knowledge from vast raw text, and uses far less data for targeted enhancements.
Key Techniques: include Supervised Fine-Tuning (SFT), where models imitate ideal prompt-response pairs; Direct Preference Optimization (DPO), which ranks preferred outputs; and Reinforcement Learning (RL) variants like PPO or GRPO for reward-based optimization. These adapt models for tasks such as math reasoning or chat identity changes.
Data Role: post-training relies on high-quality, curated datasets like dialogue pairs or feedback examples, often mixed from sources such as Tulu-3-SFT-Mix. Research emphasizes data quality metrics—turn structure, input/response quality—to avoid issues like catastrophic forgetting or overfitting.
Key Challenges: Key challenges in LLM post-training data revolve around ensuring quality, scalability, and avoiding unintended model degradation. Balancing improvements without degrading other skills. These issues stem from the need for curated, high-quality datasets to refine behaviors like reasoning and alignment without compromising base capabilities.
Data Quality Issues: Poor data can introduce biases or inconsistencies, as datasets often reflect societal skews or lack diversity in representation. Curation demands rigorous validation for ethical soundness, accurate labels, and balanced task types, yet most leading datasets remain proprietary, complicating reproducible research.
Catastrophic Forgetting: Post-training on specialized data risks eroding pre-trained knowledge, with losses up to 20-30% in unrelated tasks unless mitigated by techniques like replay or continued learning. This trade-off demands careful mixing of old and new data to preserve broad capabilities
Scalability Hurdles: Exhaustion of high-quality sources pushes reliance on synthetic data, which risks artifacts like reduced output diversity or “typicality bias” from methods like RLHF. Compute-intensive evaluation and annotation further limit accessibility, favoring large players
When to use vs not use post-training
When to use
- Use case requires model that consistently adheres to a comprehensive set of instructions or excels in a specific capability.
- Avoid to write the same instructions into every prompt
- Avoid to correct the model outputs in a specific way repeatedly
- Combine Retrieval Augmented Generation with post-training
When NOT to use:
- Simple policy or style changes i.e follow a few straightforward instructions. This can be achieved by good prompting, see prompt engineering
- Up-to-date facts: post-training is not the most effective way to impart new factual knowledge to an LLM, especially when the knowledge base is large or rapidly changing.
- For fresh or extensive information, such as new organizational policies or a large proprietary document corpus, a retrieval-augmented generation (RAG) approach is often more effective: maintain a knowledge database and have the model retrieve relevant information at query time.
Supervised Fine-Tuning (SFT)
2 The model is provided with example prompts and ideal responses, and it is trained to imitate them. The loss function is the Cross-entropy loss (negative log-likelihood). By minimizing this loss over a dataset of (prompt, desired response) pairs, the model learns to produce the desired response when it sees a similar prompt.
Common use cases for SFT include:
- Converting a foundational model into an instruction or chat model like ChatGPT or Gemini.
- Overhauling the assistant’s style or format. For example if you need to adapt a specific voice (say, a friendly custom support agent). It is important to note that simple tone or phrase tweaking, such as asking the model to be more polite, can he handled through prompt engineering with a few-shot examples.
- Distilling a larger model’s abilities into a smaller model. For instance, you can generate a large set of Q&A or dialog responses from a powerful teacher model, then fine-tune a smaller model to reproduce these responses.
Your training data should be a collection of high-quality prompt–response pairs that exemplify exactly the behavior you want the model to learn. Quality is far more important than quantity here. The reason is simple: the model tries to imitate whatever you give it, so bad examples will teach it bad habits. It’s better to have a small dataset of all great responses than to include inconsistent or poor ones.
Some strategies to get good SFT data:
- Manual writing or human labeling: If you have domain experts or annotators, they can craft prompt–response pairs demonstrating the desired answers. This ensures quality but can be slow and expensive.
- Distillation from a stronger model: Use a more capable model to generate responses to a list of prompts, and use those generations as your fine-tuning targets. You might still need to review/edit them, but this can quickly produce a large aligned dataset.
- Best-of-N sampling: Take your current model (or another model) and have it produce N different responses to each prompt, then pick the best one (using either a reward function or human judgment) as the target. This “rejection sampling” approach can yield higher-quality responses from the model itself.
- Filtering a large dataset: If you have a large pool of candidate Q&A pairs from somewhere (open-source datasets, or logs, etc.), apply filters to select only the highest-quality and most diverse examples.
Direct Preference Optimization (DPO)
3 The model is trained using comparison between a preferred response and a dispreferred (rejected) response for the same prompt. “Response A is better than response B”.
This approach works best when you first apply Supervised Fine-Tuning (SFT) and you have an already decent model, and then apply DPO to refine specific aspects.
For example, if users prefer the assistant to say “I am your AI assistant” instead of “I am your assistant,” you can create a comparison where the former is the preferred response and the latter is the dispreferred. The DPO teaches the model that preference.
DPO use cases:
- Tuning identity, voice, or policy compliance: for instance updating the model’s persona or ensuring it follows certaing guidelines
- Nudging the model away from bad habits: correct a tendency that it is not appropriate or liked
Online Reinforcement Learning (ORL)
4 The idea is to generate a feedback loop:
- The model generate responses
- Evaluate each response with a reward function
- Update the model to increase the chance of high-reward responses
Repeat until the model act like wanted.
In online RL, unlike supervised methods, the model is not told the exact correct answer. Instead it explores various outputs and receives feedback as a scalar reward.
One advantage is that you can reward a chain of thought that leads to a correct math answer, or penalize behavior like refusal to follow instructions.
This makes RL particularly powerful for improving specific capabilities, such as multi-step reasoning, coding (where passing unit tests is the reward), or factual accuracy (where a reward model might judge correctness).
In practice all LLM reinforcement learning is online: the model generates fresh responses during training to learn from; it’s a live loop where each update is based on newly generated examples.
ORL use cases: Online RL is best suited for situations where you have a clear way to evaluate success for a given output. For example:
- Training a model to solve math problems by rewarding it when it gets the correct answer
- Improving a coding assistant by using unit tests as reward signal. The model gets a positive reward when the code it writes passes the tests.
- Enhancing factual accuracy by training with a reward model that scores outputs higher when they are truthful or match a knowledge source.
Reinforcement Learning is also useful for long-horizon tasks and agent behaviors (AI Agents).
Reward Function: there are two broad categories of reward:
- Learned rewards: usually based on human-labeled data.
- Verifiable rewards: they work only for tasks with a clear correctness measure. For example comparing a generated answer to a known correct answer from Q&A or math problems, or running a test unit on generated code.
In practice, you might combine both.
The two common algorithms for RL learning are PPO (Proximal Policy Optimization) and GRPO (Group Regularized Policy Optimization).
Resources
- www.llmdata.com -> company in the y-combinator environment that studies post-training data