Alignment after pretraining
Pre-training imitates internet-scale text, including toxicity, bias, and unsafe instructions. Alignment stages steer behavior toward helpful, honest, and harmless responses relative to raw likelihood .
Instruction tuning fine-tunes on curated demonstration dialogues. RLHF (reinforcement learning from human feedback) trains a reward model from human rankings, then optimizes policy outputs against that reward .

Refusal behaviors are learned or scripted policy layers atop base weights; softmax alone does not refuse harmful prompts. Constitutional-AI-style methods add principle-guided self-critique loops during training .

Alignment reduces some harms but does not certify robustness under adversarial prompts or distribution shift. Minimizing next-token loss can still leave unsafe completions because the objective rewards imitation, not human values, unless augmented .
Helpfulness is therefore a post-training property: base weights encode statistical patterns of text, while alignment data and system prompts shape what users actually experience in products .
Preference datasets are smaller and noisier than pretraining corpora, so alignment stages can overfit stylistic cues (verbosity, hedging) while still missing robust refusals on out-of-distribution attacks .
System prompts and safety classifiers add another layer outside base weights; alignment is therefore a stack of model, data, and product decisions rather than a single loss term .
Red-team evaluations and usage policies complement loss-based training when products face real users .
Related cards
Video Content
Tasks
Card Info
- Topic: Machine learning
- Difficulty: Intermediate
- Completed: 0 users