Alignment after pretraining

Pre-training imitates internet-scale text, including toxicity, bias, and unsafe instructions. Alignment stages steer behavior toward helpful, honest, and harmless responses relative to raw likelihood .

Instruction tuning fine-tunes on curated demonstration dialogues. RLHF (reinforcement learning from human feedback) trains a reward model from human rankings, then optimizes policy outputs against that reward .

Refusal behaviors are learned or scripted policy layers atop base weights; softmax alone does not refuse harmful prompts. Constitutional-AI-style methods add principle-guided self-critique loops during training .

Alignment reduces some harms but does not certify robustness under adversarial prompts or distribution shift. Minimizing next-token loss can still leave unsafe completions because the objective rewards imitation, not human values, unless augmented .

Helpfulness is therefore a post-training property: base weights encode statistical patterns of text, while alignment data and system prompts shape what users actually experience in products .

Preference datasets are smaller and noisier than pretraining corpora, so alignment stages can overfit stylistic cues (verbosity, hedging) while still missing robust refusals on out-of-distribution attacks .

System prompts and safety classifiers add another layer outside base weights; alignment is therefore a stack of model, data, and product decisions rather than a single loss term .

Red-team evaluations and usage policies complement loss-based training when products face real users .

Alignment after pretraining

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator