Alignment after pretraining

Intermediate Machine learning
Created by Best · 01.06.2026 at 06:20 UTC

Pre-training imitates internet-scale text, including toxicity, bias, and unsafe instructions. Alignment stages steer behavior toward helpful, honest, and harmless responses relative to raw likelihood .

Instruction tuning fine-tunes on curated demonstration dialogues. RLHF (reinforcement learning from human feedback) trains a reward model from human rankings, then optimizes policy outputs against that reward .

Refusal behaviors are learned or scripted policy layers atop base weights; softmax alone does not refuse harmful prompts. Constitutional-AI-style methods add principle-guided self-critique loops during training .

Alignment reduces some harms but does not certify robustness under adversarial prompts or distribution shift. Minimizing next-token loss can still leave unsafe completions because the objective rewards imitation, not human values, unless augmented .

Helpfulness is therefore a post-training property: base weights encode statistical patterns of text, while alignment data and system prompts shape what users actually experience in products .

Preference datasets are smaller and noisier than pretraining corpora, so alignment stages can overfit stylistic cues (verbosity, hedging) while still missing robust refusals on out-of-distribution attacks .

System prompts and safety classifiers add another layer outside base weights; alignment is therefore a stack of model, data, and product decisions rather than a single loss term .

Red-team evaluations and usage policies complement loss-based training when products face real users .

University approvals: 0
Related cards
Next Hallucination and overconfidence · Machine learning
Video Content
Tasks
Question 1

RLHF-style alignment pipelines rely on:

Hint

Skim the paragraphs on RLHF style alignment pipelines rely in Alignment after pretraining before choosing. Eliminate options that contradict a definition stated in the card.

Question 2

Refusal behaviors in deployed models are:

Hint

Skim the paragraphs on Refusal behaviors deployed models in Alignment after pretraining before choosing. Eliminate options that contradict a definition stated in the card.

Question 3

Constitutional-AI-style methods add:

Hint

Skim the paragraphs on Constitutional style methods in Alignment after pretraining before choosing. Eliminate options that contradict a definition stated in the card.

Question 4

Why can minimizing next-token loss still leave unsafe completions?

Hint

Skim the paragraphs on minimizing next-token loss still leave unsafe completions in Alignment after pretraining before choosing. Eliminate options that contradict a definition stated in the card.

Card Info
  • Topic: Machine learning
  • Difficulty: Intermediate
  • Completed: 0 users
Creator
Best
Best
BestBuddy