[Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built
If the Transformer is the engine, pre-training, fine-tuning, and RLHF are the training process that makes it usable. A practical guide to how conversational AI systems like ChatGPT are actually built.
AI-assisted draft · Editorially reviewedThis blog content may use AI tools for drafting and structuring, and is published after editorial review by the RanketAI Editorial Team.
Series overview (9 of 10)▾
- 1.Road to AI 01: How Computers Were Born
- 2.Road to AI 02: Transistors and ICs, the Origin of AI Cost Curves
- 3.Road to AI 03: Why Operating Systems and Networks Still Decide AI Service Quality
- 4.The Path to AI 04: World Wide Web and the Democratization of Information, from Collective Intelligence to Artificial Intelligence
- 5.[Road to AI 05] The Infrastructure Revolution: How Distributed Computing Scaled the AI Brain
- 6.[AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster
- 7.[AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn
- 8.[Road to AI 08] The Transformer Revolution: "Attention Is All You Need"
- 9.[Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built
- 10.[Road to AI 10 · Finale] Scaling Laws and Context Window: Why Bigger Models Improve Quality and Raise Cost
Summary: In episode 08, we looked at Transformer as architecture. In episode 09, we look at training flow. Pre-training builds language priors, fine-tuning shapes role behavior, and RLHF aligns outputs toward useful human preference.
Questions This Episode Answers
Understanding Transformer does not automatically explain ChatGPT-like systems. Training data and training order fundamentally change behavior.
This episode focuses on three questions:
- What does pre-training actually teach?
- Why is fine-tuning necessary?
- How did RLHF make models feel "more conversationally useful"?
1. Pre-training: Draw the Language Map First
What is pre-training?
Pre-training teaches an LLM basic language statistics from large corpora: web text, books, docs, and code. The core task is simple: keep predicting the next token.
Example:
"Artificial intelligence will ___"
Repeat this trillions of times, and the model learns syntax, discourse patterns, topical associations, and broad world regularities.
Why is it critical?
Pre-training gives the model linguistic intuition. Without this stage, later alignment or formatting steps have no stable foundation to shape.
How did GPT and BERT diverge here?
2018 split two major directions:
- GPT: generative autoregressive modeling
- BERT: masked-token understanding
Both are pre-training families, but they optimized for different downstream strengths.
2. Fine-tuning: Adapt a General Brain to a Role
Why do we need fine-tuning?
A pre-trained model is broad but not role-specific. Without adaptation, it can be fluent yet misaligned with user intent or product tone.
Common issues after pre-training only:
- follows intent inconsistently
- responds verbosely without structure
- misses domain-specific response format
- may produce unsafe or unhelpful framing
How does fine-tuning work?
The standard entry point is SFT (Supervised Fine-Tuning):
Question: "What is RAG?" -> Answer: "RAG combines retrieval with generation to improve factual accuracy."
Question: "Give me 3 benefits of RAG." -> Answer: "Freshness, citation grounding, and domain adaptation."
Question: "How can our team adopt it?" -> Answer: "Index internal docs, then connect retrieval to your LLM workflow."
The model learns target response style and role-specific behavior.
Why is fine-tuning still not enough?
SFT teaches format, but not all subtle preference trade-offs:
- correct but cold answer
- correct and well-structured answer with uncertainty disclosure
To model that preference layer, another stage is needed: RLHF.
3. RLHF: From "Correct" to "Helpful"
What is RLHF?
RLHF means Reinforcement Learning from Human Feedback. Humans compare candidate responses; the model is optimized to align with preferred behavior.
Typical RLHF flow
- Start from a pre-trained base model
- Apply SFT for baseline instruction behavior
- Train preference alignment (historically with reward modeling + policy optimization)
RLHF targets "preferred and useful" behavior, not only raw correctness.
Why did ChatGPT feel different from older models?
The perceived difference was not only knowledge volume. Users noticed better tone, clearer structure, more grounded uncertainty handling, and stronger instruction tracking.
That shift is largely alignment, not pre-training scale alone.
4. Why PPO Entered the Picture
PPO (Proximal Policy Optimization) is widely associated with early RLHF pipelines. Intuitively, it acts like a stability mechanism: move policy toward preference without destructive jumps.
RLHF is not just "reward good answers." It is controlled policy adjustment toward a preference direction.
Since 2023, DPO (Direct Preference Optimization) spread quickly due to implementation simplicity:
- no separate reward-model training
- direct optimization from preference pairs
However, recent studies report that PPO can still outperform DPO in some reasoning/coding settings. Choice depends on task profile, data quality, and operational constraints.
5. What Each Stage Owns
| Stage | Responsibility | Analogy |
|---|---|---|
| Pre-training | Broad language/world pattern learning | Reading the world widely |
| Fine-tuning | Task and response-format adaptation | Job-specific training |
| RLHF | Preference and safety alignment | Service-level behavior training |
Understanding all three explains why two Transformer-based models can feel very different in practice.
6. How Modern LLMs Expanded This Pipeline
Modern GPT, Claude, and Gemini families extend the same backbone with:
- larger pre-training mixtures
- longer context windows
- better adaptation data
- stronger preference-learning stacks
- expanded AI feedback loops (RLAIF)
- reasoning-focused RL methods (e.g., GRPO in open literature)
- added safety/tool/memory layers
Examples:
- Claude lineage publicly discusses Constitutional AI and RLAIF framing.
- DeepSeekMath and DeepSeek-R1 explicitly describe GRPO usage.
- OpenAI o1 describes large-scale RL benefits but does not publicly name a specific optimizer.
The high-level order remains:
learn language -> adapt to role -> align to preference
Next Episode Preview
Episode 10 will cover scaling laws and context expansion: why bigger models and longer context improved quality, and what trade-offs they introduced.
Key Takeaways
| Concept | Practical takeaway |
|---|---|
| Pre-training | Builds the foundational language prior |
| Fine-tuning | Adapts behavior to target use case |
| RLHF | Aligns outputs with human preference |
| PPO | Stability-oriented policy optimization in RLHF pipelines |
| DPO | Simpler preference optimization without separate reward model |
| RLAIF | AI-generated feedback expands alignment scalability |
| GRPO | Group-relative optimization for reasoning-oriented RL |
| Modern LLMs | Refined and scaled versions of the same 3-stage backbone |
FAQ
Q1. If pre-training is strong enough, do we still need chat alignment stages?▾
Yes. Pre-training gives capability, not product behavior. Conversational quality depends heavily on adaptation and alignment.
Q2. What happens without RLHF-like alignment?▾
Outputs can be accurate but unhelpful, brittle to intent, or unsafe in framing. Good knowledge alone does not make a usable assistant.
Q3. Do all LLMs follow the exact same training recipe?▾
No. Details vary by lab and use case. But the broad structure is usually similar: broad learning, task adaptation, preference alignment.
Further Reading
Execution Summary
| Item | Practical guideline |
|---|---|
| Core topic | [Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built |
| Best fit | Prioritize for AI Infrastructure workflows |
| Primary action | Profile GPU utilization and memory bottlenecks before scaling horizontally |
| Risk check | Confirm cold-start latency, failover behavior, and cost-per-request at target scale |
| Next step | Set auto-scaling thresholds and prepare a runbook for capacity spikes |
Data Basis
- Series baseline: Core papers and public materials from GPT (2018), BERT (2018), InstructGPT (2022), and later alignment work
- Validation set: Original papers for pre-training, supervised fine-tuning, and preference alignment
- Interpretation principle: Prioritized pipeline role clarity over mathematical depth
Key Claims and Sources
This section maps key claims to their supporting sources one by one for fast verification. Review each claim together with its original reference link below.
Claim:Pre-training learns general language patterns by optimizing next-token prediction over large-scale text
Source:Radford et al. 2018Claim:InstructGPT demonstrates that human-feedback alignment significantly improves instruction-following quality
Source:Ouyang et al. 2022Claim:DPO is introduced as direct preference optimization without separate reward-model training
Source:Rafailov et al. 2023Claim:Recent comparative studies report settings where PPO outperforms DPO on reasoning and coding evaluations
Source:Ivison et al. 2024Claim:Constitutional AI explicitly uses AI feedback (RLAIF) for alignment stages
Source:Bai et al. 2022Claim:GRPO was introduced in DeepSeekMath and used in DeepSeek-R1 as a reasoning-RL framework
Source:Shao et al. 2024; DeepSeek-AI et al. 2025Claim:OpenAI o1 publicly describes large-scale reinforcement learning, while specific optimizer naming is not disclosed in public docs
Source:OpenAI: Learning to reason with LLMs
External References
The links below are original sources directly used for the claims and numbers in this post. Checking source context reduces interpretation gaps and speeds up re-validation.
- Radford et al.: Improving Language Understanding by Generative Pre-Training (2018)
- Devlin et al.: BERT (2018)
- Ouyang et al.: Training language models to follow instructions with human feedback (InstructGPT, 2022)
- Schulman et al.: Proximal Policy Optimization Algorithms (2017)
- Rafailov et al.: Direct Preference Optimization (2023)
- Ivison et al.: Unpacking DPO and PPO (NeurIPS 2024)
- Bai et al.: Constitutional AI: Harmlessness from AI Feedback (2022)
- Lee et al.: RLAIF vs. RLHF (2023)
- Shao et al.: DeepSeekMath (2024)
- DeepSeek-AI et al.: DeepSeek-R1 (2025)
- OpenAI: Learning to reason with LLMs (o1, 2024)
Related Posts
These related posts are selected to help validate the same decision criteria in different contexts. Read them in order below to broaden comparison perspectives.
[Series][Road to AI 10 · Finale] Scaling Laws and Context Window: Why Bigger Models Improve Quality and Raise Cost
Final episode of the 10-part series. A practical guide to why scaling laws and longer context windows improve LLM quality, and why latency, complexity, and cost rise at the same time.
[Series][Road to AI 08] The Transformer Revolution: "Attention Is All You Need"
A single paper from Google in 2017 changed AI history. The transformer architecture that overcame the limits of RNN and LSTM, and its self-attention mechanism — an intuitive explanation of why ChatGPT, Claude, and Gemini exist today.
[Series][AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn
Now that AI has an engine (the GPU), how does it actually learn? This episode breaks down backpropagation, gradient descent, and loss functions with zero math — just clear intuition.
[Series][AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster
Tracing how a gaming graphics chip became the backbone of modern AI — from the birth of CUDA in 2007 to the AlexNet moment in 2012 and today's GPU clusters powering billion-parameter LLMs.
[Series][Road to AI 05] The Infrastructure Revolution: How Distributed Computing Scaled the AI Brain
Data is only useful if you can process it. Discover the history of distributed computing and the cloud revolution that laid the foundation for modern AI models.