AI Infrastructure2026-04-02·Author: Trensee Editorial·Updated: 2026-04-02

[Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built

If the Transformer is the engine, pre-training, fine-tuning, and RLHF are the training process that makes it usable. A practical guide to how conversational AI systems like ChatGPT are actually built.

AI-assisted draft · Editorially reviewed

This blog content may use AI tools for drafting and structuring, and is published after editorial review by the RanketAI Editorial Team.

Series overview (9 of 10)▾

← Previous[Road to AI 08] The Transformer Revolution: "Attention Is All You Need"Next →[Road to AI 10 · Finale] Scaling Laws and Context Window: Why Bigger Models Improve Quality and Raise Cost

Summary: In episode 08, we looked at Transformer as architecture. In episode 09, we look at training flow. Pre-training builds language priors, fine-tuning shapes role behavior, and RLHF aligns outputs toward useful human preference.

Questions This Episode Answers

Understanding Transformer does not automatically explain ChatGPT-like systems. Training data and training order fundamentally change behavior.

This episode focuses on three questions:

What does pre-training actually teach?
Why is fine-tuning necessary?
How did RLHF make models feel "more conversationally useful"?

1. Pre-training: Draw the Language Map First

What is pre-training?

Pre-training teaches an LLM basic language statistics from large corpora: web text, books, docs, and code. The core task is simple: keep predicting the next token.

Example:

"Artificial intelligence will ___"

Repeat this trillions of times, and the model learns syntax, discourse patterns, topical associations, and broad world regularities.

Why is it critical?

Pre-training gives the model linguistic intuition. Without this stage, later alignment or formatting steps have no stable foundation to shape.

How did GPT and BERT diverge here?

2018 split two major directions:

GPT: generative autoregressive modeling
BERT: masked-token understanding

Both are pre-training families, but they optimized for different downstream strengths.

2. Fine-tuning: Adapt a General Brain to a Role

Why do we need fine-tuning?

A pre-trained model is broad but not role-specific. Without adaptation, it can be fluent yet misaligned with user intent or product tone.

Common issues after pre-training only:

follows intent inconsistently
responds verbosely without structure
misses domain-specific response format
may produce unsafe or unhelpful framing

How does fine-tuning work?

The standard entry point is SFT (Supervised Fine-Tuning):

Question: "What is RAG?" -> Answer: "RAG combines retrieval with generation to improve factual accuracy."
Question: "Give me 3 benefits of RAG." -> Answer: "Freshness, citation grounding, and domain adaptation."
Question: "How can our team adopt it?" -> Answer: "Index internal docs, then connect retrieval to your LLM workflow."

The model learns target response style and role-specific behavior.

Why is fine-tuning still not enough?

SFT teaches format, but not all subtle preference trade-offs:

correct but cold answer
correct and well-structured answer with uncertainty disclosure

To model that preference layer, another stage is needed: RLHF.

3. RLHF: From "Correct" to "Helpful"

What is RLHF?

RLHF means Reinforcement Learning from Human Feedback. Humans compare candidate responses; the model is optimized to align with preferred behavior.

Typical RLHF flow

Start from a pre-trained base model
Apply SFT for baseline instruction behavior
Train preference alignment (historically with reward modeling + policy optimization)

RLHF targets "preferred and useful" behavior, not only raw correctness.

Why did ChatGPT feel different from older models?

The perceived difference was not only knowledge volume. Users noticed better tone, clearer structure, more grounded uncertainty handling, and stronger instruction tracking.

That shift is largely alignment, not pre-training scale alone.

4. Why PPO Entered the Picture

PPO (Proximal Policy Optimization) is widely associated with early RLHF pipelines. Intuitively, it acts like a stability mechanism: move policy toward preference without destructive jumps.

RLHF is not just "reward good answers." It is controlled policy adjustment toward a preference direction.

Since 2023, DPO (Direct Preference Optimization) spread quickly due to implementation simplicity:

no separate reward-model training
direct optimization from preference pairs

However, recent studies report that PPO can still outperform DPO in some reasoning/coding settings. Choice depends on task profile, data quality, and operational constraints.

5. What Each Stage Owns

Stage	Responsibility	Analogy
Pre-training	Broad language/world pattern learning	Reading the world widely
Fine-tuning	Task and response-format adaptation	Job-specific training
RLHF	Preference and safety alignment	Service-level behavior training

Understanding all three explains why two Transformer-based models can feel very different in practice.

6. How Modern LLMs Expanded This Pipeline

Modern GPT, Claude, and Gemini families extend the same backbone with:

larger pre-training mixtures
longer context windows
better adaptation data
stronger preference-learning stacks
expanded AI feedback loops (RLAIF)
reasoning-focused RL methods (e.g., GRPO in open literature)
added safety/tool/memory layers

Examples:

Claude lineage publicly discusses Constitutional AI and RLAIF framing.
DeepSeekMath and DeepSeek-R1 explicitly describe GRPO usage.
OpenAI o1 describes large-scale RL benefits but does not publicly name a specific optimizer.

The high-level order remains:

learn language -> adapt to role -> align to preference

Next Episode Preview

Episode 10 will cover scaling laws and context expansion: why bigger models and longer context improved quality, and what trade-offs they introduced.

Key Takeaways

Concept	Practical takeaway
Pre-training	Builds the foundational language prior
Fine-tuning	Adapts behavior to target use case
RLHF	Aligns outputs with human preference
PPO	Stability-oriented policy optimization in RLHF pipelines
DPO	Simpler preference optimization without separate reward model
RLAIF	AI-generated feedback expands alignment scalability
GRPO	Group-relative optimization for reasoning-oriented RL
Modern LLMs	Refined and scaled versions of the same 3-stage backbone

FAQ

Q1. If pre-training is strong enough, do we still need chat alignment stages?▾

Yes. Pre-training gives capability, not product behavior. Conversational quality depends heavily on adaptation and alignment.

Q2. What happens without RLHF-like alignment?▾

Outputs can be accurate but unhelpful, brittle to intent, or unsafe in framing. Good knowledge alone does not make a usable assistant.

Q3. Do all LLMs follow the exact same training recipe?▾

No. Details vary by lab and use case. But the broad structure is usually similar: broad learning, task adaptation, preference alignment.

Execution Summary

Item	Practical guideline
Core topic	[Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built
Best fit	Prioritize for AI Infrastructure workflows
Primary action	Profile GPU utilization and memory bottlenecks before scaling horizontally
Risk check	Confirm cold-start latency, failover behavior, and cost-per-request at target scale
Next step	Set auto-scaling thresholds and prepare a runbook for capacity spikes

Data Basis

Series baseline: Core papers and public materials from GPT (2018), BERT (2018), InstructGPT (2022), and later alignment work
Validation set: Original papers for pre-training, supervised fine-tuning, and preference alignment
Interpretation principle: Prioritized pipeline role clarity over mathematical depth

Key Claims and Sources

This section maps key claims to their supporting sources one by one for fast verification. Review each claim together with its original reference link below.

Claim:Pre-training learns general language patterns by optimizing next-token prediction over large-scale text
Source:Radford et al. 2018
Claim:InstructGPT demonstrates that human-feedback alignment significantly improves instruction-following quality
Source:Ouyang et al. 2022
Claim:DPO is introduced as direct preference optimization without separate reward-model training
Source:Rafailov et al. 2023
Claim:Recent comparative studies report settings where PPO outperforms DPO on reasoning and coding evaluations
Source:Ivison et al. 2024
Claim:Constitutional AI explicitly uses AI feedback (RLAIF) for alignment stages
Source:Bai et al. 2022
Claim:GRPO was introduced in DeepSeekMath and used in DeepSeek-R1 as a reasoning-RL framework
Source:Shao et al. 2024; DeepSeek-AI et al. 2025
Claim:OpenAI o1 publicly describes large-scale reinforcement learning, while specific optimizer naming is not disclosed in public docs
Source:OpenAI: Learning to reason with LLMs

External References

The links below are original sources directly used for the claims and numbers in this post. Checking source context reduces interpretation gaps and speeds up re-validation.

X LinkedIn

These related posts are selected to help validate the same decision criteria in different contexts. Read them in order below to broaden comparison perspectives.

[Series][Road to AI 10 · Finale] Scaling Laws and Context Window: Why Bigger Models Improve Quality and Raise Cost

Final episode of the 10-part series. A practical guide to why scaling laws and longer context windows improve LLM quality, and why latency, complexity, and cost rise at the same time.

2026-04-25

[Series][Road to AI 08] The Transformer Revolution: "Attention Is All You Need"

A single paper from Google in 2017 changed AI history. The transformer architecture that overcame the limits of RNN and LSTM, and its self-attention mechanism — an intuitive explanation of why ChatGPT, Claude, and Gemini exist today.

2026-03-25

[Series][AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn

Now that AI has an engine (the GPU), how does it actually learn? This episode breaks down backpropagation, gradient descent, and loss functions with zero math — just clear intuition.

2026-03-18

[Series][AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster

Tracing how a gaming graphics chip became the backbone of modern AI — from the birth of CUDA in 2007 to the AlexNet moment in 2012 and today's GPU clusters powering billion-parameter LLMs.

2026-03-11

[Series][Road to AI 05] The Infrastructure Revolution: How Distributed Computing Scaled the AI Brain

Data is only useful if you can process it. Discover the history of distributed computing and the cloud revolution that laid the foundation for modern AI models.

2026-03-05

Back to List