Skip to main content
Back to List
AI Infrastructure·Author: Trensee Editorial·Updated: 2026-04-02

[Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built

If the Transformer is the engine, pre-training, fine-tuning, and RLHF are the training process that makes it usable. A practical guide to how conversational AI systems like ChatGPT are actually built.

AI-assisted draft · Editorially reviewed

This blog content may use AI tools for drafting and structuring, and is published after editorial review by the RanketAI Editorial Team.

Summary: In episode 08, we looked at Transformer as architecture. In episode 09, we look at training flow. Pre-training builds language priors, fine-tuning shapes role behavior, and RLHF aligns outputs toward useful human preference.

Questions This Episode Answers

Understanding Transformer does not automatically explain ChatGPT-like systems. Training data and training order fundamentally change behavior.

This episode focuses on three questions:

  1. What does pre-training actually teach?
  2. Why is fine-tuning necessary?
  3. How did RLHF make models feel "more conversationally useful"?

1. Pre-training: Draw the Language Map First

What is pre-training?

Pre-training teaches an LLM basic language statistics from large corpora: web text, books, docs, and code. The core task is simple: keep predicting the next token.

Example:

"Artificial intelligence will ___"

Repeat this trillions of times, and the model learns syntax, discourse patterns, topical associations, and broad world regularities.

Why is it critical?

Pre-training gives the model linguistic intuition. Without this stage, later alignment or formatting steps have no stable foundation to shape.

How did GPT and BERT diverge here?

2018 split two major directions:

  • GPT: generative autoregressive modeling
  • BERT: masked-token understanding

Both are pre-training families, but they optimized for different downstream strengths.


2. Fine-tuning: Adapt a General Brain to a Role

Why do we need fine-tuning?

A pre-trained model is broad but not role-specific. Without adaptation, it can be fluent yet misaligned with user intent or product tone.

Common issues after pre-training only:

  • follows intent inconsistently
  • responds verbosely without structure
  • misses domain-specific response format
  • may produce unsafe or unhelpful framing

How does fine-tuning work?

The standard entry point is SFT (Supervised Fine-Tuning):

Question: "What is RAG?" -> Answer: "RAG combines retrieval with generation to improve factual accuracy."
Question: "Give me 3 benefits of RAG." -> Answer: "Freshness, citation grounding, and domain adaptation."
Question: "How can our team adopt it?" -> Answer: "Index internal docs, then connect retrieval to your LLM workflow."

The model learns target response style and role-specific behavior.

Why is fine-tuning still not enough?

SFT teaches format, but not all subtle preference trade-offs:

  • correct but cold answer
  • correct and well-structured answer with uncertainty disclosure

To model that preference layer, another stage is needed: RLHF.


3. RLHF: From "Correct" to "Helpful"

What is RLHF?

RLHF means Reinforcement Learning from Human Feedback. Humans compare candidate responses; the model is optimized to align with preferred behavior.

Typical RLHF flow

  1. Start from a pre-trained base model
  2. Apply SFT for baseline instruction behavior
  3. Train preference alignment (historically with reward modeling + policy optimization)

RLHF targets "preferred and useful" behavior, not only raw correctness.

Why did ChatGPT feel different from older models?

The perceived difference was not only knowledge volume. Users noticed better tone, clearer structure, more grounded uncertainty handling, and stronger instruction tracking.

That shift is largely alignment, not pre-training scale alone.


4. Why PPO Entered the Picture

PPO (Proximal Policy Optimization) is widely associated with early RLHF pipelines. Intuitively, it acts like a stability mechanism: move policy toward preference without destructive jumps.

RLHF is not just "reward good answers." It is controlled policy adjustment toward a preference direction.

Since 2023, DPO (Direct Preference Optimization) spread quickly due to implementation simplicity:

  • no separate reward-model training
  • direct optimization from preference pairs

However, recent studies report that PPO can still outperform DPO in some reasoning/coding settings. Choice depends on task profile, data quality, and operational constraints.


5. What Each Stage Owns

Stage Responsibility Analogy
Pre-training Broad language/world pattern learning Reading the world widely
Fine-tuning Task and response-format adaptation Job-specific training
RLHF Preference and safety alignment Service-level behavior training

Understanding all three explains why two Transformer-based models can feel very different in practice.


6. How Modern LLMs Expanded This Pipeline

Modern GPT, Claude, and Gemini families extend the same backbone with:

  • larger pre-training mixtures
  • longer context windows
  • better adaptation data
  • stronger preference-learning stacks
  • expanded AI feedback loops (RLAIF)
  • reasoning-focused RL methods (e.g., GRPO in open literature)
  • added safety/tool/memory layers

Examples:

  • Claude lineage publicly discusses Constitutional AI and RLAIF framing.
  • DeepSeekMath and DeepSeek-R1 explicitly describe GRPO usage.
  • OpenAI o1 describes large-scale RL benefits but does not publicly name a specific optimizer.

The high-level order remains:

learn language -> adapt to role -> align to preference


Next Episode Preview

Episode 10 will cover scaling laws and context expansion: why bigger models and longer context improved quality, and what trade-offs they introduced.


Key Takeaways

Concept Practical takeaway
Pre-training Builds the foundational language prior
Fine-tuning Adapts behavior to target use case
RLHF Aligns outputs with human preference
PPO Stability-oriented policy optimization in RLHF pipelines
DPO Simpler preference optimization without separate reward model
RLAIF AI-generated feedback expands alignment scalability
GRPO Group-relative optimization for reasoning-oriented RL
Modern LLMs Refined and scaled versions of the same 3-stage backbone

FAQ

Q1. If pre-training is strong enough, do we still need chat alignment stages?

Yes. Pre-training gives capability, not product behavior. Conversational quality depends heavily on adaptation and alignment.

Q2. What happens without RLHF-like alignment?

Outputs can be accurate but unhelpful, brittle to intent, or unsafe in framing. Good knowledge alone does not make a usable assistant.

Q3. Do all LLMs follow the exact same training recipe?

No. Details vary by lab and use case. But the broad structure is usually similar: broad learning, task adaptation, preference alignment.

Further Reading

Execution Summary

ItemPractical guideline
Core topic[Road to AI 09] Pre-training, Fine-tuning, and RLHF: How Conversational LLMs Are Built
Best fitPrioritize for AI Infrastructure workflows
Primary actionProfile GPU utilization and memory bottlenecks before scaling horizontally
Risk checkConfirm cold-start latency, failover behavior, and cost-per-request at target scale
Next stepSet auto-scaling thresholds and prepare a runbook for capacity spikes

Data Basis

  • Series baseline: Core papers and public materials from GPT (2018), BERT (2018), InstructGPT (2022), and later alignment work
  • Validation set: Original papers for pre-training, supervised fine-tuning, and preference alignment
  • Interpretation principle: Prioritized pipeline role clarity over mathematical depth

Key Claims and Sources

This section maps key claims to their supporting sources one by one for fast verification. Review each claim together with its original reference link below.

External References

The links below are original sources directly used for the claims and numbers in this post. Checking source context reduces interpretation gaps and speeds up re-validation.

Related Posts

These related posts are selected to help validate the same decision criteria in different contexts. Read them in order below to broaden comparison perspectives.

[Series][Road to AI 10 · Finale] Scaling Laws and Context Window: Why Bigger Models Improve Quality and Raise Cost

Final episode of the 10-part series. A practical guide to why scaling laws and longer context windows improve LLM quality, and why latency, complexity, and cost rise at the same time.

2026-04-25

[Series][Road to AI 08] The Transformer Revolution: "Attention Is All You Need"

A single paper from Google in 2017 changed AI history. The transformer architecture that overcame the limits of RNN and LSTM, and its self-attention mechanism — an intuitive explanation of why ChatGPT, Claude, and Gemini exist today.

2026-03-25

[Series][AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn

Now that AI has an engine (the GPU), how does it actually learn? This episode breaks down backpropagation, gradient descent, and loss functions with zero math — just clear intuition.

2026-03-18

[Series][AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster

Tracing how a gaming graphics chip became the backbone of modern AI — from the birth of CUDA in 2007 to the AlexNet moment in 2012 and today's GPU clusters powering billion-parameter LLMs.

2026-03-11

[Series][Road to AI 05] The Infrastructure Revolution: How Distributed Computing Scaled the AI Brain

Data is only useful if you can process it. Discover the history of distributed computing and the cloud revolution that laid the foundation for modern AI models.

2026-03-05