Integrate RL with LLM Agents

Overview and Motivation

Presentation on integrating reinforcement learning concepts with LLM engines to improve performance
Multiple approaches exist to enhance LLM performance: in-context learning, post-training (reinforcement learning), and fine-tuning
Focus on efficiently integrating RL concepts with LLMs, particularly for multi-agent systems
Covers differences between reinforcement learning on LLMs versus reinforcement learning on LLM agents
Addresses RL implementation in HPC systems due to memory-intensive requirements

Multi-Agent System Fundamentals

Multi-agent systems consist of multiple LLMs, each with specific roles and functions
State acts as history, compiling all previous agent turns, context, and evidence in multi-turn systems

Communication Topologies

Sequential (Chain): Agents work sequentially where Agent B depends on Agent A’s output before the master agent makes final decision
Hierarchical (Leader-Worker): Parallel approach where workers operate simultaneously and communicate with leader agent
Leader agent provides feedback to worker agents, enabling performance updates
Hierarchical approach is faster than sequential

Self-Consistency and Multi-Path Generation

Workers can generate multiple paths using self-consistency by forcing individual LLMs to explore multiple directions
Can use multiple temperatures or command LLMs to choose multiple paths
Example: Four worker agents with four paths each creates 16 total paths for leader to evaluate
Inspired by principle: “agents sample candidates and only think when candidates conflict”

Shared Memory

Shared workspace for agents that persists across turns, preventing information silos and giving leader full visibility
Implemented as structured JSON or database where agents post information
Worker agents have append-only access to their sections; master/leader has full read-write access
Benefits:
- Consistency: Avoids contradictions between agents
- Context management: Reduces context window requirements
- Conflict resolution: All workers access same shared memory
Ray tool recommended for implementing shared memory and data parallelism in HPC

Bottlenecks in Agentic Systems

Master-Worker Mismatch

Master/leader agent may hallucinate, omitting critical information during final synthesis
Problem occurs when leader agent lacks sufficient knowledge about worker agent contexts
Wrong feedback creates cascading errors, potentially leading to never-ending loops
Leader agent relies on parametric knowledge which may be insufficient

Other Challenges

Over-commitment trap: Agents commit to first generated solution without deep deliberation or structural critique
Feedback loops: Agents may repeat points without progress
Poor feedback quality: Feedback from leader to workers may not be helpful
Credit assignment problem: Difficulty determining which agent failed when final output is poor
Groundedness gap: Similar to master-worker mismatch, challenge in determining what constitutes “good” or “perfect” answers

Solutions to Bottlenecks

Add verifier agent between worker and leader agents to score groundedness (e.g., 80%, 70%, 30%)
Leader uses groundedness scores to direct workers appropriately
Provide external tools (web search, databases) for verification to reduce hallucinations
Implement stopping criteria based on iteration improvements: stop when score improvement is less than 10% between iterations
Threshold can be adjusted based on task sensitivity

Reinforcement Learning Approaches

RLHF (Reinforcement Learning from Human Feedback)

Uses human preference data (A vs B comparisons)
Trains separate reward model using human preferences
Employs PPO (Proximal Policy Optimization) algorithm

RLAIF (Reinforcement Learning from AI Feedback)

Replaces human feedback with teacher LLM (e.g., GPT-4)
Teacher LLM judges which output is better between pairs
Trains reward model based on AI preferences

PPO (Proximal Policy Optimization)

Actor-critic RL algorithm requiring four models: actor, reference, critic, and reward model
Uses clipped objective to limit model changes per update
Very memory-intensive due to need for critic model

DPO (Direct Preference Optimization)

Eliminates separate reward model required in PPO
Uses chosen/rejected pairs to act as own reward model
Less memory-intensive than PPO

GRPO (Group Relative Policy Optimization)

Model generates 4-8 responses for each input
Calculates average advantage across all responses
Identifies responses better than average (e.g., R2 and R3)
Trains model to follow better-than-average responses
More powerful than DPO by using average-based comparison

KTO (Kahneman-Tversky Optimization)

Only requires binary good/bad labels, not ranked pairs
Simpler labeling requirement compared to DPO

RL on LLM vs RL on LLM Agents

Key Differences

RL on LLM: Straightforward optimization of single-turn response quality
- Goal: Improve output response based on input
- Linear input-output relationship
RL on LLM Agents: Optimize agent collaboration and specialized roles
- Worker agents: How to synthesize information from input
- Leader agents: How to be better judge and provide good feedback
- Master agents: How to synthesize and represent all information
- Must ensure efficiency and performance of each agent type, not just final output

Implementation Differences

Input/State:
- LLM: Current user prompt
- LLM Agents: Prompt plus history, processes, and conversations between agents
Action Space:
- LLM: Text generation in single block
- LLM Agents: API calls, search queries, revisions, tool usage
Rewards:
- LLM: Subjective (human-rated or LLM-as-judge)
- LLM Agents: Objective (verification scores, evidence matching)
Risks:
- LLM: Boring or repetitive text
- LLM Agents: Information loss during synthesis

Supervised Fine-Tuning for Agents

Why SFT is Necessary

RL is sparse reward problem - model must know how to call tools and format feedback messages to find reward
Need high-quality training data to teach leader agents tool access and feedback provision
Data can be synthetic or human-preferred

SFT Dataset Construction

Structured with thought, action, and communication styles
Based on high-quality teacher trajectories
Uses imitation learning
Data collection involves labeling (human or RLAIF)

Beyond Imitation: Trial and Error Learning

Use RL to punish lazy or generic feedback
Requires collecting: trajectories, rewards, and goal-based validation
When leader provides poor feedback, reward is low (e.g., 10%)
Model learns to avoid low-reward instructions and follow high-reward patterns

Trajectory Optimization

RL teaches agents to pivot when worker agents hallucinate
Reward based on number of communication rounds needed
Fewer rounds = higher reward (e.g., 2 rounds better than 5 rounds)
More back-and-forth conversations indicate inefficiency

RL Workflow for LLM Agents

Complete Training Loop

Start with task/environment and expert/human (can be LLM or human)
Expert prefers data (e.g., YW better than YL)
Calculate DPO loss and update policy
Train LLM-based agents with updated policy
LLM agents interact with environment and generate trajectories
Trajectories go to reward function (environment model)
Reward loss updates policy again
Iterative process requiring 1-10 iterations for final response

Model Architecture for Multiple Agent Types

Most implementations use same base model for all agent types
Save separate LoRA adapters for each agent category (worker, leader, master)
Cannot work with single unified model - need three different adapters
Datasets differ for each agent type based on their specific roles
At inference, load appropriate adapter for each agent type

Implementation: Scientific Paper Limitation Generation

Project Overview

Task: Generate scientific paper limitations using LLM agents and RL
Input: Research paper with limitations section removed
Ground truth: Paper’s mentioned limitations and peer reviews
Uses one-to-one matching between model responses and ground truth to measure performance

System Architecture - Left Branch

Input paper goes to multiple worker agents
LLM generates knowledge checklist as additional information for workers
Table database stores cited paper information for evidence extraction
Worker agents communicate with verifier and reward agents based on criteria
Leader agent receives scores from verifier and inputs from workers
Leader provides feedback to workers: approve, improve, or modify
Stops when performance improvement is less than 10% across two iterations
Worker agent categories: innovation and practicality (communicate with leader in parallel)
Master agent synthesizes all information, de-duplicates, and generates final limitations

System Architecture - Right Branch (Novelty Agents)

Takes input paper plus relevant papers from RAG (Retrieval-Augmented Generation)
Cross-checks novelty differences between input paper and RAG papers
Measures: novelty, technicality, experimental validation, literature aspects
Novelty agents also communicate with leader agents for feedback
Merger agent combines outputs from both left and right branches

Data Collection (Rollout Process)

Entire process called “rollout” for collecting training dataset
For SFT: Must ensure data quality is very good (human or AI validation)
For GRPO: Need at least 4-8 responses to calculate average

GRPO Implementation with Multiple Modes

Four different modes tested (task-specific):
1. Strict grounding: Leader requires citations for all worker claims
2. Critic heavy: Leader acts as harsh critic
3. Retrieval heavy: Emphasizes communicative RAG
4. Additional specialized approaches
Each mode generates four responses (16 total paths)
Note: For standard GRPO, multiple modes not required - can use single mode with four responses via self-consistency or temperature variation
Most common approach: Chain of thought with self-consistency

Training Requirements

Three essential components:
1. Log probabilities (token probabilities)
2. Trajectories (agent communication patterns)
3. Rewards (from reward model based on criteria)
Need three LoRA adapters: worker, leader, master

Inference Stage

Load three fine-tuned LoRA models
Repeat entire process for testing
Generate final limitations
Evaluate against ground truth (collected from papers’ limitation sections and peer reviews)

Reward Model

Based on criteria: groundedness, specificity, adversarial penalty
Uses LLM’s parametric knowledge (no external information currently)
Presenter acknowledges potential for hallucination in reward model
Considering adding external information sources (RAG, databases) for verification

DPO Experiments

Tested DPO only on master agent (for merging capability)
Used LLM-as-judge to determine chosen vs rejected pairs
Generated two responses with different temperatures for comparison
DPO showed considerable hallucinations
GRPO performed better than DPO
LLM agents alone (without RL) currently show superior performance
Presenter acknowledges many improvements needed in implementation

Ground Truth Collection

Extracted from explicit limitation sections or subsections
If no explicit section, searched in discussion, conclusions, or future work sections
Used LLM-as-judge for better extraction from noisy data (when limitations not clearly bounded in conclusion sections)

Training Approaches: Online vs Offline Policy

Offline Policy

Less memory-hungry compared to online policy
Stage 1: Apply rollout to collect data
- Collect trajectories (agent conversations)
- Collect log probabilities (token probabilities)
- Save LoRA adapters for each agent type
- Save all information to JSON
Stage 2: Update policy using same base model
Two stages operate separately

Online Policy

Both stages work together simultaneously
Does not save information from first stage separately
Requires very large VRAM
More memory-intensive but potentially more efficient

Verification and Reward Models

Influence-Based Verification

Agent’s reward includes both current score and future influence
Formula: Immediate score plus gamma times impact on final master output

Revision Trajectories

Identify error steps and correct bad prefixes
Use deletion signal and add good prefix
Iterative correction approach

Verifier Rubric

Groundedness: How well-supported claims are
Specificity: Level of detail and precision
Adversarial penalty: Reverse scoring (lower is better)

Stopping Criteria

Based on 10% improvement threshold between iterations
Task-dependent adjustment possible

Memory Efficiency and HPC Considerations

Model Size and GPU Requirements

LLaMA 3 7B model needs at least 80GB GPU for inference
Recommended: Use Q-LoRA with BFloat16 for optimization
Some performance drop but acceptable for memory constraints
80GB GPU sufficient for LLaMA 3 8B with online policy
Larger models require more GPU resources
Recommended for online policy implementations

Specific Memory Requirements

LLaMA 3 7B inference: Listed as 35GB but actually needs over 40GB
Does not fit in single GPU
LLaMA 3 70B inference: Over 40GB
For environments without 80GB GPUs (only 40GB available): Must use two GPUs working in parallel

Fine-Tuning Requirements

LLaMA 3 70B fine-tuning: More than 80GB, possibly around 100GB

Implementation Recommendations

Log probability collection challenging with offline policy
PyTorch has code for collecting log probabilities but implementation is difficult
Strongly recommend online policy over offline policy for easier implementation
Collect version information and proceed with online approach

LoRA Adapter Management

Must save three adapters: worker, leader, master
Load appropriate adapters during inference/testing

Tools and Parallelization

Ray for Parallel Processing

Excellent tool for multi-agent parallel processing
Industry-recommended
Enables parallel worker-leader communication
Example: Two parallel branches where worker agents communicate with leaders simultaneously
For environments with two 40GB GPU nodes: Ray enables both nodes to work simultaneously
Essential for parallel processing when worker agents are independent

Sequential vs Parallel Approaches

Sequential approach: Very time-consuming
Sequential means: Worker 1 completes, then Worker 2 starts, etc.
Parallel approach (with Ray): Much faster, workers operate simultaneously

Other Recommended Tools

LLaMA Engine for optimization

GRPO Mathematical Details

Training Stage (Old Model)

Log probability calculation: log P_old(y|x)
Where y = response, x = input

Update Stage (New Model)

Log probability calculation: log P_theta(y|x)
Uses same model for policy update
GRPO equation applies to calculate loss

Q&A Discussion Highlights

Question on Agent Specialization

Hypothesis: Multiple agents focusing on specific tasks (programmer, tester, UI designer) may be more effective for complex tasks
Question raised: Why train all agents together rather than individually?
Presenter’s response: Two approaches possible
1. Train each agent individually to maximize performance
2. Improve collaborative system (presenter’s focus)
Assumes individual agents (like GPT) are already powerful
Problem may come from collaboration, not individual capability
Performance may drop across iterations (rounds 2, 3) indicating communication issues
Training uses trajectories showing agent interactions and feedback patterns

Verifier Reliability

Concern: Verifier agent itself may hallucinate
Not working with fixed milestones currently
Depends on worker task
Self-consistency may produce varying scores (e.g., 80% then 70%)
Solution: Integrate external information sources (tools, databases) for verification

Threshold Sensitivity

10% threshold varies by task
For text-based tasks, 10% not huge jump
Fixed at 10% for presenter’s specific task
Acknowledged as open parameter requiring tuning
Initial experiments tried 40-50 rounds but took too much time
May need to decrease threshold in future work

Training Details

Batch size and gradient accumulation: 16-32 (presenter needs to verify exact configuration)
Training takes considerable time
Gradient accumulation cycles need verification

Overview and Motivation#

Multi-Agent System Fundamentals#

Bottlenecks in Agentic Systems#

Reinforcement Learning Approaches#

RL on LLM vs RL on LLM Agents#

Supervised Fine-Tuning for Agents#

RL Workflow for LLM Agents#

Implementation: Scientific Paper Limitation Generation#

Training Approaches: Online vs Offline Policy#

Verification and Reward Models#

Memory Efficiency and HPC Considerations#

Tools and Parallelization#

GRPO Mathematical Details#

Q&A Discussion Highlights#