Overview and Motivation
- Presentation on integrating reinforcement learning concepts with LLM engines to improve performance
- Multiple approaches exist to enhance LLM performance: in-context learning, post-training (reinforcement learning), and fine-tuning
- Focus on efficiently integrating RL concepts with LLMs, particularly for multi-agent systems
- Covers differences between reinforcement learning on LLMs versus reinforcement learning on LLM agents
- Addresses RL implementation in HPC systems due to memory-intensive requirements
Multi-Agent System Fundamentals
- Multi-agent systems consist of multiple LLMs, each with specific roles and functions
- State acts as history, compiling all previous agent turns, context, and evidence in multi-turn systems
Communication Topologies
- Sequential (Chain): Agents work sequentially where Agent B depends on Agent A’s output before the master agent makes final decision
- Hierarchical (Leader-Worker): Parallel approach where workers operate simultaneously and communicate with leader agent
- Leader agent provides feedback to worker agents, enabling performance updates
- Hierarchical approach is faster than sequential
Self-Consistency and Multi-Path Generation
- Workers can generate multiple paths using self-consistency by forcing individual LLMs to explore multiple directions
- Can use multiple temperatures or command LLMs to choose multiple paths
- Example: Four worker agents with four paths each creates 16 total paths for leader to evaluate
- Inspired by principle: “agents sample candidates and only think when candidates conflict”
Shared Memory
- Shared workspace for agents that persists across turns, preventing information silos and giving leader full visibility
- Implemented as structured JSON or database where agents post information
- Worker agents have append-only access to their sections; master/leader has full read-write access
- Benefits:
- Consistency: Avoids contradictions between agents
- Context management: Reduces context window requirements
- Conflict resolution: All workers access same shared memory
- Ray tool recommended for implementing shared memory and data parallelism in HPC
Bottlenecks in Agentic Systems
Master-Worker Mismatch
- Master/leader agent may hallucinate, omitting critical information during final synthesis
- Problem occurs when leader agent lacks sufficient knowledge about worker agent contexts
- Wrong feedback creates cascading errors, potentially leading to never-ending loops
- Leader agent relies on parametric knowledge which may be insufficient
Other Challenges
- Over-commitment trap: Agents commit to first generated solution without deep deliberation or structural critique
- Feedback loops: Agents may repeat points without progress
- Poor feedback quality: Feedback from leader to workers may not be helpful
- Credit assignment problem: Difficulty determining which agent failed when final output is poor
- Groundedness gap: Similar to master-worker mismatch, challenge in determining what constitutes “good” or “perfect” answers
Solutions to Bottlenecks
- Add verifier agent between worker and leader agents to score groundedness (e.g., 80%, 70%, 30%)
- Leader uses groundedness scores to direct workers appropriately
- Provide external tools (web search, databases) for verification to reduce hallucinations
- Implement stopping criteria based on iteration improvements: stop when score improvement is less than 10% between iterations
- Threshold can be adjusted based on task sensitivity
Reinforcement Learning Approaches
RLHF (Reinforcement Learning from Human Feedback)
- Uses human preference data (A vs B comparisons)
- Trains separate reward model using human preferences
- Employs PPO (Proximal Policy Optimization) algorithm
RLAIF (Reinforcement Learning from AI Feedback)
- Replaces human feedback with teacher LLM (e.g., GPT-4)
- Teacher LLM judges which output is better between pairs
- Trains reward model based on AI preferences
PPO (Proximal Policy Optimization)
- Actor-critic RL algorithm requiring four models: actor, reference, critic, and reward model
- Uses clipped objective to limit model changes per update
- Very memory-intensive due to need for critic model
DPO (Direct Preference Optimization)
- Eliminates separate reward model required in PPO
- Uses chosen/rejected pairs to act as own reward model
- Less memory-intensive than PPO
GRPO (Group Relative Policy Optimization)
- Model generates 4-8 responses for each input
- Calculates average advantage across all responses
- Identifies responses better than average (e.g., R2 and R3)
- Trains model to follow better-than-average responses
- More powerful than DPO by using average-based comparison
KTO (Kahneman-Tversky Optimization)
- Only requires binary good/bad labels, not ranked pairs
- Simpler labeling requirement compared to DPO
RL on LLM vs RL on LLM Agents
Key Differences
- RL on LLM: Straightforward optimization of single-turn response quality
- Goal: Improve output response based on input
- Linear input-output relationship
- RL on LLM Agents: Optimize agent collaboration and specialized roles
- Worker agents: How to synthesize information from input
- Leader agents: How to be better judge and provide good feedback
- Master agents: How to synthesize and represent all information
- Must ensure efficiency and performance of each agent type, not just final output
Implementation Differences
- Input/State:
- LLM: Current user prompt
- LLM Agents: Prompt plus history, processes, and conversations between agents
- Action Space:
- LLM: Text generation in single block
- LLM Agents: API calls, search queries, revisions, tool usage
- Rewards:
- LLM: Subjective (human-rated or LLM-as-judge)
- LLM Agents: Objective (verification scores, evidence matching)
- Risks:
- LLM: Boring or repetitive text
- LLM Agents: Information loss during synthesis
Supervised Fine-Tuning for Agents
Why SFT is Necessary
- RL is sparse reward problem - model must know how to call tools and format feedback messages to find reward
- Need high-quality training data to teach leader agents tool access and feedback provision
- Data can be synthetic or human-preferred
SFT Dataset Construction
- Structured with thought, action, and communication styles
- Based on high-quality teacher trajectories
- Uses imitation learning
- Data collection involves labeling (human or RLAIF)
Beyond Imitation: Trial and Error Learning
- Use RL to punish lazy or generic feedback
- Requires collecting: trajectories, rewards, and goal-based validation
- When leader provides poor feedback, reward is low (e.g., 10%)
- Model learns to avoid low-reward instructions and follow high-reward patterns
Trajectory Optimization
- RL teaches agents to pivot when worker agents hallucinate
- Reward based on number of communication rounds needed
- Fewer rounds = higher reward (e.g., 2 rounds better than 5 rounds)
- More back-and-forth conversations indicate inefficiency
RL Workflow for LLM Agents
Complete Training Loop
- Start with task/environment and expert/human (can be LLM or human)
- Expert prefers data (e.g., YW better than YL)
- Calculate DPO loss and update policy
- Train LLM-based agents with updated policy
- LLM agents interact with environment and generate trajectories
- Trajectories go to reward function (environment model)
- Reward loss updates policy again
- Iterative process requiring 1-10 iterations for final response
Model Architecture for Multiple Agent Types
- Most implementations use same base model for all agent types
- Save separate LoRA adapters for each agent category (worker, leader, master)
- Cannot work with single unified model - need three different adapters
- Datasets differ for each agent type based on their specific roles
- At inference, load appropriate adapter for each agent type
Implementation: Scientific Paper Limitation Generation
Project Overview
- Task: Generate scientific paper limitations using LLM agents and RL
- Input: Research paper with limitations section removed
- Ground truth: Paper’s mentioned limitations and peer reviews
- Uses one-to-one matching between model responses and ground truth to measure performance
System Architecture - Left Branch
- Input paper goes to multiple worker agents
- LLM generates knowledge checklist as additional information for workers
- Table database stores cited paper information for evidence extraction
- Worker agents communicate with verifier and reward agents based on criteria
- Leader agent receives scores from verifier and inputs from workers
- Leader provides feedback to workers: approve, improve, or modify
- Stops when performance improvement is less than 10% across two iterations
- Worker agent categories: innovation and practicality (communicate with leader in parallel)
- Master agent synthesizes all information, de-duplicates, and generates final limitations
System Architecture - Right Branch (Novelty Agents)
- Takes input paper plus relevant papers from RAG (Retrieval-Augmented Generation)
- Cross-checks novelty differences between input paper and RAG papers
- Measures: novelty, technicality, experimental validation, literature aspects
- Novelty agents also communicate with leader agents for feedback
- Merger agent combines outputs from both left and right branches
Data Collection (Rollout Process)
- Entire process called “rollout” for collecting training dataset
- For SFT: Must ensure data quality is very good (human or AI validation)
- For GRPO: Need at least 4-8 responses to calculate average
GRPO Implementation with Multiple Modes
- Four different modes tested (task-specific):
- Strict grounding: Leader requires citations for all worker claims
- Critic heavy: Leader acts as harsh critic
- Retrieval heavy: Emphasizes communicative RAG
- Additional specialized approaches
- Each mode generates four responses (16 total paths)
- Note: For standard GRPO, multiple modes not required - can use single mode with four responses via self-consistency or temperature variation
- Most common approach: Chain of thought with self-consistency
Training Requirements
- Three essential components:
- Log probabilities (token probabilities)
- Trajectories (agent communication patterns)
- Rewards (from reward model based on criteria)
- Need three LoRA adapters: worker, leader, master
Inference Stage
- Load three fine-tuned LoRA models
- Repeat entire process for testing
- Generate final limitations
- Evaluate against ground truth (collected from papers’ limitation sections and peer reviews)
Reward Model
- Based on criteria: groundedness, specificity, adversarial penalty
- Uses LLM’s parametric knowledge (no external information currently)
- Presenter acknowledges potential for hallucination in reward model
- Considering adding external information sources (RAG, databases) for verification
DPO Experiments
- Tested DPO only on master agent (for merging capability)
- Used LLM-as-judge to determine chosen vs rejected pairs
- Generated two responses with different temperatures for comparison
- DPO showed considerable hallucinations
- GRPO performed better than DPO
- LLM agents alone (without RL) currently show superior performance
- Presenter acknowledges many improvements needed in implementation
Ground Truth Collection
- Extracted from explicit limitation sections or subsections
- If no explicit section, searched in discussion, conclusions, or future work sections
- Used LLM-as-judge for better extraction from noisy data (when limitations not clearly bounded in conclusion sections)
Training Approaches: Online vs Offline Policy
Offline Policy
- Less memory-hungry compared to online policy
- Stage 1: Apply rollout to collect data
- Collect trajectories (agent conversations)
- Collect log probabilities (token probabilities)
- Save LoRA adapters for each agent type
- Save all information to JSON
- Stage 2: Update policy using same base model
- Two stages operate separately
Online Policy
- Both stages work together simultaneously
- Does not save information from first stage separately
- Requires very large VRAM
- More memory-intensive but potentially more efficient
Verification and Reward Models
Influence-Based Verification
- Agent’s reward includes both current score and future influence
- Formula: Immediate score plus gamma times impact on final master output
Revision Trajectories
- Identify error steps and correct bad prefixes
- Use deletion signal and add good prefix
- Iterative correction approach
Verifier Rubric
- Groundedness: How well-supported claims are
- Specificity: Level of detail and precision
- Adversarial penalty: Reverse scoring (lower is better)
Stopping Criteria
- Based on 10% improvement threshold between iterations
- Task-dependent adjustment possible
Memory Efficiency and HPC Considerations
Model Size and GPU Requirements
- LLaMA 3 7B model needs at least 80GB GPU for inference
- Recommended: Use Q-LoRA with BFloat16 for optimization
- Some performance drop but acceptable for memory constraints
- 80GB GPU sufficient for LLaMA 3 8B with online policy
- Larger models require more GPU resources
- Recommended for online policy implementations
Specific Memory Requirements
- LLaMA 3 7B inference: Listed as 35GB but actually needs over 40GB
- Does not fit in single GPU
- LLaMA 3 70B inference: Over 40GB
- For environments without 80GB GPUs (only 40GB available): Must use two GPUs working in parallel
Fine-Tuning Requirements
- LLaMA 3 70B fine-tuning: More than 80GB, possibly around 100GB
Implementation Recommendations
- Log probability collection challenging with offline policy
- PyTorch has code for collecting log probabilities but implementation is difficult
- Strongly recommend online policy over offline policy for easier implementation
- Collect version information and proceed with online approach
LoRA Adapter Management
- Must save three adapters: worker, leader, master
- Load appropriate adapters during inference/testing
Tools and Parallelization
Ray for Parallel Processing
- Excellent tool for multi-agent parallel processing
- Industry-recommended
- Enables parallel worker-leader communication
- Example: Two parallel branches where worker agents communicate with leaders simultaneously
- For environments with two 40GB GPU nodes: Ray enables both nodes to work simultaneously
- Essential for parallel processing when worker agents are independent
Sequential vs Parallel Approaches
- Sequential approach: Very time-consuming
- Sequential means: Worker 1 completes, then Worker 2 starts, etc.
- Parallel approach (with Ray): Much faster, workers operate simultaneously
Other Recommended Tools
- LLaMA Engine for optimization
GRPO Mathematical Details
Training Stage (Old Model)
- Log probability calculation: log P_old(y|x)
- Where y = response, x = input
Update Stage (New Model)
- Log probability calculation: log P_theta(y|x)
- Uses same model for policy update
- GRPO equation applies to calculate loss
Q&A Discussion Highlights
Question on Agent Specialization
- Hypothesis: Multiple agents focusing on specific tasks (programmer, tester, UI designer) may be more effective for complex tasks
- Question raised: Why train all agents together rather than individually?
- Presenter’s response: Two approaches possible
- Train each agent individually to maximize performance
- Improve collaborative system (presenter’s focus)
- Assumes individual agents (like GPT) are already powerful
- Problem may come from collaboration, not individual capability
- Performance may drop across iterations (rounds 2, 3) indicating communication issues
- Training uses trajectories showing agent interactions and feedback patterns
Verifier Reliability
- Concern: Verifier agent itself may hallucinate
- Not working with fixed milestones currently
- Depends on worker task
- Self-consistency may produce varying scores (e.g., 80% then 70%)
- Solution: Integrate external information sources (tools, databases) for verification
Threshold Sensitivity
- 10% threshold varies by task
- For text-based tasks, 10% not huge jump
- Fixed at 10% for presenter’s specific task
- Acknowledged as open parameter requiring tuning
- Initial experiments tried 40-50 rounds but took too much time
- May need to decrease threshold in future work
Training Details
- Batch size and gradient accumulation: 16-32 (presenter needs to verify exact configuration)
- Training takes considerable time
- Gradient accumulation cycles need verification