AI Feed

paper aimldata-science Tue, 02 Ju

Universal One-third Time Scaling in Learning Peaked Distributions

arxiv · score 12

Expand

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

Universal One-third Time Scaling in Learning Peaked Distributions

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

EMoE: Training-Free Expert Disagreement for Uncertainty-Aware Text-to-Image Diffusion

DOT-MoE: Differentiable Optimal Transport for MoEfication

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

Agent Guide: A Simple Agent Behavioral Watermarking Framework

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

Optimizing Diversity and Quality through Base-Aligned Model Collaboration

AdaptiveK: Complexity-Driven Sparse Autoencoders for Interpretable Language Model Representations

Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

$M^3$ Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

VERA: Variational Inference Framework for Jailbreaking Large Language Models

A Theoretical Framework for Statistical Evaluability of Generative Models

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

How to Correctly Report LLM-as-a-Judge Evaluations

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition

Inverse Depth Scaling From Most Layers Being Similar

Everywhere Learning: Artificial Intelligence with Pointwise Constraints

Business Utility of Large Language Models as Exploratory Data Analysis Agents

FLARE: Diffusion for Hybrid Language Model

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Measuring and Mitigating Bias in Code Generated by Large Language Models

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

FLaG: Fine-Grained Latent Grouping for Hallucination Detection

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning

Fair Finetuning Mitigates Distribution Inference Attacks

Score $\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation

ACON: Optimizing Context Compression for Long-horizon LLM Agents

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Language Model Networks: Supervision-Efficient Learning through Dense Communication

iML: Executable, Problem-Grounded, and Broadly Exploratory Code-Driven AutoML

The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems

VESTA: Visual Exploration with Statistical Tool Agents

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Retrieval-Augmented Linguistic Calibration

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis