MLSys 2026 Papers

Layout:

mini compact topic detail

Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Attribution-based Sparse Activation in Large Language Models

FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

When Machine Learning Isn’t Sure: Building Resilient ML-Based Computer Systems by Embracing Uncertainty

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

ContextPilot: Fast Long-Context Inference via Context Reuse

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

Machine Learning Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput

HipKittens: Fast and Furious AMD Kernels

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

Efficient, VRAM-Constrained xLM Inference on Clients

Breaking the Ice: Analyzing Cold Start Latency in vLLM

ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

LEANN: A Low-Storage Overhead Vector Index

VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

When Enough is Enough: Rank-Aware Early Termination for Vector Search

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

SwiftGS: Algorithm and System Co-Optimization for Fast 3D Gaussian Splatting on GPUs

Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Hippocampus: An Efficient and Scalable Memory Module for Agentic AI

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

PLA-Serve: A Prefill-Length-Aware LLM Serving System

FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Ontology-Guided Long-Term Agent Memory for Conversational RAG

G-HEMP: FAST MULTI-GPU PRIVATE INFERENCE FOR LARGE-SCALE GCNS WITH HOMOMORPHIC ENCRYPTION

ProToken: Token-Level Attribution for Federated Large Language Models

CSLE: A Reinforcement Learning Platform for Autonomous Security Management

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

Scaling Up Large Language Models Serving Systems for Semantic Job Search

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

PLayer-FL: A Principled Approach to Personalized Layer-wise Cross-Silo Federated Learning

TiDAR: Think in Diffusion, Talk in Autoregression

Zero redundancy distributed learning with differential privacy

BEAM: Joint Resource–Power Optimization for Energy-Efficient LLM Inference under SLO contraints

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

Efficient Long-Context Language Model Training by Core Attention Disaggregation

ProTrain: Efficient LLM Training via Automatic Memory Management

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

Locality-Aware Beam Scheduling for Efficient Test-Time Compute with a Consumer-grade GPU

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

MAC-Attention: a Match--Amend--Complete scheme for fast and accurate attention computation

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Unified LLM Model for Power, Performance, and Area Prediction from Hardware Code

Using Span Queries to Optimize Cache and Attention Locality

Automated Algorithm Design for Auto-Tuning Optimizers

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

RaidServe: High-performance Resilient Serving

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Practical Adversarial Multi-Armed Bandits with Sublinear Runtime

Pylo: Towards Accessible Learned Optimizers in PyTorch

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading

Shannonic: Efficient Entropy-Optimal Compression for ML Workloads

NEST: Network- and Memory-Aware Device Placement for Distributed Deep Learning

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Once-for-All Channel Mixers (HyperTinyPW): Generative Compression for TinyML

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Search Your Block Floating Point Scales!

PRISM: Parametrically Refactor Inference for Speculative Decoding Draft Models

PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving

CDLM: Consistency Diffusion Language Models for Faster Sampling

Speculative Decoding: Performance or Illusion?

HELIOS : Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

REPARO: LOSS-RESILIENT GENERATIVE CODEC FOR VIDEO CONFERENCING

EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence

Massive-Scale Out-Of-Core UMAP on the GPU

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems

MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

FaaScale: Unlocking Fast LLM Scaling for Serverless Inference

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs

CATWILD: Compiler Autotuning for TPU workloads in the Wild

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy

Virtual Machine NUMA Placement at Scale: Learning the Norm, Shielding the Tail

Wave: A Symbolic Python DSL And Compiler for High-Performance Machine Learning

Optimizing Deployment Configurations for LLM Inference

AIRS: Scaling Live Inference in Resource Constrained Environments

DreamDDP: Accelerating Low-Bandwidth Geo-Distributed LLM Training with Layer-wise Partial Synchronization

ZK-APEX: ZERO-KNOWLEDGE APPROXIMATE PERSONALIZED UNLEARNING WITH EXECUTABLE PROOFS

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack

Hawkeye: Reproducing GPU-Level Non-Determinism

BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

GUARD: SCALABLE STRAGGLER DETECTION AND NODE HEALTH MANAGEMENT FOR LARGE-SCALE TRAINING

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

DisAgg: Distributed Aggregators for Efficient Secure Aggregation

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token

OPKV: A High-Throughput Plugin-Driven Framework for Recallable Sparsity in Paged KV Cache Systems

Dataflow Is All You Need

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

SONAR: Benchmarking Topology and Collaboration in Decentralized Learning

AXLearn: Modular, Hardware-Agnostic Large Model Training

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Demystifying the Mixture of Experts Serving Tax

Cost-aware Duration Prediction for Software Upgrades in Datacenters

Agentic Operator Generation for ML ASICs

SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving

Sparing Strategies to Minimize Reliability Impact On Large Training Jobs

ADR: AN AGENTIC DETECTION SYSTEMFORENTERPRISE AGENTIC AI SECURITY

ApproxMLIR : Accuracy-Aware Compiler for Compound ML System

SAKURAONE: An Open Ethernet–Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment