Wire Riot

paper

arXiv cs.LG

November 18th, 2025 at 5:00 AM

3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

arXiv:2510.18905v3 Announce Type: replace Abstract: AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.

#ai

#llm

Open source

Score: 2.80

Engagement proxy: 0

Canonical link: https://arxiv.org/abs/2510.18905