# CAI_AMP_Inference_Scaling_Optimization
**Repository Path**: mirrors_cloudera/CAI_AMP_Inference_Scaling_Optimization
## Basic Information
- **Project Name**: CAI_AMP_Inference_Scaling_Optimization
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-15
- **Last Updated**: 2025-12-13
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Inference Scaling - Multiobjective Optimization
[](https://arxiv.org/abs/2510.18905)
[](#)
[](LICENSE)
[](https://github.com/yourname/inference-scaling)
# Efficient AI Inference
Efficient AI inference scaling is essential for practical deployment. Instead of relying on static scaling heuristics or simple bivariate trade-offs between performance and compute, we should incorporate multiple factors such as cost, latency, and accuracy. This project models inference scaling as a multi-objective optimization (MOO) problem and simulates it using 3D and 2D space. Below is the sample output:
A feasible space shaped by 3D constraints captures values that a 2D space fails to account for.
## Project Structure
This repository contains a Jupyter notebook that implements Monte Carlo simulations for optimizing inference scaling in AI models. It explores trade-offs between cost, time, and accuracy across various pre-configured models.
```
inference-optimization-moo/
├── 01_Installer/
│ ├── install.sh
│ └── requirements.txt
├── 02_MultiobjectiveOptimization/
│ ├── __init__.py
│ ├── Inference_scaling_MOO.ipynb
│ ├── inference_scaling.py
│ └── __pycache__/
├── .project-metadata.yaml
├── pyproject.toml
└── README.md
```
## Features
- **Model Configurations**: Pre-defined settings for multiple AI models (e.g., GPT-5 variants, Nvidia Nemotron, Qwen3 series), including cost per token, latency, and accuracy distributions.
- **Monte Carlo Simulations**: Statistical estimation of performance metrics with configurable trial counts and parallelization factors.
- **Optimization Methods**:
- 1. Maximal Accuracy selection
- 2. Maximal Cube selection
- 3. Pareto frontiers and utopia-closest point
- 4. Pareto frontiers and Knee point detection
- **Interactive Visualizations**: 3D feasible space plots
- **Constraints**: Feasibility checks based on total cost and time budgets, and minimal accuracy requirements.
## Requirements
- Python 3.9+
- Jupyter Notebook
- Libraries: `numpy`, `matplotlib`, `ipywidgets`, `mpl_toolkits.mplot3d`, `pyyaml`
## Repository Organization
- `01_Installer/`: Contains installation scripts (`install.sh`), dependency list (`requirements.txt`), and packaging configuration (`pyproject.toml`).
- `02_MultiobjectiveOptimization/`: Core Python module (`inference_scaling.py`) and the interactive Jupyter notebook (`Inference_scaling_MOO.ipynb`).
## Installation
### Option 1: Manual Installation with Virtual Environment
1. **Create and activate a virtual environment:**
```bash
# Create virtual environment
python -m venv inference-optimization-moo_env
# Activate virtual environment (Windows)
inference-optimization-moo_env\Scripts\activate
# Activate virtual environment (macOS/Linux)
source inference-optimization-moo_env/bin/activate
```
2. **Install dependencies (includes Jupyter):**
```bash
pip install -r 01_Installer/requirements.txt
```
3. **Install Jupyter kernel for the virtual environment:**
```bash
python -m ipykernel install --user --name=inference-optimization-moo_env --display-name="Inference Optimization"
```
### Option 2: Shell
Run the provided shell script to set up dependencies automatically:
```bash
./01_Installer/install.sh
```
## Usage
### Launch from project root
```bash
# From the project root directory
jupyter notebook 02_MultiobjectiveOptimization/Inference_scaling_MOO.ipynb
```
### Using VS Code
1. Open the file `02_MultiobjectiveOptimization/Inference_scaling_MOO.ipynb` in VS Code
2. Select the "Inference Optimization" kernel (if using virtual environment)
3. Run cells interactively
### Running the Notebook
1. Run the cells sequentially to load model configurations and functions.
2. Use the interactive widget at the end to select a model, adjust budget constraints (max-cost, max-time, min-accuracy), and visualize results.
3. Key parameters:
- `selected_model`: Choose from available sample models (e.g., 'gpt5', 'nvidia-nemotron-ultra-253b').
- `C_max_total`: Maximum total cost in dollars.
- `T_max_total`: Maximum total time in seconds.
- `acc_min`: Minimum acceptable accuracy.
- `k_max`: Maximum number of inferences to test.
- `mc_trials`: Number of Monte Carlo trials for statistical robustness.
- `parallel_factor`: Degree of parallelism (P).
**Note:** If you encounter import errors, make sure to launch Jupyter from the `02_MultiobjectiveOptimization` directory where the `inference_scaling.py` file is located.
## Output
- **3D Feasible Cube Plot**: Visualizes the trade-off in the 3D space with constraint planes, MC trajectories, and optimal points (optimality could be different for priority).
- **Accuracy vs. k Plot**: Shows how accuracy improves with more inferences.
- **Total Cost vs. k Plot**: Displays cost scaling with k.
- **Text Summary**: Prints optimal k values and metrics for each method.
## Contributers
Thanks to **Nashua Springberry (Cloudera)** and **Michael Schuler (Cloudera)** for constructive comments on the design and programming for the simulation.
## Citation
```
@misc{jung2025optimizeinference,
title={3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency},
author={Minseok Jung and Abhas Ricky and Muhammad Rameez Chatni},
year={2025},
eprint={2510.18905},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.18905},
}
```
## License
This project is licensed under the Apache License 2.0 — see the [LICENSE](LICENSE) file for details.