Last updated:

👉 This post was initially written in 2022 and has been updated to reflect the significant changes in the ML inference landscape since then (PyTorch 2.x, the LLM inference explosion, vLLM, TensorRT-LLM, and more). Some tool versions and links may still refer to older releases.

While you can argue on the legitimate use of the acronym AI, you cannot ignore that machine learning models are now embedded in production systems at massive scale — from real-time bidding and fraud detection to recommendation engines and content generation. At Adobe, I’ve seen firsthand how the demand for fast, reliable, and cost-efficient inference has grown as models get larger and more complex.

This post surveys the main tools and strategies for scaling ML inference. I’ll cover serving platforms (NVIDIA Triton, TorchServe, ONNX Runtime), model optimization (PyTorch compilation, OpenAI Triton), and GPU orchestration on Kubernetes — with opinions on what works and what’s been superseded.

Understanding Machine Learning Inference

What is Inference?

After a machine learning model has been trained on a dataset and learned the underlying patterns, it can be used to generate predictions on new inputs or unseen data. This process — called inference — is how ML models get used in the real world: serving predictions, classifying inputs, generating text, or scoring risk in real time.

Inference runs on different hardware profiles than training. Training is computationally intensive and requires powerful GPUs or clusters. Inference can be executed on various devices, from cloud-based servers to edge devices, but at scale it still demands serious GPU resources — especially for large language models.

Challenges in Scaling Inference

Scaling inference gets hard in four dimensions:

Serving Models for Inference

NVIDIA Triton Inference Server

Overview and Features

NVIDIA Triton Inference Server (formerly TensorRT Inference Server) is an open-source serving platform for ML models across multiple frameworks. It’s one of the more mature options for production inference and handles a lot of the operational concerns out of the box.

Notable features include:

Implementing Triton for Scalable Inference

To deploy machine learning models using NVIDIA Triton Inference Server, follow these steps:

  1. Install Triton: First, you need to install Triton on your target hardware. You can use the pre-built Docker container provided by NVIDIA or build Triton from the source code. For detailed installation instructions, refer to the official documentation.
  2. Prepare your models: Convert your trained machine-learning models into a format supported by Triton. This step may involve exporting models from TensorFlow, PyTorch, or other frameworks to ONNX or TensorRT formats. Additionally, organize your models in a directory structure that follows Triton’s model repository layout.
  3. Configure Triton: Create a configuration file for each model you want to deploy, specifying parameters like input and output tensor names, dimensions, data types, and optimization settings. For more information on creating configuration files, consult the Triton documentation.
  4. Launch Triton: Start the Triton server with your prepared model repository, specifying the path to your models and any additional settings like the number of GPUs, HTTP/GRPC ports, and logging preferences.
  5. Send inference requests: Once Triton runs, you can send inference requests to the server using HTTP or gRPC APIs. In addition, client libraries are available for various programming languages, making integrating Triton with your existing applications easy.

Triton remains one of the strongest options for multi-model, multi-framework serving — especially if you’re already in the NVIDIA ecosystem.

TorchServe

Introduction to TorchServe

TorchServe is an open-source tool for serving PyTorch models in production, developed jointly by AWS and Facebook. If your stack is PyTorch-centric, TorchServe is the most direct path from trained model to production endpoint.

Key features:

Deploying Models with TorchServe

To deploy your PyTorch models using TorchServe, at a high level, you will have to follow these steps:

  1. Install TorchServe: Begin by installing TorchServe and its dependencies. You can do this using pip or by building TorchServe from the source. For detailed installation instructions, refer to the official documentation.
  2. Export your model: Export your trained PyTorch model as a TorchScript file using the torch.jit.trace or torch.jit.script methods. TorchScript is a statically-typed subset of Python that can be optimized and executed by the Torch JIT (Just-In-Time) compiler, improving inference performance.
  3. Create a model archive: Package your TorchScript model and any necessary metadata and configuration files into a model archive file. This file is a compressed archive containing all the required components for TorchServe to serve your model.
  4. Start TorchServe: Launch TorchServe with your model archive, specifying the desired settings for REST APIs, logging, and other configurable options.
  5. Send inference requests: Once TorchServe runs, you can send inference requests to the server using the REST APIs. In addition, client libraries are available for various programming languages, making integrating TorchServe with your existing applications easy.

TorchServe is a solid choice if you’re all-in on PyTorch and want a lightweight serving layer without the overhead of a multi-framework platform.

ONNX Inference

Introduction to ONNX

The Open Neural Network Exchange ( ONNX) is an open standard for representing machine learning models, developed by an open source community with open governance (founding members include Microsoft, Facebook, and IBM). ONNX provides a common format for model interchange between frameworks like TensorFlow, PyTorch, and Caffe2. This matters because it decouples your training framework from your deployment target.

ONNX Runtime is the cross-platform inference engine for ONNX models. It runs on CPUs, GPUs, and edge devices, and it applies graph-level optimizations that can meaningfully reduce latency.

Benefits of ONNX Inference

Deploying ONNX Models for Inference

To leverage ONNX for scalable machine learning inference, at a high level, you will follow these steps:

  1. Convert your model to ONNX format: Export your trained machine learning model from your preferred deep learning framework (e.g., TensorFlow or PyTorch) to the ONNX format using the appropriate conversion tools or libraries. Refer to the ONNX tutorials.
  2. Install ONNX Runtime: Install the ONNX Runtime inference engine on your target platform, ensuring you have the necessary dependencies and hardware support.
  3. Load and run your ONNX model: Use the ONNX Runtime APIs to load your ONNX model, prepare input data, and execute inference requests. The APIs are available for various programming languages like Python, C++, and C#.
  4. Integrate with serving solutions: You can also deploy your ONNX models using popular serving solutions, such as NVIDIA Triton Inference Server or TorchServe, which offer native support for ONNX models.

ONNX is particularly valuable when you need to deploy the same model across different hardware targets or when you want to decouple your training stack from your serving stack.

Model Optimization for Inference

PyTorch Compilation (torch.compile / TorchDynamo)

👉 Update (2024): Since I originally wrote this section about TorchDynamo as a standalone tool, PyTorch 2.0 was released and fundamentally changed the compilation story. TorchDynamo is now integrated into PyTorch as the backend for torch.compile(), which is the recommended way to optimize models. The legacy torchdynamo repository is archived.

How It Works

torch.compile() uses TorchDynamo under the hood to capture the computation graph from your PyTorch model and optimize it through a backend compiler (TorchInductor by default). The result: faster execution with minimal code changes.

The main optimizations include:

Using torch.compile

The simplest path — and the one I’d recommend starting with:

  1. Install PyTorch 2.x: torch.compile() is available in PyTorch 2.0 and later.
  2. Compile your model: Wrap your model with torch.compile(). This is often a single line change: model = torch.compile(model).
  3. Deploy: Use your preferred serving solution (TorchServe, Triton, etc.) with the compiled model.

For inference specifically, you can specify the mode: torch.compile(model, mode="reduce-overhead") minimizes framework overhead at the cost of a longer compilation step. In my experience, the compilation step itself can take a while on the first run, but the inference speedup is significant — especially for models with complex control flow that the old TorchScript approach struggled with.

Facebook AITemplate (AIT) — Archived

⚠️ Update (2024): Facebook AITemplate has been archived and is no longer maintained. The project was promising but has been superseded by improvements in PyTorch’s native compilation stack (torch.compile) and NVIDIA’s TensorRT-LLM. I’m keeping this section for historical reference, but I would not start a new project with AITemplate.

Facebook AITemplate was an open-source framework that rendered neural networks into high-performance CUDA/HIP C++ code. It achieved near-roofline performance on NVIDIA TensorCore and AMD MatrixCore for fp16 calculations and supported both GPU vendors from a single codebase.

Its key strengths were:

The project’s abandonment is a reminder that betting on a single optimization framework carries risk. The PyTorch compilation stack and NVIDIA’s own TensorRT path have proven more durable.

OpenAI Triton

What is OpenAI Triton?

OpenAI Triton (not to be confused with NVIDIA Triton Inference Server) is an open-source programming language and compiler for writing high-performance GPU kernels. Triton provides a Python-embedded DSL that lets you write GPU code without dropping down to raw CUDA — the compiler handles the low-level optimizations.

This is relevant to inference scaling because it’s what powers torch.compile’s TorchInductor backend. When you call torch.compile(), the generated kernels are often Triton kernels under the hood.

Key benefits:

Integrating OpenAI Triton into Your Inference Pipeline

To leverage OpenAI Triton to accelerate your machine learning models at a high level, you will follow these steps:

  1. Install OpenAI Triton: Begin by installing the Triton compiler and its dependencies. Detailed installation instructions can be found in the official documentation
  2. Implement custom GPU kernels: Write custom GPU kernels using OpenAI Triton’s Python-like syntax, focusing on the specific operations and optimizations most relevant to your machine learning workloads.
  3. Compile and test your kernels: Compile your custom Triton kernels and test them for correctness and performance. Benchmark your kernels against existing implementations to ensure your optimizations are effective.
  4. Integrate Triton kernels into your models: Modify your machine learning models to use your custom Triton kernels instead of the default implementations provided by your deep learning framework. This process may involve updating your model code or creating custom PyTorch or TensorFlow layers that utilize your Triton kernels.
  5. Deploy your optimized models: With your custom Triton kernels integrated, deploy your optimized machine learning models using your preferred serving solution, such as NVIDIA Triton Inference Server or TorchServe.

Writing custom Triton kernels is most relevant when you have specific bottleneck operations that the default PyTorch kernels don’t handle well — attention mechanisms, custom activation functions, or domain-specific operations where you need every last bit of GPU performance.

LLM Inference Techniques (2024–2026)

The explosion of large language model deployment since ChatGPT has driven a wave of inference-specific optimizations that didn’t exist when I originally wrote this post. These techniques are now fundamental to anyone serving transformer-based models at scale.

FlashAttention

FlashAttention is arguably the single most impactful inference optimization of the past few years. Standard attention computes the full N×N attention matrix, which is memory-bound and slow. FlashAttention rewrites the attention computation to be IO-aware — it tiles the computation to minimize reads/writes to GPU high-bandwidth memory (HBM) and keeps intermediate results in on-chip SRAM.

The practical impact: FlashAttention-3 reaches 75-85% utilization on H100 GPUs, up from 35% with FlashAttention-2. With FP8 support, it achieves 1.3 PFLOPS/s. Every major inference engine (vLLM, SGLang, TensorRT-LLM) uses FlashAttention under the hood. If you’re writing custom attention kernels without it, you’re leaving significant performance on the table.

Continuous Batching

Traditional static batching waits for a full batch of requests before processing them together. This adds latency (you wait for the batch to fill) and wastes compute (shorter sequences pad to the longest sequence length).

Continuous batching (also called iteration-level batching) processes requests at the token level. As soon as one request in a batch finishes generating, a new request takes its slot — no waiting, no padding. This is how vLLM, SGLang, and TensorRT-LLM handle batching by default. It’s one of the main reasons these tools dramatically outperform naive serving solutions.

KV-Cache Management

During autoregressive generation, the model computes key-value pairs for each token. Recomputing them at every step would be prohibitively expensive, so they’re cached — the KV-cache. For large models with long contexts, this cache can consume tens of gigabytes of GPU memory per request.

Two approaches dominate:

Speculative Decoding

LLM inference is bottlenecked by the sequential nature of autoregressive generation — one token at a time. Speculative decoding works around this by using a small, fast “draft” model to generate multiple candidate tokens, then verifying them in a single forward pass through the full model. Since verification is parallelizable (unlike generation), this yields 2-3x speedups in practice.

Both vLLM and SGLang support speculative decoding in production. Meta’s EAGLE-based approach for Llama models achieves ~4ms per token on 8 H100s. The technique is now mature enough that I’d consider it standard practice for any latency-sensitive LLM deployment.

Disaggregated Prefill and Decode

LLM inference has two phases with very different compute profiles: prefill (processing the input prompt — compute-bound, parallelizable) and decode (generating output tokens — memory-bound, sequential). Running both phases on the same hardware forces a compromise.

Disaggregated serving splits these phases across different hardware or processes. Prefill runs on compute-optimized instances, decode runs on memory-optimized instances. Research results show up to 74% reduction in P99 latency. This is still an emerging pattern in production, but vLLM and SGLang both support it, and I expect it to become standard for large-scale deployments.

Multi-LoRA Serving

If you’re serving multiple fine-tuned model variants (common in multi-tenant platforms), you don’t need a separate GPU for each variant. Multi-LoRA serving loads the base model once and dynamically swaps lightweight LoRA adapters per request. Both vLLM and SGLang support batching requests across different LoRA adapters on the same base model — a significant cost optimization when you have dozens or hundreds of fine-tuned variants.

Specialized GPU Orchestration and Scheduler for Optimizing GPU Usage

The Importance of GPU Orchestration and Scheduling

Once you’ve picked your serving platform and optimized your models, the next challenge is managing GPU resources at scale. This is where most teams underestimate the complexity. GPU hardware is expensive, and poor scheduling means you’re either wasting money on idle GPUs or starving workloads that need them.

The core challenges:

Key Solutions for GPU Orchestration and Scheduling

The main options:

  1. Kubernetes: The dominant choice for most teams. Extended with NVIDIA GPU Operator, Kubeflow, and tools like Karpenter for node provisioning. At Adobe, Kubernetes is our primary platform for GPU-accelerated ML workloads.
  2. SLURM: SLURM remains the standard for HPC and research clusters. It provides fine-grained GPU scheduling based on memory, power, and device type. If you’re in an academic or research environment, this is likely what you’re using.
  3. Apache Mesos: Moved to the Apache Attic and effectively end-of-life. If you’re still running Mesos, plan your migration.

Several commercial options sit on top of Kubernetes, such as Run:ai and CoreWeave, providing GPU-aware scheduling and fractional GPU sharing.

GPU Orchestration with Kubernetes

What You Need to Run ML Inference on Kubernetes

Running inference workloads on Kubernetes is not the same as running web services. Here’s what matters:

  1. GPU-aware scheduling: Kubernetes needs to know about your GPUs. The NVIDIA GPU Operator handles driver installation, device plugin registration, and GPU monitoring. Without it, Kubernetes treats your expensive GPU nodes like any other compute.
  2. High-bandwidth storage and networking: Large models need to be loaded fast. If your model takes 30 seconds to load from storage, that’s 30 seconds of cold-start latency on every new pod. Fast storage (NVMe, high-throughput PVCs) and network fabric matter.
  3. Compute provisioning: GPU nodes are expensive and take longer to provision than CPU nodes. Auto-provisioners like Karpenter can help, but you need to plan for the provisioning lag — especially for spot/preemptible instances.
  4. Model weight management: Even with pre-trained models, serving and fine-tuning are compute-heavy. Managing model artifacts (weights, configs) across pods requires a strategy — shared volumes, model registries, or init containers that pull weights at startup.

Scaling Strategies on Kubernetes

Kubernetes supports both horizontal scaling (more pods) and vertical scaling (bigger pods). For inference workloads, horizontal scaling is generally the right approach — you want more replicas of your serving pod behind a load balancer, not bigger pods.

The cost dimension matters a lot here. GPU nodes are 5-10x more expensive than CPU nodes. In a multi-cloud or hybrid setup, you can reduce costs by using spot/preemptible GPU instances for inference traffic that can tolerate occasional interruptions, while keeping a baseline of on-demand instances for guaranteed capacity.

HPA, VPA, and Cluster Autoscaler

Three Kubernetes autoscaling mechanisms work at different levels:

  1. Cluster Autoscaler (or Karpenter): Adjusts the number of nodes. When pods are pending because no GPU node is available, the autoscaler provisions one. When nodes are underutilized, it drains and removes them. For GPU nodes, this is your biggest cost lever.

  2. Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas based on metrics — typically CPU, memory, or custom metrics like request queue depth. For inference, HPA based on GPU utilization or request latency is more useful than default CPU-based scaling.

  3. Vertical Pod Autoscaler (VPA): Adjusts resource requests and limits per pod based on observed usage. VPA is especially useful for preventing OOM kills during model loading, where memory usage can spike well above steady-state.

These three work together: the Cluster Autoscaler ensures you have enough nodes, HPA ensures you have enough pods, and VPA ensures each pod has the right resource allocation. Getting this right for GPU workloads takes tuning — GPU utilization metrics aren’t as straightforward as CPU metrics, and GPU node provisioning is slow compared to CPU nodes.

Load Balancing for Inference

Adding more inference pods only helps if traffic is distributed properly. A few things to watch:

What’s Changed Since 2022

The ML inference landscape has shifted dramatically since I first wrote this post. ChatGPT’s release in late 2022 triggered an industry-wide scramble to deploy LLMs in production, and the tooling has evolved accordingly. Here are the developments that matter most.

LLM Serving Engines

NVIDIA NIM

NVIDIA NIM is NVIDIA’s higher-level inference platform — prebuilt, optimized containers for 220+ models that deploy in minutes. NIM sits above Triton and TensorRT-LLM, abstracting away the complexity of model optimization, quantization, and serving configuration. The NIM Operator for Kubernetes handles multi-model deployment, dynamic resource allocation, and KServe integration.

If you’re in the NVIDIA ecosystem and want the fastest path from model to production, NIM is increasingly the recommended entry point — with Triton and TensorRT-LLM as the underlying infrastructure you can drop down to when you need finer control.

Local and Edge Inference

Serverless GPU Inference

An entire deployment paradigm that didn’t exist when I wrote this post. Platforms like Modal, Replicate, and RunPod provide on-demand GPU access with per-second billing:

Optimization Techniques

Choosing the Right Stack

There is no single right answer — the right inference stack depends on your models, your scale, and your team’s expertise. Here’s how I’d think about it in 2026:

Serving LLMs at scale:

Serving traditional ML/DL models (classification, detection, embeddings):

Local development and edge deployment:

Bursty or low-utilization workloads:

Resource-constrained / no GPU:

The tools keep evolving fast, but the fundamentals don’t change: measure your latency and throughput, understand your cost per prediction, and don’t over-engineer before you have the traffic that demands it.