👉 This post was initially written in 2022 and has been updated to reflect the significant changes in the ML inference landscape since then (PyTorch 2.x, the LLM inference explosion, vLLM, TensorRT-LLM, and more). Some tool versions and links may still refer to older releases.
While you can argue on the legitimate use of the acronym AI, you cannot ignore that machine learning models are now embedded in production systems at massive scale — from real-time bidding and fraud detection to recommendation engines and content generation. At Adobe, I’ve seen firsthand how the demand for fast, reliable, and cost-efficient inference has grown as models get larger and more complex.
This post surveys the main tools and strategies for scaling ML inference. I’ll cover serving platforms (NVIDIA Triton, TorchServe, ONNX Runtime), model optimization (PyTorch compilation, OpenAI Triton), and GPU orchestration on Kubernetes — with opinions on what works and what’s been superseded.
Understanding Machine Learning Inference
What is Inference?
After a machine learning model has been trained on a dataset and learned the underlying patterns, it can be used to generate predictions on new inputs or unseen data. This process — called inference — is how ML models get used in the real world: serving predictions, classifying inputs, generating text, or scoring risk in real time.
Inference runs on different hardware profiles than training. Training is computationally intensive and requires powerful GPUs or clusters. Inference can be executed on various devices, from cloud-based servers to edge devices, but at scale it still demands serious GPU resources — especially for large language models.
Challenges in Scaling Inference
Scaling inference gets hard in four dimensions:
- Latency: Low-latency predictions are non-negotiable for fraud detection, real-time bidding, or any application with stringent UX expectations. As request volume climbs, keeping p99 latency under control becomes the primary constraint.
- Throughput: Handling many parallel requests efficiently. At TubeMogul, we were processing 350 billion bid requests daily — throughput at that scale is an engineering discipline in itself.
- Cost Efficiency: GPUs are expensive. Balancing inference performance against hardware cost — and minimizing cost per prediction — is critical for long-term viability.
- Reliability: Consistent performance and high availability for business-critical ML services. Downtime on a serving pipeline can have direct revenue impact.
Serving Models for Inference
NVIDIA Triton Inference Server
Overview and Features
NVIDIA Triton Inference Server (formerly TensorRT Inference Server) is an open-source serving platform for ML models across multiple frameworks. It’s one of the more mature options for production inference and handles a lot of the operational concerns out of the box.
Notable features include:
- Multi-framework support: Compatible with TensorFlow, PyTorch, ONNX Runtime, and TensorRT — so you’re not locked into a single ecosystem.
- Model ensemble support: Combine multiple models into a single inference pipeline.
- Dynamic batching: Automatically aggregates multiple inference requests for better GPU utilization and throughput.
- Model versioning: Built-in support for versioned model deployments and rollbacks.
- GPU acceleration: Designed to make full use of NVIDIA GPUs, including TensorRT optimization.
Implementing Triton for Scalable Inference
To deploy machine learning models using NVIDIA Triton Inference Server, follow these steps:
- Install Triton: First, you need to install Triton on your target hardware. You can use the pre-built Docker container provided by NVIDIA or build Triton from the source code. For detailed installation instructions, refer to the official documentation.
- Prepare your models: Convert your trained machine-learning models into a format supported by Triton. This step may involve exporting models from TensorFlow, PyTorch, or other frameworks to ONNX or TensorRT formats. Additionally, organize your models in a directory structure that follows Triton’s model repository layout.
- Configure Triton: Create a configuration file for each model you want to deploy, specifying parameters like input and output tensor names, dimensions, data types, and optimization settings. For more information on creating configuration files, consult the Triton documentation.
- Launch Triton: Start the Triton server with your prepared model repository, specifying the path to your models and any additional settings like the number of GPUs, HTTP/GRPC ports, and logging preferences.
- Send inference requests: Once Triton runs, you can send inference requests to the server using HTTP or gRPC APIs. In addition, client libraries are available for various programming languages, making integrating Triton with your existing applications easy.
Triton remains one of the strongest options for multi-model, multi-framework serving — especially if you’re already in the NVIDIA ecosystem.
TorchServe
Introduction to TorchServe
TorchServe is an open-source tool for serving PyTorch models in production, developed jointly by AWS and Facebook. If your stack is PyTorch-centric, TorchServe is the most direct path from trained model to production endpoint.
Key features:
- Native PyTorch support: No model conversion needed — serve PyTorch models directly.
- Model versioning: Simplified model management with version control for deployments.
- Batching: Configurable batching to improve GPU utilization and throughput.
- Customizable pre/post-processing: Plug in custom logic around inference without modifying the model itself.
- Metrics and monitoring: Exposes metrics via a RESTful API for integration with your existing monitoring stack.
Deploying Models with TorchServe
To deploy your PyTorch models using TorchServe, at a high level, you will have to follow these steps:
- Install TorchServe: Begin by installing TorchServe and its dependencies. You can do this using pip or by building TorchServe from the source. For detailed installation instructions, refer to the official documentation.
- Export your model: Export your trained PyTorch model as a TorchScript file using the torch.jit.trace or torch.jit.script methods. TorchScript is a statically-typed subset of Python that can be optimized and executed by the Torch JIT (Just-In-Time) compiler, improving inference performance.
- Create a model archive: Package your TorchScript model and any necessary metadata and configuration files into a model archive file. This file is a compressed archive containing all the required components for TorchServe to serve your model.
- Start TorchServe: Launch TorchServe with your model archive, specifying the desired settings for REST APIs, logging, and other configurable options.
- Send inference requests: Once TorchServe runs, you can send inference requests to the server using the REST APIs. In addition, client libraries are available for various programming languages, making integrating TorchServe with your existing applications easy.
TorchServe is a solid choice if you’re all-in on PyTorch and want a lightweight serving layer without the overhead of a multi-framework platform.
ONNX Inference
Introduction to ONNX
The Open Neural Network Exchange ( ONNX) is an open standard for representing machine learning models, developed by an open source community with open governance (founding members include Microsoft, Facebook, and IBM). ONNX provides a common format for model interchange between frameworks like TensorFlow, PyTorch, and Caffe2. This matters because it decouples your training framework from your deployment target.
ONNX Runtime is the cross-platform inference engine for ONNX models. It runs on CPUs, GPUs, and edge devices, and it applies graph-level optimizations that can meaningfully reduce latency.
Benefits of ONNX Inference
- Framework interoperability: Train in PyTorch, deploy with ONNX Runtime. No vendor lock-in.
- Optimized performance: ONNX Runtime applies graph optimizations, operator fusion, and quantization to reduce latency and improve throughput.
- Hardware compatibility: Deploy the same model across CPUs, GPUs, and edge devices.
Deploying ONNX Models for Inference
To leverage ONNX for scalable machine learning inference, at a high level, you will follow these steps:
- Convert your model to ONNX format: Export your trained machine learning model from your preferred deep learning framework (e.g., TensorFlow or PyTorch) to the ONNX format using the appropriate conversion tools or libraries. Refer to the ONNX tutorials.
- Install ONNX Runtime: Install the ONNX Runtime inference engine on your target platform, ensuring you have the necessary dependencies and hardware support.
- Load and run your ONNX model: Use the ONNX Runtime APIs to load your ONNX model, prepare input data, and execute inference requests. The APIs are available for various programming languages like Python, C++, and C#.
- Integrate with serving solutions: You can also deploy your ONNX models using popular serving solutions, such as NVIDIA Triton Inference Server or TorchServe, which offer native support for ONNX models.
ONNX is particularly valuable when you need to deploy the same model across different hardware targets or when you want to decouple your training stack from your serving stack.
Model Optimization for Inference
PyTorch Compilation (torch.compile / TorchDynamo)
👉 Update (2024): Since I originally wrote this section about TorchDynamo as a standalone tool, PyTorch 2.0 was released and fundamentally changed the compilation story. TorchDynamo is now integrated into PyTorch as the backend for
torch.compile(), which is the recommended way to optimize models. The legacy torchdynamo repository is archived.
How It Works
torch.compile() uses TorchDynamo under the hood to capture the computation graph from your PyTorch model and optimize it through a backend compiler (TorchInductor by default). The result: faster execution with minimal code changes.
The main optimizations include:
- Operator fusion: Combining multiple operations into a single kernel, reducing overhead.
- Kernel specialization: Generating optimized kernels for specific input shapes and data types.
- Memory optimizations: Reusing buffers and minimizing intermediate allocations.
Using torch.compile
The simplest path — and the one I’d recommend starting with:
- Install PyTorch 2.x:
torch.compile()is available in PyTorch 2.0 and later. - Compile your model: Wrap your model with
torch.compile(). This is often a single line change:model = torch.compile(model). - Deploy: Use your preferred serving solution (TorchServe, Triton, etc.) with the compiled model.
For inference specifically, you can specify the mode: torch.compile(model, mode="reduce-overhead") minimizes framework overhead at the cost of a longer compilation step. In my experience, the compilation step itself can take a while on the first run, but the inference speedup is significant — especially for models with complex control flow that the old TorchScript approach struggled with.
Facebook AITemplate (AIT) — Archived
⚠️ Update (2024): Facebook AITemplate has been archived and is no longer maintained. The project was promising but has been superseded by improvements in PyTorch’s native compilation stack (
torch.compile) and NVIDIA’s TensorRT-LLM. I’m keeping this section for historical reference, but I would not start a new project with AITemplate.
Facebook AITemplate was an open-source framework that rendered neural networks into high-performance CUDA/HIP C++ code. It achieved near-roofline performance on NVIDIA TensorCore and AMD MatrixCore for fp16 calculations and supported both GPU vendors from a single codebase.
Its key strengths were:
- High performance: Near-roofline fp16 performance on major models (ResNet, BERT, VisionTransformer, Stable Diffusion).
- Cross-vendor GPU support: A unified framework for both NVIDIA and AMD GPUs.
- Extensive fusion support: More fusion patterns than competing solutions at the time.
The project’s abandonment is a reminder that betting on a single optimization framework carries risk. The PyTorch compilation stack and NVIDIA’s own TensorRT path have proven more durable.
OpenAI Triton
What is OpenAI Triton?
OpenAI Triton (not to be confused with NVIDIA Triton Inference Server) is an open-source programming language and compiler for writing high-performance GPU kernels. Triton provides a Python-embedded DSL that lets you write GPU code without dropping down to raw CUDA — the compiler handles the low-level optimizations.
This is relevant to inference scaling because it’s what powers torch.compile’s TorchInductor backend. When you call torch.compile(), the generated kernels are often Triton kernels under the hood.
Key benefits:
- Near-CUDA performance: Triton kernels can match hand-tuned CUDA code without the development cost.
- Python-native: Write and test GPU kernels directly in Python — no separate compilation toolchain.
- Powers the PyTorch stack: Triton is now deeply integrated into PyTorch’s compilation pipeline, making it increasingly important for anyone working with PyTorch inference optimization.
Integrating OpenAI Triton into Your Inference Pipeline
To leverage OpenAI Triton to accelerate your machine learning models at a high level, you will follow these steps:
- Install OpenAI Triton: Begin by installing the Triton compiler and its dependencies. Detailed installation instructions can be found in the official documentation
- Implement custom GPU kernels: Write custom GPU kernels using OpenAI Triton’s Python-like syntax, focusing on the specific operations and optimizations most relevant to your machine learning workloads.
- Compile and test your kernels: Compile your custom Triton kernels and test them for correctness and performance. Benchmark your kernels against existing implementations to ensure your optimizations are effective.
- Integrate Triton kernels into your models: Modify your machine learning models to use your custom Triton kernels instead of the default implementations provided by your deep learning framework. This process may involve updating your model code or creating custom PyTorch or TensorFlow layers that utilize your Triton kernels.
- Deploy your optimized models: With your custom Triton kernels integrated, deploy your optimized machine learning models using your preferred serving solution, such as NVIDIA Triton Inference Server or TorchServe.
Writing custom Triton kernels is most relevant when you have specific bottleneck operations that the default PyTorch kernels don’t handle well — attention mechanisms, custom activation functions, or domain-specific operations where you need every last bit of GPU performance.
LLM Inference Techniques (2024–2026)
The explosion of large language model deployment since ChatGPT has driven a wave of inference-specific optimizations that didn’t exist when I originally wrote this post. These techniques are now fundamental to anyone serving transformer-based models at scale.
FlashAttention
FlashAttention is arguably the single most impactful inference optimization of the past few years. Standard attention computes the full N×N attention matrix, which is memory-bound and slow. FlashAttention rewrites the attention computation to be IO-aware — it tiles the computation to minimize reads/writes to GPU high-bandwidth memory (HBM) and keeps intermediate results in on-chip SRAM.
The practical impact: FlashAttention-3 reaches 75-85% utilization on H100 GPUs, up from 35% with FlashAttention-2. With FP8 support, it achieves 1.3 PFLOPS/s. Every major inference engine (vLLM, SGLang, TensorRT-LLM) uses FlashAttention under the hood. If you’re writing custom attention kernels without it, you’re leaving significant performance on the table.
Continuous Batching
Traditional static batching waits for a full batch of requests before processing them together. This adds latency (you wait for the batch to fill) and wastes compute (shorter sequences pad to the longest sequence length).
Continuous batching (also called iteration-level batching) processes requests at the token level. As soon as one request in a batch finishes generating, a new request takes its slot — no waiting, no padding. This is how vLLM, SGLang, and TensorRT-LLM handle batching by default. It’s one of the main reasons these tools dramatically outperform naive serving solutions.
KV-Cache Management
During autoregressive generation, the model computes key-value pairs for each token. Recomputing them at every step would be prohibitively expensive, so they’re cached — the KV-cache. For large models with long contexts, this cache can consume tens of gigabytes of GPU memory per request.
Two approaches dominate:
- PagedAttention ( vLLM): Treats KV-cache like OS virtual memory — allocating it in fixed-size pages rather than contiguous blocks. This eliminates memory fragmentation and allows much higher concurrency.
- RadixAttention ( SGLang): Uses a trie-based prefix tree to automatically share KV-cache across requests with common prefixes. Particularly effective for RAG pipelines, few-shot prompting, and multi-turn conversations where many requests share the same system prompt.
Speculative Decoding
LLM inference is bottlenecked by the sequential nature of autoregressive generation — one token at a time. Speculative decoding works around this by using a small, fast “draft” model to generate multiple candidate tokens, then verifying them in a single forward pass through the full model. Since verification is parallelizable (unlike generation), this yields 2-3x speedups in practice.
Both vLLM and SGLang support speculative decoding in production. Meta’s EAGLE-based approach for Llama models achieves ~4ms per token on 8 H100s. The technique is now mature enough that I’d consider it standard practice for any latency-sensitive LLM deployment.
Disaggregated Prefill and Decode
LLM inference has two phases with very different compute profiles: prefill (processing the input prompt — compute-bound, parallelizable) and decode (generating output tokens — memory-bound, sequential). Running both phases on the same hardware forces a compromise.
Disaggregated serving splits these phases across different hardware or processes. Prefill runs on compute-optimized instances, decode runs on memory-optimized instances. Research results show up to 74% reduction in P99 latency. This is still an emerging pattern in production, but vLLM and SGLang both support it, and I expect it to become standard for large-scale deployments.
Multi-LoRA Serving
If you’re serving multiple fine-tuned model variants (common in multi-tenant platforms), you don’t need a separate GPU for each variant. Multi-LoRA serving loads the base model once and dynamically swaps lightweight LoRA adapters per request. Both vLLM and SGLang support batching requests across different LoRA adapters on the same base model — a significant cost optimization when you have dozens or hundreds of fine-tuned variants.
Specialized GPU Orchestration and Scheduler for Optimizing GPU Usage
The Importance of GPU Orchestration and Scheduling
Once you’ve picked your serving platform and optimized your models, the next challenge is managing GPU resources at scale. This is where most teams underestimate the complexity. GPU hardware is expensive, and poor scheduling means you’re either wasting money on idle GPUs or starving workloads that need them.
The core challenges:
- Resource contention: Multiple models and teams competing for the same GPU pool. Without proper scheduling, you get bottlenecks and wasted capacity.
- Dynamic scaling: Inference traffic is rarely constant. You need to scale GPU resources up and down with demand — not just CPU and memory.
- Fault tolerance: GPUs fail. Drivers crash. You need graceful handling of hardware failures without dropping inference requests.
Key Solutions for GPU Orchestration and Scheduling
The main options:
- Kubernetes: The dominant choice for most teams. Extended with NVIDIA GPU Operator, Kubeflow, and tools like Karpenter for node provisioning. At Adobe, Kubernetes is our primary platform for GPU-accelerated ML workloads.
- SLURM: SLURM remains the standard for HPC and research clusters. It provides fine-grained GPU scheduling based on memory, power, and device type. If you’re in an academic or research environment, this is likely what you’re using.
- Apache Mesos: Moved to the Apache Attic and effectively end-of-life. If you’re still running Mesos, plan your migration.
Several commercial options sit on top of Kubernetes, such as Run:ai and CoreWeave, providing GPU-aware scheduling and fractional GPU sharing.
GPU Orchestration with Kubernetes
What You Need to Run ML Inference on Kubernetes
Running inference workloads on Kubernetes is not the same as running web services. Here’s what matters:
- GPU-aware scheduling: Kubernetes needs to know about your GPUs. The NVIDIA GPU Operator handles driver installation, device plugin registration, and GPU monitoring. Without it, Kubernetes treats your expensive GPU nodes like any other compute.
- High-bandwidth storage and networking: Large models need to be loaded fast. If your model takes 30 seconds to load from storage, that’s 30 seconds of cold-start latency on every new pod. Fast storage (NVMe, high-throughput PVCs) and network fabric matter.
- Compute provisioning: GPU nodes are expensive and take longer to provision than CPU nodes. Auto-provisioners like Karpenter can help, but you need to plan for the provisioning lag — especially for spot/preemptible instances.
- Model weight management: Even with pre-trained models, serving and fine-tuning are compute-heavy. Managing model artifacts (weights, configs) across pods requires a strategy — shared volumes, model registries, or init containers that pull weights at startup.
Scaling Strategies on Kubernetes
Kubernetes supports both horizontal scaling (more pods) and vertical scaling (bigger pods). For inference workloads, horizontal scaling is generally the right approach — you want more replicas of your serving pod behind a load balancer, not bigger pods.
The cost dimension matters a lot here. GPU nodes are 5-10x more expensive than CPU nodes. In a multi-cloud or hybrid setup, you can reduce costs by using spot/preemptible GPU instances for inference traffic that can tolerate occasional interruptions, while keeping a baseline of on-demand instances for guaranteed capacity.
HPA, VPA, and Cluster Autoscaler
Three Kubernetes autoscaling mechanisms work at different levels:
Cluster Autoscaler (or Karpenter): Adjusts the number of nodes. When pods are pending because no GPU node is available, the autoscaler provisions one. When nodes are underutilized, it drains and removes them. For GPU nodes, this is your biggest cost lever.
Horizontal Pod Autoscaler (HPA): Scales the number of pod replicas based on metrics — typically CPU, memory, or custom metrics like request queue depth. For inference, HPA based on GPU utilization or request latency is more useful than default CPU-based scaling.
Vertical Pod Autoscaler (VPA): Adjusts resource requests and limits per pod based on observed usage. VPA is especially useful for preventing OOM kills during model loading, where memory usage can spike well above steady-state.
These three work together: the Cluster Autoscaler ensures you have enough nodes, HPA ensures you have enough pods, and VPA ensures each pod has the right resource allocation. Getting this right for GPU workloads takes tuning — GPU utilization metrics aren’t as straightforward as CPU metrics, and GPU node provisioning is slow compared to CPU nodes.
Load Balancing for Inference
Adding more inference pods only helps if traffic is distributed properly. A few things to watch:
- Request routing matters: ML inference requests are not all equal. A request to a large language model takes orders of magnitude longer than a classification request. If your load balancer uses round-robin, you’ll end up with some pods overloaded and others idle. Least-connections or latency-aware routing works better.
- Warm-up time: A freshly started inference pod may need to load model weights into GPU memory before it can serve requests. Your load balancer should respect readiness probes and not send traffic to pods that aren’t ready.
- gRPC vs HTTP: Many inference servers (Triton, TorchServe) support both. gRPC generally performs better for high-throughput inference, but HTTP/2 load balancing in Kubernetes requires an L7-aware proxy (like Istio or Envoy).
What’s Changed Since 2022
The ML inference landscape has shifted dramatically since I first wrote this post. ChatGPT’s release in late 2022 triggered an industry-wide scramble to deploy LLMs in production, and the tooling has evolved accordingly. Here are the developments that matter most.
LLM Serving Engines
- vLLM: Introduced PagedAttention for efficient KV-cache management and became the first widely-adopted production LLM serving engine. Supports the broadest range of hardware (NVIDIA, AMD, Intel, AWS Trainium, TPU) and 50+ model architectures.
- SGLang: Emerged as vLLM’s primary competitor, using RadixAttention for prefix-caching and excelling at structured output, multi-turn conversations, and agentic workloads. Now powers 400,000+ GPUs in production. In benchmarks, SGLang shows 29% throughput advantages on smaller models and significantly better tail latency. For agent-based systems and RAG pipelines, I’d lean toward SGLang.
- NVIDIA TensorRT-LLM: NVIDIA’s kernel-optimized inference library for LLMs, integrating with Triton Inference Server. Extracts maximum performance from NVIDIA hardware but with less model flexibility than vLLM or SGLang.
- Hugging Face TGI: Was a popular early option but was placed in maintenance mode in December 2025, with Hugging Face directing users toward vLLM or SGLang for new deployments.
NVIDIA NIM
NVIDIA NIM is NVIDIA’s higher-level inference platform — prebuilt, optimized containers for 220+ models that deploy in minutes. NIM sits above Triton and TensorRT-LLM, abstracting away the complexity of model optimization, quantization, and serving configuration. The NIM Operator for Kubernetes handles multi-model deployment, dynamic resource allocation, and KServe integration.
If you’re in the NVIDIA ecosystem and want the fastest path from model to production, NIM is increasingly the recommended entry point — with Triton and TensorRT-LLM as the underlying infrastructure you can drop down to when you need finer control.
Local and Edge Inference
- llama.cpp (100K+ GitHub stars): The standard for running LLMs locally. A single statically-linked C++ binary that serves an OpenAI-compatible API with startup times under 5 seconds. The GGUF format packages model weights with full metadata, and quantization formats like Q4_K_M allow 70B+ parameter models to run on consumer hardware (Apple M-series, NVIDIA consumer GPUs, even CPUs). As of late 2025, the server supports dynamic model loading/unloading with LRU eviction.
- Ollama: A user-friendly wrapper around llama.cpp that simplifies model management. Good for local development and prototyping — I covered it in my agentic development post.
Serverless GPU Inference
An entire deployment paradigm that didn’t exist when I wrote this post. Platforms like Modal, Replicate, and RunPod provide on-demand GPU access with per-second billing:
- Cold starts: Modal achieves 2-8 seconds on A10G, which is fast enough for many production workloads.
- Pricing: H100s at $3.20-5.00/hr depending on the platform — significantly cheaper than reserved cloud instances at low utilization.
- When it makes sense: Bursty inference traffic, prototyping, or workloads where GPU utilization would be below 50% on dedicated instances. Above 50% utilization, persistent instances on RunPod or bare-metal cloud become more cost-effective.
Optimization Techniques
- PyTorch 2.x and torch.compile(): As discussed in the updated section above,
torch.compile()replaced TorchScript as the standard PyTorch optimization path. - Quantization: INT8 and INT4 quantization (via GPTQ, AWQ, or bitsandbytes) is now standard practice. The accuracy trade-off is minimal for most applications, and the memory/cost savings are substantial.
- Fractional GPU sharing: Tools like NVIDIA MPS and Run:ai make it practical to share a single GPU across multiple smaller inference workloads.
- Speculative decoding, FlashAttention, continuous batching, KV-cache management: Covered in the LLM Inference Techniques section above. These are no longer optional optimizations — they’re built into every serious serving framework.
Choosing the Right Stack
There is no single right answer — the right inference stack depends on your models, your scale, and your team’s expertise. Here’s how I’d think about it in 2026:
Serving LLMs at scale:
- vLLM or SGLang behind Triton (or via NVIDIA NIM), on Kubernetes with GPU Operator and Karpenter. Quantize with AWQ or GPTQ. Use speculative decoding for latency-sensitive workloads.
- SGLang if you’re doing multi-turn, structured output, or agentic workflows. vLLM for broader hardware support and ecosystem maturity.
- TensorRT-LLM if you need maximum throughput on NVIDIA hardware and can accept the tighter model support.
Serving traditional ML/DL models (classification, detection, embeddings):
- Triton or TorchServe, with
torch.compile()for PyTorch models and ONNX Runtime for cross-framework portability.
Local development and edge deployment:
- llama.cpp with GGUF models. Ollama for quick local prototyping. Apple MLX for Apple Silicon workloads.
Bursty or low-utilization workloads:
- Serverless GPU (Modal or RunPod Serverless). Avoid paying for idle GPUs.
Resource-constrained / no GPU:
- ONNX Runtime on CPU can handle smaller models (embeddings, classification) at production scale. llama.cpp on CPU with aggressive quantization for LLMs — slower but functional.
The tools keep evolving fast, but the fundamentals don’t change: measure your latency and throughput, understand your cost per prediction, and don’t over-engineer before you have the traffic that demands it.



