Edge Inference in 2026: Building AI Products That Run On-Device
Cloud AI adds latency, costs, and privacy concerns. A practical guide to deploying ML models on-device with LiteRT, ONNX Runtime, and modern edge infrastructure.
TL;DR
- Edge inference runs ML models directly on devices (phones, browsers, IoT) instead of calling cloud APIs.
- Benefits: lower latency (5–50ms vs 200–500ms), works offline, keeps sensitive data on-device, reduces cloud costs.
- LiteRT (Google’s successor to TensorFlow Lite) powers 100,000+ apps across billions of devices.
- ONNX Runtime provides cross-platform support for PyTorch, TensorFlow, and JAX models.
- Modern hardware acceleration (NPUs, GPU via ML Drift) makes complex models viable on mobile.
- Trade-offs: limited model size, device fragmentation, harder debugging.
- Use edge for latency-critical, privacy-sensitive, or offline-required features; cloud for complex reasoning.
Why Edge Inference Matters
Cloud AI has dominated the last decade, but edge inference is becoming the default for many use cases:
| Factor | Cloud AI | Edge AI |
|---|---|---|
| Latency | 200–500ms+ | 5–50ms |
| Offline capability | None | Full functionality |
| Privacy | Data leaves device | Data stays on-device |
| Cost | Per-inference pricing | One-time model deploy |
| Reliability | Network dependent | Always available |
The shift isn’t about replacing cloud AI—it’s about using the right approach for each use case.
When to Use Edge vs. Cloud
Use Edge When:
- Latency matters: Real-time camera effects, voice transcription, gaming
- Privacy is critical: Health data, financial info, biometrics
- Offline is required: Field work, travel, unreliable connectivity
- Cost scales with usage: High-frequency inference, millions of users
- Personalization: On-device learning, user-specific models
Use Cloud When:
- Model complexity exceeds device capability: Large language models, complex reasoning
- Frequent model updates: A/B testing, continuous learning
- Cross-device consistency: Results must match exactly across platforms
- Compute-heavy tasks: Video generation, large-scale analysis
Hybrid Approach
Many products use both:
User interaction → Edge model (fast response)
↓
Low confidence?
↓
Cloud model (accurate response)
Example: Voice assistant recognizes “Hey Siri” on-device (fast), sends complex queries to cloud (accurate).
Edge AI Platforms
LiteRT (Google AI Edge)
Successor to TensorFlow Lite, powering 100,000+ apps:
| Feature | Capability |
|---|---|
| Hardware acceleration | GPU via ML Drift, NPUs across vendors |
| Platform support | Android, iOS, web, IoT, desktop |
| Model sources | TensorFlow, JAX, Keras, PyTorch (via conversion) |
| Optimization | Quantization, pruning, compilation |
Getting started:
// Android with LiteRT
val interpreter = Interpreter(loadModel("model.tflite"))
val inputBuffer = ByteBuffer.allocateDirect(INPUT_SIZE)
val outputBuffer = ByteBuffer.allocateDirect(OUTPUT_SIZE)
interpreter.run(inputBuffer, outputBuffer)
ONNX Runtime
Cross-platform inference engine:
| Feature | Capability |
|---|---|
| Model format | ONNX (converted from PyTorch, TensorFlow, etc.) |
| Platforms | Windows, Linux, macOS, iOS, Android, web |
| Execution providers | CPU, CUDA, CoreML, NNAPI, DirectML |
| On-device training | Supported |
Getting started:
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
outputs = session.run(
None,
{"input": input_data}
)
MediaPipe (High-Level APIs)
Pre-built ML solutions for common tasks:
| Task | Use Case |
|---|---|
| Face detection | Camera effects, authentication |
| Pose estimation | Fitness apps, gaming |
| Hand tracking | Gesture control, sign language |
| Object detection | AR, accessibility |
| Text classification | Content moderation, intent |
MediaPipe handles model optimization, platform abstraction, and hardware acceleration.
Model Optimization
Edge devices have limited compute. Optimization is essential.
Quantization
Reduce precision from float32 to smaller formats:
| Format | Size Reduction | Accuracy Impact |
|---|---|---|
| float16 | 2x | Minimal |
| int8 | 4x | Small (<1% loss typical) |
| int4 | 8x | Moderate (needs calibration) |
# TensorFlow post-training quantization
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
quantized_model = converter.convert()
Pruning
Remove unnecessary weights:
- Structured pruning: Remove entire neurons/filters
- Unstructured pruning: Remove individual weights
- Magnitude-based: Remove smallest weights
- Typical result: 50–90% weight reduction with <2% accuracy loss
Knowledge Distillation
Train small “student” model to mimic large “teacher”:
Large cloud model (teacher)
│
▼
Train student on teacher outputs
│
▼
Small edge model (student)
Result: Small model that captures most of the teacher’s capability.
Model Compilation
Compile for specific hardware:
# LiteRT compiled model for GPU
compiled_model = litert.CompiledModel(
model_path,
target="gpu",
optimization_level="balanced"
)
Compilation optimizes for specific hardware, often achieving 2x+ speedup.
Hardware Acceleration
NPU (Neural Processing Unit)
Dedicated AI hardware in modern devices:
| Vendor | NPU | Devices |
|---|---|---|
| Tensor TPU | Pixel phones | |
| Apple | Neural Engine | iPhone, iPad, Mac |
| Qualcomm | Hexagon | Android flagships |
| MediaTek | APU | Mid-range Android |
| Samsung | NPU | Galaxy devices |
GPU Acceleration
LiteRT’s ML Drift provides unified GPU access:
- Runs models on mobile GPU
- Automatic fallback to CPU if unsupported
- Async execution for reduced latency
Choosing Hardware
| Workload | Best Hardware | Why |
|---|---|---|
| Large models | NPU | Highest throughput |
| Small models | CPU | Lowest latency for simple ops |
| Batched inference | GPU | Parallelism advantage |
| Background tasks | NPU | Power efficient |
Implementation Patterns
Async Inference
Don’t block UI on inference:
// Android coroutine-based async inference
class ModelManager(private val interpreter: Interpreter) {
suspend fun runInference(input: FloatArray): FloatArray =
withContext(Dispatchers.Default) {
val output = FloatArray(OUTPUT_SIZE)
interpreter.run(input, output)
output
}
}
// Usage
lifecycleScope.launch {
val result = modelManager.runInference(inputData)
updateUI(result)
}
Batching
Combine multiple inputs for efficiency:
# Instead of:
for image in images:
result = model.predict(image) # Slow
# Use batching:
results = model.predict(images) # Fast
Model Caching
Load models once, reuse:
object ModelCache {
private val interpreters = mutableMapOf<String, Interpreter>()
fun getInterpreter(modelName: String): Interpreter {
return interpreters.getOrPut(modelName) {
Interpreter(loadModel(modelName))
}
}
}
Fallback Strategy
Handle model failures gracefully:
async function runInference(input: Float32Array): Promise<Result> {
try {
// Try edge inference
return await edgeModel.run(input);
} catch (error) {
if (navigator.onLine) {
// Fallback to cloud
return await cloudAPI.inference(input);
} else {
// Return cached/default result
return getDefaultResult();
}
}
}
Platform-Specific Considerations
iOS
- Use CoreML for Apple hardware optimization
- Metal Performance Shaders for GPU
- Neural Engine access via CoreML
- Xcode Model Preview for testing
Android
- LiteRT/TensorFlow Lite native support
- NNAPI for hardware abstraction
- GPU delegate for acceleration
- Benchmark app for performance testing
Web (Browser)
- TensorFlow.js with WebGL backend
- ONNX Runtime Web with WebAssembly
- WebGPU (emerging, higher performance)
- Model size limits (~100MB practical)
IoT/Embedded
- LiteRT Micro for microcontrollers
- ARM Cortex-M optimization
- Minimal runtime (~100KB possible)
- Power/memory constrained
Debugging and Profiling
Model Performance
| Tool | Platform | What It Shows |
|---|---|---|
| Model Explorer | Cross-platform | Model structure, ops |
| LiteRT Benchmark | Android | Inference time, memory |
| Instruments | iOS | Timeline, memory, CPU |
| Perfetto | Android | System-wide traces |
Common Issues
| Issue | Symptom | Solution |
|---|---|---|
| Slow inference | High latency | Enable GPU/NPU delegate |
| Memory crash | OOM errors | Reduce model size, quantize |
| Accuracy loss | Wrong predictions | Check quantization calibration |
| Device fragmentation | Works on some devices | Use hardware fallbacks |
Implementation Checklist
Model Preparation
- Train/export model for edge deployment
- Convert to target format (TFLite, ONNX)
- Apply quantization (int8 for mobile)
- Test accuracy post-optimization
- Benchmark on target devices
Integration
- Choose inference runtime (LiteRT, ONNX, CoreML)
- Implement async inference
- Add hardware acceleration
- Handle model loading/caching
- Implement fallback strategy
Testing
- Test on low-end devices
- Verify offline functionality
- Profile memory and battery impact
- Test edge cases and failures
- Measure real-world latency
FAQ
How small does my model need to be?
For mobile: <100MB is comfortable, <50MB is ideal, <10MB for real-time. For web: <20MB practical limit. These are rough guidelines—test on target devices.
Can I run LLMs on-device?
Small LLMs (1–7B parameters, quantized) can run on flagship phones. Expect 5–20 tokens/second. For complex reasoning, cloud is still better.
How do I update models without app updates?
Implement model download and versioning. Check for updates on app launch, download in background, swap models when ready.
What about on-device training?
ONNX Runtime supports on-device training. Use for personalization, federated learning, or privacy-preserving fine-tuning. Keep training data on-device.
How do I handle device fragmentation?
Use hardware abstraction layers (NNAPI on Android), implement fallbacks (GPU → CPU), test on representative device set.
Is edge AI ready for production?
Yes. Apps like Google Photos, Snapchat, and voice assistants use edge AI at massive scale. The tooling and hardware have matured significantly.
Sources & Further Reading
- Google AI Edge — Official LiteRT documentation
- ONNX Runtime — Cross-platform inference
- LiteRT Inference Guide — On-device inference patterns
- ONNX On-Device Training — Personalization
- AWS Greengrass ML — IoT deployment
- LLM Cost Optimization — Related: cloud cost reduction
- Agent Routing Strategies — Related: model selection
Interested in our research?
We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.
Get in Touch