AI #edge AI#inference#on-device

Edge Inference in 2026: Building AI Products That Run On-Device

Cloud AI adds latency, costs, and privacy concerns. A practical guide to deploying ML models on-device with LiteRT, ONNX Runtime, and modern edge infrastructure.

15 min · January 18, 2026 · Updated January 27, 2026

TL;DR

Edge inference runs ML models directly on devices (phones, browsers, IoT) instead of calling cloud APIs.
Benefits: lower latency (5–50ms vs 200–500ms), works offline, keeps sensitive data on-device, reduces cloud costs.
LiteRT (Google’s successor to TensorFlow Lite) powers 100,000+ apps across billions of devices.
ONNX Runtime provides cross-platform support for PyTorch, TensorFlow, and JAX models.
Modern hardware acceleration (NPUs, GPU via ML Drift) makes complex models viable on mobile.
Trade-offs: limited model size, device fragmentation, harder debugging.
Use edge for latency-critical, privacy-sensitive, or offline-required features; cloud for complex reasoning.

Why Edge Inference Matters

Cloud AI has dominated the last decade, but edge inference is becoming the default for many use cases:

Factor	Cloud AI	Edge AI
Latency	200–500ms+	5–50ms
Offline capability	None	Full functionality
Privacy	Data leaves device	Data stays on-device
Cost	Per-inference pricing	One-time model deploy
Reliability	Network dependent	Always available

The shift isn’t about replacing cloud AI—it’s about using the right approach for each use case.

When to Use Edge vs. Cloud

Use Edge When:

Latency matters: Real-time camera effects, voice transcription, gaming
Privacy is critical: Health data, financial info, biometrics
Offline is required: Field work, travel, unreliable connectivity
Cost scales with usage: High-frequency inference, millions of users
Personalization: On-device learning, user-specific models

Use Cloud When:

Model complexity exceeds device capability: Large language models, complex reasoning
Frequent model updates: A/B testing, continuous learning
Cross-device consistency: Results must match exactly across platforms
Compute-heavy tasks: Video generation, large-scale analysis

Hybrid Approach

Many products use both:

User interaction → Edge model (fast response)
                 ↓
           Low confidence?
                 ↓
          Cloud model (accurate response)

Example: Voice assistant recognizes “Hey Siri” on-device (fast), sends complex queries to cloud (accurate).

Edge AI Platforms

LiteRT (Google AI Edge)

Successor to TensorFlow Lite, powering 100,000+ apps:

Feature	Capability
Hardware acceleration	GPU via ML Drift, NPUs across vendors
Platform support	Android, iOS, web, IoT, desktop
Model sources	TensorFlow, JAX, Keras, PyTorch (via conversion)
Optimization	Quantization, pruning, compilation

Getting started:

// Android with LiteRT
val interpreter = Interpreter(loadModel("model.tflite"))

val inputBuffer = ByteBuffer.allocateDirect(INPUT_SIZE)
val outputBuffer = ByteBuffer.allocateDirect(OUTPUT_SIZE)

interpreter.run(inputBuffer, outputBuffer)

ONNX Runtime

Cross-platform inference engine:

Feature	Capability
Model format	ONNX (converted from PyTorch, TensorFlow, etc.)
Platforms	Windows, Linux, macOS, iOS, Android, web
Execution providers	CPU, CUDA, CoreML, NNAPI, DirectML
On-device training	Supported

Getting started:

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
outputs = session.run(
    None,
    {"input": input_data}
)

MediaPipe (High-Level APIs)

Pre-built ML solutions for common tasks:

Task	Use Case
Face detection	Camera effects, authentication
Pose estimation	Fitness apps, gaming
Hand tracking	Gesture control, sign language
Object detection	AR, accessibility
Text classification	Content moderation, intent

MediaPipe handles model optimization, platform abstraction, and hardware acceleration.

Model Optimization

Edge devices have limited compute. Optimization is essential.

Quantization

Reduce precision from float32 to smaller formats:

Format	Size Reduction	Accuracy Impact
float16	2x	Minimal
int8	4x	Small (<1% loss typical)
int4	8x	Moderate (needs calibration)

# TensorFlow post-training quantization
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
quantized_model = converter.convert()

Pruning

Remove unnecessary weights:

Structured pruning: Remove entire neurons/filters
Unstructured pruning: Remove individual weights
Magnitude-based: Remove smallest weights
Typical result: 50–90% weight reduction with <2% accuracy loss

Knowledge Distillation

Train small “student” model to mimic large “teacher”:

Large cloud model (teacher)
         │
         ▼
    Train student on teacher outputs
         │
         ▼
Small edge model (student)

Result: Small model that captures most of the teacher’s capability.

Model Compilation

Compile for specific hardware:

# LiteRT compiled model for GPU
compiled_model = litert.CompiledModel(
    model_path,
    target="gpu",
    optimization_level="balanced"
)

Compilation optimizes for specific hardware, often achieving 2x+ speedup.

Hardware Acceleration

NPU (Neural Processing Unit)

Dedicated AI hardware in modern devices:

Vendor	NPU	Devices
Google	Tensor TPU	Pixel phones
Apple	Neural Engine	iPhone, iPad, Mac
Qualcomm	Hexagon	Android flagships
MediaTek	APU	Mid-range Android
Samsung	NPU	Galaxy devices

GPU Acceleration

LiteRT’s ML Drift provides unified GPU access:

Runs models on mobile GPU
Automatic fallback to CPU if unsupported
Async execution for reduced latency

Choosing Hardware

Workload	Best Hardware	Why
Large models	NPU	Highest throughput
Small models	CPU	Lowest latency for simple ops
Batched inference	GPU	Parallelism advantage
Background tasks	NPU	Power efficient

Implementation Patterns

Async Inference

Don’t block UI on inference:

// Android coroutine-based async inference
class ModelManager(private val interpreter: Interpreter) {
    
    suspend fun runInference(input: FloatArray): FloatArray = 
        withContext(Dispatchers.Default) {
            val output = FloatArray(OUTPUT_SIZE)
            interpreter.run(input, output)
            output
        }
}

// Usage
lifecycleScope.launch {
    val result = modelManager.runInference(inputData)
    updateUI(result)
}

Batching

Combine multiple inputs for efficiency:

# Instead of:
for image in images:
    result = model.predict(image)  # Slow

# Use batching:
results = model.predict(images)  # Fast

Model Caching

Load models once, reuse:

object ModelCache {
    private val interpreters = mutableMapOf<String, Interpreter>()
    
    fun getInterpreter(modelName: String): Interpreter {
        return interpreters.getOrPut(modelName) {
            Interpreter(loadModel(modelName))
        }
    }
}

Fallback Strategy

Handle model failures gracefully:

async function runInference(input: Float32Array): Promise<Result> {
    try {
        // Try edge inference
        return await edgeModel.run(input);
    } catch (error) {
        if (navigator.onLine) {
            // Fallback to cloud
            return await cloudAPI.inference(input);
        } else {
            // Return cached/default result
            return getDefaultResult();
        }
    }
}

Platform-Specific Considerations

iOS

Use CoreML for Apple hardware optimization
Metal Performance Shaders for GPU
Neural Engine access via CoreML
Xcode Model Preview for testing

Android

LiteRT/TensorFlow Lite native support
NNAPI for hardware abstraction
GPU delegate for acceleration
Benchmark app for performance testing

Web (Browser)

TensorFlow.js with WebGL backend
ONNX Runtime Web with WebAssembly
WebGPU (emerging, higher performance)
Model size limits (~100MB practical)

IoT/Embedded

LiteRT Micro for microcontrollers
ARM Cortex-M optimization
Minimal runtime (~100KB possible)
Power/memory constrained

Debugging and Profiling

Model Performance

Tool	Platform	What It Shows
Model Explorer	Cross-platform	Model structure, ops
LiteRT Benchmark	Android	Inference time, memory
Instruments	iOS	Timeline, memory, CPU
Perfetto	Android	System-wide traces

Common Issues

Issue	Symptom	Solution
Slow inference	High latency	Enable GPU/NPU delegate
Memory crash	OOM errors	Reduce model size, quantize
Accuracy loss	Wrong predictions	Check quantization calibration
Device fragmentation	Works on some devices	Use hardware fallbacks

Implementation Checklist

Model Preparation

Train/export model for edge deployment
Convert to target format (TFLite, ONNX)
Apply quantization (int8 for mobile)
Test accuracy post-optimization
Benchmark on target devices

Integration

Choose inference runtime (LiteRT, ONNX, CoreML)
Implement async inference
Add hardware acceleration
Handle model loading/caching
Implement fallback strategy

Testing

FAQ

How small does my model need to be?

For mobile: <100MB is comfortable, <50MB is ideal, <10MB for real-time. For web: <20MB practical limit. These are rough guidelines—test on target devices.

Can I run LLMs on-device?

Small LLMs (1–7B parameters, quantized) can run on flagship phones. Expect 5–20 tokens/second. For complex reasoning, cloud is still better.

How do I update models without app updates?

Implement model download and versioning. Check for updates on app launch, download in background, swap models when ready.

What about on-device training?

ONNX Runtime supports on-device training. Use for personalization, federated learning, or privacy-preserving fine-tuning. Keep training data on-device.

How do I handle device fragmentation?

Use hardware abstraction layers (NNAPI on Android), implement fallbacks (GPU → CPU), test on representative device set.

Is edge AI ready for production?

Yes. Apps like Google Photos, Snapchat, and voice assistants use edge AI at massive scale. The tooling and hardware have matured significantly.

Sources & Further Reading

Google AI Edge — Official LiteRT documentation
ONNX Runtime — Cross-platform inference
LiteRT Inference Guide — On-device inference patterns
ONNX On-Device Training — Personalization
AWS Greengrass ML — IoT deployment
LLM Cost Optimization — Related: cloud cost reduction
Agent Routing Strategies — Related: model selection

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch