Back to blog
AI #edge AI#inference#on-device

Edge Inference in 2026: Building AI Products That Run On-Device

Cloud AI adds latency, costs, and privacy concerns. A practical guide to deploying ML models on-device with LiteRT, ONNX Runtime, and modern edge infrastructure.

15 min · January 18, 2026 · Updated January 27, 2026
Topic relevant background image

TL;DR

  • Edge inference runs ML models directly on devices (phones, browsers, IoT) instead of calling cloud APIs.
  • Benefits: lower latency (5–50ms vs 200–500ms), works offline, keeps sensitive data on-device, reduces cloud costs.
  • LiteRT (Google’s successor to TensorFlow Lite) powers 100,000+ apps across billions of devices.
  • ONNX Runtime provides cross-platform support for PyTorch, TensorFlow, and JAX models.
  • Modern hardware acceleration (NPUs, GPU via ML Drift) makes complex models viable on mobile.
  • Trade-offs: limited model size, device fragmentation, harder debugging.
  • Use edge for latency-critical, privacy-sensitive, or offline-required features; cloud for complex reasoning.

Why Edge Inference Matters

Cloud AI has dominated the last decade, but edge inference is becoming the default for many use cases:

FactorCloud AIEdge AI
Latency200–500ms+5–50ms
Offline capabilityNoneFull functionality
PrivacyData leaves deviceData stays on-device
CostPer-inference pricingOne-time model deploy
ReliabilityNetwork dependentAlways available

The shift isn’t about replacing cloud AI—it’s about using the right approach for each use case.

When to Use Edge vs. Cloud

Use Edge When:

  • Latency matters: Real-time camera effects, voice transcription, gaming
  • Privacy is critical: Health data, financial info, biometrics
  • Offline is required: Field work, travel, unreliable connectivity
  • Cost scales with usage: High-frequency inference, millions of users
  • Personalization: On-device learning, user-specific models

Use Cloud When:

  • Model complexity exceeds device capability: Large language models, complex reasoning
  • Frequent model updates: A/B testing, continuous learning
  • Cross-device consistency: Results must match exactly across platforms
  • Compute-heavy tasks: Video generation, large-scale analysis

Hybrid Approach

Many products use both:

User interaction → Edge model (fast response)

           Low confidence?

          Cloud model (accurate response)

Example: Voice assistant recognizes “Hey Siri” on-device (fast), sends complex queries to cloud (accurate).

Edge AI Platforms

LiteRT (Google AI Edge)

Successor to TensorFlow Lite, powering 100,000+ apps:

FeatureCapability
Hardware accelerationGPU via ML Drift, NPUs across vendors
Platform supportAndroid, iOS, web, IoT, desktop
Model sourcesTensorFlow, JAX, Keras, PyTorch (via conversion)
OptimizationQuantization, pruning, compilation

Getting started:

// Android with LiteRT
val interpreter = Interpreter(loadModel("model.tflite"))

val inputBuffer = ByteBuffer.allocateDirect(INPUT_SIZE)
val outputBuffer = ByteBuffer.allocateDirect(OUTPUT_SIZE)

interpreter.run(inputBuffer, outputBuffer)

ONNX Runtime

Cross-platform inference engine:

FeatureCapability
Model formatONNX (converted from PyTorch, TensorFlow, etc.)
PlatformsWindows, Linux, macOS, iOS, Android, web
Execution providersCPU, CUDA, CoreML, NNAPI, DirectML
On-device trainingSupported

Getting started:

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
outputs = session.run(
    None,
    {"input": input_data}
)

MediaPipe (High-Level APIs)

Pre-built ML solutions for common tasks:

TaskUse Case
Face detectionCamera effects, authentication
Pose estimationFitness apps, gaming
Hand trackingGesture control, sign language
Object detectionAR, accessibility
Text classificationContent moderation, intent

MediaPipe handles model optimization, platform abstraction, and hardware acceleration.

Model Optimization

Edge devices have limited compute. Optimization is essential.

Quantization

Reduce precision from float32 to smaller formats:

FormatSize ReductionAccuracy Impact
float162xMinimal
int84xSmall (<1% loss typical)
int48xModerate (needs calibration)
# TensorFlow post-training quantization
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
quantized_model = converter.convert()

Pruning

Remove unnecessary weights:

  • Structured pruning: Remove entire neurons/filters
  • Unstructured pruning: Remove individual weights
  • Magnitude-based: Remove smallest weights
  • Typical result: 50–90% weight reduction with <2% accuracy loss

Knowledge Distillation

Train small “student” model to mimic large “teacher”:

Large cloud model (teacher)


    Train student on teacher outputs


Small edge model (student)

Result: Small model that captures most of the teacher’s capability.

Model Compilation

Compile for specific hardware:

# LiteRT compiled model for GPU
compiled_model = litert.CompiledModel(
    model_path,
    target="gpu",
    optimization_level="balanced"
)

Compilation optimizes for specific hardware, often achieving 2x+ speedup.

Hardware Acceleration

NPU (Neural Processing Unit)

Dedicated AI hardware in modern devices:

VendorNPUDevices
GoogleTensor TPUPixel phones
AppleNeural EngineiPhone, iPad, Mac
QualcommHexagonAndroid flagships
MediaTekAPUMid-range Android
SamsungNPUGalaxy devices

GPU Acceleration

LiteRT’s ML Drift provides unified GPU access:

  • Runs models on mobile GPU
  • Automatic fallback to CPU if unsupported
  • Async execution for reduced latency

Choosing Hardware

WorkloadBest HardwareWhy
Large modelsNPUHighest throughput
Small modelsCPULowest latency for simple ops
Batched inferenceGPUParallelism advantage
Background tasksNPUPower efficient

Implementation Patterns

Async Inference

Don’t block UI on inference:

// Android coroutine-based async inference
class ModelManager(private val interpreter: Interpreter) {
    
    suspend fun runInference(input: FloatArray): FloatArray = 
        withContext(Dispatchers.Default) {
            val output = FloatArray(OUTPUT_SIZE)
            interpreter.run(input, output)
            output
        }
}

// Usage
lifecycleScope.launch {
    val result = modelManager.runInference(inputData)
    updateUI(result)
}

Batching

Combine multiple inputs for efficiency:

# Instead of:
for image in images:
    result = model.predict(image)  # Slow

# Use batching:
results = model.predict(images)  # Fast

Model Caching

Load models once, reuse:

object ModelCache {
    private val interpreters = mutableMapOf<String, Interpreter>()
    
    fun getInterpreter(modelName: String): Interpreter {
        return interpreters.getOrPut(modelName) {
            Interpreter(loadModel(modelName))
        }
    }
}

Fallback Strategy

Handle model failures gracefully:

async function runInference(input: Float32Array): Promise<Result> {
    try {
        // Try edge inference
        return await edgeModel.run(input);
    } catch (error) {
        if (navigator.onLine) {
            // Fallback to cloud
            return await cloudAPI.inference(input);
        } else {
            // Return cached/default result
            return getDefaultResult();
        }
    }
}

Platform-Specific Considerations

iOS

  • Use CoreML for Apple hardware optimization
  • Metal Performance Shaders for GPU
  • Neural Engine access via CoreML
  • Xcode Model Preview for testing

Android

  • LiteRT/TensorFlow Lite native support
  • NNAPI for hardware abstraction
  • GPU delegate for acceleration
  • Benchmark app for performance testing

Web (Browser)

  • TensorFlow.js with WebGL backend
  • ONNX Runtime Web with WebAssembly
  • WebGPU (emerging, higher performance)
  • Model size limits (~100MB practical)

IoT/Embedded

  • LiteRT Micro for microcontrollers
  • ARM Cortex-M optimization
  • Minimal runtime (~100KB possible)
  • Power/memory constrained

Debugging and Profiling

Model Performance

ToolPlatformWhat It Shows
Model ExplorerCross-platformModel structure, ops
LiteRT BenchmarkAndroidInference time, memory
InstrumentsiOSTimeline, memory, CPU
PerfettoAndroidSystem-wide traces

Common Issues

IssueSymptomSolution
Slow inferenceHigh latencyEnable GPU/NPU delegate
Memory crashOOM errorsReduce model size, quantize
Accuracy lossWrong predictionsCheck quantization calibration
Device fragmentationWorks on some devicesUse hardware fallbacks

Implementation Checklist

Model Preparation

  • Train/export model for edge deployment
  • Convert to target format (TFLite, ONNX)
  • Apply quantization (int8 for mobile)
  • Test accuracy post-optimization
  • Benchmark on target devices

Integration

  • Choose inference runtime (LiteRT, ONNX, CoreML)
  • Implement async inference
  • Add hardware acceleration
  • Handle model loading/caching
  • Implement fallback strategy

Testing

  • Test on low-end devices
  • Verify offline functionality
  • Profile memory and battery impact
  • Test edge cases and failures
  • Measure real-world latency

FAQ

How small does my model need to be?

For mobile: <100MB is comfortable, <50MB is ideal, <10MB for real-time. For web: <20MB practical limit. These are rough guidelines—test on target devices.

Can I run LLMs on-device?

Small LLMs (1–7B parameters, quantized) can run on flagship phones. Expect 5–20 tokens/second. For complex reasoning, cloud is still better.

How do I update models without app updates?

Implement model download and versioning. Check for updates on app launch, download in background, swap models when ready.

What about on-device training?

ONNX Runtime supports on-device training. Use for personalization, federated learning, or privacy-preserving fine-tuning. Keep training data on-device.

How do I handle device fragmentation?

Use hardware abstraction layers (NNAPI on Android), implement fallbacks (GPU → CPU), test on representative device set.

Is edge AI ready for production?

Yes. Apps like Google Photos, Snapchat, and voice assistants use edge AI at massive scale. The tooling and hardware have matured significantly.

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Let's build
something real.

No more slide decks. No more "maybe next quarter".
Let's ship your MVP in weeks.

Start Building Now