build

How Arm Reached 3 Billion Devices by Targeting 2015 CPUs

TRIGGER

Teams building on-device LLM experiences typically optimize for the latest flagship hardware with NPUs and advanced CPU features, but this excludes the majority of users whose devices are 3-5 years old and lack these capabilities—limiting adoption to a small fraction of the potential user base.

APPROACH

Arm and ExecuTorch teams optimized Int4 matrix multiplication using the SDOT (Signed Dot Product) instruction available since Armv8.2 (2015), rather than requiring newer I8MM features from Armv8.6. SDOT performs dot products on 8-bit signed integer vectors, producing four 32-bit outputs per instruction. Input: Int4 quantized Llama 3.2 1B model weights and activations. Output: On older SDOT-only devices, decode performance exceeds average human reading speed; on newer I8MM devices (Galaxy S24+), achieves 350+ tokens/second prefill and 40+ tokens/second decode. This enabled targeting 72% of all Arm devices (~3 billion devices) rather than just recent flagships.

PATTERN

“72% device coverage (3 billion Arm devices) vs flagship-only by targeting SDOT from 2015 instead of the latest I8MM. Trading 20% performance on new phones for reaching users on 3-5 year old hardware that actually exists in the wild.”

✓ WORKS WHEN

Target instruction (e.g., SDOT) has been in silicon for 5+ years and is present in 70%+ of deployed devices
Quantized model inference (Int4/Int8) is the workload, where dot-product instructions directly accelerate matrix multiplication
Decode latency above human reading speed (~250 words/minute) is acceptable for the use case
Use cases are latency-tolerant (text completion, summarization) rather than real-time (live translation, voice interruption)
Privacy or offline operation is a key requirement that justifies on-device tradeoffs

✗ FAILS WHEN

Application requires real-time streaming output where sub-100ms latency is mandatory
Model size exceeds device memory constraints even with Int4 quantization (models >3B parameters on 4GB RAM devices)
Target platform lacks the baseline instruction entirely (pre-Armv8.2 devices, non-Arm architectures)
Use case demands highest-quality output where quantization degradation is unacceptable (medical, legal)
You're building for a known homogeneous device fleet (enterprise tablets, kiosks) where you can target specific hardware

Stage

build

Source

Hugging Face →

From

August 2025