How Arm Reached 3 Billion Devices by Targeting 2015 CPUs
TRIGGER
Teams building on-device LLM experiences typically optimize for the latest flagship hardware with NPUs and advanced CPU features, but this excludes the majority of users whose devices are 3-5 years old and lack these capabilities—limiting adoption to a small fraction of the potential user base.
APPROACH
Arm and ExecuTorch teams optimized Int4 matrix multiplication using the SDOT (Signed Dot Product) instruction available since Armv8.2 (2015), rather than requiring newer I8MM features from Armv8.6. SDOT performs dot products on 8-bit signed integer vectors, producing four 32-bit outputs per instruction. Input: Int4 quantized Llama 3.2 1B model weights and activations. Output: On older SDOT-only devices, decode performance exceeds average human reading speed; on newer I8MM devices (Galaxy S24+), achieves 350+ tokens/second prefill and 40+ tokens/second decode. This enabled targeting 72% of all Arm devices (~3 billion devices) rather than just recent flagships.
PATTERN
“72% device coverage (3 billion Arm devices) vs flagship-only by targeting SDOT from 2015 instead of the latest I8MM. Trading 20% performance on new phones for reaching users on 3-5 year old hardware that actually exists in the wild.”
✓ WORKS WHEN
- Target instruction (e.g., SDOT) has been in silicon for 5+ years and is present in 70%+ of deployed devices
- Quantized model inference (Int4/Int8) is the workload, where dot-product instructions directly accelerate matrix multiplication
- Decode latency above human reading speed (~250 words/minute) is acceptable for the use case
- Use cases are latency-tolerant (text completion, summarization) rather than real-time (live translation, voice interruption)
- Privacy or offline operation is a key requirement that justifies on-device tradeoffs
✗ FAILS WHEN
- Application requires real-time streaming output where sub-100ms latency is mandatory
- Model size exceeds device memory constraints even with Int4 quantization (models >3B parameters on 4GB RAM devices)
- Target platform lacks the baseline instruction entirely (pre-Armv8.2 devices, non-Arm architectures)
- Use case demands highest-quality output where quantization degradation is unacceptable (medical, legal)
- You're building for a known homogeneous device fleet (enterprise tablets, kiosks) where you can target specific hardware