ActQuant

Sub-4-bit Action-Guided Quantization
for Vision-Language-Action Models

Post-Training Quantization Sub-4-bit VLA · OpenVLA-OFT · π0.5
Arash Akbari1,*, Arman Akbari1, Masih Eskandar1, Qitao Tan2, Yixiao Chen1, Jingwu Luo1,
Bertha Pangaribuan1, Liyun Zhang1, Jennifer Dy1, Geng Yuan2, Xue Lin1, Gaowen Liu3, Stratis Ioannidis1, Yanzhi Wang1
*Corresponding author
1Northeastern University, 2University of Georgia, 3Cisco Systems
ActQuant teaser figure

ActQuant is an action-guided mixed-precision post-training quantization framework for Vision-Language-Action models. It is the only method that retains task success at or below 3 bits per weight—keeping 95.0% of OpenVLA-OFT performance on LIBERO at 3 bpw and reaching 2.5 bpw while compressing the backbone 5.3× (14.3 GB → 2.7 GB).

Abstract

Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute makes deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels.

We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on π0.5. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3×). On the physical UR3 arm, π0.5 quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5×.

Method

ActQuant method overview

Overview of ActQuant. Stage 1 allocates a per-tensor bit-width using an action-loss sensitivity signal. Stage 2 optimizes per-block scales with action-aware curvature so quantization error is concentrated away from weights that drive control.

The Sub-4-bit Cliff

Figure 1: sub-4-bit cliff and ActQuant overview

Existing PTQ methods (RTN, AWQ, GPTQ, QVLA) collapse to 0% success below ~3 bpw. ActQuant is the first PTQ method that crosses the sub-4-bit cliff while preserving task success, unlocking real memory savings for on-device deployment.

Main Results on LIBERO

Comparison of quantization methods on two VLA models. RTN, AWQ, and GPTQ use uniform integer precision; QVLA and ActQuant support mixed precision (reported as average bits-per-weight). Bold = best at that bpw; underline = second best. ActQuant rows are highlighted.

Method Vision+LLM
BPW
Success Rate (%) Δ VLA
BPW
Mem.
(GB)
SpatialObjectGoalLongAvg.
OpenVLA-OFT
Baseline16.097.698.496.895.196.90.0016.014.3
AWQ4.094.698.696.894.496.1−0.84.34.1
AWQ3.086.498.090.291.491.5−5.43.43.2
AWQ2.00.00.00.00.00.0−96.92.72.4
GPTQ4.090.898.896.492.694.6−2.34.34.1
GPTQ3.082.498.088.887.289.0−7.93.43.2
GPTQ2.00.00.00.00.00.0−96.92.72.4
QVLA4.096.098.797.495.296.8−0.14.84.5
QVLA3.582.997.291.493.691.3−5.64.34.1
QVLA3.040.856.820.231.837.4−59.53.93.7
QVLA2.50.00.20.00.00.0−96.93.43.1
ActQuant4.096.098.297.295.096.6−0.34.44.2
ActQuant3.595.898.496.095.696.5−0.43.83.4
ActQuant3.092.897.494.095.695.0−1.93.53.2
ActQuant2.586.498.284.891.090.1−6.83.02.7
π0.5
Baseline16.098.498.497.693.697.00.0016.06.7
AWQ4.096.698.096.893.096.1−0.95.62.4
AWQ3.093.497.091.490.293.0−4.05.02.1
AWQ2.00.00.00.00.00.0−97.04.11.7
GPTQ4.097.296.897.091.895.7−1.35.62.4
GPTQ3.092.295.488.689.091.3−5.75.02.1
GPTQ2.00.00.00.00.00.0−97.04.11.7
QVLA4.098.097.296.491.895.8−1.26.32.7
QVLA3.597.896.291.492.694.5−2.56.02.5
QVLA3.097.497.887.886.892.5−4.55.62.4
QVLA2.593.097.671.055.079.2−17.85.32.2
QVLA2.00.00.00.00.00.0−97.04.72.0
ActQuant4.098.499.496.891.896.6−0.46.22.7
ActQuant3.599.099.694.092.496.3−0.75.82.5
ActQuant3.098.298.895.087.294.8−2.25.62.4
ActQuant2.596.698.474.073.685.7−11.35.22.2
ActQuant2.061.280.025.025.048.0−49.04.82.0

Real-World Deployment on UR3

Real-world UR3 experiments

On a physical 6-DoF UR3 arm, π0.5 quantized with ActQuant matches the full-precision baseline's success rate while reducing memory by 2.5×. All models are deployed through the OmniModel.cpp runtime with low-bit kernels.

Real-World Rollouts on UR3

π0.5 quantized with ActQuant and deployed via OmniModel.cpp on a 6-DoF UR3 arm. Each task is shown from the 3rd-person and wrist cameras simultaneously.

Open the pot by removing its lid
3rd-person
Wrist
Pick up the pink cylinder and place it in the orange box
3rd-person
Wrist
3rd-person
Wrist
Pick up the white glass and put on a brown coaster
3rd-person
Wrist
3rd-person
Wrist
Remove cup from nested cups
3rd-person
Wrist
3rd-person
Wrist
Single-finger push to blue marker
3rd-person
Wrist

BibTeX

@article{akbari2026actquant,
  title   = {ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models},
  author  = {Akbari, Arash and Akbari, Arman and Eskandar, Masih and Tan, Qitao and Chen, Yixiao and Luo, Jingwu and Pangaribuan, Bertha and Zhang, Liyun and Dy, Jennifer and Yuan, Geng and Lin, Xue and Liu, Gaowen and Ioannidis, Stratis and Wang, Yanzhi},
  year    = {2026}
}