STRIDE-2B

STRIDE (Structured Temporal Refinement with Iterative DEnoising) is a lightweight proactive activation model for streaming video understanding. It decides when a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries.

arXiv Project Page GitHub HF

Paper: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Junho Kim*, Hosu Lee*, James M. Rehg, Minsu Kim, Yong Man Ro

UIUC, KAIST, Google DeepMind

What is STRIDE?

Existing streaming Video-LLMs are reactive — they only respond when a user explicitly asks a question. STRIDE makes them proactive by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating when to trigger a response.

The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond now?"), but a span-structured sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through masked diffusion over a temporal activation window, jointly predicting and iteratively refining activation signals across the window.

Two-Stage Architecture

Video Stream
  │
  ▼
[STRIDE Activation Model]  ← this model (2B)
  │
  │ trigger (only if active)
  ▼
[Downstream Video-LLM]     ← frozen, any off-the-shelf
  │
  ▼
Response
  • Stage 1 — Activation (STRIDE): Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion.
  • Stage 2 — Response (Downstream LLM): When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully plug-and-play — compatible with any off-the-shelf Video-LLM.

Results

OVO-Bench (Online Video Understanding)

Method Real-Time Perception Backward Tracing Forward Active Responding Overall
Flash-VStream-7B 28.37 27.38 45.09 33.61
Dispider 54.55 36.06 34.72 41.78
TimeChat-Online-7B 58.60 42.00 36.40 45.60
QueryStream-7B 61.40 42.10 39.03 47.51
StreamAgent-7B 61.30 41.70 45.40 49.40
STRIDE + Gemma3-4B 60.93 34.87 55.73 50.51
STRIDE + InternVL3-8B 67.72 45.23 58.00 56.98
STRIDE + Qwen3-VL-8B 69.68 47.83 59.70 59.07

StreamingBench (Streaming Comprehension)

Method Real-Time Visual Omni-Source Contextual Overall
Flash-VStream-7B 23.23 26.00 24.12 24.04
VideoLLM-Online-8B 35.99 28.45 26.55 32.48
Dispider 67.63 35.66 33.61 53.12
StreamAgent-7B 74.31 36.26 34.62 57.02
STRIDE + Gemma3-4B 60.00 36.80 38.80 50.14
STRIDE + InternVL3-8B 72.45 39.20 38.80 57.58
STRIDE + Qwen3-VL-8B 74.24 41.30 39.90 59.29

ET-Bench (Temporal Grounding, Activation-Only)

Model Params TVG EPM TAL DVC SLC Avg
Temporal-Localization Specialized
VTimeLLM 7B 7.6 1.9 18.2 12.4 8.7 9.8
TimeChat 7B 26.2 3.9 10.1 16.6 5.6 12.5
VTG-LLM 7B 15.9 3.7 14.4 40.2 20.8 19.0
LITA 13B 22.2 4.6 18.0 39.7 21.0 21.1
ETChat 5B 38.6 10.2 30.8 38.4 24.4 28.5
Streaming Baselines
VideoLLM-Online 8B 13.2 3.8 9.1 24.0 9.9 12.0
Dispider 9B 36.1 15.5 27.3 33.8 18.8 26.3
StreamBridge 8B 34.3 24.3 38.3 22.6
Ours
STRIDE 2B 62.8 10.7 24.6 36.5 28.5 32.6

STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines.

Usage

For the full streaming inference pipeline and evaluation scripts, please refer to the STRIDE GitHub repository.

Training

  • Architecture: Qwen3VLForProactiveMDM (Qwen3-VL backbone with a temporal activation head)
  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Training data: Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM) See the paper for full training details and hyperparameters.

Model Variants

Model Params Description
STRIDE-2B (this) 2B Default activation model
STRIDE-4B 4B Scaled variant with improved accuracy

Citation

@article{kim2025stride,
  title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding},
  author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man},
  journal={arXiv preprint arXiv:2603.XXXXX},
  year={2025}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
40
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for interlive/STRIDE-2B

Finetuned
(172)
this model

Evaluation results