STRIDE-2B
STRIDE (Structured Temporal Refinement with Iterative DEnoising) is a lightweight proactive activation model for streaming video understanding. It decides when a downstream Video-LLM should respond during a live video stream — without waiting for explicit user queries.
Paper: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding
Junho Kim*, Hosu Lee*, James M. Rehg, Minsu Kim, Yong Man Ro
UIUC, KAIST, Google DeepMind
What is STRIDE?
Existing streaming Video-LLMs are reactive — they only respond when a user explicitly asks a question. STRIDE makes them proactive by adding a lightweight front-end that continuously monitors incoming frames and predicts coherent activation spans indicating when to trigger a response.
The key insight is that activation in streaming video is not a point-wise binary decision ("should I respond now?"), but a span-structured sequence modeling problem — the model must capture consistent onset (0 → 1), persistence (1 → 1), and offset (1 → 0) transitions. STRIDE achieves this through masked diffusion over a temporal activation window, jointly predicting and iteratively refining activation signals across the window.
Two-Stage Architecture
Video Stream
│
▼
[STRIDE Activation Model] ← this model (2B)
│
│ trigger (only if active)
▼
[Downstream Video-LLM] ← frozen, any off-the-shelf
│
▼
Response
- Stage 1 — Activation (STRIDE): Monitors the stream at 1 FPS, maintains a sliding activation window, and iteratively denoises binary activation labels via masked diffusion.
- Stage 2 — Response (Downstream LLM): When triggered, the frozen downstream Video-LLM receives the accumulated frame cache and generates a response. STRIDE is fully plug-and-play — compatible with any off-the-shelf Video-LLM.
Results
OVO-Bench (Online Video Understanding)
| Method | Real-Time Perception | Backward Tracing | Forward Active Responding | Overall |
|---|---|---|---|---|
| Flash-VStream-7B | 28.37 | 27.38 | 45.09 | 33.61 |
| Dispider | 54.55 | 36.06 | 34.72 | 41.78 |
| TimeChat-Online-7B | 58.60 | 42.00 | 36.40 | 45.60 |
| QueryStream-7B | 61.40 | 42.10 | 39.03 | 47.51 |
| StreamAgent-7B | 61.30 | 41.70 | 45.40 | 49.40 |
| STRIDE + Gemma3-4B | 60.93 | 34.87 | 55.73 | 50.51 |
| STRIDE + InternVL3-8B | 67.72 | 45.23 | 58.00 | 56.98 |
| STRIDE + Qwen3-VL-8B | 69.68 | 47.83 | 59.70 | 59.07 |
StreamingBench (Streaming Comprehension)
| Method | Real-Time Visual | Omni-Source | Contextual | Overall |
|---|---|---|---|---|
| Flash-VStream-7B | 23.23 | 26.00 | 24.12 | 24.04 |
| VideoLLM-Online-8B | 35.99 | 28.45 | 26.55 | 32.48 |
| Dispider | 67.63 | 35.66 | 33.61 | 53.12 |
| StreamAgent-7B | 74.31 | 36.26 | 34.62 | 57.02 |
| STRIDE + Gemma3-4B | 60.00 | 36.80 | 38.80 | 50.14 |
| STRIDE + InternVL3-8B | 72.45 | 39.20 | 38.80 | 57.58 |
| STRIDE + Qwen3-VL-8B | 74.24 | 41.30 | 39.90 | 59.29 |
ET-Bench (Temporal Grounding, Activation-Only)
| Model | Params | TVG | EPM | TAL | DVC | SLC | Avg |
|---|---|---|---|---|---|---|---|
| Temporal-Localization Specialized | |||||||
| VTimeLLM | 7B | 7.6 | 1.9 | 18.2 | 12.4 | 8.7 | 9.8 |
| TimeChat | 7B | 26.2 | 3.9 | 10.1 | 16.6 | 5.6 | 12.5 |
| VTG-LLM | 7B | 15.9 | 3.7 | 14.4 | 40.2 | 20.8 | 19.0 |
| LITA | 13B | 22.2 | 4.6 | 18.0 | 39.7 | 21.0 | 21.1 |
| ETChat | 5B | 38.6 | 10.2 | 30.8 | 38.4 | 24.4 | 28.5 |
| Streaming Baselines | |||||||
| VideoLLM-Online | 8B | 13.2 | 3.8 | 9.1 | 24.0 | 9.9 | 12.0 |
| Dispider | 9B | 36.1 | 15.5 | 27.3 | 33.8 | 18.8 | 26.3 |
| StreamBridge | 8B | 34.3 | – | 24.3 | 38.3 | 22.6 | – |
| Ours | |||||||
| STRIDE | 2B | 62.8 | 10.7 | 24.6 | 36.5 | 28.5 | 32.6 |
STRIDE achieves the best overall average with only 2B parameters, outperforming 7-13B temporal-localization specialized models and streaming baselines.
Usage
For the full streaming inference pipeline and evaluation scripts, please refer to the STRIDE GitHub repository.
Training
- Architecture:
Qwen3VLForProactiveMDM(Qwen3-VL backbone with a temporal activation head) - Base model: Qwen/Qwen3-VL-2B-Instruct
- Training data: Temporal activation annotations curated from eight publicly available video understanding datasets (ActivityNet-Captions, LITA, YouCook2, ET-Instruct, Charades, CharadesEgo, DiDeMo, Grounded-VideoLLM) See the paper for full training details and hyperparameters.
Model Variants
| Model | Params | Description |
|---|---|---|
| STRIDE-2B (this) | 2B | Default activation model |
| STRIDE-4B | 4B | Scaled variant with improved accuracy |
Citation
@article{kim2025stride,
title={STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding},
author={Kim, Junho and Lee, Hosu and Rehg, James M. and Kim, Minsu and Ro, Yong Man},
journal={arXiv preprint arXiv:2603.XXXXX},
year={2025}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 40
Model tree for interlive/STRIDE-2B
Base model
Qwen/Qwen3-VL-2B-InstructEvaluation results
- Overall (w/ Qwen3-VL-8B) on OVO-Benchself-reported59.070
- Overall (w/ Qwen3-VL-8B) on StreamingBenchself-reported59.290
- TVG F1 on ET-Benchself-reported62.800
- EPM F1 on ET-Benchself-reported10.700
- TAL F1 on ET-Benchself-reported24.600
- DVC F1 on ET-Benchself-reported36.500
- SLC F1 on ET-Benchself-reported28.500