Wire Riot

paper

arXiv cs.CV

November 18th, 2025 at 5:00 AM

Video Finetuning Improves Reasoning Between Frames

arXiv:2511.12868v1 Announce Type: new Abstract: Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.

#ai

#llm

Open source

Score: 2.80

Engagement proxy: 0

Canonical link: https://arxiv.org/abs/2511.12868