Cascade Context-oriented Spatio-temporal Attention Network for Efficient and Fine-grained Video-grounded Dialogues
en-GBde-DEes-ESfr-FR

Cascade Context-oriented Spatio-temporal Attention Network for Efficient and Fine-grained Video-grounded Dialogues

09/01/2026 Frontiers Journals

Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. Existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts.
To solve the problems, a research team led by Bin GUO published their new research on 15 July 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team proposed a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA). Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Second, COSTA designs a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.
In the research, they propose a Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), for efficient reasoning over long-form and content-dynamic videos and generating accurate and fine-grained responses. This system is featured with two key components: 1) To prevent interference from redundant visual information and facilitate efficient video reasoning, the system employs a cascade attention network for region-of-query localization. In short, we first identify several potential video clips that are highly relevant to dialogue contexts (i.e., temporal filtering) and then localize the spatial regions from the corresponding video frames (i.e., spatial filtering). A coarse-to-fine visual-context attention scheme is adopted to discern more context-specific, discriminative visual features. 2) For comprehensive multi-modality alignment and semantic co-reasoning, they propose a memory distillation-inspired iterative visual-textural cross-attention strategy. Different kinds of semantic correlations between videos and dialogue contexts are progressively integrated in multiple cross-attention steps, where the memory distillation refines video features based on cross-modal semantic interactions to dynamically propagate information from early steps to later ones.
Extensive experiments demonstrate the superiority of COSTA over state-of-the-art methods. Furthermore, they conduct ablation studies and case studies to analyze the impacts of the proposed modules.

DOI: 10.1007/s11704-024-40387-w
Fichiers joints
  • Fig. 1 Proposed COSTA architecture
09/01/2026 Frontiers Journals
Regions: Asia, China
Keywords: Applied science, Computing

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Témoignages

We have used AlphaGalileo since its foundation but frankly we need it more than ever now to ensure our research news is heard across Europe, Asia and North America. As one of the UK’s leading research universities we want to continue to work with other outstanding researchers in Europe. AlphaGalileo helps us to continue to bring our research story to them and the rest of the world.
Peter Dunn, Director of Press and Media Relations at the University of Warwick
AlphaGalileo has helped us more than double our reach at SciDev.Net. The service has enabled our journalists around the world to reach the mainstream media with articles about the impact of science on people in low- and middle-income countries, leading to big increases in the number of SciDev.Net articles that have been republished.
Ben Deighton, SciDevNet
AlphaGalileo is a great source of global research news. I use it regularly.
Robert Lee Hotz, LA Times

Nous travaillons en étroite collaboration avec...


  • e
  • The Research Council of Norway
  • SciDevNet
  • Swiss National Science Foundation
  • iesResearch
Copyright 2026 by DNN Corp Terms Of Use Privacy Statement