Latest

Fresh from the feed

Filter by timeframe and category to zero in on the moves that matter.

Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring
paper
arXiv cs.CV3 days ago

arXiv:2503.01691v2 Announce Type: replace Abstract: Global biodiversity is declining at an unprecedented rate, yet little information is known about most species and how their populations are changing. Indeed, some 90% of Earth's species are estimated to be completely unknown. Machine learning has recently emerged as a promising tool to facilitate long-term, large-scale biodiversity monitoring, including algorithms for fine-grained classification of species from images. However, such algorithms typically are not designed to detect examples from categories unseen during training -- the problem of open-set recognition (OSR) -- limiting their applicability for highly diverse, poorly studied taxa such as insects. To address this gap, we introduce Open-Insect, a large-scale, fine-grained dataset to evaluate unknown species detection across different geographic regions with varying difficulty. We benchmark 38 OSR algorithms across three categories: post-hoc, training-time regularization, and training with auxiliary data, finding that simple post-hoc approaches remain a strong baseline. We also demonstrate how to leverage auxiliary data to improve species discovery in regions with limited data. Our results provide insights to guide the development of computer vision methods for biodiversity monitoring and species discovery.

#ai
Score · 2.80
BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation
paper
arXiv cs.CV3 days ago

arXiv:2503.03256v2 Announce Type: replace Abstract: Event cameras deliver visual information characterized by a high dynamic range and high temporal resolution, offering significant advantages in estimating optical flow for complex lighting conditions and fast-moving objects. Current advanced optical flow methods for event cameras largely adopt established image-based frameworks. However, the spatial sparsity of event data limits their performance. In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. BAT includes three novel designs: 1) a bidirectional temporal correlation that transforms bidirectional temporally dense motion cues into spatially dense ones, enabling accurate and spatially dense optical flow estimation; 2) an adaptive temporal sampling strategy for maintaining temporal consistency in correlation; 3) spatially adaptive temporal motion aggregation to efficiently and adaptively aggregate consistent target motion features into adjacent motion features while suppressing inconsistent ones. Our results rank $1^{st}$ on the DSEC-Flow benchmark, outperforming existing state-of-the-art methods by a large margin while also exhibiting sharp edges and high-quality details. Notably, our BAT can accurately predict future optical flow using only past events, significantly outperforming E-RAFT's warm-start approach. Code: \textcolor{magenta}{https://github.com/gangweiX/BAT}.

#ai
#research
#open_source
Score · 2.80
S4M: 4-points to Segment Anything
paper
arXiv cs.CV3 days ago

arXiv:2503.05534v2 Announce Type: replace Abstract: Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging.

#research
Score · 2.80
SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model
paper
arXiv cs.CV3 days ago

arXiv:2503.06515v2 Announce Type: replace Abstract: Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100x), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder. Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages. For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7% higher mAP than the baseline in instance segmentation task.

#ai
#research
Score · 2.80
SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements
paper
arXiv cs.CV3 days ago

arXiv:2503.07101v3 Announce Type: replace Abstract: Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye's sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection. Code is available at https://ocean146.github.io/SimROD2025/.

#ai
#open_source
Score · 2.80
Revealing the Implicit Noise-based Imprint of Generative Models
paper
arXiv cs.CV3 days ago

arXiv:2503.09314v2 Announce Type: replace Abstract: With the rapid advancement of vision generation models, the potential security risks stemming from synthetic visual content have garnered increasing attention, posing significant challenges for AI-generated image detection. Existing methods suffer from inadequate generalization capabilities, resulting in unsatisfactory performance on emerging generative models. To address this issue, this paper presents NIRNet (Noise-based Imprint Revealing Network), a novel framework that leverages noise-based imprint for the detection task. Specifically, we propose a novel Noise-based Imprint Simulator to capture intrinsic patterns imprinted in images generated by different models. By aggregating imprint from various generative models, imprint of future models can be extrapolated to expand training data, thereby enhancing generalization and robustness. Furthermore, we design a new pipeline that pioneers the use of noise patterns, derived from a Noise-based Imprint Extractor, alongside other visual features for AI-generated image detection, significantly improving detection performance. Our approach achieves state-of-the-art performance across seven diverse benchmarks, including five public datasets and two newly proposed generalization tests, demonstrating its superior generalization and effectiveness. Paper Submission: pdf

#ai
#research
Score · 2.80
QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution
paper
arXiv cs.CV3 days ago

arXiv:2503.12015v2 Announce Type: replace Abstract: Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at https://github.com/linYDTHU/QDM.

#ai
#open_source
Score · 2.80
SFMNet: Sparse Focal Modulation for 3D Object Detection
paper
arXiv cs.CV3 days ago

arXiv:2503.12093v2 Announce Type: replace Abstract: We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.

Score · 2.80
Is clustering enough for LiDAR instance segmentation? A state-of-the-art training-free baseline
paper
arXiv cs.CV3 days ago

arXiv:2503.13203v3 Announce Type: replace Abstract: Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method outperforms {most} state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. It is fully explainable, and requires no learning or parameter tuning. Alpine combined with state-of-the-art semantic segmentation ranks first on the official panoptic segmentation leaderboard of SemanticKITTI. Code is available at https://github.com/valeoai/Alpine/

#ai
#open_source
Score · 2.80
TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
paper
arXiv cs.CV3 days ago

arXiv:2503.16929v3 Announce Type: replace Abstract: Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm}, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference LEarning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

#ai
#llm
Score · 2.80
Dereflection Any Image with Diffusion Priors and Diversified Data
paper
arXiv cs.CV3 days ago

arXiv:2503.17347v2 Announce Type: replace Abstract: Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.

#ai
#research
Score · 2.80
GaussianFocus: Constrained Attention Focus for 3D Gaussian Splatting
paper
arXiv cs.CV3 days ago

arXiv:2503.17798v2 Announce Type: replace Abstract: Recent developments in 3D reconstruction and neural rendering have significantly propelled the capabilities of photo-realistic 3D scene rendering across various academic and industrial fields. The 3D Gaussian Splatting technique, alongside its derivatives, integrates the advantages of primitive-based and volumetric representations to deliver top-tier rendering quality and efficiency. Despite these advancements, the method tends to generate excessive redundant noisy Gaussians overfitted to every training view, which degrades the rendering quality. Additionally, while 3D Gaussian Splatting excels in small-scale and object-centric scenes, its application to larger scenes is hindered by constraints such as limited video memory, excessive optimization duration, and variable appearance across views. To address these challenges, we introduce GaussianFocus, an innovative approach that incorporates a patch attention algorithm to refine rendering quality and implements a Gaussian constraints strategy to minimize redundancy. Moreover, we propose a subdivision reconstruction strategy for large-scale scenes, dividing them into smaller, manageable blocks for individual training. Our results indicate that GaussianFocus significantly reduces unnecessary Gaussians and enhances rendering quality, surpassing existing State-of-The-Art (SoTA) methods. Furthermore, we demonstrate the capability of our approach to effectively manage and render large scenes, such as urban environments, whilst maintaining high fidelity in the visual output.

#ai
Score · 2.80
Improved tissue sodium concentration quantification in breast cancer by reducing partial volume effects: a preliminary study
paper
arXiv cs.CV3 days ago

arXiv:2503.19570v2 Announce Type: replace Abstract: Introduction: In sodium (23Na) magnetic resonance imaging (MRI), partial volume effects (PVE) are one of the most common causes of errors in the in vivo quantification of tissue sodium concentration (TSC). Advanced image reconstruction algorithms, such as compressed sensing (CS), have the potential to reduce PVE. Therefore, we investigated the feasibility of using CS-based methods to improve image quality and TSC quantification accuracy in patients with breast cancer. Subjects and methods: In this study, three healthy participants and 12 female participants with breast cancer were examined on a 7T MRI scanner. 23Na-MRI images were reconstructed using weighted total variation (wTV), directional total variation (dTV), anatomically guided total variation (AG-TV) and adaptive combine (ADC) methods. The consistency of tumor volume delineations based on sodium data was assessed using the Dice score, and TSC quantification was performed for various image reconstruction methods. Pearsons correlation coefficients were calculated to assess the relationships between wTV, dTV, AG-TV, and ADC values. Results: All methods provided breast MRI images with well-preserved sodium signal and tissue structures. The mean Dice scores for wTV, dTV, and AG-TV were 65%, 72%, and 75%, respectively. Average TSC values in breast tumors were 61.0, 72.0, 73.0, and 88.0 mmol/L for wTV, dTV, AG-TV, and ADC, respectively. A strong negative correlation was observed between wTV and dTV (r = -0.78, 95% CI [-0.94, -0.31], p = 0.0076) and a strong positive correlation between dTV and AG-TV (r = 0.71, 95% CI [0.16, 0.92], p = 0.0207) was found. Conclusion: The results of this study showed that differences in tumor appearance and TSC estimations may depend on the type of image reconstruction and the parameters used. This is most likely due to differences in their ability to reduce PVE.

#research
Score · 2.80
CamSAM2: Segment Anything Accurately in Camouflaged Videos
paper
arXiv cs.CV3 days ago

arXiv:2503.19730v3 Announce Type: replace Abstract: Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code is available at https://github.com/zhoustan/CamSAM2.

#ai
#product
#open_source
Score · 2.80
vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition
paper
arXiv cs.CV3 days ago

arXiv:2503.21262v2 Announce Type: replace Abstract: Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.

#ai
Score · 2.80
KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters
paper
arXiv cs.CV3 days ago

arXiv:2503.23379v2 Announce Type: replace Abstract: Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived "child" layers generated from a shared "parent" convolutional kernel through an adapter. To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves a state-of-the-art accuracy-efficiency balance among dynamic convolution variants.

#ai
#llm
Score · 2.80
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
paper
arXiv cs.CV3 days ago

arXiv:2503.23447v2 Announce Type: replace Abstract: We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

Score · 2.80
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
paper
arXiv cs.CV3 days ago

arXiv:2504.00844v2 Announce Type: replace Abstract: In Scene Graph Generation (SGG), structured representations are extracted from visual inputs as object nodes and connecting predicates, enabling image-based reasoning for diverse downstream tasks. While fully supervised SGG has improved steadily, it suffers from training bias due to limited curated data and long-tail predicate distributions, leading to poor predicate diversity and degraded downstream performance. We present PRISM-0, a zero-shot open-vocabulary SGG framework that leverages foundation models in a bottom-up pipeline to capture a broad spectrum of predicates. Detected object pairs are filtered, described via a Vision-Language Model (VLM), and processed by a Large Language Model (LLM) to generate fine- and coarse-grained predicates, which are then validated by a Visual Question Answering (VQA) model. PRISM-0 modular, dataset-independent design enriches existing SGG datasets such as Visual Genome and produces diverse, unbiased graphs. While operating entirely in a zero-shot setting, PRISM-0 achieves performance on par with state-of-the-art weakly-supervised models on SGG benchmarks and even state-of-the-art supervised methods in tasks such as Sentence-to-Graph Retrieval.

#ai
#llm
Score · 2.80
SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
paper
arXiv cs.CV3 days ago

arXiv:2504.04519v4 Announce Type: replace Abstract: Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT--a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.

#ai
#research
#open_source
Score · 2.80
Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability
paper
arXiv cs.CV3 days ago

arXiv:2504.10804v3 Announce Type: replace Abstract: Vision Transformers (ViTs) have demonstrated impressive performance across a range of applications, including many safety-critical tasks. However, their unique architectural properties raise new challenges and opportunities in adversarial robustness. In particular, we observe that adversarial examples crafted on ViTs exhibit higher transferability compared to those crafted on CNNs, suggesting that ViTs contain structural characteristics favorable for transferable attacks. In this work, we investigate the role of computational redundancy in ViTs and its impact on adversarial transferability. Unlike prior studies that aim to reduce computation for efficiency, we propose to exploit this redundancy to improve the quality and transferability of adversarial examples. Through a detailed analysis, we identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness. Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training. Extensive experiments on the ImageNet-1k dataset validate the effectiveness of our approach, showing that our methods significantly outperform existing baselines in both transferability and generality across diverse model architectures.

#ai
Score · 2.80
Page 32 of 93