new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 11

ED$^4$: Explicit Data-level Debiasing for Deepfake Detection

Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED^4, a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED^4 is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches.

  • 7 authors
·
Aug 13, 2024

Multi-Dimensional Hyena for Spatial Inductive Bias

In recent years, Vision Transformers have attracted increasing interest from computer vision researchers. However, the advantage of these transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. Our empirical findings indicate that the proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.

  • 2 authors
·
Sep 24, 2023

LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown theoretical advantages over relative positional embeddings (RPE), particularly due to the ability of sinusoidal functions to preserve spatial inductive biases like monotonicity and shift invariance, a fundamental challenge arises when mapping a 2D grid to a 1D sequence. Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings. To address this, we propose LOOPE, a learnable patch-ordering method that optimizes spatial representation for a given set of frequencies, providing a principled approach to patch order optimization. Empirical results show that our PE significantly improves classification accuracy across various ViT architectures. To rigorously evaluate the effectiveness of positional embeddings, we introduce the "Three Cell Experiment", a novel benchmarking framework that assesses the ability of PEs to retain relative and absolute positional information across different ViT architectures. Unlike standard evaluations, which typically report a performance gap of 4 to 6% between models with and without PE, our method reveals a striking 30 to 35% difference, offering a more sensitive diagnostic tool to measure the efficacy of PEs. Our experimental analysis confirms that the proposed LOOPE demonstrates enhanced effectiveness in retaining both relative and absolute positional information.

  • 3 authors
·
Apr 19

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes

The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce Surprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. Surprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. Surprise3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.

  • 9 authors
·
Jul 10

Spatial-frequency channels, shape bias, and adversarial robustness

What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel'') that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. On the other hand, the neural network channel, across various architectures and training strategies, is 2-4 times as wide as the human channel. In other words, networks are vulnerable to high and low frequency noise that does not affect human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (53% variance explained) and with robustness of adversarially-trained networks (74% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further away from the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only increases this difference.

  • 4 authors
·
Sep 22, 2023

Benchmarking Spatial Relationships in Text-to-Image Generation

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, SR_{2D}, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the SR_{2D} dataset and the VISOR metric to the community in support of T2I reasoning research.

  • 8 authors
·
Dec 20, 2022

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved-spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer .

  • 6 authors
·
Oct 6, 2024

SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition

The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].

  • 5 authors
·
Jul 15

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.

  • 5 authors
·
Sep 29

RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation. In addition, we tackled the problem of fixed input image resolution for global MLP-based models by utilizing bicubic interpolation. We demonstrated that these models could be applied as the backbone of architectures for downstream tasks such as object detection. However, it did not have significant performance and mentioned the need for MLP-specific architectures for downstream tasks for global MLP-based models. The source code in PyTorch version is available at https://github.com/okojoalg/raft-mlp.

  • 2 authors
·
Aug 9, 2021

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.

  • 11 authors
·
Nov 3, 2023 1

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation

We introduce a novel state-space architecture for diffusion models, effectively harnessing spatial and frequency information to enhance the inductive bias towards local features in input images for image generation tasks. While state-space networks, including Mamba, a revolutionary advancement in recurrent neural networks, typically scan input sequences from left to right, they face difficulties in designing effective scanning strategies, especially in the processing of image data. Our method demonstrates that integrating wavelet transformation into Mamba enhances the local structure awareness of visual inputs and better captures long-range relations of frequencies by disentangling them into wavelet subbands, representing both low- and high-frequency components. These wavelet-based outputs are then processed and seamlessly fused with the original Mamba outputs through a cross-attention fusion layer, combining both spatial and frequency information to optimize the order awareness of state-space models which is essential for the details and overall quality of image generation. Besides, we introduce a globally-shared transformer to supercharge the performance of Mamba, harnessing its exceptional power to capture global relationships. Through extensive experiments on standard benchmarks, our method demonstrates superior results compared to DiT and DIFFUSSM, achieving faster training convergence and delivering high-quality outputs. The codes and pretrained models are released at https://github.com/VinAIResearch/DiMSUM.git.

  • 6 authors
·
Nov 6, 2024

Non-Uniform Spatial Alignment Errors in sUAS Imagery From Wide-Area Disasters

This work presents the first quantitative study of alignment errors between small uncrewed aerial systems (sUAS) geospatial imagery and a priori building polygons and finds that alignment errors are non-uniform and irregular. The work also introduces a publicly available dataset of imagery, building polygons, and human-generated and curated adjustments that can be used to evaluate existing strategies for aligning building polygons with sUAS imagery. There are no efforts that have aligned pre-existing spatial data with sUAS imagery, and thus, there is no clear state of practice. However, this effort and analysis show that the translational alignment errors present in this type of data, averaging 82px and an intersection over the union of 0.65, which would induce further errors and biases in downstream machine learning systems unless addressed. This study identifies and analyzes the translational alignment errors of 21,619 building polygons in fifty-one orthomosaic images, covering 16787.2 Acres (26.23 square miles), constructed from sUAS raw imagery from nine wide-area disasters (Hurricane Ian, Hurricane Harvey, Hurricane Michael, Hurricane Ida, Hurricane Idalia, Hurricane Laura, the Mayfield Tornado, the Musset Bayou Fire, and the Kilauea Eruption). The analysis finds no uniformity among the angle and distance metrics of the building polygon alignments as they present an average degree variance of 0.4 and an average pixel distance variance of 0.45. This work alerts the sUAS community to the problem of spatial alignment and that a simple linear transform, often used to align satellite imagery, will not be sufficient to align spatial data in sUAS orthomosaic imagery.

  • 6 authors
·
May 10, 2024

Does resistance to style-transfer equal Global Shape Bias? Measuring network sensitivity to global shape configuration

Deep learning models are known to exhibit a strong texture bias, while human tends to rely heavily on global shape structure for object recognition. The current benchmark for evaluating a model's global shape bias is a set of style-transferred images with the assumption that resistance to the attack of style transfer is related to the development of global structure sensitivity in the model. In this work, we show that networks trained with style-transfer images indeed learn to ignore style, but its shape bias arises primarily from local detail. We provide a Disrupted Structure Testbench (DiST) as a direct measurement of global structure sensitivity. Our test includes 2400 original images from ImageNet-1K, each of which is accompanied by two images with the global shapes of the original image disrupted while preserving its texture via the texture synthesis program. We found that black{(1) models that performed well on the previous cue-conflict dataset do not fare well in the proposed DiST; (2) the supervised trained Vision Transformer (ViT) lose its global spatial information from positional embedding, leading to no significant advantages over Convolutional Neural Networks (CNNs) on DiST. While self-supervised learning methods, especially mask autoencoder significantly improves the global structure sensitivity of ViT. (3) Improving the global structure sensitivity is orthogonal to resistance to style-transfer, indicating that the relationship between global shape structure and local texture detail is not an either/or relationship. Training with DiST images and style-transferred images are complementary, and can be combined to train network together to enhance the global shape sensitivity and robustness of local features.} Our code will be hosted in github: https://github.com/leelabcnbc/DiST

  • 4 authors
·
Oct 11, 2023

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Models

Recently, DALL-E, a multimodal transformer language model, and its variants (including diffusion models) have shown high-quality text-to-image generation capabilities. However, despite the interesting image generation results, there has not been a detailed analysis on how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic dataset and evaluation toolkit that measures these skills. In our experiments, there exists a large gap between the performance of recent text-to-image models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess gender and skin tone biases by measuring the variance of the gender/skin tone distribution based on automated and human evaluation. We demonstrate that recent text-to-image models learn specific gender/skin tone biases from web image-text pairs. We hope that our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. Code and data: https://github.com/j-min/DallEval

  • 3 authors
·
Feb 8, 2022

How to Detect Network Dependence in Latent Factor Models? A Bias-Corrected CD Test

In a recent paper Juodis and Reese (2022) (JR) show that the application of the CD test proposed by Pesaran (2004) to residuals from panels with latent factors results in over-rejection. They propose a randomized test statistic to correct for over-rejection, and add a screening component to achieve power. This paper considers the same problem but from a different perspective, and shows that the standard CD test remains valid if the latent factors are weak in the sense the strength is less than half. In the case where latent factors are strong, we propose a bias-corrected version, CD*, which is shown to be asymptotically standard normal under the null of error cross-sectional independence and have power against network type alternatives. This result is shown to hold for pure latent factor models as well as for panel regression models with latent factors. The case where the errors are serially correlated is also considered. Small sample properties of the CD* test are investigated by Monte Carlo experiments and are shown to have the correct size for strong and weak factors as well as for Gaussian and non-Gaussian errors. In contrast, it is found that JR's test tends to over-reject in the case of panels with non-Gaussian errors, and has low power against spatial network alternatives. In an empirical application, using the CD* test, it is shown that there remains spatial error dependence in a panel data model for real house price changes across 377 Metropolitan Statistical Areas in the U.S., even after the effects of latent factors are filtered out.

  • 2 authors
·
Sep 1, 2021

Social Biases through the Text-to-Image Generation Lens

Text-to-Image (T2I) generation is enabling new applications that support creators, designers, and general end users of productivity software by generating illustrative content with high photorealism starting from a given descriptive text as a prompt. Such models are however trained on massive amounts of web data, which surfaces the peril of potential harmful biases that may leak in the generation process itself. In this paper, we take a multi-dimensional approach to studying and quantifying common social biases as reflected in the generated images, by focusing on how occupations, personality traits, and everyday situations are depicted across representations of (perceived) gender, age, race, and geographical location. Through an extensive set of both automated and human evaluation experiments we present findings for two popular T2I models: DALLE-v2 and Stable Diffusion. Our results reveal that there exist severe occupational biases of neutral prompts majorly excluding groups of people from results for both models. Such biases can get mitigated by increasing the amount of specification in the prompt itself, although the prompting mitigation will not address discrepancies in image quality or other usages of the model or its representations in other scenarios. Further, we observe personality traits being associated with only a limited set of people at the intersection of race, gender, and age. Finally, an analysis of geographical location representations on everyday situations (e.g., park, food, weddings) shows that for most situations, images generated through default location-neutral prompts are closer and more similar to images generated for locations of United States and Germany.

  • 2 authors
·
Mar 30, 2023

Training for temporal sparsity in deep neural networks, application in video processing

Activation sparsity improves compute efficiency and resource utilization in sparsity-aware neural network accelerators. As the predominant operation in DNNs is multiply-accumulate (MAC) of activations with weights to compute inner products, skipping operations where (at least) one of the two operands is zero can make inference more efficient in terms of latency and power. Spatial sparsification of activations is a popular topic in DNN literature and several methods have already been established to bias a DNN for it. On the other hand, temporal sparsity is an inherent feature of bio-inspired spiking neural networks (SNNs), which neuromorphic processing exploits for hardware efficiency. Introducing and exploiting spatio-temporal sparsity, is a topic much less explored in DNN literature, but in perfect resonance with the trend in DNN, to shift from static signal processing to more streaming signal processing. Towards this goal, in this paper we introduce a new DNN layer (called Delta Activation Layer), whose sole purpose is to promote temporal sparsity of activations during training. A Delta Activation Layer casts temporal sparsity into spatial activation sparsity to be exploited when performing sparse tensor multiplications in hardware. By employing delta inference and ``the usual'' spatial sparsification heuristics during training, the resulting model learns to exploit not only spatial but also temporal activation sparsity (for a given input data distribution). One may use the Delta Activation Layer either during vanilla training or during a refinement phase. We have implemented Delta Activation Layer as an extension of the standard Tensoflow-Keras library, and applied it to train deep neural networks on the Human Action Recognition (UCF101) dataset. We report an almost 3x improvement of activation sparsity, with recoverable loss of model accuracy after longer training.

  • 2 authors
·
Jul 15, 2021

Error-Driven Scene Editing for 3D Grounding in Large Language Models

Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.

Coordinate-Aware Modulation for Neural Fields

Neural fields, mapping low-dimensional input coordinates to corresponding signals, have shown promising results in representing various signals. Numerous methodologies have been proposed, and techniques employing MLPs and grid representations have achieved substantial success. MLPs allow compact and high expressibility, yet often suffer from spectral bias and slow convergence speed. On the other hand, methods using grids are free from spectral bias and achieve fast training speed, however, at the expense of high spatial complexity. In this work, we propose a novel way for exploiting both MLPs and grid representations in neural fields. Unlike the prevalent methods that combine them sequentially (extract features from the grids first and feed them to the MLP), we inject spectral bias-free grid representations into the intermediate features in the MLP. More specifically, we suggest a Coordinate-Aware Modulation (CAM), which modulates the intermediate features using scale and shift parameters extracted from the grid representations. This can maintain the strengths of MLPs while mitigating any remaining potential biases, facilitating the rapid learning of high-frequency components. In addition, we empirically found that the feature normalizations, which have not been successful in neural filed literature, proved to be effective when applied in conjunction with the proposed CAM. Experimental results demonstrate that CAM enhances the performance of neural representation and improves learning stability across a range of signals. Especially in the novel view synthesis task, we achieved state-of-the-art performance with the least number of parameters and fast training speed for dynamic scenes and the best performance under 1MB memory for static scenes. CAM also outperforms the best-performing video compression methods using neural fields by a large margin.

  • 5 authors
·
Nov 25, 2023

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.

  • 5 authors
·
May 16, 2024

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a ktimes k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20times faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2times faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.

  • 6 authors
·
Feb 27 2

Emergent Properties of Foveated Perceptual Systems

The goal of this work is to characterize the representational impact that foveation operations have for machine vision systems, inspired by the foveated human visual system, which has higher acuity at the center of gaze and texture-like encoding in the periphery. To do so, we introduce models consisting of a first-stage fixed image transform followed by a second-stage learnable convolutional neural network, and we varied the first stage component. The primary model has a foveated-textural input stage, which we compare to a model with foveated-blurred input and a model with spatially-uniform blurred input (both matched for perceptual compression), and a final reference model with minimal input-based compression. We find that: 1) the foveated-texture model shows similar scene classification accuracy as the reference model despite its compressed input, with greater i.i.d. generalization than the other models; 2) the foveated-texture model has greater sensitivity to high-spatial frequency information and greater robustness to occlusion, w.r.t the comparison models; 3) both the foveated systems, show a stronger center image-bias relative to the spatially-uniform systems even with a weight sharing constraint. Critically, these results are preserved over different classical CNN architectures throughout their learning dynamics. Altogether, this suggests that foveation with peripheral texture-based computations yields an efficient, distinct, and robust representational format of scene information, and provides symbiotic computational insight into the representational consequences that texture-based peripheral encoding may have for processing in the human visual system, while also potentially inspiring the next generation of computer vision models via spatially-adaptive computation. Code + Data available here: https://github.com/ArturoDeza/EmergentProperties

  • 2 authors
·
Jun 14, 2020

E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization

The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera's egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.

  • 6 authors
·
Oct 14

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.

  • 4 authors
·
Oct 12, 2022

Punching Bag vs. Punching Person: Motion Transferability in Videos

Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.

  • 5 authors
·
Jul 31

HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field. However, Mamba's performance on dense prediction tasks, including human pose estimation and semantic segmentation, has been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. To address these challenges, we introduce the Dynamic Visual State Space (DVSS) block, which utilizes multi-scale convolutional kernels to extract local features across different scales and enhance inductive bias, and employs deformable convolution to mitigate the long-range forgetting problem while enabling adaptive spatial aggregation based on input and task-specific information. By leveraging the multi-resolution parallel design proposed in HRNet, we introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process while promoting effective multi-scale feature learning. Extensive experiments highlight HRVMamba's impressive performance on dense prediction tasks, achieving competitive results against existing benchmark models without bells and whistles. Code is available at https://github.com/zhanghao5201/HRVMamba.

  • 6 authors
·
Oct 4, 2024

Eliminating Position Bias of Language Models: A Mechanistic Approach

Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to ELIMINATE position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a TRAINING-FREE ZERO-SHOT manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset.

  • 9 authors
·
Jul 1, 2024 1

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model's parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an "I don't know" option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs.

  • 3 authors
·
Nov 13

Assessing Social and Intersectional Biases in Contextualized Word Representations

Social bias in machine learning has drawn significant attention, with work ranging from demonstrations of bias in a multitude of applications, curating definitions of fairness for different contexts, to developing algorithms to mitigate bias. In natural language processing, gender bias has been shown to exist in context-free word embeddings. Recently, contextual word representations have outperformed word embeddings in several downstream NLP tasks. These word representations are conditioned on their context within a sentence, and can also be used to encode the entire sentence. In this paper, we analyze the extent to which state-of-the-art models for contextual word representations, such as BERT and GPT-2, encode biases with respect to gender, race, and intersectional identities. Towards this, we propose assessing bias at the contextual word level. This novel approach captures the contextual effects of bias missing in context-free word embeddings, yet avoids confounding effects that underestimate bias at the sentence encoding level. We demonstrate evidence of bias at the corpus level, find varying evidence of bias in embedding association tests, show in particular that racial bias is strongly encoded in contextual word models, and observe that bias effects for intersectional minorities are exacerbated beyond their constituent minority identities. Further, evaluating bias effects at the contextual word level captures biases that are not captured at the sentence level, confirming the need for our novel approach.

  • 2 authors
·
Nov 4, 2019

Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective

Deep neural networks (DNNs) often rely on easy-to-learn discriminatory features, or cues, that are not necessarily essential to the problem at hand. For example, ducks in an image may be recognized based on their typical background scenery, such as lakes or streams. This phenomenon, also known as shortcut learning, is emerging as a key limitation of the current generation of machine learning models. In this work, we introduce a set of experiments to deepen our understanding of shortcut learning and its implications. We design a training setup with several shortcut cues, named WCST-ML, where each cue is equally conducive to the visual recognition problem at hand. Even under equal opportunities, we observe that (1) certain cues are preferred to others, (2) solutions biased to the easy-to-learn cues tend to converge to relatively flat minima on the loss surface, and (3) the solutions focusing on those preferred cues are far more abundant in the parameter space. We explain the abundance of certain cues via their Kolmogorov (descriptional) complexity: solutions corresponding to Kolmogorov-simple cues are abundant in the parameter space and are thus preferred by DNNs. Our studies are based on the synthetic dataset DSprites and the face dataset UTKFace. In our WCST-ML, we observe that the inborn bias of models leans toward simple cues, such as color and ethnicity. Our findings emphasize the importance of active human intervention to remove the inborn model biases that may cause negative societal impacts.

  • 5 authors
·
Oct 6, 2021

Visual Search Asymmetry: Deep Nets and Humans Share Similar Inherent Biases

Visual search is a ubiquitous and often challenging daily task, exemplified by looking for the car keys at home or a friend in a crowd. An intriguing property of some classical search tasks is an asymmetry such that finding a target A among distractors B can be easier than finding B among A. To elucidate the mechanisms responsible for asymmetry in visual search, we propose a computational model that takes a target and a search image as inputs and produces a sequence of eye movements until the target is found. The model integrates eccentricity-dependent visual recognition with target-dependent top-down cues. We compared the model against human behavior in six paradigmatic search tasks that show asymmetry in humans. Without prior exposure to the stimuli or task-specific training, the model provides a plausible mechanism for search asymmetry. We hypothesized that the polarity of search asymmetry arises from experience with the natural environment. We tested this hypothesis by training the model on augmented versions of ImageNet where the biases of natural images were either removed or reversed. The polarity of search asymmetry disappeared or was altered depending on the training protocol. This study highlights how classical perceptual properties can emerge in neural network models, without the need for task-specific training, but rather as a consequence of the statistical properties of the developmental diet fed to the model. All source code and data are publicly available at https://github.com/kreimanlab/VisualSearchAsymmetry.

  • 5 authors
·
Jun 5, 2021

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

  • 6 authors
·
Dec 1

Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.

  • 8 authors
·
Mar 26

Quantifying Bias in Text-to-Image Generative Models

Bias in text-to-image (T2I) models can propagate unfair social representations and may be used to aggressively market ideas or push controversial agendas. Existing T2I model bias evaluation methods only focus on social biases. We look beyond that and instead propose an evaluation methodology to quantify general biases in T2I generative models, without any preconceived notions. We assess four state-of-the-art T2I models and compare their baseline bias characteristics to their respective variants (two for each), where certain biases have been intentionally induced. We propose three evaluation metrics to assess model biases including: (i) Distribution bias, (ii) Jaccard hallucination and (iii) Generative miss-rate. We conduct two evaluation studies, modelling biases under general, and task-oriented conditions, using a marketing scenario as the domain for the latter. We also quantify social biases to compare our findings to related works. Finally, our methodology is transferred to evaluate captioned-image datasets and measure their bias. Our approach is objective, domain-agnostic and consistently measures different forms of T2I model biases. We have developed a web application and practical implementation of what has been proposed in this work, which is at https://huggingface.co/spaces/JVice/try-before-you-bias. A video series with demonstrations is available at https://www.youtube.com/channel/UCk-0xyUyT0MSd_hkp4jQt1Q

  • 4 authors
·
Dec 20, 2023

A Closer Look at AUROC and AUPRC under Class Imbalance

In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.

  • 5 authors
·
Jan 11, 2024

Chinese vs. World Bank Development Projects: Insights from Earth Observation and Computer Vision on Wealth Gains in Africa, 2002-2013

Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002 to 2013), representative of 88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with rich tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery systematically shrinks effects relative to tabular-only models, indicating prior work likely overstated benefits. On average, both donors raise wealth, with larger gains for China; sector extremes in our sample include Trade and Tourism for the World Bank (+6.27 IWI points), and Emergency Response for China (+14.32). Assignment-mechanism analyses show World Bank placement is generally more predictable from imagery alone, as well as from tabular covariates. This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 450 times finer than prior fixed effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but directionally consistent effects.

CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Text-to-image diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code will be available at https://github.com/blurgyy/CoMPaSS.

  • 8 authors
·
Dec 17, 2024

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning

Vision language models (VLMs) have demonstrated impressive performance across a wide range of downstream tasks. However, their proficiency in spatial reasoning remains limited, despite its crucial role in tasks involving navigation and interaction with physical environments. Specifically, most of these tasks rely on the core spatial reasoning capabilities in two-dimensional (2D) environments, and our evaluation reveals that state-of-the-art VLMs frequently generate implausible and incorrect responses to composite spatial reasoning problems, including simple pathfinding tasks that humans can solve effortlessly at a glance. To address this, we explore an effective approach to enhance 2D spatial reasoning within VLMs by training the model solely on basic spatial capabilities. We begin by disentangling the key components of 2D spatial reasoning: direction comprehension, distance estimation, and localization. Our central hypothesis is that mastering these basic spatial capabilities can significantly enhance a model's performance on composite spatial tasks requiring advanced spatial understanding and combinatorial problem-solving, with generalized improvements in visual-spatial tasks. To investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes VLMs on these three basic spatial capabilities by synthetic data generation and targeted supervision to form an instruction dataset for each capability. Our experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant performance gains, not only in the basic tasks themselves but also in generalizing to composite and out-of-distribution spatial reasoning tasks. These findings underscore the effectiveness of mastering basic spatial capabilities in enhancing composite spatial problem-solving, offering insights into systematic strategies for improving VLMs' spatial reasoning capabilities.

  • 10 authors
·
Oct 21, 2024

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

  • 5 authors
·
Nov 26

Latent Compass: Creation by Navigation

In Marius von Senden's Space and Sight, a newly sighted blind patient describes the experience of a corner as lemon-like, because corners "prick" sight like lemons prick the tongue. Prickliness, here, is a dimension in the feature space of sensory experience, an effect of the perceived on the perceiver that arises where the two interact. In the account of the newly sighted, an effect familiar from one interaction translates to a novel context. Perception serves as the vehicle for generalization, in that an effect shared across different experiences produces a concrete abstraction grounded in those experiences. Cezanne and the post-impressionists, fluent in the language of experience translation, realized that the way to paint a concrete form that best reflected reality was to paint not what they saw, but what it was like to see. We envision a future of creation using AI where what it is like to see is replicable, transferrable, manipulable - part of the artist's palette that is both grounded in a particular context, and generalizable beyond it. An active line of research maps human-interpretable features onto directions in GAN latent space. Supervised and self-supervised approaches that search for anticipated directions or use off-the-shelf classifiers to drive image manipulation in embedding space are limited in the variety of features they can uncover. Unsupervised approaches that discover useful new directions show that the space of perceptually meaningful directions is nowhere close to being fully mapped. As this space is broad and full of creative potential, we want tools for direction discovery that capture the richness and generalizability of human perception. Our approach puts creators in the discovery loop during real-time tool use, in order to identify directions that are perceptually meaningful to them, and generate interpretable image translations along those directions.

  • 3 authors
·
Dec 19, 2020

Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.

  • 9 authors
·
May 21 2

ViG-Bias: Visually Grounded Bias Discovery and Mitigation

The proliferation of machine learning models in critical decision making processes has underscored the need for bias discovery and mitigation strategies. Identifying the reasons behind a biased system is not straightforward, since in many occasions they are associated with hidden spurious correlations which are not easy to spot. Standard approaches rely on bias audits performed by analyzing model performance in pre-defined subgroups of data samples, usually characterized by common attributes like gender or ethnicity when it comes to people, or other specific attributes defining semantically coherent groups of images. However, it is not always possible to know a-priori the specific attributes defining the failure modes of visual recognition systems. Recent approaches propose to discover these groups by leveraging large vision language models, which enable the extraction of cross-modal embeddings and the generation of textual descriptions to characterize the subgroups where a certain model is underperforming. In this work, we argue that incorporating visual explanations (e.g. heatmaps generated via GradCAM or other approaches) can boost the performance of such bias discovery and mitigation frameworks. To this end, we introduce Visually Grounded Bias Discovery and Mitigation (ViG-Bias), a simple yet effective technique which can be integrated to a variety of existing frameworks to improve both, discovery and mitigation performance. Our comprehensive evaluation shows that incorporating visual explanations enhances existing techniques like DOMINO, FACTS and Bias-to-Text, across several challenging datasets, including CelebA, Waterbirds, and NICO++.

  • 6 authors
·
Jul 2, 2024

This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models

Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. This contrasts with a multilingual human, who would likely answer consistently. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages -- a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context. Our code and data are available at https://github.com/manestay/borderlines

  • 3 authors
·
May 23, 2023

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

Recent works have showcased the ability of LLMs to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remains unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks. Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups. Our experiments unveil that LLMs harbor deep rooted bias against various socio-demographics underneath a veneer of fairness. While they overtly reject stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), they manifest stereotypical and erroneous presumptions when asked to answer questions while adopting a persona. These can be observed as abstentions in responses, e.g., 'As a Black person, I can't answer this question as it requires math knowledge', and generally result in a substantial performance drop. Our experiments with ChatGPT-3.5 show that this bias is ubiquitous - 80% of our personas demonstrate bias; it is significant - some datasets show performance drops of 70%+; and can be especially harmful for certain groups - some personas suffer statistically significant drops on 80%+ of the datasets. Overall, all 4 LLMs exhibit this bias to varying extents, with GPT-4-Turbo showing the least but still a problematic amount of bias (evident in 42% of the personas). Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.

  • 7 authors
·
Nov 8, 2023

"Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters

Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying biases exist in these model-constructed letters, using them without scrutinization could lead to direct societal harms, such as sabotaging application success rates for female applicants. In light of this pressing issue, it is imminent and necessary to comprehensively study fairness issues and associated harms in this real-world use case. In this paper, we critically examine gender biases in LLM-generated reference letters. Drawing inspiration from social science findings, we design evaluation methods to manifest biases through 2 dimensions: (1) biases in language style and (2) biases in lexical content. We further investigate the extent of bias propagation by analyzing the hallucination bias of models, a term that we define to be bias exacerbation in model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated recommendation letters. Our findings not only warn against using LLMs for this application without scrutinization, but also illuminate the importance of thoroughly studying hidden biases and harms in LLM-generated professional documents.

  • 6 authors
·
Oct 13, 2023