Papers
arxiv:2602.11733

Adapting Vision-Language Models for E-commerce Understanding at Scale

Published on Feb 12
· Submitted by
Matteo Nulli
on Feb 13
· eBay eBay
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

General-purpose Vision-Language Models can be effectively adapted for e-commerce applications through targeted techniques that enhance product understanding while maintaining broad multimodal capabilities.

AI-generated summary

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Community

ViT Architecture
Figure 1: Output of our E-commerce Adapted VLMs compared against same size LLaVA-OneVision. We show our models ability to more faithfully extract attributes from e-commerce items. In red, we highlight wrong model predictions that are neither tied to the image nor valid item attributes.

verification
Figure 2: Visual Verification Pipeline. The figure shows the pipeline we use to create the 4M e-commerce visual instruction tuning data. We begin by collecting raw listings data from the web (left). We then clean and pre-process the textual entries. In parallel, we create detailed captions for the corresponding image through InternVL-2.5-26B. Finally, we provide the captions together with the cleaned listings to Mistral-Small-3-24B to obtain the verified instructions, used, along with original images, to train our models 🔥.

ebay-si-deanon-1
Figure 3: eBay Single-Image Visual Instruction Tuning Set. We break down the components of our internal single-image instruction tuning set. The pie chart on the left shows the percentages of tasks in our set. On the right we report each tasks with its own sub tasks with the total number of instructions in parenthesis.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.11733 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.11733 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.11733 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.