mercari AI

Blog

Mercari AI’s research paper, “Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace” has been accepted to the TSMO 2025 workshop at KDD 2025

 Overview

We are pleased to announce that the paper "Zero-Shot Retrieval for Scalable Visual Search in a Two-Sided Marketplace" by Andre Rusli, Shoma Ishimoto, Sho Akiyama, and Aman Kumar Singh of the Mercari AI engineering and product teams has been accepted for presentation at the TSMO 2025 workshop of the international conference KDD 2025 in the field of knowledge discovery and data mining.

TSMO (Two-sided Marketplace Optimization) is a specialized workshop focusing on research related to two-sided marketplace platforms and online ecosystems.

KDD (Knowledge Discovery and Data Mining) is one of the most prestigious conferences in the field of data science and machine learning. The conference will be held in Toronto, Canada from August 3-7, 2025.

The presentation is scheduled for the TSMO 2025 workshop on August 4, 2025.

Key points of presentation

This research presents a scalable visual search system deployed in Mercari, addressing the unique challenges of consumer-to-consumer (C2C) platforms where end-users act as both buyers and sellers. The key innovation lies in evaluating recent vision-language models for zero-shot image retrieval and demonstrating their superiority over existing baselines.

The main contributions include:

  • Development of a production-ready visual search system that integrates real-time inference and background indexing workflows
  • Comprehensive evaluation of multilingual SigLIP model showing 13.3% improvement in nDCG@5 over the baseline
  • Real-world validation through online A/B testing with up to 40.9% increase in transaction rate via image search
  • Demonstration that zero-shot models can serve as strong practical baselines for production use

Background

C2C marketplace platforms like Mercari face unique challenges compared to traditional B2C platforms. Most sellers are everyday individuals rather than professional merchants, leading to diverse and often inconsistent product catalogs with varied naming conventions, category assignments, and visual quality. This creates difficulties for customers navigating the catalog, whether for purchasing or researching similar listings before selling.

While text-based search engines remain the default method for product discovery, they often struggle in such environments. Many items are difficult to describe precisely in words or are identified primarily by visual traits (e.g., fashion, character goods, or collectibles). Visual search provides an intuitive alternative, helping end-users locate visually similar items without requiring clean textual metadata.


Illustration of a customer need for a visual search feature


UX flow of the proposed visual search feature

Summary of paper

The research evaluates recent vision-language models, specifically multilingual SigLIP, for zero-shot image retrieval in a large-scale marketplace setting. The system architecture combines real-time inference and background indexing workflows, supported by a unified embedding pipeline optimized through dimensionality reduction from 768 to 128 dimensions.


System flow of when a user upload a search query image

Key findings include:

  • Multilingual SigLIP outperformed other models across multiple retrieval metrics in offline evaluation
  • Online A/B testing confirmed real-world impact with substantial gains in engagement and conversion
  • The system serves approximately 1.5 million customers per month (around 7% of Mercari's monthly active customers) by the time of writing, and still increasing
  • Zero-shot approaches can eliminate the need for extensive downstream training while providing robustness to distribution shifts

Comparison of results of different models on the offline evaluation dataset

Model nDCG@5 Prec@5 Recall@5
baseline 0.510 0.190 0.551
clip-japanese-base 0.472 0.175 0.516
siglip-base-patch16-256-multilingual 0.578 0.212 0.634
dinov2-large 0.526 0.190 0.565

The study demonstrates that recent zero-shot models can serve as strong and practical baselines for production use, enabling teams to deploy effective visual search systems with minimal overhead while retaining flexibility for future fine-tuning based on domain-specific needs.

About the Authors' Teams

AI/LLM Team:

Focuses on advancing artificial intelligence and machine learning capabilities across Mercari's products and services. The team conducts experiments in computer vision, natural language processing, and large language models, working closely with various product teams to experiment with and implement the latest AI technologies to enhance customer experiences and solve complex marketplace challenges.

Search Team:

Dedicated to building and continuously improving search systems that help customers efficiently find products across Mercari's platform. The team develops and maintains various search modalities including text-based and visual search, ensuring optimal product experiences for millions of customers.

ML Platform Team:

Specializes in providing robust, scalable infrastructure and platforms that enable product teams across the organization to deploy and operate machine learning systems effectively. The team focuses on building reliable ML infrastructure, optimizing performance, and ensuring seamless integration of AI capabilities into production environments.

Acknowledgments

We would like to express our sincere gratitude to the engineering members of the Search, AI/LLM, and ML Platform teams for their outstanding contributions throughout the development of this work. Their deep technical expertise, collaborative spirit, and relentless focus on scalability and performance were instrumental in building and deploying the system. We also extend our heartfelt thanks to our cross-functional partners, including designers, product managers, and UX researchers, whose thoughtful input ensured a seamless and user-centric experience for our customers. Special appreciation goes to the Business Intelligence (BI) team for their analytical support in shaping our evaluation strategy, to the QA team for their rigorous testing and quality assurance, and to the Marketing team for their efforts in driving awareness and adoption. Their dedication and expertise have collectively played a vital role in bringing this work to life.