Skip to main content
Enterprise AI Analysis: CAT-ID2: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce

Enterprise AI Analysis: Generative Retrieval in E-commerce

CAT-ID2: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce

Authors: Xiaoyu Liu et al. | Publication Date: 21 February 2026

Generative retrieval (GR) integrates LLMs for document retrieval, but constructing effective discrete semantic identifiers (DocIDs) is challenging. Existing methods overlook native hierarchical category information crucial in e-commerce. This paper proposes CAT-ID2, a novel ID learning method incorporating prior category information into semantic IDs. CAT-ID2 utilizes a Hierarchical Class Constraint Loss, Cluster Scale Constraint Loss, and Dispersion Loss to generate IDs that make similar documents more alike while preserving uniqueness. Offline and online A/B tests confirm its effectiveness, showing a 0.33% increase in average orders for ambiguous intent queries and 0.24% for long-tail queries.

Executive Impact & Key Findings

This analysis highlights CAT-ID2, a groundbreaking method for generative retrieval in e-commerce, which significantly enhances document identifier learning by integrating hierarchical category-tree information. Unlike traditional approaches that struggle with long-tail queries and information loss, CAT-ID2 leverages Large Language Models to create semantically rich and distinct document IDs. Key innovations include a Hierarchical Class Constraint Loss for robust category alignment, a Cluster Scale Constraint Loss to prevent encoding collapse, and a Dispersion Loss for unique ID generation. Evaluated through extensive offline experiments and a 10-day online A/B test, CAT-ID2 achieved a +0.13% overall increase in average orders, with notable gains for ambiguous (+0.33%) and long-tail (+0.24%) queries, demonstrating its superior performance and practical value in optimizing product discovery for enterprise search systems.

0.0% Online A/B Test Order Increase
0% CAT-ID2 Recall@100 (ESCI-us)
0 Authored Innovations
0 Total Downloads (Paper)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Understanding Generative Retrieval

GR integrates LLMs to directly retrieve document IDs, combining understanding and retrieval. It aims to overcome limitations of decoupled query rewriting methods, particularly for ambiguous and complex queries. CAT-ID2 enhances GR by generating high-quality semantic IDs, improving LLM memory and retrieval accuracy. Key properties for effective GR IDs include similarity for similar documents, distinctiveness for dissimilar documents, and uniqueness.

The Core of Semantic ID Learning

SIL is the first stage of GR, discretizing continuous semantic embeddings into token ID sequences. The effectiveness of GR heavily depends on this stage, as discretization acts as information quantization. Poorly constructed IDs degrade performance, while high-quality IDs enhance LLM memory and accurate retrieval. CAT-ID2 addresses this by integrating hierarchical category information.

Leveraging E-commerce Hierarchy

E-commerce data naturally possesses hierarchical category structures (e.g., Clothes -> Shoes -> Sneakers). Existing SIL methods often ignore this vital information, treating labels as plain text. CAT-ID2 explicitly incorporates this hierarchical category tree into the ID indexing process. This ensures documents within the same category are more similar, leveraging domain-specific knowledge for improved semantic representation.

CAT-ID2's Innovative Loss Functions

CAT-ID2 introduces three key loss functions: 1. Hierarchical Class Constraint Loss (HCCL) integrates category information layer-by-layer for contrastive learning, ensuring intra-category compactness and inter-category separation. 2. Cluster Scale Constraint Loss (CSCL) promotes uniform ID token distribution, preventing codebook collapse. 3. Dispersion Loss (DisL) enhances distinctiveness of reconstructed embeddings, ensuring unique IDs. Together, these losses create IDs with robust semantic properties.

Real-world Performance & Impact

Extensive offline and online experiments demonstrate CAT-ID2's effectiveness. Offline, it outperforms all GR and DR baselines on ESCI datasets (e.g., Recall@100 of 23.37% for ESCI-us). Online A/B tests showed a +0.13% overall increase in average orders, with +0.33% for ambiguous and +0.24% for long-tail queries. This confirms CAT-ID2's ability to significantly improve product discovery in real-world e-commerce scenarios.

+0.13% Overall Order Increase in A/B Test

CAT-ID2 delivered a significant +0.13% overall increase in average orders per thousand users during a 10-day online A/B test, demonstrating its real-world impact in e-commerce search.

Enterprise Process Flow: CAT-ID2 Methodology

Document Embedding
Residual Quantization
Hierarchical Class Constraint Loss
Cluster Scale Constraint Loss
Dispersion Loss
Semantic ID Generation
Generative Model Training

The CAT-ID2 framework integrates hierarchical category information into document identifier learning through a multi-stage process, ensuring semantic IDs are both distinct and representative.

Key Differentiators: CAT-ID2 vs. TIGER

Feature CAT-ID2 (Proposed) TIGER (Baseline)
Hierarchical Category Integration
  • Explicitly integrates category-tree labels via HCCL for robust semantic alignment and separation.
  • Uses a rigid prefix-based approach that lacks flexibility and global semantic relationships.
Codebook Utilization & Stability
  • Employs CSCL to ensure uniform codebook distribution, preventing collapse and improving ID distinctiveness.
  • Prone to codebook collapse issues due to less robust mechanisms for uniform distribution.
Semantic ID Uniqueness
  • Integrates Dispersion Loss to maximize distinctiveness of reconstructed embeddings, ensuring unique semantic IDs.
  • Does not explicitly optimize for distinctiveness of reconstructed embeddings in the same way.
Offline Performance (Recall@100)
  • 23.37% (ESCI-us, 512 codebook)
  • 18.59% (ESCI-us)
Online A/B Test Impact
  • +0.13% overall order increase, significant gains for ambiguous/long-tail queries.
  • Not explicitly designed or tested for this level of online impact.

CAT-ID2 introduces a novel combination of loss functions to address limitations of existing methods, ensuring superior semantic ID properties and retrieval performance.

Case Study: CAT-ID2's Semantic ID Structure in E-commerce

A detailed examination of DocIDs generated by CAT-ID2, particularly for categories like 'Connectivity Devices' or 'Protective Cases', reveals a highly organized and semantically coherent hierarchical structure. For instance, the ID <a_31> represents broad 'Connectivity Devices,' which then branches into specific sub-categories like smart TVs (<a_31><b_229>) or hardware adapters (<a_31><b_454>). This multi-level encoding, driven by the specialized loss functions, ensures that similar products share common prefixes while distinct products maintain unique identifiers. Visualizations (like t-SNE in Figure 4) demonstrate that CAT-ID2 produces tighter, more compact clusters for intra-category items and clearer separation between different categories compared to TIGER, confirming its superior ability to capture and leverage product hierarchies for improved retrieval accuracy.

Calculate Your Potential ROI with CAT-ID2

Estimate the annual savings and efficiency gains your organization could achieve by implementing CAT-ID2's advanced generative retrieval. Adjust the parameters to reflect your enterprise's unique operational context.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Your Implementation Roadmap

A structured approach ensures successful integration and maximum impact. Our experts will guide you through each phase, tailored to your enterprise's specific needs and existing infrastructure.

Phase 1: Discovery & Strategy Alignment

Initial consultation to understand current search systems, data architecture, and business objectives. Define key performance indicators (KPIs) and tailor CAT-ID2's deployment strategy.

Phase 2: Data Preparation & ID Learning Setup

Assist in preparing e-commerce product data, including hierarchical category information. Configure and train the CAT-ID2 model for optimal semantic ID generation based on your unique dataset.

Phase 3: Generative Model Integration & Fine-tuning

Integrate the learned semantic IDs with your existing LLM infrastructure. Fine-tune the generative retrieval model to ensure accurate and relevant document ID sequence generation for diverse queries.

Phase 4: A/B Testing & Performance Monitoring

Deploy CAT-ID2 in a controlled A/B testing environment. Continuously monitor key metrics like order increase, recall, and query satisfaction, with iterative optimization based on real-world user feedback.

Phase 5: Full-Scale Rollout & Ongoing Optimization

Seamlessly transition to a full production environment. Provide ongoing support, maintenance, and further enhancements to adapt to evolving e-commerce trends and business requirements.

Ready to Transform Your E-commerce Search?

Unlock the full potential of generative retrieval for your enterprise. Our experts are ready to design a tailored strategy that drives measurable improvements in product discovery and customer satisfaction.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking