Enterprise AI Analysis: Generative Retrieval in E-commerce
CAT-ID2: Category-Tree Integrated Document Identifier Learning for Generative Retrieval In E-commerce
Authors: Xiaoyu Liu et al. | Publication Date: 21 February 2026
Generative retrieval (GR) integrates LLMs for document retrieval, but constructing effective discrete semantic identifiers (DocIDs) is challenging. Existing methods overlook native hierarchical category information crucial in e-commerce. This paper proposes CAT-ID2, a novel ID learning method incorporating prior category information into semantic IDs. CAT-ID2 utilizes a Hierarchical Class Constraint Loss, Cluster Scale Constraint Loss, and Dispersion Loss to generate IDs that make similar documents more alike while preserving uniqueness. Offline and online A/B tests confirm its effectiveness, showing a 0.33% increase in average orders for ambiguous intent queries and 0.24% for long-tail queries.
Executive Impact & Key Findings
This analysis highlights CAT-ID2, a groundbreaking method for generative retrieval in e-commerce, which significantly enhances document identifier learning by integrating hierarchical category-tree information. Unlike traditional approaches that struggle with long-tail queries and information loss, CAT-ID2 leverages Large Language Models to create semantically rich and distinct document IDs. Key innovations include a Hierarchical Class Constraint Loss for robust category alignment, a Cluster Scale Constraint Loss to prevent encoding collapse, and a Dispersion Loss for unique ID generation. Evaluated through extensive offline experiments and a 10-day online A/B test, CAT-ID2 achieved a +0.13% overall increase in average orders, with notable gains for ambiguous (+0.33%) and long-tail (+0.24%) queries, demonstrating its superior performance and practical value in optimizing product discovery for enterprise search systems.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Understanding Generative Retrieval
GR integrates LLMs to directly retrieve document IDs, combining understanding and retrieval. It aims to overcome limitations of decoupled query rewriting methods, particularly for ambiguous and complex queries. CAT-ID2 enhances GR by generating high-quality semantic IDs, improving LLM memory and retrieval accuracy. Key properties for effective GR IDs include similarity for similar documents, distinctiveness for dissimilar documents, and uniqueness.
The Core of Semantic ID Learning
SIL is the first stage of GR, discretizing continuous semantic embeddings into token ID sequences. The effectiveness of GR heavily depends on this stage, as discretization acts as information quantization. Poorly constructed IDs degrade performance, while high-quality IDs enhance LLM memory and accurate retrieval. CAT-ID2 addresses this by integrating hierarchical category information.
Leveraging E-commerce Hierarchy
E-commerce data naturally possesses hierarchical category structures (e.g., Clothes -> Shoes -> Sneakers). Existing SIL methods often ignore this vital information, treating labels as plain text. CAT-ID2 explicitly incorporates this hierarchical category tree into the ID indexing process. This ensures documents within the same category are more similar, leveraging domain-specific knowledge for improved semantic representation.
CAT-ID2's Innovative Loss Functions
CAT-ID2 introduces three key loss functions: 1. Hierarchical Class Constraint Loss (HCCL) integrates category information layer-by-layer for contrastive learning, ensuring intra-category compactness and inter-category separation. 2. Cluster Scale Constraint Loss (CSCL) promotes uniform ID token distribution, preventing codebook collapse. 3. Dispersion Loss (DisL) enhances distinctiveness of reconstructed embeddings, ensuring unique IDs. Together, these losses create IDs with robust semantic properties.
Real-world Performance & Impact
Extensive offline and online experiments demonstrate CAT-ID2's effectiveness. Offline, it outperforms all GR and DR baselines on ESCI datasets (e.g., Recall@100 of 23.37% for ESCI-us). Online A/B tests showed a +0.13% overall increase in average orders, with +0.33% for ambiguous and +0.24% for long-tail queries. This confirms CAT-ID2's ability to significantly improve product discovery in real-world e-commerce scenarios.
CAT-ID2 delivered a significant +0.13% overall increase in average orders per thousand users during a 10-day online A/B test, demonstrating its real-world impact in e-commerce search.
Enterprise Process Flow: CAT-ID2 Methodology
The CAT-ID2 framework integrates hierarchical category information into document identifier learning through a multi-stage process, ensuring semantic IDs are both distinct and representative.
| Feature | CAT-ID2 (Proposed) | TIGER (Baseline) |
|---|---|---|
| Hierarchical Category Integration |
|
|
| Codebook Utilization & Stability |
|
|
| Semantic ID Uniqueness |
|
|
| Offline Performance (Recall@100) |
|
|
| Online A/B Test Impact |
|
|
CAT-ID2 introduces a novel combination of loss functions to address limitations of existing methods, ensuring superior semantic ID properties and retrieval performance.
Case Study: CAT-ID2's Semantic ID Structure in E-commerce
A detailed examination of DocIDs generated by CAT-ID2, particularly for categories like 'Connectivity Devices' or 'Protective Cases', reveals a highly organized and semantically coherent hierarchical structure. For instance, the ID <a_31> represents broad 'Connectivity Devices,' which then branches into specific sub-categories like smart TVs (<a_31><b_229>) or hardware adapters (<a_31><b_454>). This multi-level encoding, driven by the specialized loss functions, ensures that similar products share common prefixes while distinct products maintain unique identifiers. Visualizations (like t-SNE in Figure 4) demonstrate that CAT-ID2 produces tighter, more compact clusters for intra-category items and clearer separation between different categories compared to TIGER, confirming its superior ability to capture and leverage product hierarchies for improved retrieval accuracy.
Calculate Your Potential ROI with CAT-ID2
Estimate the annual savings and efficiency gains your organization could achieve by implementing CAT-ID2's advanced generative retrieval. Adjust the parameters to reflect your enterprise's unique operational context.
Your Implementation Roadmap
A structured approach ensures successful integration and maximum impact. Our experts will guide you through each phase, tailored to your enterprise's specific needs and existing infrastructure.
Phase 1: Discovery & Strategy Alignment
Initial consultation to understand current search systems, data architecture, and business objectives. Define key performance indicators (KPIs) and tailor CAT-ID2's deployment strategy.
Phase 2: Data Preparation & ID Learning Setup
Assist in preparing e-commerce product data, including hierarchical category information. Configure and train the CAT-ID2 model for optimal semantic ID generation based on your unique dataset.
Phase 3: Generative Model Integration & Fine-tuning
Integrate the learned semantic IDs with your existing LLM infrastructure. Fine-tune the generative retrieval model to ensure accurate and relevant document ID sequence generation for diverse queries.
Phase 4: A/B Testing & Performance Monitoring
Deploy CAT-ID2 in a controlled A/B testing environment. Continuously monitor key metrics like order increase, recall, and query satisfaction, with iterative optimization based on real-world user feedback.
Phase 5: Full-Scale Rollout & Ongoing Optimization
Seamlessly transition to a full production environment. Provide ongoing support, maintenance, and further enhancements to adapt to evolving e-commerce trends and business requirements.
Ready to Transform Your E-commerce Search?
Unlock the full potential of generative retrieval for your enterprise. Our experts are ready to design a tailored strategy that drives measurable improvements in product discovery and customer satisfaction.