Progressive discretization for generative retrieval: A self-supervised approach to high-quality DocID generation
1China Telecom Research Institute, Beijing, 102209, China.
Related Experiment Videos
View abstract on PubMed
Summary
This study introduces Self-supervised Progressive Discretization (SPD) to create better document identifiers (DocIDs) for generative retrieval. SPD enhances generative retrieval performance by improving DocID quality and reducing information distortion.
Area of Science:
- Information Retrieval
- Machine Learning
- Natural Language Processing
Background:
- Generative retrieval utilizes large language models as differentiable indices for document memorization and retrieval.
- Traditional methods encode documents and queries separately, limiting performance.
- Existing methods for generating document identifiers (DocIDs) often suffer from information distortion due to unsupervised discretization.
Purpose of the Study:
- To propose a novel framework, Self-supervised Progressive Discretization (SPD), for generating high-quality document identifiers (DocIDs).
- To improve the performance of generative retrieval systems by addressing limitations in DocID creation.
Main Methods:
- SPD distills document information into multi-perspective continuous representations using self-supervised learning.
- A progressive discretization algorithm transforms continuous representations into approximate vectors and discrete DocIDs.
- The self-supervised model, approximate vectors, and DocIDs are integrated into a query-side training pipeline.
Main Results:
- SPD successfully creates high-quality, search-oriented DocIDs.
- The proposed framework achieves state-of-the-art performance in generative retrieval benchmarks.
- SPD mitigates information distortion during the discretization process.
Conclusions:
- Self-supervised Progressive Discretization (SPD) offers a robust method for generating effective DocIDs for generative retrieval.
- The SPD framework significantly advances the capabilities of large-scale generative retrieval systems.
- This work demonstrates the potential of self-supervised learning in optimizing document representations for retrieval.