What is Generative Recommendation

21 Feb 2026

Introduction

Generative Recommendation (GR) has gained significant popularity recently. Top-tier companies claim that they have successfully converted traditional Deep Learning Recommendation Systems (DLRM) to GR. It appears that recommendation systems (RS) without generative capabilities may become outdated.

The definition of GR is straightforward: generate the exposed recommended item (as Large Language Models (LLM) do) instead of ranking candidates and selecting the top-ranked ones, where ranking is based on the probability of item interaction (click, purchase).

However, a question arises: when LLM generates text, it seems to do something similar - it selects tokens with the highest probability for exposure. Why isn’t DLRM considered generative?

LLM’s Generation

Unlike traditional recommendation systems that simply rank items and select the top-1, LLMs use sophisticated sampling strategies for text generation. The key difference is that LLMs don’t always choose the most probable token - they introduce controlled randomness through temperature and nucleus (top-p) sampling to create diverse and natural text.

The Generation Process

At each step of generation, the LLM produces a probability distribution over the entire vocabulary:

P(token_i | context) = softmax(logits_i)

The generation strategy determines which token to select from this distribution.

Temperature Scaling

Temperature (τ) controls the “flatness” of the probability distribution:

τ = 1.0: Original distribution (no scaling)
τ < 1.0: Sharpens distribution (more deterministic, focuses on high-probability tokens)
τ > 1.0: Flattens distribution (more random, allows low-probability tokens)

The temperature-scaled logits are computed as:

def temperature_scale(logits, temperature):
    """
    Scale logits by temperature to control randomness
    
    Args:
        logits: Raw output scores from the model [vocab_size]
        temperature: Temperature parameter (0 < τ < ∞)
        
    Returns:
        Temperature-scaled logits
    """
    if temperature == 0:
        raise ValueError("Temperature must be positive")
    
    return logits / temperature

Top-p (Nucleus) Sampling

Top-p sampling keeps the smallest set of highest-probability tokens whose cumulative probability exceeds a threshold p (typically 0.9-0.95). This focuses sampling on the “nucleus” of the distribution while maintaining diversity.

Why it’s important:

Removes unlikely tokens that would degrade quality
Keeps the number of candidates dynamic based on confidence
Better than fixed top-k in handling uncertain situations

def top_p_sampling(logits, top_p=0.9):
    """
    Nucleus sampling: keep tokens with cumulative probability ≤ p
    
    Args:
        logits: Raw or temperature-scaled model output [vocab_size]
        top_p: Cumulative probability threshold (default: 0.9)
        
    Returns:
        Sampled token index
    """
    # Sort logits in descending order
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    
    # Compute softmax probabilities
    sorted_probs = softmax(sorted_logits)
    
    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
    
    # Remove tokens beyond the nucleus
    sorted_indices_to_remove = cumulative_probs > top_p
    
    # Shift removal index to keep the first token that exceeds top_p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = 0
    
    # Remove low-probability tokens by setting logit to -inf
    sorted_logits[sorted_indices_to_remove] = float('-inf')
    
    # Re-normalize probabilities over remaining tokens
    filtered_probs = softmax(sorted_logits)
    
    # Sample one token from the filtered distribution
    sampled_index = torch.multinomial(filtered_probs, num_samples=1)
    
    # Map back to original vocabulary indices
    final_token = sorted_indices[sampled_index]
    
    return final_token

Combined Temperature + Top-p Generation

In practice, these two strategies are combined for optimal results:

def generate_with_sampling(model, input_ids, 
                           max_length=100,
                           temperature=0.7,
                           top_p=0.9):
    """
    Generate text using temperature + top-p sampling
    
    Args:
        model: Pre-trained LLM
        input_ids: Tokenized input prompt
        max_length: Maximum tokens to generate
        temperature: Temperature parameter
        top_p: Nucleus sampling threshold
        
    Returns:
        Generated token sequence
    """
    generated_ids = input_ids.clone()
    
    for step in range(max_length):
        # Forward pass to get logits
        outputs = model(generated_ids)
        logits = outputs.logits[:, -1, :]  # Get next token predictions
        
        # Apply temperature scaling
        scaled_logits = logits / temperature
        
        # Apply top-p sampling
        next_token = top_p_sampling(scaled_logits, top_p=top_p)
        
        # Append to sequence
        generated_ids = torch.cat([generated_ids, next_token], dim=-1)
        
        # Stop if EOS token is generated
        if next_token == model.config.eos_token_id:
            break
    
    return generated_ids

Practical Settings

# Different scenarios use different sampling parameters:

# Creative writing: high randomness
generation_config = {
    'temperature': 0.8,
    'top_p': 0.95
}

# Technical documentation: more deterministic
generation_config = {
    'temperature': 0.3,
    'top_p': 0.9
}

# Code generation: highly deterministic
generation_config = {
    'temperature': 0.2,
    'top_p': 0.8
}

This sampling-based approach is what makes LLMs “generative” - they create novel content through probabilistic selection rather than deterministic ranking.

DLRM’s Ranking v.s. LLM’s Generation

Aspect	Traditional DLRM	LLM Generation
Selection Method	Argmax (deterministic)	Sampling (stochastic)
Candidate Set	Fixed (all items)	Dynamic (top-p nucleus)
Output	Single best item	Sampled sequence
Diversity	Low (same output)	High (varied outputs)
Parameter Control	None	Temperature, top-p

Though different, both approaches actually perform a similar ranking procedure, but none of us would agree that LLM is deterministic. The key difference between DLRM and GR is not how it gets the result, but what they model.

Before going deeper, let’s review what and how recommendation systems model.

Objective of Deep Learning Recommendation System

Traditionally, recommendation systems (RS) select and expose the item that a user is most likely to interact with. Normally, we use Click-Through Rate (CTR) as the metric to measure the user’s interest in an item. The item with the highest CTR is selected and exposed to the user. Therefore, instead of generating the token that most likely appears as LLM does (modeling the nature of language), RS modifies the appearance nature of exposure, leading to high conversion. Specifically,

(1) Click-Through Rate (CTR) Prediction

# Modeling the probability of user clicking on an item
P(click = 1 | user, item, context)

This is the most fundamental objective where the model learns to predict whether a user will interact with a given item based on user profile, item features, and contextual information.

(2) Conversion Rate (CVR) Prediction

# Modeling the probability of conversion after click
P(conversion = 1 | click = 1, user, item, context)

For e-commerce scenarios, CVR prediction is crucial as it focuses on the actual purchase or conversion events rather than just clicks.

(3) Next Item Prediction

# Predicting the next item in user's interaction sequence
P(item_{t+1} | item_1, item_2, ..., item_t, user)

This objective models the sequential nature of user behavior, where the history of interactions influences future choices.

(4) Multi-Objective Optimization

# Combining multiple objectives with weighted importance
Loss = α · L_CTR + β · L_CVR + γ · L_Time

Modern systems often optimize multiple objectives simultaneously, balancing engagement, conversion, and other business metrics.

Generative Recommendation

Following the analysis above, it is easy to find that generative recommendation is basically modeling the nature of exposure (the objective it models), leading to its definition of generation.

Unlike traditional ranking-based systems that predict probabilities for fixed candidate items and select top-ranked ones, generative recommendation systems treat the recommendation problem as a generation task where items are “generated” from a learned distribution, similar to how LLMs generate tokens.

The key insight is that rather than ranking existing candidates, the model learns to directly sample/produce item IDs from a probability distribution conditioned on user preferences and context.

OneRec: A Representative Generative Recommendation System

OneRec is a pioneering work that applies generative modeling principles to recommendation systems. It treats item IDs as discrete tokens in a “vocabulary” and learns to generate appropriate item sequences using autoregressive generation, similar to language modeling.

Key Innovations of OneRec

(1) Item as Vocabulary

OneRec conceptualizes the entire item catalog as a vocabulary where each item ID corresponds to a token, using item embeddings similar to word embeddings in LLMs. This allows the model to leverage techniques from natural language processing and sequence generation.

(2) Autoregressive Item Generation

OneRec models the recommendation process as an autoregressive generation task using transformer-based architecture. The model learns to predict next item probabilities based on user history and contextual information, generating items sequentially where each item conditions subsequent generation.

(3) Generation with Sampling Strategies

OneRec employs similar sampling strategies as LLMs for item generation, including temperature scaling and nucleus (top-p) sampling. This enables controlled randomness in generation, allowing for diverse and novel recommendations rather than always selecting the most probable items.

(4) Training Objective

OneRec is trained using standard language modeling objectives with cross-entropy loss for positive items and optional contrastive loss for negative items. This approach learns the underlying distribution of items given user context.

Advantages of OneRec’s Generative Approach

Aspect	Traditional DLRM	OneRec (Generative)
Candidate Selection	Pre-defined candidate set	Generates from entire item space
Diversity	Limited to top candidates	Sampling provides natural diversity
Novelty	Biased to popular items	Can generate unexpected items
Scalability	O(N) scoring per user	O(1) generation per user
Cold Start	Needs item embeddings	Can generate from distribution

Comparison with Traditional Ranking

The fundamental difference lies in the modeling philosophy:

# Traditional DLRM: Probability modeling for ranking
def dlrm_objective(model, user, item):
    return log P(click | user, item)  # Learn to predict interactions

# OneRec: Distribution modeling for generation  
def onerec_objective(model, user, context):
    return log P(item | user, context)  # Learn the item distribution

While both approaches involve probability distributions, the key distinction is:

DLRM: Models interaction probabilities for existing candidates → ranking
OneRec: Models the underlying item distribution → generation

This paradigm shift enables generative recommendation systems to directly produce recommendations rather than ranking pre-defined candidates, offering greater flexibility and potential for novel discovery.

Scaling v.s. Generative Recommendation

Most so-called Generative Recommendation is not actually modeling the probability of item occurrence, but still modeling the interaction probability (CTR, CVR). They use a Transformers-based architecture and scaling approach, which is basically a scaling law but not generative.

Kunlun: Establishing Scaling Laws for Recommendation Systems

Meta’s recent work “Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design” establishes scaling laws for recommendation systems similar to those found in large language models. These laws describe how model performance scales with:

Model Size: Number of parameters in the recommendation model
Data Volume: Amount of training data (user interactions, impressions)
Compute Budget: Computational resources available for training and inference

The key insight is that as we increase these factors, recommendation performance improves predictably according to power-law relationships, rather than diminishing returns that plateau early.

Performance Optimization through Scaling Laws

Rather than arbitrary architectural choices, Kunlun uses scaling laws to guide model design:

Optimal Model Size: Determining the ideal number of parameters for given constraints
Resource Allocation: Balancing memory, latency, and throughput requirements
Data Efficiency: Understanding how much data is needed to train models of different sizes

This principled approach leads to better performance with more efficient resource utilization.

Conclusion

Generative Recommendation differs from traditional RS in its modeling objective: it models the exposure sequence instead of exposing the candidates with high interaction probability.

Chen Shangyu