2026-06-30

RoPE (Rotary Position Embedding)

Rotary Position Embedding (RoPE) encodes position information by rotating the query and key vectors in self-attention according to their absolute positions. The dot product between rotated vectors then naturally depends only on their relative position difference, combining the flexibility of learned absolute embeddings with the inductive bias of relative position encoding — making it the dominant position encoding scheme in modern LLMs including LLaMA, GPT-NeoX, PaLM, Qwen, and Mistral.

1. Core Concept

1.1 Motivation

Problem with existing position encodings:

Sinusoidal (Absolute): Encodes absolute positions but doesn’t explicitly model relative distances in attention computation
Learned Absolute: Fixed maximum sequence length, cannot extrapolate
Relative (Shaw et al.): Adds pairwise biases to attention scores — computationally expensive ( $O (N^{2})$ extra parameters)
ALiBi: Simple linear bias but lacks expressiveness for complex position patterns

RoPE’s key insight: Instead of adding position information to the token embeddings or attention scores, rotate the query and key vectors so that their inner product inherently encodes relative position:

⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n) ⟩ = g (x_{m}, x_{n}, m - n)

[!NOTE] Intuition
Imagine two vectors in 2D. If you rotate both by the same angle, their dot product stays the same. But if you rotate them by different angles proportional to their positions, the dot product depends only on the difference in rotation angles — i.e., the relative position.

1.2 High-Level Mechanism

Without RoPE:
  Attention(Q, K) = softmax(Q @ K^T / sqrt(d))

With RoPE:
  Q' = Rotate(Q by position m)
  K' = Rotate(K by position n)
  Attention(Q', K') = softmax(Q' @ K'^T / sqrt(d))
                      → naturally encodes (m - n)

Key advantage: RoPE is applied directly to Q and K before the attention computation, so no extra parameters or architectural changes are needed.

2. Mathematical Formulation

2.1 2D Case: Rotation Matrix

For a 2D vector $x = (x_{1}, x_{2})$ , rotating by angle $θ$ :

R (θ) x = (\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}) (\begin{matrix} x_{1} \\ x_{2} \end{matrix})

For position $m$ , set $θ = m \cdot ω$ where $ω$ is a frequency:

f_{{q, k}} (x_{m}, m) = R (m \cdot ω) \cdot W_{{q, k}} x_{m}

Then the dot product between query at position $m$ and key at position $n$ :

⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n) ⟩ = (R (m ω) q)^{⊤} (R (n ω) k) = q^{⊤} R ((n - m) ω) k

The result depends only on the relative position $(n - m)$ — the key property.

2.2 General $d$ -Dimensional Case

For a $d$ -dimensional vector ( $d$ even), RoPE pairs up adjacent dimensions $(2 i, 2 i + 1)$ and applies a 2D rotation to each pair with a different frequency:

Θ = {θ_{i} = 10000^{- 2 i / d} ∣ i = 0, 1, \dots, d / 2 - 1}

The rotary matrix $R_{Θ, m}^{d}$ is a block-diagonal matrix:

R_{Θ, m}^{d} = (\begin{matrix} \cos m θ_{0} & - \sin m θ_{0} & 0 & 0 & \dots \\ \sin m θ_{0} & \cos m θ_{0} & 0 & 0 & \dots \\ 0 & 0 & \cos m θ_{1} & - \sin m θ_{1} & \dots \\ 0 & 0 & \sin m θ_{1} & \cos m θ_{1} & \dots \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ \end{matrix})

RoPE definition:

f_{{q, k}} (x_{m}, m) = R_{Θ, m}^{d} \cdot W_{{q, k}} x_{m}

2.3 Compact Form Using Complex Numbers

Since each 2D rotation corresponds to multiplication by a complex exponential $e^{i θ}$ , RoPE can be expressed elegantly in complex form:

View $x \in R^{d}$ as a complex vector $x \in C^{d / 2}$ by pairing dimensions:

x_{complex} = (x_{0} + i x_{1}, x_{2} + i x_{3}, \dots, x_{d - 2} + i x_{d - 1})

Then RoPE is simply:

f_{{q, k}} (x_{m}, m)_{j} = (W_{{q, k}} x_{m})_{complex}^{(j)} \cdot e^{i m θ_{j}}

where $θ_{j} = 10000^{- 2 j / d}$ and $j = 0, 1, \dots, d / 2 - 1$ .

2.4 Frequency Design

The frequencies follow a geometric progression inspired by the sinusoidal position encoding:

$j$	$θ_{j}$	Period (tokens for full rotation)
0	$10000^{0} = 1$	$2 π \approx 6.28$
1	$10000^{- 2 / d}$	$\approx 2 π \cdot 10000^{2 / d}$
$d / 2 - 1$	$10000^{- (d - 2) / d} \approx 10^{- 4}$	$\approx 2 π \cdot 10000$

Design rationale:

High frequencies (small $j$ ): Capture local, short-range position patterns
Low frequencies (large $j$ ): Capture global, long-range position patterns
The exponential spacing ensures coverage across all scales

3. Key Properties

3.1 Relative Position Encoding

Theorem: RoPE implicitly encodes relative position through rotation.

Proof sketch:

\begin{aligned} ⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n) ⟩ & = (R_{Θ, m}^{d} q_{m})^{⊤} (R_{Θ, n}^{d} k_{n}) \\ = q_{m}^{⊤} (R_{Θ, m}^{d})^{⊤} R_{Θ, n}^{d} k_{n} \\ = q_{m}^{⊤} R_{Θ, n - m}^{d} k_{n} \end{aligned}

The last equality uses the rotation matrix property: $(R_{Θ, m}^{d})^{⊤} R_{Θ, n}^{d} = R_{Θ, n - m}^{d}$ .

The attention score depends on $q_{m}$ , $k_{n}$ , and the relative position $n - m$ — never on the absolute positions independently.

3.2 Long-Term Decay

A crucial property: as the relative distance $| m - n |$ increases, the upper bound of the attention score decays:

| ⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n) ⟩ | \leq max_{i} | \sum_{j} q_{m}^{(j)} k_{n}^{(j)} e^{i (m - n) θ_{j}} |

The sum of complex exponentials with incommensurable frequencies tends to cancel out at large $| m - n |$ , creating a natural locality bias: tokens further apart have weaker maximum attention, without any explicit distance penalty.

3.3 Properties Summary

Property	RoPE	Sinusoidal Absolute	Learned Absolute	ALiBi
Relative encoding	✅ (implicit)	❌	❌	✅ (additive)
No extra parameters	✅	✅	❌	✅
Theoretically unbounded length	✅	✅	❌	✅
Long-term decay	✅ (theoretical)	❌	❌	✅ (linear)
Dimension-specific frequencies	✅ (per pair)	✅ (per dim)	❌	❌
Extrapolation ability	Good (with tuning)	Poor	Cannot	Best

4. Implementation

4.1 Efficient RoPE (Precomputed Frequencies)

import torch
import torch.nn as nn

class RotaryPositionEmbedding(nn.Module):
    """Efficient RoPE implementation with precomputed sin/cos tables.
    
    Args:
        dim: Head dimension (must be even)
        max_seq_len: Maximum sequence length for precomputation
        theta: Base frequency (default: 10000.0)
    """
    def __init__(self, dim: int, max_seq_len: int = 2048, theta: float = 10000.0):
        super().__init__()
        self.dim = dim
        self.max_seq_len = max_seq_len
        
        # Compute frequencies: theta_i = theta^(-2i/d)
        # Shape: (dim // 2,)
        freq_seq = torch.arange(0, dim, 2, dtype=torch.float32)
        inv_freq = 1.0 / (theta ** (freq_seq / dim))
        
        # Register as buffer (not a parameter, but moved with model)
        self.register_buffer('inv_freq', inv_freq, persistent=False)
        
        # Precompute sin/cos for all positions
        self._compute_sin_cos(max_seq_len)
    
    def _compute_sin_cos(self, seq_len: int):
        """Precompute sin and cos tables for fast lookup."""
        # Position indices: (seq_len, 1)
        t = torch.arange(seq_len, dtype=torch.float32, device=self.inv_freq.device)
        # Frequencies: (seq_len, dim // 2)
        freqs = torch.outer(t, self.inv_freq)
        # Embeddings: (seq_len, dim)
        emb = torch.cat([freqs, freqs], dim=-1)
        
        self.register_buffer('cos_cached', emb.cos(), persistent=False)
        self.register_buffer('sin_cached', emb.sin(), persistent=False)
    
    def forward(self, x: torch.Tensor, position_ids: torch.Tensor = None):
        """Apply rotary position embedding to x.
        
        Args:
            x: Input tensor of shape (batch, seq_len, num_heads, head_dim)
            position_ids: Optional position indices, shape (batch, seq_len)
        
        Returns:
            Rotated tensor of same shape
        """
        seq_len = x.shape[1]
        
        if position_ids is not None:
            cos = self.cos_cached[position_ids]  # (batch, seq_len, dim)
            sin = self.sin_cached[position_ids]
        else:
            cos = self.cos_cached[:seq_len].unsqueeze(0)  # (1, seq_len, dim)
            sin = self.sin_cached[:seq_len].unsqueeze(0)
        
        # Add head dimension: (batch, seq_len, 1, dim)
        cos = cos.unsqueeze(2)
        sin = sin.unsqueeze(2)
        
        # Apply rotation: x_rot = x * cos + rotate_half(x) * sin
        return (x * cos) + (self._rotate_half(x) * sin)
    
    @staticmethod
    def _rotate_half(x: torch.Tensor) -> torch.Tensor:
        """Rotate the first half and second half of the last dimension.
        
        For a pair (x0, x1): rotate_half → (-x1, x0)
        """
        # x shape: (..., dim)
        x1 = x[..., : x.shape[-1] // 2]
        x2 = x[..., x.shape[-1] // 2 :]
        return torch.cat([-x2, x1], dim=-1)

4.2 Applying RoPE in Self-Attention

class RoPESelfAttention(nn.Module):
    """Self-attention with Rotary Position Embedding."""
    
    def __init__(self, hidden_dim: int, num_heads: int, max_seq_len: int = 2048):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads
        
        self.q_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
        self.k_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
        self.v_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
        self.o_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
        
        # RoPE only applied to Q and K
        self.rotary = RotaryPositionEmbedding(self.head_dim, max_seq_len)
    
    def forward(self, x: torch.Tensor):
        B, N, D = x.shape
        
        # Project to Q, K, V and reshape to multi-head
        q = self.q_proj(x).view(B, N, self.num_heads, self.head_dim)
        k = self.k_proj(x).view(B, N, self.num_heads, self.head_dim)
        v = self.v_proj(x).view(B, N, self.num_heads, self.head_dim)
        
        # Apply RoPE to Q and K (NOT to V)
        q = self.rotary(q)  # (B, N, H, d)
        k = self.rotary(k)  # (B, N, H, d)
        
        # Compute attention
        q = q.transpose(1, 2)  # (B, H, N, d)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        
        scale = self.head_dim ** -0.5
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) * scale
        attn_weights = F.softmax(attn_weights, dim=-1)
        
        out = torch.matmul(attn_weights, v)
        out = out.transpose(1, 2).reshape(B, N, D)
        
        return self.o_proj(out)

4.3 HuggingFace-style Implementation

# Minimal RoPE implementation (used in LLaMA, Mistral, etc.)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None):
    """Apply rotary position embedding to q and k tensors.
    
    Args:
        q, k: Query and key tensors, shape (batch, seq_len, num_heads, head_dim)
        cos, sin: Precomputed cosine/sine tables, shape (max_seq_len, head_dim)
        position_ids: Optional explicit position indices
    
    Returns:
        q_embed, k_embed: Rotated query and key tensors
    """
    if position_ids is not None:
        cos = cos[position_ids].unsqueeze(2)  # (batch, seq_len, 1, head_dim)
        sin = sin[position_ids].unsqueeze(2)
    else:
        cos = cos[:q.shape[1]].unsqueeze(0).unsqueeze(2)
        sin = sin[:q.shape[1]].unsqueeze(0).unsqueeze(2)
    
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


def rotate_half(x):
    """Rotate half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

4.4 Computational Cost

Operation	Complexity	Notes
Frequency precomputation	$O (d \cdot L_{max})$	One-time, negligible
RoPE application	$O (B \cdot N \cdot d)$	Simple element-wise operations
Vs. Attention computation	$O (B \cdot H \cdot N^{2} \cdot d)$	RoPE adds $< 1 %$ overhead

RoPE is extremely efficient — two element-wise multiplications and one concatenation per token, negligible compared to the attention computation itself.

5. Comparison with Other Position Encodings

5.1 Sinusoidal (Vaswani et al., 2017)

Original Transformer:

P E_{(p o s, 2 i)} = \sin (\frac{p o s}{10000^{2 i / d}}), P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{2 i / d}})

Aspect	Sinusoidal	RoPE
Where applied	Added to token embeddings (before attention)	Rotates Q and K (within attention)
Relative position	Not directly encoded	Encoded via rotation difference
Extrapolation	Poor (in practice)	Good (with tuning)
Trainable	No	No
Theoretical elegance	Moderate	High (group-theoretic foundation)

5.2 Learned Absolute

Positions 0 through $L_{max}$ each have a learnable embedding vector.

Aspect	Learned Absolute	RoPE
Parameters	$L_{max} \times d$	0
Max length	Fixed (hard limit)	Theoretically unbounded
Extrapolation	Cannot (no embeddings for $> L_{max}$ )	Good
Inductive bias	None (must learn everything)	Rotation-based relative encoding
Training cost	Minimal	Minimal

5.3 ALiBi (Press et al., 2022)

Adds a static, non-learned linear bias to attention scores:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + m \cdot B)

where $B_{i j} = - | i - j |$ and $m$ is a head-specific slope.

Aspect	ALiBi	RoPE
Mechanism	Add bias to attention scores	Rotate Q and K vectors
Bias form	Linear decay	Sinusoidal decay (richer pattern)
Extrapolation	Excellent (by design)	Good
Expressiveness	Low (single slope per head)	High (dimension-specific frequencies)
Adoption	BLOOM, early models	LLaMA, GPT-NeoX, PaLM, most modern LLMs

5.4 Decision Guide

Scenario	Recommended Encoding	Reason
LLM training (2024+)	RoPE	Industry standard, best overall performance
Extreme extrapolation (>10×)	ALiBi or NTK-RoPE	ALiBi excels at length generalization
Maximum simplicity	ALiBi	No learnable or computed embeddings
Legacy compatibility	Learned or Sinusoidal	BERT, GPT-2 style
Theory-oriented research	RoPE	Rich mathematical structure

6. RoPE in Modern LLMs

6.1 Adoption Timeline

2021: RoPE proposed (Su et al.)
  │
2022: GPT-NeoX-20B (EleutherAI) — first major Western LLM with RoPE
  │    PaLM (Google) — adopts RoPE
  │
2023: LLaMA (Meta) — adopts RoPE, becomes most influential open-source LLM
  │    LLaMA 2 (Meta) — retains RoPE
  │    Qwen (Alibaba) — adopts RoPE
  │    Mistral — adopts RoPE with sliding window
  │    Yi (01.AI) — adopts RoPE
  │
2024: LLaMA 3 (Meta) — RoPE with increased theta (500,000)
  │    DeepSeek-V2/V3 — RoPE with YaRN extension
  │    Qwen 2 — RoPE
  │    Phi-3 (Microsoft) — RoPE

6.2 Key Design Choices

Model	$θ$ (base frequency)	Max Length	RoPE Variant
LLaMA	10000	2048	Standard
LLaMA 2	10000	4096	Standard
LLaMA 3	500000	8192	High-theta
GPT-NeoX	10000	2048	Standard
Mistral	10000	8192 (sliding window)	Standard
Qwen 2	1000000	32768	NTK-aware
DeepSeek-V2	10000	128K	YaRN
Code LLaMA	1000000	16384	NTK-aware

6.3 LLaMA’s RoPE Configuration

LLaMA applies RoPE only to a fraction of the head dimension (partial RoPE), leaving part of the embedding unrotated to preserve some absolute position information:

# LLaMA-style: apply RoPE to first `partial_rope_dim` dimensions only
def apply_llama_rope(q, k, cos, sin, partial_rope_dim=None):
    if partial_rope_dim is not None:
        q_rot, q_pass = q[..., :partial_rope_dim], q[..., partial_rope_dim:]
        k_rot, k_pass = k[..., :partial_rope_dim], k[..., partial_rope_dim:]
        q_rot, k_rot = apply_rotary_pos_emb(q_rot, k_rot, cos, sin)
        q = torch.cat([q_rot, q_pass], dim=-1)
        k = torch.cat([k_rot, k_pass], dim=-1)
    else:
        q, k = apply_rotary_pos_emb(q, k, cos, sin)
    return q, k

7. Extensions and Variants

7.1 NTK-Aware Scaled RoPE

Problem: Standard RoPE with $θ = 10000$ performs poorly when extrapolating to sequences longer than training length — high frequencies “wrap around” and create confusion between nearby and distant positions.

Solution: Scale $θ$ by a factor to “slow down” the highest frequencies:

θ_{NTK} = θ \cdot α^{\frac{d}{d - 2}}

where $α$ is the desired extrapolation factor (e.g., $α = 2$ for 2× context extension).

def ntk_aware_scaling(inv_freq, scale_factor: float):
    """Apply NTK-aware scaling to inverse frequencies.
    
    Blends the original and scaled frequencies to preserve 
    local resolution while enabling longer contexts.
    """
    dim = len(inv_freq) * 2
    # Scale factor for highest frequency
    ntk_factor = scale_factor ** (dim / (dim - 2))
    # Scale all frequencies proportionally
    return inv_freq * ntk_factor

Key insight: High frequencies (local patterns) should be preserved; low frequencies (global patterns) should be extended. NTK-aware scaling strikes this balance by interpolating in frequency space.

7.2 YaRN (Yet another RoPE extensioN)

YaRN (Peng et al., 2023) combines NTK-aware scaling with temperature tuning of attention logits:

Attention = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}} \cdot T})

where the temperature $T$ is adjusted per-head during extrapolation.

YaRN components:

NTK-aware frequency scaling: Stretch low frequencies
Length scaling: Directly scale position indices
Attention temperature: Tune the softmax temperature for long contexts

def yarn_position_interpolation(position_ids, inv_freq, 
                                 original_max_len, extended_max_len,
                                 scale: float, alpha: float):
    """YaRN: advanced RoPE interpolation for extreme length extension."""
    scale_factor = extended_max_len / original_max_len
    
    # Part 1: NTK-aware frequency scaling
    dim = len(inv_freq) * 2
    ntk_factor = scale_factor ** (dim / (dim - 2))
    scaled_inv_freq = inv_freq * ntk_factor
    
    # Part 2: Position interpolation
    scaled_positions = position_ids.float() / scale
    
    # Part 3: Temperature tuning
    temperature = 1.0 / (1.0 + alpha * (scale_factor - 1.0))
    
    return scaled_positions, scaled_inv_freq, temperature

7.3 Linear (PI) vs Dynamic NTK

Method	Mechanism	Quality	Compute Cost	Example
Position Interpolation (PI)	Linearly scale all positions	Moderate (blurring)	Zero	Early LLaMA extensions
NTK-Aware	Scale $θ$ exponentially	Good	Zero	Code LLaMA
Dynamic NTK	Adapt $θ$ per sequence length	Very Good	Minimal	Modern deployments
YaRN	NTK + temperature tuning	Best	Minimal	DeepSeek, long-context models

7.4 2D RoPE (Vision)

RoPE generalizes naturally to 2D for vision Transformers. For a patch at position $(h, w)$ , two independent sets of frequencies encode row and column positions separately:

f_{{q, k}} (x_{(h, w)}, (h, w)) = R_{Θ_{h}, h} \cdot R_{Θ_{w}, w} \cdot W x_{(h, w)}

This encodes 2D relative position naturally — the dot product depends on $(h_{q} - h_{k}, w_{q} - w_{k})$ .

Applications: [[Vision Transformer (ViT)|ViT]] variants, [[DiT]], video Transformers.

8. Theoretical Analysis

8.1 Group-Theoretic Interpretation

RoPE has a clean interpretation via representation theory of the rotation group $S O (2)$ :

Each $(2 i, 2 i + 1)$ pair is a 2D irreducible representation of $S O (2)$
Different frequencies $θ_{i}$ correspond to different “rotation speeds”
The block-diagonal form is the canonical decomposition of rotation in high dimensions

This makes RoPE mathematically more principled than heuristic positional encodings.

8.2 Why Not Rotate V?

RoPE is typically applied only to Q and K, not to V:

Reason: The attention output is a weighted sum of V vectors:

{Output}_{m} = \sum_{n} A_{m n} V_{n}

If V were also rotated by position $n$ , the output would encode position information that might interfere with subsequent layers. Keeping V unrotated allows position information to be “consumed” by the attention pattern without accumulating in the residual stream.

8.3 Partial RoPE Analysis

Why LLaMA uses partial RoPE (rotating only part of the embedding dimensions):

Preserves absolute position cues: Unrotated dimensions can carry absolute position information
Numerical stability: Prevents degenerate behavior at very long contexts
Empirical improvement: Partial RoPE consistently outperforms full RoPE on downstream tasks

The trade-off: partial rotation ratio $ρ$ typically set to 0.5–1.0, with $ρ = 1.0$ being full RoPE and smaller values trading relative encoding strength for absolute encoding preservation.

9. Practical Guidelines

9.1 Choosing $θ$ (Base Frequency)

$θ$	Max Effective Context	Use Case
10000 (default)	2K–4K tokens	Standard training
500000 (LLaMA 3)	8K tokens	Longer pre-training context
1000000 (Code LLaMA)	16K–32K tokens	Code, long documents
10000000+	128K+ tokens	Extreme long context

Rule of thumb: $θ = 10000$ works well for training length ≤ 4096. For longer contexts, either increase $θ$ or use NTK-aware scaling at inference.

9.2 Sequence Length Extension Checklist

[ ] Set appropriate $θ$ for target context length
[ ] Apply NTK-aware scaling if extending beyond training length
[ ] Tune softmax temperature for long sequences (YaRN-style)
[ ] Verify perplexity doesn’t degrade at target length
[ ] Test on long-context benchmarks (Passkey Retrieval, LongBench)
[ ] Consider sliding window attention + RoPE (Mistral approach)

9.3 Common Pitfalls

Pitfall	Symptom	Fix
$θ$ too small	Poor performance on long sequences	Increase $θ$ or use NTK
$θ$ too large	Loss of local positional resolution	Decrease or use dynamic NTK
RoPE applied to V	Position leakage, degraded quality	Never rotate V
RoPE cached with wrong dtype	Numerical drift in long sequences	Use float32 for sin/cos tables

10. Core Formula Cards

[!QUOTE] RoPE Definition (Matrix Form)
$f_{{q, k}} (x_{m}, m) = R_{Θ, m}^{d} \cdot W_{{q, k}} x_{m}$

[!QUOTE] Frequency Schedule
$Θ = {θ_{i} = 10000^{- 2 i / d} | i = 0, 1, \dots, \frac{d}{2} - 1}$

[!QUOTE] Relative Position Property
$⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n) ⟩ = g (x_{m}, x_{n}, m - n)$

[!QUOTE] Efficient Computation (Complex Form)
$RoPE (x, m)^{(j)} = (x_{2 j} + i x_{2 j + 1}) \cdot e^{i m θ_{j}}$

[!QUOTE] Efficient Computation (Real Form)
$(\begin{matrix} x_{2 j}^{'} \\ x_{2 j + 1}^{'} \end{matrix}) = (\begin{matrix} \cos (m θ_{j}) & - \sin (m θ_{j}) \\ \sin (m θ_{j}) & \cos (m θ_{j}) \end{matrix}) (\begin{matrix} x_{2 j} \\ x_{2 j + 1} \end{matrix})$

[!QUOTE] Attention Score with RoPE
$a_{m n} = \frac{1}{\sqrt{d_{k}}} \cdot q_{m}^{⊤} R_{Θ, n - m}^{d} k_{n}$

[!QUOTE] NTK-Aware Scaling
$θ_{NTK} = θ \cdot α^{\frac{d}{d - 2}}, α = \frac{L_{target}}{L_{train}}$

11. Summary

Aspect	Description
Core idea	Rotate Q and K by position-dependent angles so their dot product encodes relative position
Key mechanism	Block-diagonal 2D rotation matrices applied per dimension pair
Mathematical foundation	Rotation group $S O (2)$ representation theory
Why it works	$(R_{m} Q)^{⊤} (R_{n} K) = Q^{⊤} R_{n - m} K$ — pure relative encoding
Parameters	Zero additional learnable parameters
Computational cost	Negligible ( $< 1 %$ of attention cost)
Adoption	Dominant position encoding: LLaMA, GPT-NeoX, Mistral, Qwen, PaLM, DeepSeek, Phi-3
Extensions	NTK-aware scaling, YaRN, Dynamic NTK, 2D RoPE (vision), Partial RoPE
Key hyperparameter	$θ$ (base frequency) — controls long-range capacity
Comparison	Outperforms sinusoidal and learned absolute; matches ALiBi on extrapolation; richer than ALiBi

[[Vision Transformer (ViT)]]
[[DiT]]
[[Transformer]]
[[U-Net]]
[[ResNet]]
[[Diffusion Model]]

Dataview Query

1
2
3

LIST
FROM #rope OR #position_encoding OR #rotary_position_embedding
SORT file.ctime DESC

References

Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
Paper: RoPE: Rotary Position Embedding (Su et al., 2023 — extended analysis)
Paper: LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)
Paper: YaRN: Efficient Context Window Extension of Large Language Models (Peng et al., 2023)
Paper: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (Press et al., 2022 — ALiBi)
Paper: GPT-NeoX-20B: An Open-Source Autoregressive Language Model (Black et al., 2022)
Paper: Code LLaMA: Open Foundation Models for Code (Rozière et al., 2023)
Blog: Rotary Embeddings: A Relative Revolution — EleutherAI Blog
Blog: Extending Context Window of LLMs with Position Interpolation — KAIST AI Blog
Blog: Applied RoPE Scaling — HuggingFace Blog
Code: https://github.com/huggingface/transformers (LlamaRotaryEmbedding)
Code: https://github.com/eleutherai/gpt-neox (original GPT-NeoX RoPE)
Code: https://github.com/jquesnelle/yarn (YaRN implementation)