RoPE (Rotary Position Embedding)

Rotary Position Embedding (RoPE) encodes position information by rotating the query and key vectors in self-attention according to their absolute positions. The dot product between rotated vectors then naturally depends only on their relative position difference, combining the flexibility of learned absolute embeddings with the inductive bias of relative position encoding — making it the dominant position encoding scheme in modern LLMs including LLaMA, GPT-NeoX, PaLM, Qwen, and Mistral.


1. Core Concept

1.1 Motivation

Problem with existing position encodings:

  1. Sinusoidal (Absolute): Encodes absolute positions but doesn’t explicitly model relative distances in attention computation
  2. Learned Absolute: Fixed maximum sequence length, cannot extrapolate
  3. Relative (Shaw et al.): Adds pairwise biases to attention scores — computationally expensive ( O(N2) extra parameters)
  4. ALiBi: Simple linear bias but lacks expressiveness for complex position patterns

RoPE’s key insight: Instead of adding position information to the token embeddings or attention scores, rotate the query and key vectors so that their inner product inherently encodes relative position:

fq(xm,m),fk(xn,n)=g(xm,xn,mn)

[!NOTE] Intuition
Imagine two vectors in 2D. If you rotate both by the same angle, their dot product stays the same. But if you rotate them by different angles proportional to their positions, the dot product depends only on the difference in rotation angles — i.e., the relative position.

1.2 High-Level Mechanism

1
2
3
4
5
6
7
8
Without RoPE:
Attention(Q, K) = softmax(Q @ K^T / sqrt(d))

With RoPE:
Q' = Rotate(Q by position m)
K' = Rotate(K by position n)
Attention(Q', K') = softmax(Q' @ K'^T / sqrt(d))
→ naturally encodes (m - n)

Key advantage: RoPE is applied directly to Q and K before the attention computation, so no extra parameters or architectural changes are needed.


2. Mathematical Formulation

2.1 2D Case: Rotation Matrix

For a 2D vector x=(x1,x2) , rotating by angle θ :

R(θ)x=(cosθsinθsinθcosθ)(x1x2)

For position m , set θ=mω where ω is a frequency:

f{q,k}(xm,m)=R(mω)W{q,k}xm

Then the dot product between query at position m and key at position n :

fq(xm,m),fk(xn,n)=(R(mω)q)(R(nω)k)=qR((nm)ω)k

The result depends only on the relative position (nm) — the key property.

2.2 General d -Dimensional Case

For a d -dimensional vector ( d even), RoPE pairs up adjacent dimensions (2i,2i+1) and applies a 2D rotation to each pair with a different frequency:

Θ={θi=100002i/di=0,1,,d/21}

The rotary matrix RΘ,md is a block-diagonal matrix:

RΘ,md=(cosmθ0sinmθ000sinmθ0cosmθ00000cosmθ1sinmθ100sinmθ1cosmθ1)

RoPE definition:

f{q,k}(xm,m)=RΘ,mdW{q,k}xm

2.3 Compact Form Using Complex Numbers

Since each 2D rotation corresponds to multiplication by a complex exponential eiθ , RoPE can be expressed elegantly in complex form:

View xRd as a complex vector xCd/2 by pairing dimensions:

xcomplex=(x0+ix1,x2+ix3,,xd2+ixd1)

Then RoPE is simply:

f{q,k}(xm,m)j=(W{q,k}xm)complex(j)eimθj

where θj=100002j/d and j=0,1,,d/21 .

2.4 Frequency Design

The frequencies follow a geometric progression inspired by the sinusoidal position encoding:

j θj Period (tokens for full rotation)
0 100000=1 2π6.28
1 100002/d 2π100002/d
d/21 10000(d2)/d104 2π10000

Design rationale:

  • High frequencies (small j ): Capture local, short-range position patterns
  • Low frequencies (large j ): Capture global, long-range position patterns
  • The exponential spacing ensures coverage across all scales

3. Key Properties

3.1 Relative Position Encoding

Theorem: RoPE implicitly encodes relative position through rotation.

Proof sketch:

fq(xm,m),fk(xn,n)=(RΘ,mdqm)(RΘ,ndkn)=qm(RΘ,md)RΘ,ndkn=qmRΘ,nmdkn

The last equality uses the rotation matrix property: (RΘ,md)RΘ,nd=RΘ,nmd .

The attention score depends on qm , kn , and the relative position nm never on the absolute positions independently.

3.2 Long-Term Decay

A crucial property: as the relative distance |mn| increases, the upper bound of the attention score decays:

|fq(xm,m),fk(xn,n)|maxi|jqm(j)kn(j)ei(mn)θj|

The sum of complex exponentials with incommensurable frequencies tends to cancel out at large |mn| , creating a natural locality bias: tokens further apart have weaker maximum attention, without any explicit distance penalty.

3.3 Properties Summary

Property RoPE Sinusoidal Absolute Learned Absolute ALiBi
Relative encoding ✅ (implicit) ✅ (additive)
No extra parameters
Theoretically unbounded length
Long-term decay ✅ (theoretical) ✅ (linear)
Dimension-specific frequencies ✅ (per pair) ✅ (per dim)
Extrapolation ability Good (with tuning) Poor Cannot Best

4. Implementation

4.1 Efficient RoPE (Precomputed Frequencies)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import torch.nn as nn

class RotaryPositionEmbedding(nn.Module):
"""Efficient RoPE implementation with precomputed sin/cos tables.

Args:
dim: Head dimension (must be even)
max_seq_len: Maximum sequence length for precomputation
theta: Base frequency (default: 10000.0)
"""
def __init__(self, dim: int, max_seq_len: int = 2048, theta: float = 10000.0):
super().__init__()
self.dim = dim
self.max_seq_len = max_seq_len

# Compute frequencies: theta_i = theta^(-2i/d)
# Shape: (dim // 2,)
freq_seq = torch.arange(0, dim, 2, dtype=torch.float32)
inv_freq = 1.0 / (theta ** (freq_seq / dim))

# Register as buffer (not a parameter, but moved with model)
self.register_buffer('inv_freq', inv_freq, persistent=False)

# Precompute sin/cos for all positions
self._compute_sin_cos(max_seq_len)

def _compute_sin_cos(self, seq_len: int):
"""Precompute sin and cos tables for fast lookup."""
# Position indices: (seq_len, 1)
t = torch.arange(seq_len, dtype=torch.float32, device=self.inv_freq.device)
# Frequencies: (seq_len, dim // 2)
freqs = torch.outer(t, self.inv_freq)
# Embeddings: (seq_len, dim)
emb = torch.cat([freqs, freqs], dim=-1)

self.register_buffer('cos_cached', emb.cos(), persistent=False)
self.register_buffer('sin_cached', emb.sin(), persistent=False)

def forward(self, x: torch.Tensor, position_ids: torch.Tensor = None):
"""Apply rotary position embedding to x.

Args:
x: Input tensor of shape (batch, seq_len, num_heads, head_dim)
position_ids: Optional position indices, shape (batch, seq_len)

Returns:
Rotated tensor of same shape
"""
seq_len = x.shape[1]

if position_ids is not None:
cos = self.cos_cached[position_ids] # (batch, seq_len, dim)
sin = self.sin_cached[position_ids]
else:
cos = self.cos_cached[:seq_len].unsqueeze(0) # (1, seq_len, dim)
sin = self.sin_cached[:seq_len].unsqueeze(0)

# Add head dimension: (batch, seq_len, 1, dim)
cos = cos.unsqueeze(2)
sin = sin.unsqueeze(2)

# Apply rotation: x_rot = x * cos + rotate_half(x) * sin
return (x * cos) + (self._rotate_half(x) * sin)

@staticmethod
def _rotate_half(x: torch.Tensor) -> torch.Tensor:
"""Rotate the first half and second half of the last dimension.

For a pair (x0, x1): rotate_half → (-x1, x0)
"""
# x shape: (..., dim)
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat([-x2, x1], dim=-1)

4.2 Applying RoPE in Self-Attention

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class RoPESelfAttention(nn.Module):
"""Self-attention with Rotary Position Embedding."""

def __init__(self, hidden_dim: int, num_heads: int, max_seq_len: int = 2048):
super().__init__()
self.num_heads = num_heads
self.head_dim = hidden_dim // num_heads

self.q_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
self.k_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
self.v_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)
self.o_proj = nn.Linear(hidden_dim, hidden_dim, bias=False)

# RoPE only applied to Q and K
self.rotary = RotaryPositionEmbedding(self.head_dim, max_seq_len)

def forward(self, x: torch.Tensor):
B, N, D = x.shape

# Project to Q, K, V and reshape to multi-head
q = self.q_proj(x).view(B, N, self.num_heads, self.head_dim)
k = self.k_proj(x).view(B, N, self.num_heads, self.head_dim)
v = self.v_proj(x).view(B, N, self.num_heads, self.head_dim)

# Apply RoPE to Q and K (NOT to V)
q = self.rotary(q) # (B, N, H, d)
k = self.rotary(k) # (B, N, H, d)

# Compute attention
q = q.transpose(1, 2) # (B, H, N, d)
k = k.transpose(1, 2)
v = v.transpose(1, 2)

scale = self.head_dim ** -0.5
attn_weights = torch.matmul(q, k.transpose(-2, -1)) * scale
attn_weights = F.softmax(attn_weights, dim=-1)

out = torch.matmul(attn_weights, v)
out = out.transpose(1, 2).reshape(B, N, D)

return self.o_proj(out)

4.3 HuggingFace-style Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Minimal RoPE implementation (used in LLaMA, Mistral, etc.)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None):
"""Apply rotary position embedding to q and k tensors.

Args:
q, k: Query and key tensors, shape (batch, seq_len, num_heads, head_dim)
cos, sin: Precomputed cosine/sine tables, shape (max_seq_len, head_dim)
position_ids: Optional explicit position indices

Returns:
q_embed, k_embed: Rotated query and key tensors
"""
if position_ids is not None:
cos = cos[position_ids].unsqueeze(2) # (batch, seq_len, 1, head_dim)
sin = sin[position_ids].unsqueeze(2)
else:
cos = cos[:q.shape[1]].unsqueeze(0).unsqueeze(2)
sin = sin[:q.shape[1]].unsqueeze(0).unsqueeze(2)

q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed


def rotate_half(x):
"""Rotate half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)

4.4 Computational Cost

Operation Complexity Notes
Frequency precomputation O(dLmax) One-time, negligible
RoPE application O(BNd) Simple element-wise operations
Vs. Attention computation O(BHN2d) RoPE adds <1% overhead

RoPE is extremely efficient — two element-wise multiplications and one concatenation per token, negligible compared to the attention computation itself.


5. Comparison with Other Position Encodings

5.1 Sinusoidal (Vaswani et al., 2017)

Original Transformer:

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)
Aspect Sinusoidal RoPE
Where applied Added to token embeddings (before attention) Rotates Q and K (within attention)
Relative position Not directly encoded Encoded via rotation difference
Extrapolation Poor (in practice) Good (with tuning)
Trainable No No
Theoretical elegance Moderate High (group-theoretic foundation)

5.2 Learned Absolute

Positions 0 through Lmax each have a learnable embedding vector.

Aspect Learned Absolute RoPE
Parameters Lmax×d 0
Max length Fixed (hard limit) Theoretically unbounded
Extrapolation Cannot (no embeddings for >Lmax ) Good
Inductive bias None (must learn everything) Rotation-based relative encoding
Training cost Minimal Minimal

5.3 ALiBi (Press et al., 2022)

Adds a static, non-learned linear bias to attention scores:

Attention(Q,K,V)=softmax(QKdk+mB)

where Bij=|ij| and m is a head-specific slope.

Aspect ALiBi RoPE
Mechanism Add bias to attention scores Rotate Q and K vectors
Bias form Linear decay Sinusoidal decay (richer pattern)
Extrapolation Excellent (by design) Good
Expressiveness Low (single slope per head) High (dimension-specific frequencies)
Adoption BLOOM, early models LLaMA, GPT-NeoX, PaLM, most modern LLMs

5.4 Decision Guide

Scenario Recommended Encoding Reason
LLM training (2024+) RoPE Industry standard, best overall performance
Extreme extrapolation (>10×) ALiBi or NTK-RoPE ALiBi excels at length generalization
Maximum simplicity ALiBi No learnable or computed embeddings
Legacy compatibility Learned or Sinusoidal BERT, GPT-2 style
Theory-oriented research RoPE Rich mathematical structure

6. RoPE in Modern LLMs

6.1 Adoption Timeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2021: RoPE proposed (Su et al.)

2022: GPT-NeoX-20B (EleutherAI) — first major Western LLM with RoPE
│ PaLM (Google) — adopts RoPE

2023: LLaMA (Meta) — adopts RoPE, becomes most influential open-source LLM
│ LLaMA 2 (Meta) — retains RoPE
│ Qwen (Alibaba) — adopts RoPE
│ Mistral — adopts RoPE with sliding window
│ Yi (01.AI) — adopts RoPE

2024: LLaMA 3 (Meta) — RoPE with increased theta (500,000)
│ DeepSeek-V2/V3 — RoPE with YaRN extension
│ Qwen 2 — RoPE
│ Phi-3 (Microsoft) — RoPE

6.2 Key Design Choices

Model θ (base frequency) Max Length RoPE Variant
LLaMA 10000 2048 Standard
LLaMA 2 10000 4096 Standard
LLaMA 3 500000 8192 High-theta
GPT-NeoX 10000 2048 Standard
Mistral 10000 8192 (sliding window) Standard
Qwen 2 1000000 32768 NTK-aware
DeepSeek-V2 10000 128K YaRN
Code LLaMA 1000000 16384 NTK-aware

6.3 LLaMA’s RoPE Configuration

LLaMA applies RoPE only to a fraction of the head dimension (partial RoPE), leaving part of the embedding unrotated to preserve some absolute position information:

1
2
3
4
5
6
7
8
9
10
11
# LLaMA-style: apply RoPE to first `partial_rope_dim` dimensions only
def apply_llama_rope(q, k, cos, sin, partial_rope_dim=None):
if partial_rope_dim is not None:
q_rot, q_pass = q[..., :partial_rope_dim], q[..., partial_rope_dim:]
k_rot, k_pass = k[..., :partial_rope_dim], k[..., partial_rope_dim:]
q_rot, k_rot = apply_rotary_pos_emb(q_rot, k_rot, cos, sin)
q = torch.cat([q_rot, q_pass], dim=-1)
k = torch.cat([k_rot, k_pass], dim=-1)
else:
q, k = apply_rotary_pos_emb(q, k, cos, sin)
return q, k

7. Extensions and Variants

7.1 NTK-Aware Scaled RoPE

Problem: Standard RoPE with θ=10000 performs poorly when extrapolating to sequences longer than training length — high frequencies “wrap around” and create confusion between nearby and distant positions.

Solution: Scale θ by a factor to “slow down” the highest frequencies:

θNTK=θαdd2

where α is the desired extrapolation factor (e.g., α=2 for 2× context extension).

1
2
3
4
5
6
7
8
9
10
11
def ntk_aware_scaling(inv_freq, scale_factor: float):
"""Apply NTK-aware scaling to inverse frequencies.

Blends the original and scaled frequencies to preserve
local resolution while enabling longer contexts.
"""
dim = len(inv_freq) * 2
# Scale factor for highest frequency
ntk_factor = scale_factor ** (dim / (dim - 2))
# Scale all frequencies proportionally
return inv_freq * ntk_factor

Key insight: High frequencies (local patterns) should be preserved; low frequencies (global patterns) should be extended. NTK-aware scaling strikes this balance by interpolating in frequency space.

7.2 YaRN (Yet another RoPE extensioN)

YaRN (Peng et al., 2023) combines NTK-aware scaling with temperature tuning of attention logits:

Attention=softmax(QKdkT)

where the temperature T is adjusted per-head during extrapolation.

YaRN components:

  1. NTK-aware frequency scaling: Stretch low frequencies
  2. Length scaling: Directly scale position indices
  3. Attention temperature: Tune the softmax temperature for long contexts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def yarn_position_interpolation(position_ids, inv_freq, 
original_max_len, extended_max_len,
scale: float, alpha: float):
"""YaRN: advanced RoPE interpolation for extreme length extension."""
scale_factor = extended_max_len / original_max_len

# Part 1: NTK-aware frequency scaling
dim = len(inv_freq) * 2
ntk_factor = scale_factor ** (dim / (dim - 2))
scaled_inv_freq = inv_freq * ntk_factor

# Part 2: Position interpolation
scaled_positions = position_ids.float() / scale

# Part 3: Temperature tuning
temperature = 1.0 / (1.0 + alpha * (scale_factor - 1.0))

return scaled_positions, scaled_inv_freq, temperature

7.3 Linear (PI) vs Dynamic NTK

Method Mechanism Quality Compute Cost Example
Position Interpolation (PI) Linearly scale all positions Moderate (blurring) Zero Early LLaMA extensions
NTK-Aware Scale θ exponentially Good Zero Code LLaMA
Dynamic NTK Adapt θ per sequence length Very Good Minimal Modern deployments
YaRN NTK + temperature tuning Best Minimal DeepSeek, long-context models

7.4 2D RoPE (Vision)

RoPE generalizes naturally to 2D for vision Transformers. For a patch at position (h,w) , two independent sets of frequencies encode row and column positions separately:

f{q,k}(x(h,w),(h,w))=RΘh,hRΘw,wWx(h,w)

This encodes 2D relative position naturally — the dot product depends on (hqhk,wqwk) .

Applications: [[Vision Transformer (ViT)|ViT]] variants, [[DiT]], video Transformers.


8. Theoretical Analysis

8.1 Group-Theoretic Interpretation

RoPE has a clean interpretation via representation theory of the rotation group SO(2) :

  • Each (2i,2i+1) pair is a 2D irreducible representation of SO(2)
  • Different frequencies θi correspond to different “rotation speeds”
  • The block-diagonal form is the canonical decomposition of rotation in high dimensions

This makes RoPE mathematically more principled than heuristic positional encodings.

8.2 Why Not Rotate V?

RoPE is typically applied only to Q and K, not to V:

Reason: The attention output is a weighted sum of V vectors:

Outputm=nAmnVn

If V were also rotated by position n , the output would encode position information that might interfere with subsequent layers. Keeping V unrotated allows position information to be “consumed” by the attention pattern without accumulating in the residual stream.

8.3 Partial RoPE Analysis

Why LLaMA uses partial RoPE (rotating only part of the embedding dimensions):

  1. Preserves absolute position cues: Unrotated dimensions can carry absolute position information
  2. Numerical stability: Prevents degenerate behavior at very long contexts
  3. Empirical improvement: Partial RoPE consistently outperforms full RoPE on downstream tasks

The trade-off: partial rotation ratio ρ typically set to 0.5–1.0, with ρ=1.0 being full RoPE and smaller values trading relative encoding strength for absolute encoding preservation.


9. Practical Guidelines

9.1 Choosing θ (Base Frequency)

θ Max Effective Context Use Case
10000 (default) 2K–4K tokens Standard training
500000 (LLaMA 3) 8K tokens Longer pre-training context
1000000 (Code LLaMA) 16K–32K tokens Code, long documents
10000000+ 128K+ tokens Extreme long context

Rule of thumb: θ=10000 works well for training length ≤ 4096. For longer contexts, either increase θ or use NTK-aware scaling at inference.

9.2 Sequence Length Extension Checklist

  • [ ] Set appropriate θ for target context length
  • [ ] Apply NTK-aware scaling if extending beyond training length
  • [ ] Tune softmax temperature for long sequences (YaRN-style)
  • [ ] Verify perplexity doesn’t degrade at target length
  • [ ] Test on long-context benchmarks (Passkey Retrieval, LongBench)
  • [ ] Consider sliding window attention + RoPE (Mistral approach)

9.3 Common Pitfalls

Pitfall Symptom Fix
θ too small Poor performance on long sequences Increase θ or use NTK
θ too large Loss of local positional resolution Decrease or use dynamic NTK
RoPE applied to V Position leakage, degraded quality Never rotate V
RoPE cached with wrong dtype Numerical drift in long sequences Use float32 for sin/cos tables

10. Core Formula Cards

[!QUOTE] RoPE Definition (Matrix Form)

f{q,k}(xm,m)=RΘ,mdW{q,k}xm

[!QUOTE] Frequency Schedule

Θ={θi=100002i/d|i=0,1,,d21}

[!QUOTE] Relative Position Property

fq(xm,m),fk(xn,n)=g(xm,xn,mn)

[!QUOTE] Efficient Computation (Complex Form)

RoPE(x,m)(j)=(x2j+ix2j+1)eimθj

[!QUOTE] Efficient Computation (Real Form)

(x2jx2j+1)=(cos(mθj)sin(mθj)sin(mθj)cos(mθj))(x2jx2j+1)

[!QUOTE] Attention Score with RoPE

amn=1dkqmRΘ,nmdkn

[!QUOTE] NTK-Aware Scaling

θNTK=θαdd2,α=LtargetLtrain

11. Summary

Aspect Description
Core idea Rotate Q and K by position-dependent angles so their dot product encodes relative position
Key mechanism Block-diagonal 2D rotation matrices applied per dimension pair
Mathematical foundation Rotation group SO(2) representation theory
Why it works (RmQ)(RnK)=QRnmK — pure relative encoding
Parameters Zero additional learnable parameters
Computational cost Negligible ( <1% of attention cost)
Adoption Dominant position encoding: LLaMA, GPT-NeoX, Mistral, Qwen, PaLM, DeepSeek, Phi-3
Extensions NTK-aware scaling, YaRN, Dynamic NTK, 2D RoPE (vision), Partial RoPE
Key hyperparameter θ (base frequency) — controls long-range capacity
Comparison Outperforms sinusoidal and learned absolute; matches ALiBi on extrapolation; richer than ALiBi

  • [[Vision Transformer (ViT)]]
  • [[DiT]]
  • [[Transformer]]
  • [[U-Net]]
  • [[ResNet]]
  • [[Diffusion Model]]

Dataview Query

1
2
3
LIST
FROM #rope OR #position_encoding OR #rotary_position_embedding
SORT file.ctime DESC

References

  • Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
  • Paper: RoPE: Rotary Position Embedding (Su et al., 2023 — extended analysis)
  • Paper: LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)
  • Paper: YaRN: Efficient Context Window Extension of Large Language Models (Peng et al., 2023)
  • Paper: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (Press et al., 2022 — ALiBi)
  • Paper: GPT-NeoX-20B: An Open-Source Autoregressive Language Model (Black et al., 2022)
  • Paper: Code LLaMA: Open Foundation Models for Code (Rozière et al., 2023)
  • Blog: Rotary Embeddings: A Relative Revolution — EleutherAI Blog
  • Blog: Extending Context Window of LLMs with Position Interpolation — KAIST AI Blog
  • Blog: Applied RoPE Scaling — HuggingFace Blog
  • Code: https://github.com/huggingface/transformers (LlamaRotaryEmbedding)
  • Code: https://github.com/eleutherai/gpt-neox (original GPT-NeoX RoPE)
  • Code: https://github.com/jquesnelle/yarn (YaRN implementation)