Skip to content

⚡️ Speed up method Dinov2WithRegistersSelfAttention.transpose_for_scores by 17% in PR #1250 (feature/inference-v1-models) #1272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

codeflash-ai[bot]
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented May 14, 2025

⚡️ This pull request contains optimizations for PR #1250

If you approve this dependent PR, these changes will be merged into the original PR branch feature/inference-v1-models.

This PR will be automatically closed if the original PR is merged.


📄 17% (0.17x) speedup for Dinov2WithRegistersSelfAttention.transpose_for_scores in inference/v1/models/rfdetr/dinov2_with_windowed_attn.py

⏱️ Runtime : 416 microseconds 357 microseconds (best of 80 runs)

📝 Explanation and details

Here is an optimized version of your program. The original code is already efficient but can benefit from minor optimizations in PyTorch tensor reshaping (by using reshape and transpose instead of view and permute, which have better performance when shapes are guaranteed compatible), reduction of attribute lookups in the hot path, and reduced overhead in the constructor.

No change in function name, signature, or output—only improved runtime. All original comments are preserved.

Key changes for speed:

  • Stash repeated configuration lookups as local variables in __init__.
  • Use integer division (//) in attention head size computation.
  • In transpose_for_scores, switched to .reshape() and .transpose() for faster in-place-compatible operations instead of .view() and .permute(), as the input and target shapes are compatible.
  • Avoids repeated member lookups in hot path.

All outputs remain identical to the original.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 62 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage
🌀 Generated Regression Tests Details
import pytest  # used for our unit tests
import torch
from inference.v1.models.rfdetr.dinov2_with_windowed_attn import \
    Dinov2WithRegistersSelfAttention
from torch import nn

# function to test
# ------------------------------------------------------------------------
# RF-DETR
# Copyright (c) 2025 Roboflow. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
# ------------------------------------------------------------------------
# Modified from HuggingFace Dinov2 (https://github.com/huggingface/transformers)
# Copyright 2024 Meta Inc. and the HuggingFace Inc. team. All rights reserved.
# ------------------------------------------------------------------------


class DummyConfig:
    """Minimal config class for testing."""
    def __init__(
        self,
        hidden_size,
        num_attention_heads,
        qkv_bias=False,
        attention_probs_dropout_prob=0.1,
    ):
        self.hidden_size = hidden_size
        self.num_attention_heads = num_attention_heads
        self.qkv_bias = qkv_bias
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
from inference.v1.models.rfdetr.dinov2_with_windowed_attn import \
    Dinov2WithRegistersSelfAttention

# unit tests

# ---------------------------#
# Basic Test Cases
# ---------------------------#

def make_attention_module(hidden_size, num_heads):
    """Helper to create attention module with given config."""
    return Dinov2WithRegistersSelfAttention(
        DummyConfig(hidden_size=hidden_size, num_attention_heads=num_heads)
    )

def test_basic_transpose_shape_and_values():
    """Test that output shape and values are correct for a simple case."""
    batch, seq, hidden, heads = 2, 4, 8, 2
    module = make_attention_module(hidden, heads)
    # Create tensor with increasing values for easy tracking
    x = torch.arange(batch * seq * hidden).float().reshape(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output
    # Check that values are correctly rearranged
    # For head 0, should be the first half of last dim; for head 1, second half
    for b in range(batch):
        for s in range(seq):
            for h in range(heads):
                expected = x[b, s, h*(hidden//heads):(h+1)*(hidden//heads)]
                actual = out[b, h, s, :]

def test_single_batch_single_head():
    """Test with batch size 1 and single attention head."""
    batch, seq, hidden, heads = 1, 3, 6, 1
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_single_sequence_length():
    """Test with sequence length 1."""
    batch, seq, hidden, heads = 2, 1, 4, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_hidden_size_equals_num_heads():
    """Test where hidden size equals number of heads (head size = 1)."""
    batch, seq, hidden, heads = 2, 5, 4, 4
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output
    # Each value in hidden dim should map to a different head
    for h in range(heads):
        pass

# ---------------------------#
# Edge Test Cases
# ---------------------------#

def test_hidden_size_not_divisible_by_heads_raises():
    """Should raise ValueError if hidden_size not divisible by num_attention_heads."""
    with pytest.raises(ValueError):
        make_attention_module(hidden_size=10, num_heads=3)

def test_zero_batch_size():
    """Test with batch size 0 (should return empty tensor with correct shape)."""
    batch, seq, hidden, heads = 0, 3, 6, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_zero_sequence_length():
    """Test with sequence length 0 (should return empty tensor with correct shape)."""
    batch, seq, hidden, heads = 2, 0, 8, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_zero_hidden_size():
    """Test with hidden size 0 (should return empty tensor with correct shape)."""
    batch, seq, hidden, heads = 2, 3, 0, 1
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_large_head_size():
    """Test with large head size (hidden_size much larger than num_heads)."""
    batch, seq, hidden, heads = 1, 2, 128, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_non_contiguous_input():
    """Test with non-contiguous input tensor (should not raise error)."""
    batch, seq, hidden, heads = 2, 4, 8, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden * 2)[:, :, ::2]  # Slicing makes it non-contiguous
    # Should still work
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_dtype_preservation():
    """Test that dtype is preserved in output."""
    batch, seq, hidden, heads = 2, 3, 6, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden, dtype=torch.float64)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_device_preservation():
    """Test that device is preserved in output."""
    if torch.cuda.is_available():
        device = torch.device("cuda")
        batch, seq, hidden, heads = 2, 3, 6, 2
        module = make_attention_module(hidden, heads).to(device)
        x = torch.randn(batch, seq, hidden, device=device)
        codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_gradients_flow():
    """Test that gradients flow through the operation."""
    batch, seq, hidden, heads = 2, 3, 6, 2
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden, requires_grad=True)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output
    out.sum().backward()

# ---------------------------#
# Large Scale Test Cases
# ---------------------------#

def test_large_batch_and_sequence():
    """Test with large batch and sequence length, but under 100MB."""
    batch, seq, hidden, heads = 32, 64, 128, 8  # 32*64*128*4 = 1MB
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_large_hidden_size():
    """Test with large hidden size and moderate batch/seq."""
    batch, seq, hidden, heads = 4, 8, 1024, 16  # 4*8*1024*4 = 128KB
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_maximum_reasonable_tensor():
    """Test with the largest tensor allowed under 100MB."""
    # 100MB / 4 = 25 million floats
    # Let's pick batch=8, seq=128, hidden=256 (8*128*256*4 = 1MB)
    batch, seq, hidden, heads = 16, 128, 512, 32  # 16*128*512*4 = 4MB
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output

def test_large_number_of_heads():
    """Test with a large number of heads."""
    batch, seq, hidden, heads = 2, 10, 320, 32
    module = make_attention_module(hidden, heads)
    x = torch.randn(batch, seq, hidden)
    codeflash_output = module.transpose_for_scores(x); out = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pytest  # used for our unit tests
import torch
from inference.v1.models.rfdetr.dinov2_with_windowed_attn import \
    Dinov2WithRegistersSelfAttention
from torch import nn

# function to test
# ------------------------------------------------------------------------
# RF-DETR
# Copyright (c) 2025 Roboflow. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 [see LICENSE for details]
# ------------------------------------------------------------------------
# Modified from HuggingFace Dinov2 (https://github.com/huggingface/transformers)
# Copyright 2024 Meta Inc. and the HuggingFace Inc. team. All rights reserved.
# ------------------------------------------------------------------------


# Minimal config class for testing
class WindowedDinov2WithRegistersConfig:
    def __init__(
        self,
        hidden_size=8,
        num_attention_heads=2,
        qkv_bias=True,
        attention_probs_dropout_prob=0.1,
        embedding_size=None,
    ):
        self.hidden_size = hidden_size
        self.num_attention_heads = num_attention_heads
        self.qkv_bias = qkv_bias
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.embedding_size = embedding_size
from inference.v1.models.rfdetr.dinov2_with_windowed_attn import \
    Dinov2WithRegistersSelfAttention

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_transpose_for_scores_basic_2d():
    # Basic test with batch_size=1, seq_len=2, hidden_size=4, num_heads=2
    config = WindowedDinov2WithRegistersConfig(hidden_size=4, num_attention_heads=2)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(1*2*4).float().reshape(1,2,4)
    # Expected shape: (1, 2, 2, 2) after view and permute
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_basic_3d():
    # Test with batch_size=2, seq_len=3, hidden_size=6, num_heads=3
    config = WindowedDinov2WithRegistersConfig(hidden_size=6, num_attention_heads=3)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(2*3*6).float().reshape(2,3,6)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_basic_values():
    # Check that the values are correctly split and permuted
    config = WindowedDinov2WithRegistersConfig(hidden_size=4, num_attention_heads=2)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.tensor([[[1.0,2.0,3.0,4.0],[5.0,6.0,7.0,8.0]]])
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

# -------------------- EDGE TEST CASES --------------------

def test_transpose_for_scores_single_token():
    # Single token sequence
    config = WindowedDinov2WithRegistersConfig(hidden_size=6, num_attention_heads=3)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(1*1*6).float().reshape(1,1,6)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_single_head():
    # Only one attention head
    config = WindowedDinov2WithRegistersConfig(hidden_size=4, num_attention_heads=1)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(2*2*4).float().reshape(2,2,4)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_single_batch():
    # Only one batch
    config = WindowedDinov2WithRegistersConfig(hidden_size=8, num_attention_heads=4)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(1*3*8).float().reshape(1,3,8)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output


def test_transpose_for_scores_zero_batch():
    # Zero batch size
    config = WindowedDinov2WithRegistersConfig(hidden_size=4, num_attention_heads=2)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.empty(0, 2, 4)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_zero_seq():
    # Zero sequence length
    config = WindowedDinov2WithRegistersConfig(hidden_size=4, num_attention_heads=2)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.empty(1, 0, 4)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_zero_hidden():
    # Zero hidden size, should raise error on view
    config = WindowedDinov2WithRegistersConfig(hidden_size=0, num_attention_heads=1)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.empty(1, 1, 0)
    with pytest.raises(RuntimeError):
        attn.transpose_for_scores(x)

def test_transpose_for_scores_high_dim_input():
    # Input with extra dimensions (e.g., batch, seq, hidden, extra)
    config = WindowedDinov2WithRegistersConfig(hidden_size=6, num_attention_heads=2)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(2*3*4*6).float().reshape(2,3,4,6)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_non_contiguous_input():
    # Input tensor is non-contiguous
    config = WindowedDinov2WithRegistersConfig(hidden_size=4, num_attention_heads=2)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.arange(2*3*4).float().reshape(2,3,4)
    x_t = x.transpose(0,1)  # Now shape (3,2,4), non-contiguous
    # Should still work
    codeflash_output = attn.transpose_for_scores(x_t); out = codeflash_output

# -------------------- LARGE SCALE TEST CASES --------------------

def test_transpose_for_scores_large_batch_and_seq():
    # Large batch and sequence, but under 100MB
    batch_size = 32
    seq_len = 32
    hidden_size = 64  # 32*32*64*4 = 262144 bytes = 0.25 MB
    num_heads = 8
    config = WindowedDinov2WithRegistersConfig(hidden_size=hidden_size, num_attention_heads=num_heads)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.randn(batch_size, seq_len, hidden_size)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output
    # Check a few random values for consistency
    for b in [0, batch_size-1]:
        for h in [0, num_heads-1]:
            for s in [0, seq_len-1]:
                expected = x[b, s, h*(hidden_size//num_heads):(h+1)*(hidden_size//num_heads)]

def test_transpose_for_scores_large_heads():
    # Large number of heads
    batch_size = 4
    seq_len = 8
    hidden_size = 128
    num_heads = 32
    config = WindowedDinov2WithRegistersConfig(hidden_size=hidden_size, num_attention_heads=num_heads)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.randn(batch_size, seq_len, hidden_size)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_large_hidden():
    # Large hidden size, small batch/seq
    batch_size = 2
    seq_len = 2
    hidden_size = 512
    num_heads = 16
    config = WindowedDinov2WithRegistersConfig(hidden_size=hidden_size, num_attention_heads=num_heads)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.randn(batch_size, seq_len, hidden_size)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output

def test_transpose_for_scores_performance():
    # Stress test with a tensor close to 100MB
    # 100_000_000 / 4 = 25_000_000 floats
    # Let's use batch=16, seq=64, hidden=96 (16*64*96=98304)
    # Actually, that's only ~0.38MB, so we can go bigger:
    # batch=32, seq=128, hidden=256 = 32*128*256=1048576 floats = 4MB
    # Let's use batch=64, seq=128, hidden=256 = 64*128*256=2097152 floats = 8MB
    batch_size = 64
    seq_len = 128
    hidden_size = 256
    num_heads = 16
    config = WindowedDinov2WithRegistersConfig(hidden_size=hidden_size, num_attention_heads=num_heads)
    attn = Dinov2WithRegistersSelfAttention(config)
    x = torch.randn(batch_size, seq_len, hidden_size)
    codeflash_output = attn.transpose_for_scores(x); out = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1250-2025-05-14T12.06.49 and push.

Codeflash

…ores` by 17% in PR #1250 (`feature/inference-v1-models`)

Here is an optimized version of your program. The original code is already efficient but can benefit from minor optimizations in PyTorch tensor reshaping (by using `reshape` and `transpose` instead of `view` and `permute`, which have better performance when shapes are guaranteed compatible), reduction of attribute lookups in the hot path, and reduced overhead in the constructor.

**No change in function name, signature, or output—only improved runtime. All original comments are preserved.**



**Key changes for speed:**
- Stash repeated configuration lookups as local variables in `__init__`.
- Use integer division (`//`) in attention head size computation.
- In `transpose_for_scores`, switched to `.reshape()` and `.transpose()` for faster in-place-compatible operations instead of `.view()` and `.permute()`, as the input and target shapes are compatible.  
- Avoids repeated member lookups in hot path.

**All outputs remain identical to the original.**
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label May 14, 2025
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label May 14, 2025
@grzegorz-roboflow
Copy link
Contributor

Not relevant anymore, source branch received further updates.

@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr1250-2025-05-14T12.06.49 branch June 10, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant