Skip to content

Does SwiftInfer integrate well with Page Attention ?  #9

@gawainx

Description

@gawainx

Page Attention is a widely used method for llm serving. It splits the KVCache of a request into multiple blocks and each block contains multiple slots (tokens). I think that the Page Attention might hinder the Streaming LLM since we can not evict some slots within a block. So I want to know how swiftinfer integrate with PA ?

ref: [2309.06180] Efficient Memory Management for Large Language Model Serving with PagedAttention

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions