Skip to content

sync #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 59 commits into
base: main
Choose a base branch
from
Open

sync #65

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
8efeb7f
add print statements for debugging
skrider Feb 8, 2024
ac5e78a
add print statements for debugging
skrider Feb 9, 2024
14b190b
reshape gmem copy
skrider Feb 11, 2024
409431b
only test trivial block size
skrider Feb 11, 2024
70dd049
implement kv page iteration functions
skrider Feb 11, 2024
59e76be
rearrange initial offset computation
skrider Feb 11, 2024
175369f
tests passing for single page k
skrider Feb 11, 2024
3691677
paged copy refactor working for page size 256
skrider Feb 11, 2024
c05b857
allow small page sizes in flash api
skrider Feb 11, 2024
347a625
remove print statements
skrider Feb 11, 2024
3bb71a9
tidy flash_fwd_kernel
skrider Feb 11, 2024
fa13c6b
compiles for all h but 128
skrider Feb 13, 2024
bde5aec
all working except rotary embedding
skrider Feb 13, 2024
bc66858
add page size 16 to tests
skrider Feb 13, 2024
0f5a45e
reshape rotary sin/cos copy to align with paged KV copy
skrider Feb 26, 2024
802dd6a
revert hardcoded rotcossin thread layout
skrider Feb 26, 2024
135a1da
resolve page offsets absolutely not relatively
skrider Mar 26, 2024
a63157e
add test for page table overflow
skrider Mar 26, 2024
7968148
allow smaller page sizes in varlen api
skrider Mar 26, 2024
0c07b79
Minor fix in compute_attn_1rowblock_splitkv (#900)
beginlner Mar 28, 2024
69dfbf6
Add the option for the macro and note (#893)
drisspg Mar 28, 2024
06c2389
Sync up with Dao-AILab/main
WoosukKwon Mar 28, 2024
47ad076
Remove backward pass
WoosukKwon Mar 28, 2024
6ac8e63
flash_attn -> vllm_flash_attn
WoosukKwon Mar 28, 2024
ae856f3
Remove unnecessary files
WoosukKwon Mar 28, 2024
498cd8c
flash-attn -> vllm-flash-attn
WoosukKwon Mar 28, 2024
a43fbbf
Merge remote-tracking branch 'tri/main'
WoosukKwon Apr 22, 2024
0e892ad
Upgrade to PyTorch 2.2.1
WoosukKwon Apr 22, 2024
cb02853
Pin PyTorch and CUDA versions
WoosukKwon Apr 22, 2024
733117e
Add build script
WoosukKwon Apr 22, 2024
a45acf9
Upgrade to PyTorch 2.3.0
WoosukKwon May 6, 2024
422545b
Version up to 2.5.8
WoosukKwon May 6, 2024
f638e65
Upgrade to v2.5.8.post1
WoosukKwon May 7, 2024
961cfbd
Github action for build
WoosukKwon May 7, 2024
d6cd3cd
Move
WoosukKwon May 7, 2024
7731823
Remove cuda 11.8
WoosukKwon May 7, 2024
f80aa0f
MAX_JOBS=1
WoosukKwon May 7, 2024
50601bf
Use int64_t for page pointer arth
WoosukKwon May 19, 2024
264a683
Fix typo
WoosukKwon May 19, 2024
3c263a9
Upgrade to 2.5.8.post2
WoosukKwon May 19, 2024
eee8e47
Remove dropout & Uneven K
WoosukKwon May 19, 2024
b16c279
Expose out in python API (#2)
Yard1 May 22, 2024
7f3b182
Upgrade to v2.5.8.post3
WoosukKwon May 22, 2024
03bf1f8
Don't use kwargs in autograd functions (#3)
Yard1 May 27, 2024
e5da6e4
Fix out kwarg shape check with ngroups swapped (#4)
Yard1 May 31, 2024
a3dd38d
Bump up to v2.5.9
WoosukKwon May 31, 2024
ba625d5
Upgrade to torch 2.3.1 (#5)
WoosukKwon Jun 7, 2024
537f75e
Upgrade to v2.5.9.post1 (#6)
WoosukKwon Jun 7, 2024
8f48a54
use global function rather than lambda (#7)
youkaichao Jul 24, 2024
5a3e6eb
Update torch to 2.4.0 (#8)
SageMoore Jul 29, 2024
e23f458
Add CUDA 11.8 (#9)
WoosukKwon Jul 29, 2024
f424d25
Bump up to 2.6.0
WoosukKwon Jul 29, 2024
1237570
Adds Python 3.12 to publish.yml (#10)
mgoin Aug 1, 2024
d562aa6
Sync with FA v2.6.0 to support soft capping (#13)
WoosukKwon Aug 1, 2024
30a44ae
Support non-default CUDA version (#14)
WoosukKwon Aug 1, 2024
f9d2c10
Bump up to v2.6.1 (#15)
WoosukKwon Aug 1, 2024
9dfc07d
bump cuda to 12.4
AlpinDale Aug 31, 2024
f26c056
bump to 2.6.1.post1
AlpinDale Aug 31, 2024
49b45ed
fix import and bump to 2.6.1.post2
AlpinDale Sep 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
232 changes: 52 additions & 180 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -1,227 +1,99 @@
# This workflow will:
# - Create a new Github release
# - Build wheels for supported architectures
# - Deploy the wheels to the Github release
# - Release the static code to PyPi
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
# This workflow will upload a Python Package to Release asset
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions

name: Build wheels and deploy
name: Create Release

on:
create:
push:
tags:
- v*

jobs:
# Needed to create release and upload assets
permissions:
contents: write

setup_release:
jobs:
release:
# Retrieve tag and create release
name: Create Release
runs-on: ubuntu-latest
outputs:
upload_url: ${{ steps.create_release.outputs.upload_url }}
steps:
- name: Get the tag version
id: extract_branch
run: echo ::set-output name=branch::${GITHUB_REF#refs/tags/}
- name: Checkout
uses: actions/checkout@v3

- name: Extract branch info
shell: bash
run: |
echo "release_tag=${GITHUB_REF#refs/*/}" >> $GITHUB_ENV

- name: Create Release
id: create_release
uses: actions/create-release@v1
uses: "actions/github-script@v6"
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
RELEASE_TAG: ${{ env.release_tag }}
with:
tag_name: ${{ steps.extract_branch.outputs.branch }}
release_name: ${{ steps.extract_branch.outputs.branch }}
github-token: "${{ secrets.GITHUB_TOKEN }}"
script: |
const script = require('.github/workflows/scripts/create_release.js')
await script(github, context, core)

build_wheels:
wheel:
name: Build Wheel
needs: setup_release
runs-on: ${{ matrix.os }}
needs: release

strategy:
fail-fast: false
matrix:
# Using ubuntu-20.04 instead of 22.04 for more compatibility (glibc). Ideally we'd use the
# manylinux docker image, but I haven't figured out how to install CUDA on manylinux.
os: [ubuntu-20.04]
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11']
torch-version: ['1.12.1', '1.13.1', '2.0.1', '2.1.2', '2.2.0', '2.3.0.dev20240207']
cuda-version: ['11.8.0', '12.2.2']
# We need separate wheels that either uses C++11 ABI (-D_GLIBCXX_USE_CXX11_ABI) or not.
# Pytorch wheels currently don't use it, but nvcr images have Pytorch compiled with C++11 ABI.
# Without this we get import error (undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs)
# when building without C++11 ABI and using it on nvcr images.
cxx11_abi: ['FALSE', 'TRUE']
exclude:
# see https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix
# Pytorch <= 1.12 does not support Python 3.11
- torch-version: '1.12.1'
python-version: '3.11'
# Pytorch >= 2.0 only supports Python >= 3.8
- torch-version: '2.0.1'
python-version: '3.7'
- torch-version: '2.1.2'
python-version: '3.7'
- torch-version: '2.2.0'
python-version: '3.7'
- torch-version: '2.3.0.dev20240207'
python-version: '3.7'
# Pytorch <= 2.0 only supports CUDA <= 11.8
- torch-version: '1.12.1'
cuda-version: '12.2.2'
- torch-version: '1.13.1'
cuda-version: '12.2.2'
- torch-version: '2.0.1'
cuda-version: '12.2.2'
os: ['ubuntu-20.04']
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
pytorch-version: ['2.4.0'] # Should be synced with setup.py.
cuda-version: ['12.4', '11.8']

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Setup ccache
uses: hendrikmuhs/[email protected]

- name: Set CUDA and PyTorch versions
run: |
echo "MATRIX_CUDA_VERSION=$(echo ${{ matrix.cuda-version }} | awk -F \. {'print $1 $2'})" >> $GITHUB_ENV
echo "MATRIX_TORCH_VERSION=$(echo ${{ matrix.torch-version }} | awk -F \. {'print $1 "." $2'})" >> $GITHUB_ENV
echo "MATRIX_PYTHON_VERSION=$(echo ${{ matrix.python-version }} | awk -F \. {'print $1 $2'})" >> $GITHUB_ENV

- name: Free up disk space
- name: Set up Linux Env
if: ${{ runner.os == 'Linux' }}
# https://github.com/easimon/maximize-build-space/blob/master/action.yml
# https://github.com/easimon/maximize-build-space/tree/test-report
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf /opt/hostedtoolcache/CodeQL
bash -x .github/workflows/scripts/env.sh

- name: Set up swap space
if: runner.os == 'Linux'
uses: pierotofy/[email protected]
- name: Set up Python
uses: actions/setup-python@v4
with:
swap-size-gb: 10
python-version: ${{ matrix.python-version }}

- name: Install CUDA ${{ matrix.cuda-version }}
if: ${{ matrix.cuda-version != 'cpu' }}
uses: Jimver/[email protected]
id: cuda-toolkit
with:
cuda: ${{ matrix.cuda-version }}
linux-local-args: '["--toolkit"]'
# default method is "local", and we're hitting some error with caching for CUDA 11.8 and 12.1
# method: ${{ (matrix.cuda-version == '11.8.0' || matrix.cuda-version == '12.1.0') && 'network' || 'local' }}
method: 'network'
# We need the cuda libraries (e.g. cuSparse, cuSolver) for compiling PyTorch extensions,
# not just nvcc
# sub-packages: '["nvcc"]'

- name: Install PyTorch ${{ matrix.torch-version }}+cu${{ matrix.cuda-version }}
run: |
pip install --upgrade pip
# If we don't install before installing Pytorch, we get error for torch 2.0.1
# ERROR: Could not find a version that satisfies the requirement setuptools>=40.8.0 (from versions: none)
pip install lit
# We want to figure out the CUDA version to download pytorch
# e.g. we can have system CUDA version being 11.7 but if torch==1.12 then we need to download the wheel from cu116
# see https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix
# This code is ugly, maybe there's a better way to do this.
export TORCH_CUDA_VERSION=$(python -c "from os import environ as env; \
minv = {'1.12': 113, '1.13': 116, '2.0': 117, '2.1': 118, '2.2': 118, '2.3': 118}[env['MATRIX_TORCH_VERSION']]; \
maxv = {'1.12': 116, '1.13': 117, '2.0': 118, '2.1': 121, '2.2': 121, '2.3': 121}[env['MATRIX_TORCH_VERSION']]; \
print(max(min(int(env['MATRIX_CUDA_VERSION']), maxv), minv))" \
)
if [[ ${{ matrix.torch-version }} == *"dev"* ]]; then
if [[ ${MATRIX_TORCH_VERSION} == "2.2" ]]; then
# --no-deps because we can't install old versions of pytorch-triton
pip install typing-extensions jinja2
pip install --no-cache-dir --no-deps --pre https://download.pytorch.org/whl/nightly/cu${TORCH_CUDA_VERSION}/torch-${{ matrix.torch-version }}%2Bcu${TORCH_CUDA_VERSION}-cp${MATRIX_PYTHON_VERSION}-cp${MATRIX_PYTHON_VERSION}-linux_x86_64.whl
else
pip install --no-cache-dir --pre torch==${{ matrix.torch-version }} --index-url https://download.pytorch.org/whl/nightly/cu${TORCH_CUDA_VERSION}
fi
else
pip install --no-cache-dir torch==${{ matrix.torch-version }} --index-url https://download.pytorch.org/whl/cu${TORCH_CUDA_VERSION}
fi
nvcc --version
python --version
python -c "import torch; print('PyTorch:', torch.__version__)"
python -c "import torch; print('CUDA:', torch.version.cuda)"
python -c "from torch.utils import cpp_extension; print (cpp_extension.CUDA_HOME)"
shell:
bash

- name: Build wheel
run: |
# We want setuptools >= 49.6.0 otherwise we can't compile the extension if system CUDA version is 11.7 and pytorch cuda version is 11.6
# https://github.com/pytorch/pytorch/blob/664058fa83f1d8eede5d66418abff6e20bd76ca8/torch/utils/cpp_extension.py#L810
# However this still fails so I'm using a newer version of setuptools
pip install setuptools==68.0.0
pip install ninja packaging wheel
export PATH=/usr/local/nvidia/bin:/usr/local/nvidia/lib64:$PATH
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Limit MAX_JOBS otherwise the github runner goes OOM
MAX_JOBS=2 FLASH_ATTENTION_FORCE_BUILD="TRUE" FLASH_ATTENTION_FORCE_CXX11_ABI=${{ matrix.cxx11_abi}} python setup.py bdist_wheel --dist-dir=dist
tmpname=cu${MATRIX_CUDA_VERSION}torch${MATRIX_TORCH_VERSION}cxx11abi${{ matrix.cxx11_abi }}
wheel_name=$(ls dist/*whl | xargs -n 1 basename | sed "s/-/+$tmpname-/2")
ls dist/*whl |xargs -I {} mv {} dist/${wheel_name}
echo "wheel_name=${wheel_name}" >> $GITHUB_ENV
bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}

- name: Log Built Wheels
- name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
run: |
ls dist
bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}

- name: Get the tag version
id: extract_branch
run: echo ::set-output name=branch::${GITHUB_REF#refs/tags/}

- name: Get Release with tag
id: get_current_release
uses: joutvhu/get-release@v1
with:
tag_name: ${{ steps.extract_branch.outputs.branch }}
- name: Build wheel
shell: bash
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
run: |
bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
wheel_name=$(ls dist/*whl | xargs -n 1 basename)
asset_name=${wheel_name//"linux"/"manylinux1"}
echo "wheel_name=${wheel_name}" >> $GITHUB_ENV
echo "asset_name=${asset_name}" >> $GITHUB_ENV

- name: Upload Release Asset
id: upload_release_asset
uses: actions/upload-release-asset@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
upload_url: ${{ steps.get_current_release.outputs.upload_url }}
asset_path: ./dist/${{env.wheel_name}}
asset_name: ${{env.wheel_name}}
upload_url: ${{ needs.release.outputs.upload_url }}
asset_path: ./dist/${{ env.wheel_name }}
asset_name: ${{ env.asset_name }}
asset_content_type: application/*

publish_package:
name: Publish package
needs: [build_wheels]

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install dependencies
run: |
pip install ninja packaging setuptools wheel twine
# We don't want to download anything CUDA-related here
pip install torch --index-url https://download.pytorch.org/whl/cpu

- name: Build core package
env:
FLASH_ATTENTION_SKIP_CUDA_BUILD: "TRUE"
run: |
python setup.py sdist --dist-dir=dist

- name: Deploy
env:
TWINE_USERNAME: "__token__"
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
run: |
python -m twine upload dist/*
16 changes: 16 additions & 0 deletions .github/workflows/scripts/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

python_executable=python$1
cuda_home=/usr/local/cuda-$2

# Update paths
PATH=${cuda_home}/bin:$PATH
LD_LIBRARY_PATH=${cuda_home}/lib64:$LD_LIBRARY_PATH

# Install requirements
$python_executable -m pip install wheel packaging

# Limit the number of parallel jobs to avoid OOM
export MAX_JOBS=1
# Build
$python_executable setup.py bdist_wheel --dist-dir=dist
20 changes: 20 additions & 0 deletions .github/workflows/scripts/create_release.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
// Uses Github's API to create the release and wait for result.
// We use a JS script since github CLI doesn't provide a way to wait for the release's creation and returns immediately.

module.exports = async (github, context, core) => {
try {
const response = await github.rest.repos.createRelease({
draft: false,
generate_release_notes: true,
name: process.env.RELEASE_TAG,
owner: context.repo.owner,
prerelease: true,
repo: context.repo.repo,
tag_name: process.env.RELEASE_TAG,
});

core.setOutput('upload_url', response.data.upload_url);
} catch (error) {
core.setFailed(error.message);
}
}
23 changes: 23 additions & 0 deletions .github/workflows/scripts/cuda-install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# Replace '.' with '-' ex: 11.8 -> 11-8
cuda_version=$(echo $1 | tr "." "-")
# Removes '-' and '.' ex: ubuntu-20.04 -> ubuntu2004
OS=$(echo $2 | tr -d ".\-")

# Installs CUDA
wget -nv https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
rm cuda-keyring_1.1-1_all.deb
sudo apt -qq update
sudo apt -y install cuda-${cuda_version} cuda-nvcc-${cuda_version} cuda-libraries-dev-${cuda_version}
sudo apt clean

# Test nvcc
PATH=/usr/local/cuda-$1/bin:${PATH}
nvcc --version

# Log gcc, g++, c++ versions
gcc --version
g++ --version
c++ --version
56 changes: 56 additions & 0 deletions .github/workflows/scripts/env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

# This file installs common linux environment tools

export LANG C.UTF-8

# python_version=$1

sudo apt-get update && \
sudo apt-get install -y --no-install-recommends \
software-properties-common \

sudo apt-get install -y --no-install-recommends \
build-essential \
apt-utils \
ca-certificates \
wget \
git \
vim \
libssl-dev \
curl \
unzip \
unrar \
cmake \
net-tools \
sudo \
autotools-dev \
rsync \
jq \
openssh-server \
tmux \
screen \
htop \
pdsh \
openssh-client \
lshw \
dmidecode \
util-linux \
automake \
autoconf \
libtool \
net-tools \
pciutils \
libpci-dev \
libaio-dev \
libcap2 \
libtinfo5 \
fakeroot \
devscripts \
debhelper \
nfs-common

# Remove github bloat files to free up disk space
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf "/usr/share/dotnet"
Loading