Local AI Stack

This project currently deploys Qwen/Qwen3.6-27B from ModelScope behind an OpenAI-compatible text chat API, while keeping the gateway and registry generic for additional local models.

The public API process is a lightweight proxy on port 8000. It starts an internal model worker only when a chat request arrives, then stops that worker after IDLE_UNLOAD_SECONDS seconds without active requests. This fully releases GPU memory when idle.

Current endpoint

LAN base URL: http://<LAN_IP>:8000/v1
Local base URL: http://127.0.0.1:8000/v1
API key: local-dev-key
Model name: Qwen/Qwen3.6-27B

Files

Model: models/Qwen3.6-27B
Public proxy: proxy_qwen36.py
Internal worker: serve_qwen36.py
Download script: download_model.py
Start foreground: ./start.sh
Start background: ./start_background.sh
Stop: ./stop.sh

Visual overview

On-demand start

To avoid keeping the API proxies resident all the time, start the Web UI and both APIs only when needed:

./api.sh

Then open http://127.0.0.1:8080. For another device on the same network, use the LAN address printed by ./api.sh. Replace <LAN_IP> in examples with that address. Keep that terminal open while using the Web UI or APIs. Press Ctrl+C to stop the Web UI, both API proxies, and any worker they started. Quick helpers:

./api.sh status
./api.sh stop

Web UI and model switching

The Web UI runs on port 8080 by default and calls the local OpenAI-compatible APIs through web_ui.py, so the browser does not need to hold the API key directly.

Model choices are loaded from model_registry.json. To add another OpenAI-compatible model later, add another item with:

{
  "id": "provider/model-name",
  "label": "Display name",
  "type": "chat",
  "base_url": "http://127.0.0.1:8002/v1",
  "api_key_env": "API_KEY",
  "default": false
}

Use "type": "image" for image generation models. Restart ./api.sh after editing the registry.

User systemd service

The service can also be managed as a user unit, but the on-demand ./api.sh flow avoids keeping it resident:

systemctl --user status qwen36-api.service
systemctl --user stop qwen36-api.service

Health

curl http://127.0.0.1:8000/health

worker_running:false means the model worker is stopped and GPU memory should be near baseline.

Example

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer local-dev-key' \
  -d '{
    "model": "Qwen/Qwen3.6-27B",
    "messages": [{"role": "user", "content": "Say OK in one word."}],
    "max_tokens": 8,
    "temperature": 0,
    "extra_body": {"chat_template_kwargs": {"enable_thinking": false}}
  }'

Idle behavior

Default idle timeout is 300 seconds:

IDLE_UNLOAD_SECONDS=60 ./start.sh

When the timeout is reached, the proxy terminates the internal worker process. This is intentional: process termination is the reliable way to release CUDA memory on this hardware.

The wrapper currently targets text chat. The upstream model is multimodal; for image/video API support use the official transformers serve or vLLM/SGLang on compatible hardware.

Qwen-Image text-to-image API

Qwen/Qwen-Image-2512 is downloaded in models/Qwen-Image-2512 and served through an OpenAI-compatible Images API.

LAN base URL: http://<LAN_IP>:8001/v1
Local base URL: http://127.0.0.1:8001/v1
API key: local-dev-key
Image model name: Qwen/Qwen-Image-2512
Public proxy: proxy_qwen_image.py
Internal worker: qwen_image_worker.py
Start: ./start_qwen_image.sh
Stop: ./stop_qwen_image.sh

The image API also uses lazy loading. The proxy starts the worker on the first image request and stops it after IMAGE_IDLE_UNLOAD_SECONDS=300 seconds of inactivity. On this Tesla P40 machine, the working configuration is IMAGE_DTYPE=float32 plus IMAGE_DEVICE_MAP=sequential; FP16 generated black images because Qwen-Image-2512 expects BF16-class numerics, which P40 does not support.

systemctl --user status qwen-image-api.service
systemctl --user stop qwen-image-api.service

Health check:

curl http://127.0.0.1:8001/health

Example request:

curl http://127.0.0.1:8001/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer local-dev-key" \
  -d "{\"model\":\"Qwen/Qwen-Image-2512\",\"prompt\":\"A red circle on a white background\",\"size\":\"512x512\",\"n\":1,\"response_format\":\"b64_json\",\"extra_body\":{\"num_inference_steps\":12,\"true_cfg_scale\":4.0,\"seed\":42}}"

Generated image files are saved under outputs/qwen-image/. The quick smoke test is:

.venv/bin/python smoke_test_qwen_image.py

Layered local AI stack

The Web UI now also acts as a lightweight LiteLLM-style gateway. In addition to the existing direct model endpoints, external tools can use one unified OpenAI-compatible base URL:

http://127.0.0.1:8080/v1
http://<LAN_IP>:8080/v1

Supported gateway endpoints:

GET  /v1/models
POST /v1/chat/completions
POST /v1/images/generations

The Stack tab in the Web UI shows the active layers, model health, and copyable Open WebUI/LiteLLM/curl snippets.

Relationship to community projects

This project references community projects for product shape and architecture layering; it does not copy their code. The mapping is:

UI layer: Open WebUI informs the local model console, multi-model entry points, and OpenAI-compatible client experience [1].
Gateway layer: LiteLLM informs the unified API gateway, model routing, and external tool integration pattern [2].
Serving layer: vLLM and SGLang inform the high-throughput OpenAI-compatible serving direction for a future text worker replacement [3,4].
Image and multi-modal layer: ComfyUI informs image workflow design, while LocalAI informs local multi-modal API aggregation [5,6].

Local reference snapshots and integration notes are in:

docs/STACK.md
integrations/
references/

Reference projects

[1] Open WebUI. open-webui/open-webui. GitHub repository. https://github.com/open-webui/open-webui. Accessed: 2026-06-24.

[2] LiteLLM. BerriAI/litellm. GitHub repository. https://github.com/BerriAI/litellm. Accessed: 2026-06-24.

[3] vLLM. vllm-project/vllm. GitHub repository. https://github.com/vllm-project/vllm. Accessed: 2026-06-24.

[4] SGLang. sgl-project/sglang. GitHub repository. https://github.com/sgl-project/sglang. Accessed: 2026-06-24.

[5] ComfyUI. Comfy-Org/ComfyUI. GitHub repository. https://github.com/Comfy-Org/ComfyUI. Accessed: 2026-06-24.

[6] LocalAI. mudler/LocalAI. GitHub repository. https://github.com/mudler/LocalAI. Accessed: 2026-06-24.

Gateway smoke test:

./api.sh
.venv/bin/python smoke_test_gateway.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local AI Stack

Current endpoint

Files

Visual overview

On-demand start

Web UI and model switching

User systemd service

Health

Example

Idle behavior

Qwen-Image text-to-image API

Layered local AI stack

Relationship to community projects

Reference projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
integrations		integrations
references		references
web		web
.gitignore		.gitignore
README.md		README.md
README.zh-CN.md		README.zh-CN.md
api.sh		api.sh
download_model.py		download_model.py
download_qwen_image.py		download_qwen_image.py
model_registry.json		model_registry.json
proxy_qwen36.py		proxy_qwen36.py
proxy_qwen_image.py		proxy_qwen_image.py
qwen_image_worker.py		qwen_image_worker.py
requirements.txt		requirements.txt
serve_qwen36.py		serve_qwen36.py
smoke_test.py		smoke_test.py
smoke_test_gateway.py		smoke_test_gateway.py
smoke_test_qwen_image.py		smoke_test_qwen_image.py
start.sh		start.sh
start_background.sh		start_background.sh
start_qwen_image.sh		start_qwen_image.sh
stop.sh		stop.sh
stop_qwen_image.sh		stop_qwen_image.sh
web_ui.py		web_ui.py

Folders and files

Latest commit

History

Repository files navigation

Local AI Stack

Current endpoint

Files

Visual overview

On-demand start

Web UI and model switching

User systemd service

Health

Example

Idle behavior

Qwen-Image text-to-image API

Layered local AI stack

Relationship to community projects

Reference projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages