Skip to content

Conversation

@deftdawg
Copy link
Contributor

@deftdawg deftdawg commented Mar 19, 2025

Intel-Arc-Sky-blue

Initial attempt at Intel Arc support

  • Detect Arc A770
  • Loads Weights
  • Produces model output

Eventually fix #557

TODO:

  • Change elif back to if for nv fall through
  • Bump tinygrad to latest git rev
  • Rerun with DEBUG > 2
  • Tidy up and squash commits

@deftdawg deftdawg changed the title Intel arc Intel Arc Support Mar 19, 2025
@deftdawg
Copy link
Contributor Author

deftdawg commented Mar 19, 2025

Llama 3.2 1B - can't find eos_token?

loaded weights in 2930.88 ms, 2.47 GB loaded at 0.84 GB/s
Checking if local path exists to load tokenizer from local
local_path=PosixPath('/root/.cache/exo/downloads/--root--.cache--exo--downloads--unsloth--Llama-3.2-1B-Instruct')
Trying AutoProcessor for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct
Failed to load processor for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct. Error: 'bool' object has no attribute 'eos_token_id'
Traceback (most recent call last):
  File "/source/exo/exo/inference/tokenizers.py", line 46, in _resolve_tokenizer
    processor.eos_token_id = getattr(processor, 'tokenizer', getattr(processor, '_tokenizer', processor)).eos_token_id
AttributeError: 'bool' object has no attribute 'eos_token_id'

Trying AutoTokenizer for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
did_peers_change=False
Collecting topology max_depth=4 visited=set()
Received request: GET /v1/download/progress
Received request: GET /v1/topology
Received request: GET /v1/download/progress
Received request: GET /v1/download/progress
Received request: GET /v1/download/progress
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 5386752, is finished: False, buffered tokens: 1
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process prompt: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=0, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) prompt='<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting
Knowledge Date: December 2023\nToday Date: 19 Mar 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy is the sky
blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' elapsed_time_ns=8612216238
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Waiting for response to finish. timeout=900s
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
Received request: GET /v1/download/progress
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
did_peers_change=False
Collecting topology max_depth=4 visited=set()
Received request: GET /v1/download/progress
Received request: GET /v1/download/progress
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 128256, is finished: False, buffered tokens: 2
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process_tensor: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) tensor.size=1 tensor.shape=(1, 1) elapsed_time_ns=1460651511
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
Received request: GET /v1/download/progress
Received request: GET /v1/topology
Received request: GET /v1/download/progress
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
did_peers_change=False
Collecting topology max_depth=4 visited=set()
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 128256, is finished: False, buffered tokens: 3
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process_tensor: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) tensor.size=1 tensor.shape=(1, 1) elapsed_time_ns=962302979
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 128256, is finished: False, buffered tokens: 4
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process_tensor: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) tensor.size=1 tensor.shape=(1, 1) elapsed_time_ns=137632566
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'

@deftdawg
Copy link
Contributor Author

@joshuacoles; sorry to ping you out of the blue, I noticed you're working on #734; I feel like you'd have some insight into how eos token parsing works through that.

Am I out to lunch thinking Exo is just not seeing the eos token, so it keeps going (GPU activity seems to be pinned; though the token count stops increasing)?

Thinking to maybe rebase onto your branch to see if your changes might help with resolving this behaviour.

Grateful if you could share any pointers or directions of what to look at next... 🍻

Llama 3.2 1B - can't find eos_token?

loaded weights in 2930.88 ms, 2.47 GB loaded at 0.84 GB/s
Checking if local path exists to load tokenizer from local
local_path=PosixPath('/root/.cache/exo/downloads/--root--.cache--exo--downloads--unsloth--Llama-3.2-1B-Instruct')
Trying AutoProcessor for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct
Failed to load processor for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct. Error: 'bool' object has no attribute 'eos_token_id'
Traceback (most recent call last):
  File "/source/exo/exo/inference/tokenizers.py", line 46, in _resolve_tokenizer
    processor.eos_token_id = getattr(processor, 'tokenizer', getattr(processor, '_tokenizer', processor)).eos_token_id
AttributeError: 'bool' object has no attribute 'eos_token_id'

Trying AutoTokenizer for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
did_peers_change=False
Collecting topology max_depth=4 visited=set()
Received request: GET /v1/download/progress
Received request: GET /v1/topology
Received request: GET /v1/download/progress
Received request: GET /v1/download/progress
Received request: GET /v1/download/progress
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 5386752, is finished: False, buffered tokens: 1
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process prompt: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=0, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) prompt='<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting
Knowledge Date: December 2023\nToday Date: 19 Mar 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy is the sky
blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' elapsed_time_ns=8612216238
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Waiting for response to finish. timeout=900s
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
Received request: GET /v1/download/progress
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
did_peers_change=False
Collecting topology max_depth=4 visited=set()
Received request: GET /v1/download/progress
Received request: GET /v1/download/progress
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 128256, is finished: False, buffered tokens: 2
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process_tensor: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) tensor.size=1 tensor.shape=(1, 1) elapsed_time_ns=1460651511
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
Received request: GET /v1/download/progress
Received request: GET /v1/topology
Received request: GET /v1/download/progress
update_peers: added=[] removed=[] updated=[] unchanged=[] to_disconnect=[] to_connect=[]
did_peers_change=False
Collecting topology max_depth=4 visited=set()
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 128256, is finished: False, buffered tokens: 3
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process_tensor: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) tensor.size=1 tensor.shape=(1, 1) elapsed_time_ns=962302979
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'
[95c4486d-aa96-4db2-9213-34ea5702488e] result size: 128256, is finished: False, buffered tokens: 4
Triggering all on_token callbacks with request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
[95c4486d-aa96-4db2-9213-34ea5702488e] process_tensor: base_shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
shard=Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) tensor.size=1 tensor.shape=(1, 1) elapsed_time_ns=137632566
Broadcasting result: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' result=[128256] is_finished=False
target partition index: 0
Computed target from: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16) 0, Topology(Nodes: {01e581cf-46a1-4592-b254-14e0735baecb: Model:
Linux Box (Intel(R) Arc(TM) A770 Graphics). Chip: Intel(R) Arc(TM) A770 Graphics. Memory: 15473MB. Flops: fp32: 19.66 TFLOPS, fp16: 39.32 TFLOPS, int8: 78.64
TFLOPS}, Edges: {}). target shard: Shard(model_id='llama-3.2-1b', start_layer=0, end_layer=15, n_layers=16)
[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None
[ChatGPTAPI] Waiting for token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e'

@joshuacoles
Copy link

I am out and haven't got a chance to give this more than a glance (will look at it properly when I get the chance), but IIRC EOS determination is done in two places,

  • In the process inference result method of the node with the final shard.
  • In the ChatGPTAPI class when determining when to send the stop chunk.

I think these can happen independently, it is possible for inference to continue (ie GPU usage) past when the API has stopped serving new tokens, which looks like it might it be your experience?

I see in one log an error at the start about not being able to find the EOS id (from the tokenizers.py), and in another log later down a reference from ChatGPTAPI containing the eos_token_id which would lead credence to this theory.

So first port of call would be seeing if the EOS is determined differently in these different places and how these interacts with your model / inference engine.

As I said I'll have a better look over this later on, probably tomorrow.

@joshuacoles
Copy link

Taking a closer look at this, I think something is going wrong during the inference process rather than in EOS token determination. The initial error in your logs,

loaded weights in 2930.88 ms, 2.47 GB loaded at 0.84 GB/s
Checking if local path exists to load tokenizer from local
local_path=PosixPath('/root/.cache/exo/downloads/--root--.cache--exo--downloads--unsloth--Llama-3.2-1B-Instruct')
Trying AutoProcessor for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct
Failed to load processor for /root/.cache/exo/downloads/unsloth--Llama-3.2-1B-Instruct. Error: 'bool' object has no attribute 'eos_token_id'
Traceback (most recent call last):
  File "/source/exo/exo/inference/tokenizers.py", line 46, in _resolve_tokenizer
    processor.eos_token_id = getattr(processor, 'tokenizer', getattr(processor, '_tokenizer', processor)).eos_token_id
AttributeError: 'bool' object has no attribute 'eos_token_id'

Seems to be occurring as we initially load the tokenizer with AutoProcessor.from_pretrained(..., use_fast=False), if we run this in the REPL we see,

AutoProcessor.from_pretrained('unsloth/Llama-3.2-1B-Instruct', use_fast=False) # => False
AutoProcessor.from_pretrained('unsloth/Llama-3.2-1B-Instruct', use_fast=True) # => Works!

This means that the initial loading attempt in _resolve_tokenizer fails, however this is not your issue as we then fall back to AutoTokenizer.from_pretrained(repo_id_or_local_path, trust_remote_code=True) which loads correctly, so this isn't your issue.

If we look at the logs from the ChatGPT API we see that is has correctly determined the EOS token id however it is receiving an invalid token output (128256 being the vocab size of the model and hence outside of the valid token range) from the inference process at each iteration.

[ChatGPTAPI] Got token from queue: request_id='95c4486d-aa96-4db2-9213-34ea5702488e' tokens=[128256] is_finished=False
eos_token_id=128009 tokens[-1]=128256 finish_reason=None

Looking at your changes I see you updated the tinygrad version, I presume to support the new hardware, so I would suggest focusing on stepping through the tinygrad inference code to see if you can spot where the issue originates. This is doubly so with your later error of ValueError: cannot broadcast (2048,) to new_shape=(1, 53, 3072) which strongly indicates that something has gone astray somewhere in the inference engine

@deftdawg
Copy link
Contributor Author

I would suggest focusing on stepping through the tinygrad inference code to see if you can spot where the issue originates.

Thank you so much for your guidance, I'll chase that path. 🍻

@deftdawg
Copy link
Contributor Author

@joshuacoles - thanks for the assist, I was able to track down the issue... Essentially the render for Intel doesn't come on unless both GPU=1 and INTEL=1 are present in the env...

I not sure INTEL=1 is actually supposed to still be required anymore... but I put a PR in to make it an OR condition that will match INTEL=1 or if the device_name contains Intel...
tinygrad/tinygrad#9524

Not super fast at like ~8 tok/s, but faster than CPU-only (CPU is <1 tok/s)

Intel-Arc-hello-world

@Pingasmaster
Copy link

Very nice PR, I am not very familiar with this codebase though, does this PR adds support for the newest Battlemage GPUs? I don't know if the tinygrad version you put has support for them. I think we also might need to add other Intel arc devices to the list of known devices, like the A750 A580 A380 and A310. Also if the arc B series are supported we also need to add the B580 and B570. I will have soon a B580 to test things out, if needed I have a couple friend that already have some I can borrow for some short tests.

@deftdawg deftdawg mentioned this pull request Jun 25, 2025
@AlexCheema AlexCheema force-pushed the main branch 2 times, most recently from a39f85b to 56f783b Compare October 21, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nvidia nvml destory the start in docker without nvidia gpu

3 participants