-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Hi, I am trying to run some TensorFlow training on an Intel A770 again (I am almost certain this was working in the past). I am using n2v here as an example, but I had other applications fail "in the same way" (silent kernel died). I am running inside a docker container, but I also managed to reproduce the error in a "normal" python environment.
here is my minimal producer of the error
import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"Available devices: {tf.config.list_physical_devices()}")
print()
# This single line crashes with Bus Error on Intel Arc A770
print("Creating constant tensor on XPU device...")
with tf.device('/XPU:0'):
a = tf.constant(1.0)
print(f"Success! Tensor value: {a}")cd /workspaces/n2v/n2v/examples/2D/denoising2D_SEM && chmod +x minimal_crash_reproducer.py && timeout 10 python minimal_crash_reproducer.py 2>&1 || echo "Exit code: $?"
2025-11-07 23:49:25.274659: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:49:25.318626: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-07 23:49:25.318687: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-07 23:49:25.320203: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-07 23:49:25.327549: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:49:25.327775: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-07 23:49:26.089688: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-11-07 23:49:27.070977: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.071271: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.072538: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.072560: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.313421: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2025-11-07 23:49:27.314303: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2025-11-07 23:49:27.314324: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2025-11-07 23:49:27.343488: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:49:27.343572: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2025-11-07 23:49:27.346140: I external/xla/xla/service/service.cc:168] XLA service 0x3a458e80 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2025-11-07 23:49:27.346165: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Intel(R) Arc(TM) A770 Graphics, <undefined>
2025-11-07 23:49:27.346272: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:49:27.346295: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2025-11-07 23:49:27.347501: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2025-11-07 23:49:27.347532: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 14602718822 bytes on device 0 for BFCAllocator.
2025-11-07 23:49:27.361948: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
TensorFlow version: 2.15.1
Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]
Creating constant tensor on XPU device...
2025-11-07 23:49:28.179984: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
timeout: the monitored command dumped core
Bus error
Exit code: 135
I would love if someone could help me fix this error or help me debug where and why the kernel dies.
To make it easier for you to reproduce, here is my setup using docker:
system setup
Ubuntu 24.04.3 LTS with Docker version 28.5.2, build ecc6942 using the https://ppa.launchpadcontent.net/kobuk-team/intel-graphics/ubuntu/ for drivers.
compeng@dev05:~$ apt list --installed | grep ppa
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
apparmor/noble-updates,now 4.0.1really4.0.1-0ubuntu0.24.04.4 amd64 [installed,automatic]
intel-gsc/noble,now 0.9.5-0ubuntu1~24.04~ppa1 amd64 [installed]
intel-media-va-driver-non-free/noble,now 25.4.0-0ubuntu1~24.04~ppa1 amd64 [installed]
intel-metrics-discovery/noble,now 1.14.182-0ubuntu1~24.04~ppa1 amd64 [installed]
intel-ocloc/noble,now 25.35.35096.9-1~24.04~ppa3 amd64 [installed]
intel-opencl-icd/noble,now 25.35.35096.9-1~24.04~ppa3 amd64 [installed]
libapparmor1/noble-updates,now 4.0.1really4.0.1-0ubuntu0.24.04.4 amd64 [installed,automatic]
libigdgmm12/noble,now 22.8.2-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libigsc0/noble,now 0.9.5-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libmetee-dev/noble,now 4.3.0-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libmetee4/noble,now 4.3.0-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libmfx-gen1/noble,now 25.4.0-0ubuntu1~24.04~ppa1 amd64 [installed]
libva-drm2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libva-glx2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed]
libva-wayland2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libva-x11-2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libva2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libvpl-tools/noble,now 1.4.0-0ubuntu1~24.04~ppa1 amd64 [installed]
libvpl2/noble,now 1:2.15.0-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libze-dev/noble,now 1.24.1-1~24.04~ppa1 amd64 [installed]
libze-intel-gpu1/noble,now 25.35.35096.9-1~24.04~ppa3 amd64 [installed]
libze1/noble,now 1.24.1-1~24.04~ppa1 amd64 [installed,automatic]
va-driver-all/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
vainfo/noble,now 2.22.0-0ubuntu1~24.04~ppa1 amd64 [installed]
GPU shows up on the host
clinfo | grep "Device Name"
WARNING: Small BAR detected for device 0000:26:00.0
Device Name Intel(R) Arc(TM) A770 Graphics
Device Name NVIDIA GeForce RTX 2080 Ti
Device Name Intel(R) Arc(TM) A770 Graphics
Device Name Intel(R) Arc(TM) A770 Graphics
Device Name Intel(R) Arc(TM) A770 Graphics
and also inside the container:
>>> import tensorflow as tf
2025-11-07 23:07:35.208131: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:07:35.256439: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-07 23:07:35.256513: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-07 23:07:35.258237: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-07 23:07:35.266459: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:07:35.266678: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-07 23:07:36.062417: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-11-07 23:07:37.216239: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.216474: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.217777: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.217800: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.453929: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2025-11-07 23:07:37.454522: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2025-11-07 23:07:37.454541: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2025-11-07 23:07:37.481900: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:07:37.481966: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2025-11-07 23:07:37.484427: I external/xla/xla/service/service.cc:168] XLA service 0x3ce84ae0 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2025-11-07 23:07:37.484449: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Intel(R) Arc(TM) A770 Graphics, <undefined>
2025-11-07 23:07:37.484524: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:07:37.484547: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2025-11-07 23:07:37.485729: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2025-11-07 23:07:37.485758: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 14602718822 bytes on device 0 for BFCAllocator.
2025-11-07 23:07:37.499916: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]
Dockerfile
FROM intel/intel-extension-for-tensorflow:2.15.0.3-xpu-pip-base
RUN pip install jupyterlab n2v
RUN apt update && apt install -y git
RUN cd / && git clone https://github.com/juglab/n2v
Docker run command
docker run -it --rm --network=host --device=/dev/dri --volume=/dev/dri/by-path:/dev/dri/by-path --ipc=host
Working with the notebook
I think this doesn't matter because I was able to create the very simple reproducer but for completeness I added it.
Inside the docker image I open a jupyter-lab --ip="*"in / and run the notebook in /n2v/examples/2D/denoising2D_SEM/01_training.ipynb and when I reach model = N2V(config, model_name, basedir=basedir) the kernel just dies. You also have to fix one line imgs = datagen.load_imgs_from_directory(directory = "data/SEM")imgs = datagen.load_imgs_from_directory(directory = "data/")`` needs to be ... = "data/SEM")` I have logs of this working, but at some point it stopped. Sadly, there is also no log or error message created which would allow further investigation.