Skip to content

Silent Kernel Crash - Bus error #91

@alexschroeter

Description

@alexschroeter

Hi, I am trying to run some TensorFlow training on an Intel A770 again (I am almost certain this was working in the past). I am using n2v here as an example, but I had other applications fail "in the same way" (silent kernel died). I am running inside a docker container, but I also managed to reproduce the error in a "normal" python environment.

here is my minimal producer of the error

import tensorflow as tf

print(f"TensorFlow version: {tf.__version__}")
print(f"Available devices: {tf.config.list_physical_devices()}")
print()

# This single line crashes with Bus Error on Intel Arc A770
print("Creating constant tensor on XPU device...")
with tf.device('/XPU:0'):
    a = tf.constant(1.0)
    print(f"Success! Tensor value: {a}")
cd /workspaces/n2v/n2v/examples/2D/denoising2D_SEM && chmod +x minimal_crash_reproducer.py && timeout 10 python minimal_crash_reproducer.py 2>&1 || echo "Exit code: $?"
2025-11-07 23:49:25.274659: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:49:25.318626: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-07 23:49:25.318687: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-07 23:49:25.320203: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-07 23:49:25.327549: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:49:25.327775: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-07 23:49:26.089688: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-11-07 23:49:27.070977: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.071271: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.072538: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.072560: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:49:27.313421: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2025-11-07 23:49:27.314303: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2025-11-07 23:49:27.314324: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2025-11-07 23:49:27.343488: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:49:27.343572: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2025-11-07 23:49:27.346140: I external/xla/xla/service/service.cc:168] XLA service 0x3a458e80 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2025-11-07 23:49:27.346165: I external/xla/xla/service/service.cc:176]   StreamExecutor device (0): Intel(R) Arc(TM) A770 Graphics, <undefined>
2025-11-07 23:49:27.346272: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:49:27.346295: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2025-11-07 23:49:27.347501: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2025-11-07 23:49:27.347532: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 14602718822 bytes on device 0 for BFCAllocator.
2025-11-07 23:49:27.361948: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
TensorFlow version: 2.15.1
Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]

Creating constant tensor on XPU device...
2025-11-07 23:49:28.179984: I tensorflow/core/common_runtime/next_pluggable_device/next_pluggable_device_factory.cc:118] Created 1 TensorFlow NextPluggableDevices. Physical device type: XPU
timeout: the monitored command dumped core
Bus error
Exit code: 135

I would love if someone could help me fix this error or help me debug where and why the kernel dies.

To make it easier for you to reproduce, here is my setup using docker:

system setup

Ubuntu 24.04.3 LTS with Docker version 28.5.2, build ecc6942 using the https://ppa.launchpadcontent.net/kobuk-team/intel-graphics/ubuntu/ for drivers.

compeng@dev05:~$ apt list --installed | grep ppa

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

apparmor/noble-updates,now 4.0.1really4.0.1-0ubuntu0.24.04.4 amd64 [installed,automatic]
intel-gsc/noble,now 0.9.5-0ubuntu1~24.04~ppa1 amd64 [installed]
intel-media-va-driver-non-free/noble,now 25.4.0-0ubuntu1~24.04~ppa1 amd64 [installed]
intel-metrics-discovery/noble,now 1.14.182-0ubuntu1~24.04~ppa1 amd64 [installed]
intel-ocloc/noble,now 25.35.35096.9-1~24.04~ppa3 amd64 [installed]
intel-opencl-icd/noble,now 25.35.35096.9-1~24.04~ppa3 amd64 [installed]
libapparmor1/noble-updates,now 4.0.1really4.0.1-0ubuntu0.24.04.4 amd64 [installed,automatic]
libigdgmm12/noble,now 22.8.2-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libigsc0/noble,now 0.9.5-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libmetee-dev/noble,now 4.3.0-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libmetee4/noble,now 4.3.0-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libmfx-gen1/noble,now 25.4.0-0ubuntu1~24.04~ppa1 amd64 [installed]
libva-drm2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libva-glx2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed]
libva-wayland2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libva-x11-2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libva2/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libvpl-tools/noble,now 1.4.0-0ubuntu1~24.04~ppa1 amd64 [installed]
libvpl2/noble,now 1:2.15.0-0ubuntu1~24.04~ppa1 amd64 [installed,automatic]
libze-dev/noble,now 1.24.1-1~24.04~ppa1 amd64 [installed]
libze-intel-gpu1/noble,now 25.35.35096.9-1~24.04~ppa3 amd64 [installed]
libze1/noble,now 1.24.1-1~24.04~ppa1 amd64 [installed,automatic]
va-driver-all/noble,now 2.22.0-1ubuntu1~24.04~ppa1 amd64 [installed,automatic]
vainfo/noble,now 2.22.0-0ubuntu1~24.04~ppa1 amd64 [installed]

GPU shows up on the host

clinfo | grep "Device Name"
WARNING: Small BAR detected for device 0000:26:00.0
  Device Name                                     Intel(R) Arc(TM) A770 Graphics
  Device Name                                     NVIDIA GeForce RTX 2080 Ti
    Device Name                                   Intel(R) Arc(TM) A770 Graphics
    Device Name                                   Intel(R) Arc(TM) A770 Graphics
    Device Name                                   Intel(R) Arc(TM) A770 Graphics

and also inside the container:

>>> import tensorflow as tf
2025-11-07 23:07:35.208131: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:07:35.256439: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-11-07 23:07:35.256513: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-11-07 23:07:35.258237: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-11-07 23:07:35.266459: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-11-07 23:07:35.266678: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-07 23:07:36.062417: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2025-11-07 23:07:37.216239: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.216474: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.217777: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.217800: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2025-11-07 23:07:37.453929: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2025-11-07 23:07:37.454522: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2025-11-07 23:07:37.454541: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2025-11-07 23:07:37.481900: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:07:37.481966: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2025-11-07 23:07:37.484427: I external/xla/xla/service/service.cc:168] XLA service 0x3ce84ae0 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2025-11-07 23:07:37.484449: I external/xla/xla/service/service.cc:176]   StreamExecutor device (0): Intel(R) Arc(TM) A770 Graphics, <undefined>
2025-11-07 23:07:37.484524: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) oneAPI Unified Runtime over Level-Zero
2025-11-07 23:07:37.484547: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2025-11-07 23:07:37.485729: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2025-11-07 23:07:37.485758: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 14602718822 bytes on device 0 for BFCAllocator.
2025-11-07 23:07:37.499916: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.
>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')]

Dockerfile

FROM intel/intel-extension-for-tensorflow:2.15.0.3-xpu-pip-base 
RUN pip install jupyterlab n2v
RUN apt update && apt install -y git
RUN cd / && git clone https://github.com/juglab/n2v

Docker run command
docker run -it --rm --network=host --device=/dev/dri --volume=/dev/dri/by-path:/dev/dri/by-path --ipc=host

Working with the notebook
I think this doesn't matter because I was able to create the very simple reproducer but for completeness I added it.

Inside the docker image I open a jupyter-lab --ip="*"in / and run the notebook in /n2v/examples/2D/denoising2D_SEM/01_training.ipynb and when I reach model = N2V(config, model_name, basedir=basedir) the kernel just dies. You also have to fix one line imgs = datagen.load_imgs_from_directory(directory = "data/SEM")imgs = datagen.load_imgs_from_directory(directory = "data/")`` needs to be ... = "data/SEM")` I have logs of this working, but at some point it stopped. Sadly, there is also no log or error message created which would allow further investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions