Skip to content

Testing Stable Diffusion Inference Performance with Latest NVIDIA Driver including TensorRT ONNX

FurkanGozukara edited this page Oct 19, 2025 · 1 revision

Testing Stable Diffusion Inference Performance with Latest NVIDIA Driver including TensorRT ONNX

Testing Stable Diffusion Inference Performance with Latest NVIDIA Driver including TensorRT ONNX

image Hits Patreon BuyMeACoffee Furkan Gözükara Medium Codio Furkan Gözükara Medium

YouTube Channel Furkan Gözükara LinkedIn Udemy Twitter Follow Furkan Gözükara

🚀 UNLOCK INSANE SPEED BOOSTS with NVIDIA's Latest Driver Update or not? 🚀 Are you ready to turbocharge your AI performance? Watch me compare the brand-new NVIDIA 555 driver against the older 552 driver on an RTX 3090 TI for #StableDiffusion. Discover how TensorRT and ONNX models can skyrocket your speed! Don't miss out on these game-changing results!

1-Click fresh Automatic1111 SD Web UI Installer Script with TensorRT and more ⤵️

https://www.patreon.com/posts/86307255

00:00:00 Introduction to the NVIDIA newest driver update performance boost claims

00:00:25 What I am going to test and compare in this video

00:01:11 How to install latest version of Automatic1111 Web UI

00:01:40 The very best sampler of Automatic1111 for Stable Diffusion image generation / inference

00:01:57 Automatic1111 SD Web UI default installation versions

00:02:12 RTX 3090 TI image generation / inference speed for SDXL model with default Automatic1111 SD Web UI installation

00:02:22 How to see your NVIDIA driver version and many more info with nvitop library

00:02:40 Default installation speed for NVIDIA 551.23 driver

00:02:53 How to update Automatic1111 SD Web UI to the latest Torch and xFormers

00:03:05 Which CPU and RAM used to conduct these speed tests CPU-Z results

00:03:54 nvitop status while generating an image with Stable Diffusion XL - SDLX on Automatic1111 Web UI

00:04:10 The new generation speed after updating Torch (2.3.0) and xFormers (0.0.26) to the latest version

00:04:20 How to install TensorRT extension on Automatic1111 SD Web UI

00:05:28 How to generate a TensorRT ONNX model for huge speed up during image generation / inference

00:06:39 How to enable SD Unet selection to be able to use TensorRT generated model

00:07:13 TensorRT pros and cons

00:07:38 TensorRT image generation / inference speed results

00:08:09 How to download and install the latest NVIDIA driver properly and cleanly on Windows

00:09:03 Repeating all the testing again on the newest NVIDIA driver (555.85)

00:10:06 Comparison of other optimizations such as SDP attention or doggettx

00:10:35 Conclusion of the tutorial

NVIDIA's Latest Driver: Does It Really Deliver?

In this video, we dive deep into NVIDIA's newest driver update, comparing the performance of driver versions 552 and 555 on an RTX 3090 TI running Windows 10. We'll explore the claims of speed improvements, particularly with #ONNX runtime and TensorRT integration, using the popular Automatic1111 Web UI.

What You'll Learn:

Driver Comparison: Direct performance comparison between NVIDIA drivers 552 and 555.

Setup and Installation: Step-by-step guide on setting up a fresh #Automatic1111 Web UI installation, including the latest versions of Torch and xFormers.

ONNX and TensorRT Models: Detailed testing of default and TensorRT-generated models to measure speed differences.

Hardware Specifications: Insights into the hardware used for testing, including CPU and memory configurations.

Testing Procedure:

Initial Setup:

Fresh installation using a custom installer script which includes necessary models and styles.

Initial speed test with default settings and configurations.

Driver 552 Performance:

Speed testing on driver 552 with default models and configurations.

Detailed performance metrics and image generation speed analysis.

Upgrading to Latest Torch and xFormers:

Updating to the latest versions of Torch (2.3.0) and xFormers (0.0.26).

Performance testing after updates and comparison with initial setup.

TensorRT Installation and Testing:

Installing TensorRT extension and generating TensorRT models.

Overcoming common installation errors and optimizations.

Speed testing with TensorRT models and analysis of performance improvements.

Upgrading to Driver 555:

Step-by-step guide on downloading and installing NVIDIA driver 555.

Performance comparison between driver 552 and 555.

Analyzing the impact on speed and efficiency.

Results and Conclusions:

Performance Metrics: Detailed analysis of speed improvements (or lack thereof) with the newest NVIDIA driver.

TensorRT Benefits: How TensorRT models significantly boost performance.

Driver Update Impact: Understanding the real-world impact of updating to the latest NVIDIA driver.

Video Transcription

  • 00:00:00 NVIDIA claims that the newest driver brings huge  speed improvements when you are using the AI.  

  • 00:00:06 It is claimed that the newest driver brings huge  performance with ONNX runtime. Automatic1111 Web  

  • 00:00:13 UI supports ONNX models with TensorRT. So today I  am going to compare this newest driver with basic  

  • 00:00:21 installation and also TensorRT ONNX models. I am  going to do testing on the RTX 3090 TI on Windows  

  • 00:00:30 10. I am going to compare NVIDIA drivers 552 vs  555, which is the latest driver. All tests are  

  • 00:00:38 compared on both drivers. I am going to do testing  on fresh Automatic1111 Web UI installation. I  

  • 00:00:45 am going to test the speed with the latest torch  version and the xFormers version. Moreover, I will  

  • 00:00:52 install TensorRT and test and repeat the testing  on TensorRT generated model, and we will see the  

  • 00:01:00 speed differences between older driver, newer  driver, between default and TensorRT model.  

  • 00:01:07 Make sure to watch the entire tutorial because  it is super important. For fresh installation,  

  • 00:01:12 I will use my installer script. So let's use this  folder. Extract here. Just let's install. This  

  • 00:01:19 installer will install everything automatically  for us, including downloading the VAE fixed SDXL  

  • 00:01:26 base model and the very best styles. So the  installation has been completed and the Web  

  • 00:01:32 UI started. Let's see the default downloaded  models. Let's try the default speed. Okay,  

  • 00:01:38 photo of an amazing sports car. The very best  sampler that I am finding is UniPC. Let's make it  

  • 00:01:44 40 steps. Change the resolution to the default  resolution. So initially I will do a warm-up  

  • 00:01:51 generation. Then I will generate four images to  see the speed. This is the default installation.  

  • 00:01:57 You see version 1.9.3, Python 3.10.11,  Torch version is 2.1.2, xFormers is 0.0.23  

  • 00:02:06 and the image is generated. To see the speed, I  will pause the video and generate images. Okay,  

  • 00:02:12 four images are generated. The IT per second is  3.42. So what is my GPU and my driver right now?  

  • 00:02:21 To show you that, I will use nvitop. You can use  this with pip install nvitop. My driver version is  

  • 00:02:30 551.23, CUDA version is 12.4. The GPU model  is not shown in the nvitop. So this is the GPU  

  • 00:02:38 versions that I have. NVIDIA SMI command. 3090 TI.  So with default fresh installation, the speed is  

  • 00:02:44 3.42 for this driver. Now I am going to update  my installation to the latest Torch version and  

  • 00:02:52 xFormers. To do that, I will use this .bat file.  It will update my installation to the latest.  

  • 00:03:00 For the speed comparison, the CPU also matters.  This is my CPU 13900-K. This is the frequency it is  

  • 00:03:07 working right now. Also, my memory is 64 gigabytes  and 2500 megahertz DDR4. The version updater will  

  • 00:03:17 show you this error, but it is not important  because it will automatically fixed Yes, fixed  

  • 00:03:22 and the installation completed. So let's start  again our web UI with this .bat file. You can also  

  • 00:03:30 start it from the web UI user .bat file here.  Okay, the latest version started. You will also  

  • 00:03:35 notice that this updater installed the one of the  very best extension after detailer (ADetailer) latest version.  

  • 00:03:42 So now the Torch version is 2.3.0 and xFormers  is 0.0.26. So let's generate the warmup image  

  • 00:03:50 with the same settings. Let's set them up and  let's generate. Let's see the nvitop status while  

  • 00:03:56 generating. You see this is the nivtop displayed  parameters: the watt usage, the GPU utilization,  

  • 00:04:02 the memory utilization and the warmup image is  generated. Now let's try the actual speed. The  

  • 00:04:09 generation speed looks like not changed. You see  the same speed. Probably for this NVIDIA driver  

  • 00:04:16 to be effective, we need TensorRT. So now I  am going to install TensorRT extension and  

  • 00:04:22 generate TensorRT model to calculate its speed.  And after this, I will update my NVIDIA driver  

  • 00:04:30 to the latest version, and I will repeat the  test. So we will see the speed difference. Okay,  

  • 00:04:36 the TensorRT installed. Then let's restart the web  UI. In the initial run of TensorRT installation,  

  • 00:04:44 it may take a while. Just patiently wait.  Okay, during the initial installation,  

  • 00:04:48 you will also get this error. Unfortunately,  TensorRT developers still didn't fix this error,  

  • 00:04:54 but we will fix it. So just click OK to  these errors. Okay, okay, okay, okay.

  • 00:05:00 Then it will start and the web UI started with  TensorRT. Now I will show you how to fix those  

  • 00:05:05 annoying pop-ups. So close the web UI, then run  the TensorRT installer one more time like this.  

  • 00:05:13 And it is done. Then start the web UI again. So  after doing that, you will not see that error  

  • 00:05:19 anymore while starting your web UI. You see it  is getting started. No errors and started. For  

  • 00:05:26 TensorRT to work, we need to generate TensorRT.  Make sure that you have selected your model,  

  • 00:05:30 which one you want to generate TensorRT. And I  will use this batch size one static. Actually,  

  • 00:05:37 I am going to change this. So the min batch  size one, optimal batch size one, let's say  

  • 00:05:42 maximum batch size four. It depends on you. You  can set the minimum height, whichever you want  

  • 00:05:47 like this. Okay, let's also set this like this and  prompt. This is important. Let's make the optimal  

  • 00:05:54 prompt like 150 and let's make the maximum 225.  Okay, so then export engine. To see the status,  

  • 00:06:03 you should check out the CMD window. It may take  a while depending on your GPU model. In the CMD  

  • 00:06:10 window, it doesn't show anything. However, in  the nvitop, I can see that it is using GPU.  

  • 00:06:16 GPU utilization increases or decreases. So it is  working. You can also see the memory usage of the  

  • 00:06:21 GPU. After a while, you will start seeing messages  like this. It is working and progressing. Okay,  

  • 00:06:27 the model generation has been completed. You can  see that TensorRT engines have been saved to the  

  • 00:06:33 disk. It took around six minutes, a little  bit longer perhaps. And we can see exported  

  • 00:06:39 successfully. To be able to use it, we need to  go to the settings. And in here search for quick,  

  • 00:06:45 you will see this part and type here Unet.  And you will see SD Unet apply settings.  

  • 00:06:53 Don't worry, you will also get this message  and reload UI. Now we have SD Unet and in here  

  • 00:06:59 we can select none, which will use the default  Unet of the model or we can use the Unet of the  

  • 00:07:06 Realistic Vision which we are going to use.  This is the TensorRT model. And let's use  

  • 00:07:11 the same prompt. This doesn't change quality, but  this increases the speed significantly. However,  

  • 00:07:18 this takes time to compile, but compiling is only  one time and it is ready. And let's generate the  

  • 00:07:24 initial warm-up. We can see that it is loading the  TensorRT Unet and the generation started. So this  

  • 00:07:30 is the initial warm-up generation while recording  video. Okay, now it is time to test. So I'm going  

  • 00:07:36 to make batch count four. So the test has been  completed. We can see that now we are getting  

  • 00:07:41 5.51 it per second. This is almost the speed of  4090. So the speed increases like this minus 3.41  

  • 00:07:53 over 3.41. We got 61.5% speed increase without any  quality change, quality loss, quality difference.  

  • 00:08:04 Now I am going to install the NVIDIA driver to see  the difference. I am going to install the latest  

  • 00:08:10 version of NVIDIA driver. So how are we going to  download it? Download NVIDIA drivers, because I  

  • 00:08:16 see that some people are having issues. So go to  here, select your GPU model. It is this one. And  

  • 00:08:23 yes, I am using Game Ready Driver. Lets search.  It is working better I think. Download and  

  • 00:08:28 download should start. If it doesn't start.  Yeah, we need to click here. Download and  

  • 00:08:33 the download started. So from driver version 551, I  am going to 555. Okay, it is downloaded. So all we  

  • 00:08:40 need to do is run as administrator. Click Yes,  when it is asking and click OK. I am going to  

  • 00:08:46 show you the selection that I am making NVIDIA  Graphics Driver, agree and continue. Custom, advanced  

  • 00:08:53 and I am going to perform a clean installation.  Now I will just do next. But before doing that,  

  • 00:08:58 I am going to turn off the video recording and I  will restart the computer and we will return back.  

  • 00:09:03 Okay, so I have installed the latest driver and  restarted the system. You can see the newest CUDA  

  • 00:09:09 driver version (555.85) and driver version here. And let's  make a warm-up test. And let's see the results. So  

  • 00:09:16 after the warm-up, I have generated four images  and we can see that we have a significant speed  

  • 00:09:22 drop from 3.41 it per second to 3.27 it per  second. Let's see the TensorRT speed because  

  • 00:09:31 this is what I wonder. So let's just change the  batch size to one and do the initial warm-up.  

  • 00:09:37 You should watch the messages here. When doing  the TensorRT optimization, the xFormers gets  

  • 00:09:43 disabled. So the TensorRT works. And yes, we can  see the speed right now. So here are the speeds 

  • 00:09:50 after generating four images. We got a speed  drop after the newest drivers. NVIDIA is  

  • 00:09:56 disappointing us. I don't know why. Before ending  this video, I will also test the optimization of  

  • 00:10:04 the PyTorch itself. So you see there are different  optimizations here. Let's try the SDP. This is  

  • 00:10:12 the optimization of the PyTorch as far as I  know. Let's see the speed of generation. OK,  

  • 00:10:18 this one is even slower than the xFormers. I'm  going to test these other ones as well. Let's  

  • 00:10:24 see which one will perform best. This other one  was even slower. Let's see the Doggettx. This is  

  • 00:10:31 the latest one. So this is it. It is not always  best to upgrade NVIDIA drivers. Unfortunately,  

  • 00:10:38 currently, this new driver didn't bring me any  speed increase for some reason. If we get a speed  

  • 00:10:44 increase, let me know. But the TensorRT is working  amazingly. So you should use it if you are going  

  • 00:10:51 to generate a lot of images on the same model.  Hopefully, see you in another amazing tutorial.

Clone this wiki locally