New PC Crashing - 4090 with Ryzen 7

ubuysa

The BSOD Doctor
First things first, you've been having BSODs with a 0x133 DPC_WATCHDOG_VIOLATION bugcheck. All of these five dumps fail because an ISR or DPC ran for too long.

Without going into details of what an ISR or DPC is (I can go into detail if you want to know) the problem driver in all these dumps is nvlddmkm.sys - the Nvidia graphics driver. Here's the failure bucket data for one of your 0x133 dumps...
Code:
FAILURE_BUCKET_ID:  0x133_DPC_nvlddmkm!unknown_function
The 'unknown_funtion' indicates that either the driver tried to execute a function that doesn't exist (and that would be a driver problem) or the graphics card returned a result for which the driver has no function (and that would be a graphics card problem). In other words, this could be a driver or a graphics card problem.

I've also looked at the live kernel dumps, these are dumps taken when Windows detects an error but is able to recover from it. Most of these dumps are 0x141 VIDEO_ENGINE_TIMEOUT_DETECTED, these are caused by a graphics hang that timed out and invoked the Timeout Detection and Recovery (TDR) feature of WIndows. This resets the graphics driver and the graphics card (which crashes to the desktop whateve app or game was causing the hang) to allow the systyem to recover from the hang. These live kernel dumps are also graphics driver or graphics card related, and there are a lot of them (14 in total for the 19th Nov).

In your System log, on 19th Nov, there are a whole host of these errors...
Code:
Log Name:      System
Source:        nvlddmkm
Date:          19/11/2023 22:22:02
Event ID:      13
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Gaming-PC
Description:
The description for Event ID 13 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\Video3
Graphics Exception on (GPC 5, PPC 0):  PEM_VSC_BETA_WK_P12_ERR

The message resource is present but the message was not found in the message table
I'm not able to analyse the graphics exception at the bottom there, I suspect this is Nvidia specific debug data. These almost certainly mirror the live kernel dumps you've had - although there are many more of these error messages than live kernel dumps.

All the evidence then is pointing squarely at either the graphics card or the graphics driver. The way to tell which is to do the following...
  1. Download DDU.
  2. Download the latest driver for your card, and the two immediately prior versions, from the Nvidia website (and from nowhere else).
  3. Disconnect from the Internet (to stop WIndows Update trying to install a graphics driver).
  4. Use DDU to uninstall the current driver (the system will reboot).
  5. Install the latest driver version.
  6. If it still fails then use DDU to uninstall that driver and install the immediately prior version.
  7. If that fails then use DDU to uninstall it and install the two-levels back driver version.
If all three most recent driver versions fail then the problem lies with the graphics card. In that case try popping the card out and reinsert it firmly. Also remove and reinsert the additional power cable - at both ends.
 

itsmrgray

Member
First things first, you've been having BSODs with a 0x133 DPC_WATCHDOG_VIOLATION bugcheck. All of these five dumps fail because an ISR or DPC ran for too long.

Without going into details of what an ISR or DPC is (I can go into detail if you want to know) the problem driver in all these dumps is nvlddmkm.sys - the Nvidia graphics driver. Here's the failure bucket data for one of your 0x133 dumps...
Code:
FAILURE_BUCKET_ID:  0x133_DPC_nvlddmkm!unknown_function
The 'unknown_funtion' indicates that either the driver tried to execute a function that doesn't exist (and that would be a driver problem) or the graphics card returned a result for which the driver has no function (and that would be a graphics card problem). In other words, this could be a driver or a graphics card problem.

I've also looked at the live kernel dumps, these are dumps taken when Windows detects an error but is able to recover from it. Most of these dumps are 0x141 VIDEO_ENGINE_TIMEOUT_DETECTED, these are caused by a graphics hang that timed out and invoked the Timeout Detection and Recovery (TDR) feature of WIndows. This resets the graphics driver and the graphics card (which crashes to the desktop whateve app or game was causing the hang) to allow the systyem to recover from the hang. These live kernel dumps are also graphics driver or graphics card related, and there are a lot of them (14 in total for the 19th Nov).

In your System log, on 19th Nov, there are a whole host of these errors...
Code:
Log Name:      System
Source:        nvlddmkm
Date:          19/11/2023 22:22:02
Event ID:      13
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      Gaming-PC
Description:
The description for Event ID 13 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\Video3
Graphics Exception on (GPC 5, PPC 0):  PEM_VSC_BETA_WK_P12_ERR

The message resource is present but the message was not found in the message table
I'm not able to analyse the graphics exception at the bottom there, I suspect this is Nvidia specific debug data. These almost certainly mirror the live kernel dumps you've had - although there are many more of these error messages than live kernel dumps.

All the evidence then is pointing squarely at either the graphics card or the graphics driver. The way to tell which is to do the following...
  1. Download DDU.
  2. Download the latest driver for your card, and the two immediately prior versions, from the Nvidia website (and from nowhere else).
  3. Disconnect from the Internet (to stop WIndows Update trying to install a graphics driver).
  4. Use DDU to uninstall the current driver (the system will reboot).
  5. Install the latest driver version.
  6. If it still fails then use DDU to uninstall that driver and install the immediately prior version.
  7. If that fails then use DDU to uninstall it and install the two-levels back driver version.
If all three most recent driver versions fail then the problem lies with the graphics card. In that case try popping the card out and reinsert it firmly. Also remove and reinsert the additional power cable - at both ends.

This is a great analysis and I’m in a way glad that it’s pointing to the graphics card as other checks have leant towards this too. I’ve already done DDU of the graphics card but I’ll do it again, but as you state, to see if that helps.

I’ve been in touch with PCS outside of this, and they’ve also recommended resitting the graphics card so I’ll do that later this afternoon if DDU fails.

I will post back on success or failure!
 

itsmrgray

Member
This is a great analysis and I’m in a way glad that it’s pointing to the graphics card as other checks have leant towards this too. I’ve already done DDU of the graphics card but I’ll do it again, but as you state, to see if that helps.

I’ve been in touch with PCS outside of this, and they’ve also recommended resitting the graphics card so I’ll do that later this afternoon if DDU fails.

I will post back on success or failure!

Unfortunately - regardless of DDU and reseating the graphics card, the PC still crashes to a black screen and eventually restarts.

I will do a clean install of Windows but I’m now more certain the GPU is faulty?
 

ubuysa

The BSOD Doctor
A clean install is rather pointless IMO, I'd be staggered if it made any difference. Some of the dumps failed because the ISR ran for too long and IMO that has to be the card.

I would ask PCS to RMA the card now you've effectively proven its the card that's faulty.
 

itsmrgray

Member
A clean install is rather pointless IMO, I'd be staggered if it made any difference. Some of the dumps failed because the ISR ran for too long and IMO that has to be the card.

I would ask PCS to RMA the card now you've effectively proven its the card that's faulty.

It wouldn’t be the RAM would it? I just find it weird how the RGB for both sticks either turns off or goes solid white (when the rest of the system remains unchanged). I have reseated these anyway
 

SpyderTracks

We love you Ukraine
It wouldn’t be the RAM would it? I just find it weird how the RGB for both sticks either turns off or goes solid white (when the rest of the system remains unchanged). I have reseated these anyway
That's because the RGB is controlled via software (icue) so when it crashes, it defaults back to the standard RGB that isn't being addressed by software.
 

itsmrgray

Member
Hi All -
Thanks for your help and guidance throughout this problem. It’s now Day 4 and I’m waiting on PCS to reply back to me with their next steps.
In the meantime, should I be considering RMA of the PC or RMA of just the GPU to swap out?
 

Martinr36

MOST VALUED CONTRIBUTOR
RMA the GPU, details of how to do this on your main account page, but basically they will send you a new GPU, and the DPD driver will take the old one and send it back to PCS
 
Top