BSOD

ubuysa

The BSOD Doctor
While I was on the phone to them I got an email saying it had already been dispatched back to me! Oh well.

When it arrives I'll do a full test for a good few days and see how it goes.
If you can, don't do a thing to it except installing the stress testing software. That way they can't claim its a software issue again.
 

DanCarter194

Bronze Level Poster
PC has been returned. I have installed Heaven and it is now running. Didn't install any updates so that I can test the system as they have sent it to me. Details here: https://docs.google.com/document/d/1fRd1G0Z3pe_gXhNhsIx0VC_oZrHg2hHWn_mH6OvX9U4/edit?usp=sharing

However, while trying to download Heaven I got a BSOD. Seriously starting to get fed up of this now.

Event logs attached below, memory dump is taking a while to upload so I'll share a link when it has finished. Any idea what caused it?

 

ubuysa

The BSOD Doctor
Either RMA it again with some stern (but polite) comments or seek a refund. I'll look at the dumps but it's clear this isn't a software issue because you're running exactly the system that PCS installed and tested it with.

This just isn't good enough IMO. Be nice but be firm.
 

DanCarter194

Bronze Level Poster
As far as I can tell, here are the differences between how PCS tested the system and how I tested the system:
  • The monitor that was plugged into the graphics card.
  • The plug socket that the machine was plugged into.
  • The keyboard/mouse connected via USB.
  • I used Heaven to test the system, not sure what they used.
  • I connected to the internet using my home WiFi network, not sure what they used.
Could any of these be the problem? The reason I ask is that if I get a refund I'll be buying a PC from somewhere else, and if it does turn out to be the monitor then I'll have the same problem again. I realise I'm clutching at straws here - I can't say I believe it IS any of these causing the issue ...
 

ubuysa

The BSOD Doctor
As far as I can tell, here are the differences between how PCS tested the system and how I tested the system:
  • The monitor that was plugged into the graphics card.
It's very unlikely but not completely impossible, I've never heard of a monitor causing a BSOD though. Could you possibly borrow a friend's monitor? Or do you have an HDMI TV you could try?
  • The plug socket that the machine was plugged into.
Again, very unlikely. I've never heard of bad power causing a BSOD but it's not impossible. Try it at a friend's house if you can to be sure (and borrow their monitor at the same time).
  • The keyboard/mouse connected via USB.
This is a very remote possibility, especially if the keyboard/mouse have lots of customisable functions and require specific drivers? What keyboard and mouse is it? Try borrowing a basic keyboard and mouse to be sure.
  • I used Heaven to test the system, not sure what they used.
It doesn't matter what they used, it shouldn't BSOD running Heaven. If you RMA it again tell them to run Heaven.
  • I connected to the internet using my home WiFi network, not sure what they used.
WiFi can cause a BSOD, I've seen a few, but these are related to bad network drivers and PCS should have installed the correct drivers for all the kit in your rig - including the WiFi. Get an Ethernet cable and temporarily connect via Ethernet and eliminate WiFi to be sure.
Could any of these be the problem? The reason I ask is that if I get a refund I'll be buying a PC from somewhere else, and if it does turn out to be the monitor then I'll have the same problem again. I realise I'm clutching at straws here - I can't say I believe it IS any of these causing the issue ...
I don't think it is any of the above but your thinking is good. If it fails so easily at your home but not at PCS then something must be different. Try my suggestions to eliminate each of these areas by borrowing kit etc...
 

ubuysa

The BSOD Doctor
The kernel dump has a stop code of DPC_WATCHDOG_VIOLATION, which means that a deferred procedure call (the way that drivers implement the back-end processing after an interrupt) or interrupt service routine (these handle the front end as the interrupt occurs) has been running for too long.

The stack trace for the active thread (part of the msedge.exe process) shows that a page fault occurred as soon as the kernel was entered. The function calls that follow also appear to be related to errors (nt!MmAccessFault+0x189, nt!MiDispatchFault+0x3d5, nt!MiResolveProtoPteFault+0x66a) and these are followed by what look like attempts to cleanup any pending interrupts. It's during this interrupt cleanup that the watchdog timer pops. That suggests to me that there has been a hardware problem with whatever device was involved in the interrupt cleanup.

The stack trace doesn't look to be related to a driver issue and there's no indication elsewhere in the dump of a driver error. The list of driver calls for the active thread suggests that a disk I/O operation was in progress, which makes me wonder whether your M.2 SSD might be flaky somehow? It wouldn't be the first time a flaky SSD has caused a BSOD. I'd suggest you display the SMART data for that drive to see whether it's been having problems - CrystalDiskInfo can do that for you.

I would be sure to keep that kernel dump and make it available to PCS,.
 

DanCarter194

Bronze Level Poster
The kernel dump has a stop code of DPC_WATCHDOG_VIOLATION, which means that a deferred procedure call (the way that drivers implement the back-end processing after an interrupt) or interrupt service routine (these handle the front end as the interrupt occurs) has been running for too long.

The stack trace for the active thread (part of the msedge.exe process) shows that a page fault occurred as soon as the kernel was entered. The function calls that follow also appear to be related to errors (nt!MmAccessFault+0x189, nt!MiDispatchFault+0x3d5, nt!MiResolveProtoPteFault+0x66a) and these are followed by what look like attempts to cleanup any pending interrupts. It's during this interrupt cleanup that the watchdog timer pops. That suggests to me that there has been a hardware problem with whatever device was involved in the interrupt cleanup.

The stack trace doesn't look to be related to a driver issue and there's no indication elsewhere in the dump of a driver error. The list of driver calls for the active thread suggests that a disk I/O operation was in progress, which makes me wonder whether your M.2 SSD might be flaky somehow? It wouldn't be the first time a flaky SSD has caused a BSOD. I'd suggest you display the SMART data for that drive to see whether it's been having problems - CrystalDiskInfo can do that for you.

I would be sure to keep that kernel dump and make it available to PCS,.
Thanks so much for looking into it. The BSOD happened as I was in the middle of downloading Heaven, so an I/O operation definitely sounds about right as it will have been saving to disk.

I downloaded CrystalDiskInfo and here's the output: https://drive.google.com/file/d/1LYf_1z2klCVDULTDX9xrG_h9WVfMZlGf/view?usp=sharing I'll be honest I don't know what I'm looking for - does that suggest any problems with the SSD?

Good suggestion to keep hold of the kernel dump - will do.

I spoke to a friend earlier this evening who has some spare parts as described above so I'm going to do another test where I vary all the things mentioned and see what I get. Hopefully that will help narrow down the problem. Ideally I'll try to keep it running for a week before considering it "fixed"!
 

SpyderTracks

We love you Ukraine
As far as I can tell, here are the differences between how PCS tested the system and how I tested the system:
  • The monitor that was plugged into the graphics card.
  • The plug socket that the machine was plugged into.
  • The keyboard/mouse connected via USB.
  • I used Heaven to test the system, not sure what they used.
  • I connected to the internet using my home WiFi network, not sure what they used.
Could any of these be the problem? The reason I ask is that if I get a refund I'll be buying a PC from somewhere else, and if it does turn out to be the monitor then I'll have the same problem again. I realise I'm clutching at straws here - I can't say I believe it IS any of these causing the issue ...
Heaven is a really handy tool because it ONLY stresses the GPU, whereas most other benchmarks stress both CPU and GPU.

Furmark is a very handy tool for the same reason, that's a torture test for the GPU.


If either of those are crashing, it suggests some kind of hardware issue either RAM or GPU in my uneducated opinion.

The other tools are handy at judging performance but more difficult with diagnosis as it's a full system test.
 

ubuysa

The BSOD Doctor
Thanks so much for looking into it. The BSOD happened as I was in the middle of downloading Heaven, so an I/O operation definitely sounds about right as it will have been saving to disk.

I downloaded CrystalDiskInfo and here's the output: https://drive.google.com/file/d/1LYf_1z2klCVDULTDX9xrG_h9WVfMZlGf/view?usp=sharing I'll be honest I don't know what I'm looking for - does that suggest any problems with the SSD?

Good suggestion to keep hold of the kernel dump - will do.

I spoke to a friend earlier this evening who has some spare parts as described above so I'm going to do another test where I vary all the things mentioned and see what I get. Hopefully that will help narrow down the problem. Ideally I'll try to keep it running for a week before considering it "fixed"!
Sorry, I forgot to reply on your SMART data. It looks fine, the key number is the Media & Data Integrity Errors count, which is zero.
 

DanCarter194

Bronze Level Poster
I'm borrowing some kit from a friend this afternoon, so hope to set Heaven running again this afternoon.

Sorry, I forgot to reply on your SMART data. It looks fine, the key number is the Media & Data Integrity Errors count, which is zero.
Thanks for this. Does this rule out SSD issues? If not, do you know of any SSD testing tools? (Do they even exist?) Just thinking that if I can focus my testing I can be more specific when sending it back for RMA.
 

SpyderTracks

We love you Ukraine
I'm borrowing some kit from a friend this afternoon, so hope to set Heaven running again this afternoon.


Thanks for this. Does this rule out SSD issues? If not, do you know of any SSD testing tools? (Do they even exist?) Just thinking that if I can focus my testing I can be more specific when sending it back for RMA.
You can use crystaldiskmark to test the drive just use the standard edition:

 

DanCarter194

Bronze Level Poster
OK, I've managed to make Heaven crash again. It took 7 hrs and didn't BSOD, but it did crash the process. Application event log is at https://drive.google.com/file/d/1lQQ5isV1T_9o9HqaccRwTdXu1eSjYKUr/view?usp=sharing.

The only information I have is what was logged, namely:
Faulting application name: heaven.exe, version: 1.0.0.0, time stamp: 0x511b9e02
Faulting module name: nvwgf2um.dll, version: 27.21.14.6109, time stamp: 0x5fed9689
Exception code: 0xc0000005
Fault offset: 0x0030431b
Faulting process id: 0x2460
Faulting application start time: 0x01d71e9e2a084913
Faulting application path: C:\Program Files (x86)\Unigine\Heaven Benchmark 4.0\bin\heaven.exe
Faulting module path: C:\windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_3621da861144492b\nvwgf2um.dll
Report Id: ab3d3667-80c9-4f4c-b5e4-8df7fcf908f7
Faulting package full name:
Faulting package-relative application ID:

Does that tell us anything? I believe nvwgf2um.dll is related to the nvidia graphics driver?

Here's what I did to ensure this really is a hardware issue:
  • Plugged it in using a different IEC lead, into a kitchen socket (which is on a different loop to the rest of the house).
  • Used a different mouse and keyboard (both wired USB).
  • Plugged it into my TV (not a monitor) using a different HDMI lead.
  • Connected to my router using a network cable rather than using wireless.
The only thing in common between all the errors and crashes I have had is the PC hardware itself - I think we've comprehensively shown that something inside the PC is faulty. No idea what, mind (probably not the GPU as that has already been replaced), but TBH I think that is PCS's job to work out - it was them that sold me the faulty machine.

I think my next step is to do another RMA with some VERY clear instructions.
 

ubuysa

The BSOD Doctor
I'm borrowing some kit from a friend this afternoon, so hope to set Heaven running again this afternoon.


Thanks for this. Does this rule out SSD issues? If not, do you know of any SSD testing tools? (Do they even exist?) Just thinking that if I can focus my testing I can be more specific when sending it back for RMA.
It doesn't rule out SSD issues but it
OK, I've managed to make Heaven crash again. It took 7 hrs and didn't BSOD, but it did crash the process. Application event log is at https://drive.google.com/file/d/1lQQ5isV1T_9o9HqaccRwTdXu1eSjYKUr/view?usp=sharing.

The only information I have is what was logged, namely:


Does that tell us anything? I believe nvwgf2um.dll is related to the nvidia graphics driver?

Here's what I did to ensure this really is a hardware issue:
  • Plugged it in using a different IEC lead, into a kitchen socket (which is on a different loop to the rest of the house).
  • Used a different mouse and keyboard (both wired USB).
  • Plugged it into my TV (not a monitor) using a different HDMI lead.
  • Connected to my router using a network cable rather than using wireless.
The only thing in common between all the errors and crashes I have had is the PC hardware itself - I think we've comprehensively shown that something inside the PC is faulty. No idea what, mind (probably not the GPU as that has already been replaced), but TBH I think that is PCS's job to work out - it was them that sold me the faulty machine.

I think my next step is to do another RMA with some VERY clear instructions.
The nvwgf2um.dll file is the NVIDIA Compatible D3D10 Driver (whatever that is?) and the exception code of 0xc0000005 is an access violation.

I think your testing is as good as anyone could be expected to do. I agree that an RMA with very detailed instructions is the right way to go....
 

DanCarter194

Bronze Level Poster
So ... I did another RMA, and I've just got this message from PCS:
The RMA on your system has been completed, we have performed a full core rebuild and replaced all components. Before the rebuild we can confirm the system was randomly crashing and blue screening, afterwards we have had none. To ensure that your system had no further issues we have ran your system on a stress test period for 24hrs which it has now passed. We will now begin your systems dispatch process.
The PC is being dispatched to me tomorrow, so although I don't have it back yet, it looks like they've responded to my request for a rebuild!

Thanks to everyone who has helped me diagnose the problem so far - I hope I won't need to post any more on this thread!
 
Top