Random restarts - Ryzen 5950x

So, about a month ago I got my new workstation, and for the last couple of days I had some random restarts, for which the event viewer loggs:
A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 0

The details view of this entry contains further information.

The crashes are completely random; when idle, when browsing, when editing... any solutions?
My specs are:

Case
COOLERMASTER SILENCIO S600 QUIET MID TOWER CASE
Processor (CPU)
AMD Ryzen 9 5950X 16 Core CPU (3.4GHz-4.9GHz/72MB CACHE/AM4)
Motherboard
ASUS® CROSSHAIR VIII HERO (DDR4, PCIe 4.0, CrossFireX/SLI) - RGB Ready!
Memory (RAM)
64GB Corsair VENGEANCE RGB PRO DDR4 3200MHz (4 x 16GB)
Graphics Card
24GB NVIDIA GEFORCE RTX 3090 - HDMI, DP
1st Storage Drive
2TB PCS 2.5" SSD, SATA 6 Gb (520MB/R, 470MB/W)
1st M.2 SSD Drive
500GB SAMSUNG 970 EVO PLUS M.2, PCIe NVMe (up to 3500MB/R, 3200MB/W)
1st M.2 SSD Drive
1TB INTEL® 665p M.2 NVMe PCIe SSD (up to 2000MB/sR | 1925MB/sW)
Memory Card Reader
USB 3.0 EXTERNAL SD/MICRO SD CARD READER
Power Supply
CORSAIR 850W RM SERIES™ MODULAR 80 PLUS® GOLD, ULTRA QUIET
Power Cable
1 x 1 Metre European Power Cable (Kettle Lead)
Processor Cooling
Noctua NH-U14S Ultra Quiet Performance CPU Cooler
Thermal Paste
ARCTIC MX-4 EXTREME THERMAL CONDUCTIVITY COMPOUND
Extra Case Fans
2x 120mm Black Case Fan (configured to extract from rear/roof)
Sound Card
ONBOARD 8 CHANNEL (7.1) HIGH DEF AUDIO (AS STANDARD)
Network Card
10/100/1000 GIGABIT LAN PORT (Wi-Fi NOT INCLUDED)
Wireless Network Card
WIRELESS 802.11N 300Mbps/2.4GHz PCI-E CARD
USB/Thunderbolt Options
MIN. 2 x USB 3.0 & 6 x USB 2.0 PORTS @ BACK PANEL + MIN. 2 FRONT PORTS
Operating System
Windows 10 Professional 64 Bit - inc. Single Licence [MUP-00003]
Operating System Language
United Kingdom - English Language
Windows Recovery Media
Windows 10 Multi-Language Recovery Image - Unlimited Downloads from Online Account
Office Software
FREE 30 Day Trial of Microsoft 365® (Operating System Required)
Anti-Virus
NO ANTI-VIRUS SOFTWARE
Browser
Microsoft® Edge (Windows 10 Only)
 

Martinr36

MOST VALUED CONTRIBUTOR
Might be an idea to add your details to the link below, to discover if it related to 4 sticks, try running on 2 and see if the crashes still occur

@ubuysa

 
Thank you for the suggestion, @Martinr36 . By the way - I'm not really up to date on how things work these days - is it OK to just pull RAM out of the sockets, or should I do anything is BIOS as well?
Regarding the restarts - I checked the "Armoury Crate", and there were some updates pending for "Asus HAL Central" and "Device Kit". So I updated those, and since then the computer has not restarted, even though it was idling for the most of the last two days. It's early to say if it went away completely, but just a heads up for anyone experiencing same on ASUS motherboards.
 

Martinr36

MOST VALUED CONTRIBUTOR
No you can just take it out, make sure the PC is turned off but still plugged in & just touch some bare metal on the pc to earth yourself.

Fingers crossed its sorted then
 

ubuysa

The BSOD Doctor
This looks like a manifestation of the "4 RAM sticks in an AMD build" issue - though most users have problems with 3600MHz RAM, I think this is the first time I've seen it in 3200MHz RAM.

There should be a kernel memory dump in the C:\Windows\Memory.dmp file, please upload it to the cloud somewhere (it will be large) and post a link to it here. I'll let you know whether it's a similar problem to the ones we know about. :)
 
This looks like a manifestation of the "4 RAM sticks in an AMD build" issue - though most users have problems with 3600MHz RAM, I think this is the first time I've seen it in 3200MHz RAM.

There should be a kernel memory dump in the C:\Windows\Memory.dmp file, please upload it to the cloud somewhere (it will be large) and post a link to it here. I'll let you know whether it's a similar problem to the ones we know about. :)
Thank you, but I cannot find that file anywhere. I check around the web for possible locations, and checked everywhere, but no such file exists on my system 🤷‍♂️
 

ubuysa

The BSOD Doctor
Thank you, but I cannot find that file anywhere. I check around the web for possible locations, and checked everywhere, but no such file exists on my system [emoji2369]
OK, you didn't gat a dump taken then. Have you modified your pagefile size at all?
 

ubuysa

The BSOD Doctor
I actually wouldn't even know how to do that :)
That's good then because Windows will be using the default - which is fine. :)

Take a look in the folder C:\Windows\Minidumps (if it exists) and if you find any .dmp files in there upload them to the cloud and post a link here. These are minidumps, they don't contain all the kernel data structures and aren't as useful as a kernel dump, but they're worth uploading if you have any . :)
 

ubuysa

The BSOD Doctor
I found some WHEA.dmp files, and uploaded them here:

Hope that helps!
Unfortunately minidumps are of little use with WHEA errors. WHEA is the Windows Hardware Error Architecture and a minidump just doesn't contain the kernel hardware data areas necessary to diagnose the problem.

What I can tell you is that they are all machine check errors (a hardware error) and they are identical. In all cases the failing process is smss.exe - the Windows Session Manager subsystem - which is responsible for starting and managing user sessions. It's an essential Windows component.

The list of driver calls is pretty similar in all dumps and from the stack trace of the active thread it appears that the thread is attempting to call and initialise a driver of some sort. Check that there are no devices in Device Manager missing a driver. Download the AMD Driver Detect Tool and see whether there are updated chipset drivers for your AMD processor.

Looking again at your spec (and I missed this the first time) you're running a AMD processor with four RAM cards. This configuration does seem to be causing problems for some users (though mostly those running 3600MHz RAM). Please add your details to this survey thread. I would also contact PCS and seek their advice, they know about this AMD processor and four RAM cards issue.
 
Thank you @ubuysa for your time, examination of the dump files and a great explanation. I will follow suggested actions, and if I come up with a solution, I will update here!
I also just wanted to say - it is great to see/experience a forum with people actually trying to help each other, that is a rare occasion on the interwebs of today...
Anyway, thanks again!
 

SpyderTracks

We love you Ukraine
Thank you @ubuysa for your time, examination of the dump files and a great explanation. I will follow suggested actions, and if I come up with a solution, I will update here!
I also just wanted to say - it is great to see/experience a forum with people actually trying to help each other, that is a rare occasion on the interwebs of today...
Anyway, thanks again!
@ubuysa is an absolute wizard when it comes to RAM and dump file analysis. That's the kind of stuff you can't teach, just comes with experience.... although he is helping to teach me! 🙏
 

ubuysa

The BSOD Doctor
Thank you @ubuysa for your time, examination of the dump files and a great explanation. I will follow suggested actions, and if I come up with a solution, I will update here!
I also just wanted to say - it is great to see/experience a forum with people actually trying to help each other, that is a rare occasion on the interwebs of today...
Anyway, thanks again!
Thanks for this. TBH I love analysing dump files... :)
 
Hey everyone! Just wanted to update the thread if anyone is interested. It turns out Nvidia Studio Driver on my RTX3090 was what caused restarts for me :oops:
Last week I went to Zotac website, downloaded the latest driver there (regular "gaming" driver), and suddenly system was rock stable. Then I went to Nvidias website, installed Studio Driver from there, and I had WHEA error and restart within 5 minutes. Went back to Zotacs one, and system is rock solid since.
Still not sure what to make of it... 🙃
 

ubuysa

The BSOD Doctor
Hey everyone! Just wanted to update the thread if anyone is interested. It turns out Nvidia Studio Driver on my RTX3090 was what caused restarts for me :oops:
Last week I went to Zotac website, downloaded the latest driver there (regular "gaming" driver), and suddenly system was rock stable. Then I went to Nvidias website, installed Studio Driver from there, and I had WHEA error and restart within 5 minutes. Went back to Zotacs one, and system is rock solid since.
Still not sure what to make of it... [emoji854]
What happens if you install the latest Nvidia Game Ready driver....?

AFAIK the Nvidia Studio driver is optimised for content creators and is a relatively new(ish) concept. Perhaps it's thus less stable??
 
What happens if you install the latest Nvidia Game Ready driver....?
System is stable.

AFAIK the Nvidia Studio driver is optimised for content creators and is a relatively new(ish) concept. Perhaps it's thus less stable??
Yes, and that's what bothers/confuses me... I would expect it being more stable, since it's geard towards content creation, which relies on overnight rendering, etc...
 

ubuysa

The BSOD Doctor
System is stable.


Yes, and that's what bothers/confuses me... I would expect it being more stable, since it's geard towards content creation, which relies on overnight rendering, etc...
I guess that content creation is what you do, rather than gaming? I have no idea (of course) what performance improvements are in the Studio driver, nor for which applications it is optimised. I also don't know what performance difference it will make using the Studio driver vs the Game Ready driver for content creation applications.

It's always a delicate balance between stability and performance, both in software and hardware, and it's an undeniable fact that as soon as you mess with driver code, in order to introduce some performance enhancement in one area for example, you run the risk of creating instability in another. Personally I'm always happy when I look at a driver for one of my devices and see that it has a date that is many years old. Old code has been run billions of times and any bugs in it were ironed out long ago. The most stable code is old code. That is why I always recommend that drivers are only ever updated when you're having problems with the device or when you need the functionality in the new driver. Updating all drivers just because there's a new version is dumb IMO - it's just a faster way to a BSOD.

IMO - and it is just an opinion, it's not based on any real facts other than experience - the main reason we have the problems that we do with Nvidia graphics drivers is because they are updated so regularly. I fully appreciate that the updates are (in the main) performance enhancements for specific games (or for specific content creation application in the Studio drivers) and that's a big deal for people playing those games. The downside is the unexpected instabilities that messing with the code introduces elsewhere, and often where you least expect it.

I'm sure that the Nvidia developers are masters at managing their code but they're human and humans make mistakes.

Sorry, I got carried away a bit there. ☺️

If your system is stable with the Game Ready driver and not the Studio driver then I think your choice is clear. No matter what the improvements in the Studio driver, stability is everything. You get no performance gains at all whilst you're looking at a BSOD....
 

Wuffles

Active member
Thanks for this. TBH I love analysing dump files... :)
I have enjoyed (and learnt so much) your informative posts and as a way of saying thank you, given your love of Dump Files, I thought I would send you this one to show my appreciation (y)
Dump File.jpg
 
Top