Computer had incident

Hon1nbo, Wed Aug 27 2014, 06:39PM

Hi all,

so I had my AC fail over the weekend, and i shut my computer off to protect it (my condo got over 100 F). When the AC was restored, I brought the computer back online. However, since it would take time to get the entire unit down to temp, I put a block of dry ice on my radiator (I'd taken measures to prevent the water from freezing).

When the AC got cool enough, I figured I'd benchmark an overclock while I'm at it since there was still plenty of Dry Ice left. After the CPU portion of the benchmark finished, I went into the next room for a couple minutes while the GPU ran.

I came back to white screens. I shut the computer off, and I couldn't even get POST to start. I started disabling devices (motherboard has a handy set of DIP switches that allow the disabling of controllers and PCIe devices for debugging as well as a HEX readout). I found that everything works fine when the GPU on PCIe lane 1 was disabled. I know the water lines didn't freeze on me as the water was still moving through the system with Dry Ice remaining, as indicated through a flow window I have.

I would chuck it up to a failed GPU (possible overheat), but I also found something that was much worse for me after the POST finally started: the incident had removed two drives from my RAID 5 array (which could only tolerate 1 drive failure out of the 4). I figured I'd start restoring from backup, maybe the system crashed as the benchmark started doing I/O. However, I found Windows Backup does not check integrity of files before updating with an incremental (how? am currently having to restore all files by hand, and the process starts over if I encounter a invalid ZIP file.

So this leaves me with a couple questions for the community:

I am wary of assuming the GPU is the only issue, as the two drives were dropped off the array (though their SMART history shows no issues, it could have been an extreme case of a write hole or the array was being reinitialized during the crash).
I don't currently have a safe PCIe device to test if the lane is bad and not the GPU (or both are bad). I am afraid to put my other GPU in it until I can confirm the lane is good.

But I am trying to think of any other cases that could have happened. Mobo seems to be operating fine (though the 12V rail is registering low, maybe the PSU got tripped from not cooling off fully after the AC was restored? Or the benchmark drew more power than expected? I currently have a 750 which should be enough, but I won't rule out a surge). This would also be able to explain how the HDDs went out of sync with the raid and dropped their membership.


Any ideas?

-Jim

P.S: I am also looking for a new backup solution. I used to have Acronis, but it has issues with my RAMDisk drivers, and their hotfix doesn't work for me.
Re: Computer had incident
hen918, Wed Sept 03 2014, 06:22PM

I had exactly the same issue with a GPU: Playing Crisis too hard, got a slight glitch, checked GPU temp: 95degrees C. Oops, shut down computer, let it cool, failed to display POST and the GPU got hot very quickly - It must have been drawing a fair bit of power, so this might have caused power supply damage perhaps? I RMAd the graphics card in the end.

Anyway, I can't see what caused the RAID failure possibly from the under-voltage, but I can't see this likely.

Good Luck!
Henry