Jump to content
Dante Unbound: Share Bug Reports and Feedback Here! ×
  • 0

Catching crashes?


Fleuria
 Share

Question

Lately, my machine has taken to crashing a *lot*. And by crash, I mean a power-down crash which is not specific to warframe, but is specific to gpu-heavy applications. So my windows 10 nvidia drivers are a likely (but not guaranteed) culprit. (But, also, these crashes have persisted through several driver updates and some major windows updates also. And, yes, I've been scanning heavily for malware on my machine, also... for all the good that does...)

Thing is, though... I am pretty sure that the nature of these crashes has meant that nothing is being reported upstream. (Usually my display freezes, often glitches out, and the machine shuts down.)

So... is there a way that anyone knows of finding out whether crash details are being reported by my machine? (Or... does anyone know of any good way of tackling this kind of issue?)

(Oh, yes, one other thing -- this all started right after I got my machine serviced for shutting down too often on overheat. And they replaced both my GPU and my motherboard for that... Maybe I got defective parts from that servicing, but how would I be able to tell?)

Link to comment
Share on other sites

Recommended Posts

  • 0

run 'Event Viewer' from the Start Menu.

Custom Views -> Administrative Events
skim through the timeline to see if any entries seem to correlate in time with your problem. note that Event Viewer logs everything that happens basically, not every entry is a problem that needs to be fixed, it's just trying to collect anything and everything that it can.

 

if you want to check Hardware, be sure that the supplemental Power for your GPU if applicable is plugged in correctly (if there are multiple Power Connectors, ideally using separate Cables for each Connector rather than bridging off of the same one).

you can run a stress test that isn't a game as another datapoint. pick your poison, it won't matter all that much. i use Furmark for most stuff by default but you could use any sort of Benchmarking Software too, and Benchmark the GPU to give it a non game load.

Link to comment
Share on other sites

  • 0
14 minutes ago, Fleuria said:

So my windows 10 nvidia drivers are a likely (but not guaranteed) culprit.

I can assure you that Win10 is the culprit, and not the nVidia drivers.

17 minutes ago, Fleuria said:

I am pretty sure that the nature of these crashes has meant that nothing is being reported upstream. (Usually my display freezes, often glitches out, and the machine shuts down.

This is consistent with insufficient power supplied, as I have those same issues under Win7 when I'm doing something more CPU and/or GPU active and not under Debian (which Linux deals with the hardware properly and more accurately than any WinOS, without being a bother to the user).

I wouldn't be surprised that the fat cow known as "Win10" is actually doing more harm than good at hardware "handling" and is causing those one way or another, during its "hardware polling procedure" that are usually hidden by the "System Idle Process". (a.k.a. Telemetry Central)

 

Taking in consideration taiiat's suggestion about testing your hardware, I suggest getting Prime95 for it, since its an unbiased general hardware stress tester, and I suggest Hardware Monitor for redundancy values verification. (more than one source is always nice)

Link to comment
Share on other sites

  • 0
19 hours ago, taiiat said:

run 'Event Viewer' from the Start Menu.

Custom Views -> Administrative Events
skim through the timeline to see if any entries seem to correlate in time with your problem. note that Event Viewer logs everything that happens basically, not every entry is a problem that needs to be fixed, it's just trying to collect anything and everything that it can.

 

if you want to check Hardware, be sure that the supplemental Power for your GPU if applicable is plugged in correctly (if there are multiple Power Connectors, ideally using separate Cables for each Connector rather than bridging off of the same one).

you can run a stress test that isn't a game as another datapoint. pick your poison, it won't matter all that much. i use Furmark for most stuff by default but you could use any sort of Benchmarking Software too, and Benchmark the GPU to give it a non game load.

Event viewer did not seem to have anything significant in administrative events.

There were some errors from turning on the machine, before networking came up, then nothing until an error about the previous shutdown being unexpected, and a couple entries later an error about dump file creation failed due to error during dump creation.

So it's trying to record details but failing because it's crashing...

It's a sealed system (I didn't put it together), so checking power to the GPU is a bit problematic (I don't know how to do that). But I ran hwinfo in logging mode (a log entry every couple minutes) and ran warframe until it crashed (didn't take long -- maybe 10 minutes), and the GPU never went above 80 watts (this is on geforce 1070 which might burn 450 watts for other people, so a lack of available power could very well be the issue -- I'll have to ask the support people that repaired my machine about that).

19 hours ago, Uhkretor said:

I can assure you that Win10 is the culprit, and not the nVidia drivers.

This is consistent with insufficient power supplied, as I have those same issues under Win7 when I'm doing something more CPU and/or GPU active and not under Debian (which Linux deals with the hardware properly and more accurately than any WinOS, without being a bother to the user).

I wouldn't be surprised that the fat cow known as "Win10" is actually doing more harm than good at hardware "handling" and is causing those one way or another, during its "hardware polling procedure" that are usually hidden by the "System Idle Process". (a.k.a. Telemetry Central)

 

Taking in consideration taiiat's suggestion about testing your hardware, I suggest getting Prime95 for it, since its an unbiased general hardware stress tester, and I suggest Hardware Monitor for redundancy values verification. (more than one source is always nice)

I am not sure I understand the specifics of what you are trying to tell me, here, but lack of power seems a highly plausible issue.

Link to comment
Share on other sites

  • 0
29 minutes ago, Fleuria said:

I am not sure I understand the specifics of what you are trying to tell me, here, but lack of power seems a highly plausible issue.

Prime95 will lock your computer automatically if its an issue related to lack of power supplied by your PSU, since its a stress tester. And a stress tester will test your entire rig under extreme stress, meaning that it will max-out your PC power consumption from the PSU.

My computer takes 3 seconds to lock due to insufficient power, under a standard test from Prime95. But, my PC rig spent 2 years circulating 60ºC air through it due to the restrictive area where it was placed before, so the PSU is pretty much in a bad shape. Now, my rig is circulating air at ambient temperature, and the hot air that comes out is now being actively extracted outside so, my rig isn't idling at 60ºC/70ºC on  permanent basis now, its idling around 40ºC/50ºC and working being at 55ºC/60ºC. But it still locks up once in a while, especially when I'm demanding too much of my CPU/GPU combo, which they'll draw out more power in order to work more.

This may be your case, hence why I suggested testing your hardware with Prime95. However, if it isn't a PSU problem, the stress test will most likely continue undisturbed until a hardware malfunction is triggered, IF there is one. At that point, the test will most likely be automatically interrupted.

Edited by Uhkretor
Link to comment
Share on other sites

  • 0
41 minutes ago, Fleuria said:

It's a sealed system (I didn't put it together), so checking power to the GPU is a bit problematic (I don't know how to do that).

But I ran hwinfo in logging mode (a log entry every couple minutes) and ran warframe until it crashed (didn't take long -- maybe 10 minutes), and the GPU never went above 80 watts (this is on geforce 1070 which might burn 450 watts for other people, so a lack of available power could very well be the issue -- I'll have to ask the support people that repaired my machine about that).

ok, well assuming it's not some alien shaped Case, the left sidepanel comes off with a couple screws on the back edge of it. once removed/loosened(if they are captive and don't come out) the Panel can just be opened kinda like a door.

08Go3YK.png

so, in this visual aid, you want to look at the designated square regions. these are where the Power Connectors for any GPU reside. sometimes they're on the long side close to the end, or sometimes they are on the short side/end.
just make sure they are fully inserted (they have clips on them so when they are fully inserted those clamp down). nothing complicated, just applying firm pressure to make sure they're actually plugged in.
then like i said, if there are multiple Power Connectors, you will prefer them to be two separate Cables rather than bridged off of one Cable.

 

80 Watts is certainly rather low for a 1070 under significant load(granted that the Power it will try to draw is dependent on the load it is given, so if the GPU is at like 40% Utilization then that could be perfectly reasonable), i would expect more in the 120-160 range.
a GTX1070 can't possibly draw 450Watt though, that would be more than double what any 1070 could possibly draw without manually modifying the BIOS or physically modifying the GPU. a 1070 would absolutely max out around 200 Watt, peak.

 

the next step seems like giving individual parts/the entire System non-game loads, so you have a separate data point on whether the Hardware itself is for whatever reason unstable in its current state in some way.

11 minutes ago, Uhkretor said:

Prime95 will lock your computer automatically if its an issue related to lack of power supplied by your PSU, since its a stress tester. And a stress tester will test your entire rig under extreme stress, meaning that it will max-out your PC power consumption from the PSU.

uhh.... no it won't. Prime95 is almost exclusively used for stressing the CPU, it uses almost no Memory, and doesn't touch any of the other Hardware.

Link to comment
Share on other sites

  • 0
3 minutes ago, taiiat said:

ok, well assuming it's not some alien shaped Case, the left sidepanel comes off with a couple screws on the back edge of it. once removed/loosened(if they are captive and don't come out) the Panel can just be opened kinda like a door.

Negative.


Form factor is "laptop" (albeit - unwieldy and heavy "laptop").

Link to comment
Share on other sites

  • 0
30 minutes ago, Uhkretor said:

Prime95 will lock your computer automatically if its an issue related to lack of power supplied by your PSU, since its a stress tester. And a stress tester will test your entire rig under extreme stress, meaning that it will max-out your PC power consumption from the PSU.

3 minutes ago, taiiat said:

uhh.... no it won't. Prime95 is almost exclusively used for stressing the CPU, it uses almost no Memory, and doesn't touch any of the other Hardware.

Thanks for agreeing with me.

 

1 hour ago, Fleuria said:

It's a sealed system (I didn't put it together), so checking power to the GPU is a bit problematic (I don't know how to do that).

"Hardware Monitor" shows a decent amount of information, including temperatures, core usage and power usage. You should try it out.

Link to comment
Share on other sites

  • 0
1 minute ago, Fleuria said:

Form factor is "laptop" (albeit - unwieldy and heavy "laptop").

oh.....

ok, well servicing a Notebook is a lot more finicky so since you're not familiar with what you're doing here, then sure leave it for the place you had it serviced last and let them do it instead. since it's not working right that's their fault then IMO anyways.

Just now, Uhkretor said:

Thanks for agreeing with me.

i didn't agree with you. if the problem was say, supply to the GPU, or anything other than the CPU, then Prime95 wouldn't express any issues.

Edited by taiiat
Link to comment
Share on other sites

  • 0
3 minutes ago, Uhkretor said:

Thanks for agreeing with me.

 

"Hardware Monitor" shows a decent amount of information, including temperatures, core usage and power usage. You should try it out.

(1) Symptoms point to GPU being the issue, so I am currently not inclined to think that prime95 would be particularly relevant, here.

(2) HWInfo shows a "decent amount of information" also, including: temperatures, core usage and power usage. (Seriously, the csv file it generated for me when I turned on its logging had about 300 columns -- far more information than I know how to use, but it also had gpu voltage and power, which might be providing relevant clues here.)

Link to comment
Share on other sites

  • 0
9 minutes ago, Fleuria said:

(2) HWInfo shows a "decent amount of information" also, including: temperatures, core usage and power usage. (Seriously, the csv file it generated for me when I turned on its logging had about 300 columns -- far more information than I know how to use, but it also had gpu voltage and power, which might be providing relevant clues here.)

https://www.cpuid.com/softwares/hwmonitor.html

^ Hardware Monitor ^

If it isn't this one that you're using, you're seeing too much information than what you really need to know... Not to mention that if its for a WinOS, you might be seeing values that the OS wants you to see and not the real ones...

Edited by Uhkretor
Link to comment
Share on other sites

  • 0
49 minutes ago, Uhkretor said:

https://www.cpuid.com/softwares/hwmonitor.html

^ Hardware Monitor ^

If it isn't this one that you're using, you're seeing too much information than what you really need to know... Not to mention that if its for a WinOS, you might be seeing values that the OS wants you to see and not the real ones...

hwinfo is configurable, how much information it shows.

And there were bug reports against it, a few years ago, about showing invalid values because of some intermediate software layer. Those problems have been fixed.

(Or: I am willing to believe problems can exist. But I need specifics before I can believe that problems do exist. It's too easy for people to just say stuff when they are talking about some other problem on something else. So I need specific details so I can check what I am being told against what I can observe.)

Link to comment
Share on other sites

  • 0
1 hour ago, Uhkretor said:

If it isn't this one that you're using, you're seeing too much information than what you really need to know... Not to mention that if its for a WinOS, you might be seeing values that the OS wants you to see and not the real ones...

HW Info gets accurate data, as far as Software monitoring is concerned. limited Accuracy compared to getting out the Multi-Meter, but that's true of any Software based readouts, so shrug.

Link to comment
Share on other sites

  • 0

Stupid question but how is the battery? 

I had an issue a while back with a bunk battery or maybe a bunk charging circuit (never did bother figuring that out... anyhow...), what was happening was the system would intermittently turn off even when plugged in. No warnings, just WTF?

After a short bit of trial and error I ended up pulling the battery and just running it off of the power brick and that kept it going until it eventually died and I'm reasonably sure it was the brick that died not the laptop but again I wasn't suuuper into caring at that point the thing was getting really long in the tooth and I've never really liked Laptops anyhow. 

Link to comment
Share on other sites

  • 0
20 hours ago, Oreades said:

Stupid question but how is the battery?

Measured how?

Anyways, I run the thing hooked up to AC power. Sometimes the cord comes loose, but typically when it comes unplugged I notice the sharply reduced performance (and reduced brightness) and plug it back in before it shuts down from the battery reaching too low of a level.

Link to comment
Share on other sites

  • 0
5 hours ago, Fleuria said:

Measured how?

Anyways, I run the thing hooked up to AC power. Sometimes the cord comes loose, but typically when it comes unplugged I notice the sharply reduced performance (and reduced brightness) and plug it back in before it shuts down from the battery reaching too low of a level.

Measured? Uhhh I just removed my battery and took note of the fact that the strange behavior (intermittently shutting down) in my laptop stopped.

I mean I suppose there would probably be a way to determine if there is a bad cell by checking the voltages with a DMM but most batteries these days aren't user serviceable and even if they are it wouldn't be advisable. 

You could try running it without the battery, exclusively off the power brick for a while and see if it continues with the current odd behavior. If it stops the behavior then it very well could be something to do with the battery (or maybe charging circuit) if it keeps shutting off intermittently then it probably isn't anything to do with the battery.

I think people have already covered thermal issues, but those are also a distinct possibility. Even if removing the battery does stop the shutdown it could still be thermal because the batteries/charging circuit are a decent source of heat. So removing the battery from the equation could also just be lowering the average temp enough that temp stops being an issue. 

 

Link to comment
Share on other sites

  • 0

Ok... I no longer think that disconnected power cable" is a viable suggestion.

I can run furmark on this thing and gpu power quickly rises to around 100 watts, no crashes though. And, I can unplug the power cable from the back of the machine, and I get a rendering hiccough and my screen goes darker (because I have the machine configured for minimum brightness when on battery power -- I want to notice when the power comes unplugged), but no crashes despite power use dropping to the 25..50 watt range (it's not constant).

So something else is wrong.

The game crashes with the same apparent problem signature in a variety of games, but does not crash in benchmarks.

So I am thinking that the cause might be a defect in some aspect of the gpu that benchmarks do not normally exercise. (Maybe something having to do with loading textures??)

But that leaves me in a quandary: how can I demonstrate to the technical people that this problem happens? How can they know if they have fixed it? I need some program that they can run (which isn't logging into my account and playing my games) that causes the same crash.

Does anyone have any suggestions on that?

Link to comment
Share on other sites

  • 0
On 2019-11-12 at 6:54 PM, Fleuria said:

ran warframe until it crashed (didn't take long -- maybe 10 minutes)

Maybe run Furmark for 10 minutes, then repeatedly press space bar to turn Furmark's donut rendering off and on.

Maybe run Prime95 and Furmark both at the same time. This can test if the PSU, battery or AC adapter can handle all that power.

My AMD Rx 580 at default 1405 Mhz core: Furmark, 10 minutes up to my 89 C thermal throttling, then space bar, space bar freeze and crash, and Warframe may sometimes freeze and crash computer, requiring power cycle to reset. At 1250 Mhz core it works perfectly fine without problems.

Maybe reduce GPU clock speed like I did. For AMD, Radeon settings, global, Wattsman is good enough. For NVidia, maybe download a tool like NVidia inspector which have overclocking options to reduce clock speed.

Link to comment
Share on other sites

  • 0
On 2019-11-12 at 7:54 PM, Fleuria said:

GPU never went above 80 watts (this is on geforce 1070 which might burn 450 watts for other people, so a lack of available power could very well be the issue -- I'll have to ask the support people that repaired my machine about that).

The NVIDIA GeForce 1070, as pulled from the NVIDIA website, has a maximum power draw of 180 watts. Needless to say, you're running into power problems. An effective, and permanent, fix would be to ensure the card is fully connected to the motherboard, and purchase a higher-capacity power block, maybe a 750 watt at the very least.

Link to comment
Share on other sites

  • 0

Either your GPU is not having enough power, or your laptop is overheating again. If your battery is removable, then remove the battery and plug in your laptop. Your laptop will directly run off the mains supply, which should solve GPU not having enough power if that is the case. 

Though it may be possible that those who did your laptop servicing may have fiddled with your GPU to reduce heating. (Not accusing them, but its possible)

Link to comment
Share on other sites

  • 0
7 hours ago, sam686 said:

Maybe run Furmark for 10 minutes, then repeatedly press space bar to turn Furmark's donut rendering off and on.

Maybe run Prime95 and Furmark both at the same time. This can test if the PSU, battery or AC adapter can handle all that power.

My AMD Rx 580 at default 1405 Mhz core: Furmark, 10 minutes up to my 89 C thermal throttling, then space bar, space bar freeze and crash, and Warframe may sometimes freeze and crash computer, requiring power cycle to reset. At 1250 Mhz core it works perfectly fine without problems.

Maybe reduce GPU clock speed like I did. For AMD, Radeon settings, global, Wattsman is good enough. For NVidia, maybe download a tool like NVidia inspector which have overclocking options to reduce clock speed.

Those are some cute ideas, but they don't crash for me. I can run furmark for hours and toggle the donut all over the place -- no problem. And running p95 in the background doesn't seem to matter, either. Right now I'm drawing about 95 watts on my gpu and about 60 on my cpu and about 7 on my DRAM, and things are behaving. My power supply is only 330 watts, but it's doing fine with this kind of load.

And warframe is crashing when drawing less power than this. (And other games have had the same problem, so ... if it's a bug in warframe it's a bug somehow shared by a variety of other games).

Edited by Fleuria
Link to comment
Share on other sites

  • 0
10 hours ago, Fleuria said:

I can run furmark on this thing and gpu power quickly rises to around 100 watts, no crashes though.

interesting. we have to assume it's probably a Software issue now. (the unstressed parts of the CPU/GPU by certain loads could also be an explanation but the circuitry doesn't generally just suddenly break for no reason)
we can 'skip' the diagnosis part if you say, create a new Partition on your Drive(s), and install another copy of Windows to that Partition. then boot to that and install your Drivers and some games, Et Cetera and see if it still has issues there.
if it doesn't then the road is a long and complicated one to try and stab into the dark as to what Software is misbehaving(as that includes literally all Software from the Operating System to some small tool you install, all of it could theoretically break something), heh.

if that Partition does still express the same issue then......................
uhh....

Link to comment
Share on other sites

  • 0
12 minutes ago, taiiat said:

interesting. we have to assume it's probably a Software issue now. (the unstressed parts of the CPU/GPU by certain loads could also be an explanation but the circuitry doesn't generally just suddenly break for no reason)
we can 'skip' the diagnosis part if you say, create a new Partition on your Drive(s), and install another copy of Windows to that Partition. then boot to that and install your Drivers and some games, Et Cetera and see if it still has issues there.
if it doesn't then the road is a long and complicated one to try and stab into the dark as to what Software is misbehaving(as that includes literally all Software from the Operating System to some small tool you install, all of it could theoretically break something), heh.

if that Partition does still express the same issue then......................
uhh....

I have already reinstalled windows twice on this machine, trying that route. And, I've used at least four versions of the drivers with no changes in symptoms. That said, I just now noticed a new driver with a date from four days ago, and have installed that... so... time for another round of tests (and, hopefully, no crashes, but... I am not feeling optimistic).

Link to comment
Share on other sites

  • 0

 

12 hours ago, taiiat said:

ok....
all i got left then is resetting the Bios to default, or an inspection of the components to look for some damage that only shows in some load situations?

Or, maybe it's like you were originally suggesting and some connector is the problem... just... instead of being completely disconnected, maybe it's connected but not seated properly.

That laptop form factor means me using the keyboard introduces vibration (and shipping can loosen connectors).

Computers are just so fragile, it's hard to think about how to fix them sometimes.

Edited by Fleuria
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...