Uncommanded shutdowns

Is there a log made somewhere every time the computer shuts down?
If not, can I set some debug level somewhere?
The reason I ask is that after a while (the period is getting shorter every day) the computer just shuts down on it’s own accord.
If it’s the motherboard then fine, I’ll get a new one, but what if it’s not? it could be a drive or a card…
Can I get Ubuntu to tell me what the last thing that happened was? (I know, horrible grammar ???)

When you say “shuts down”, do you mean an orderly shutdown, or it just powers off without shutting down ?

Its a “normal” orderly shutdown, as though I had actually asked it to.
So I’m thinking now is the time to find out why before it decides otherwise…

The fan is working fine, and I’ve checked that the power settings are set to Shutdown/hibernate: Never.
Ubuntu Lucid by the way, It’s been working fine with Lucid since the distro came out, i’ts just the last 3 or 4 days I’ve had the problem.

I’ve not added any applications for a few months, so methinks a bit of hardware. It would just be annoying if I change the motherboard only to find out the NIC was overheating or something.

OK, first port of call should be the Log File Viewer, and see if there is anything in syslog or messages with a timestamp around when the system shuts down.

If it’s not an overheating issue, it sounds like it’s probably “power management” related… as it’s a desktop PC, I’d try either stopping the gnome-power-manager process in System Monitor, or stop it from starting at bootup in Startup Aplications… if this makes a difference, turn it back on and attempt to debug the issue… there is a page on logging power management events here:

https://wiki.ubuntu.com/DebuggingGNOMEPowerManager

Thanks.
Found the syslog. The last 4 lines are:
Jul 8 11:21:01 office sm-msp-queue[2219]: unable to qualify my own domain name (office) – using short name
Jul 8 11:30:01 office CRON[2335]: (brafferton) CMD (curl “http://www.ehabich.info/images/synchro/europecrescent.jpg” -o “europecrescent.jpg” >/dev/null 2>&1)
Jul 8 11:39:24 office kernel: [ 2084.681031] Critical temperature reached (78 C), shutting down.Jul 8 11:39:24 office kernel: Kernel logging (proc) stopped.

So it’s out with screwdriver, pliers, hammer, spanner, whip the heatsink off, clean it and change the heatsink compound.
I’ll see if that cures it.

Mind you, is 78C a bit low?

Ubuntu is getting the critical shutdown temp settings (S5) from your motherboard through ACPI

I would have thought that 78C was about right… check to see what temp the BIOS says your CPU is running at… if there is a clear discrepancy and you CPU is obviously running cool, you could try either the:

acpi=off

kernel boot parameter (which effectively stops the kernel listening to the BIOS ACPI), or the:

thermal.nocrt=1

kernel boot parameter (which disables the thermal “critical” shutdown).

BE AWARE - Turning thermal shutdown OFF is “high risk”, and could result in a fried CPU, only turn it off if you are 100% SURE it isn’t working properly

In either case installing lm-sensors and using them to keep an eye on your CPU temp would be a good idea… though whether these would be recording the correct values if there IS an ACPI issue, is another matter ???

Depending on the mobo, there is also a chance a BIOS update may fix the issue… though this too isn’t without risk.

BTW, High CPU temps can also be caused by a bad/failing powersupply, where the voltages are unstable… so if you have a spare, you may want to try that too.

I think I’ll start looking at the BIOS temp as you suggested.
Cleaned the innards and repasted the heatsink - lasted half an hour

I did notice on the system monitor that CPU’s 1 & 2 were at 100% for 90% of the time, yet the processes only showed system monitor working, eveything else idle.
So in BIOS, I expect no shutdown, ie whatever linux has as a hidden process won’t be on when I leave the computer on the BIOS setup page

Does top tell you what’s using all the system resources ?

Open a terminal and enter

top

because both cores at 100% for 90% of the time doesn’t sound right at all.

Wierd!
on restarting (to do the suggested “top”) the computer did a disk check.
Eventually, it continued startup as normal, so I “topped” it at about 7pm, and it’s still running at 10:30.
Not quite sure what’s happened, but top shows only Xorg, sysmon and Firefox - the only 3 applications running, and all is well so far with both cpu’s around 20% each.
Doesn’t really tell us what was wrong in the first place, or how it miraculously repaired itself, but what the heck!
I’ll let it run overnight and report tomorrow…

Left on overnight and going fine…

I think I found something…
Clamscan uses 95-100% for quite a long time when I’m browsing the web and typing this post…
So I’ll switch Clamscan off for a while and see what happens.

Is that on both cores ? … That seems a bit intense even for Clam. :o

You’re right. it was one cpu only, so the investigation continued…
Sitting square eyed for about half an hour not blinking once :o I noticed boinc had come up.
Straight away, both CPU’s were at 100%
Shortly after that, about three or four minutes, it shut down.
Boinc uses the idle period to compute whether there’s extra terrestrial intelligence somewhere out there, 'cos there’s sure as hell ain’t any on Earth.
As luck would have it, boinc was idle last night due to some problem with their website, so nothing happened.
I’m pretty sure it’s boinc, so I’ve removed it.

Thanks for the help. I’d have spent another couple of hundred quid, and never have known the real fault as I wouldn’t have loaded boinc on the new mobo.

Will keep the damn thing on until Midnight tonight, but I don’t expect any more problems (famous last words!)

Cheers Mark.

Mind you, is 78C a bit low?

Eeek! Once you start running your chip over 60 degrees or so, you’ll start to seriously shorten the chip’s lifespan. If you look at the spec of the chip (assuming you built from parts) it’ll tell you the min/max thresholds in the tech spec.

What you should really be looking for on a new machine with a powerful CPU in a ‘residential’ environment is 30-35 when idle and a 60 top-end.
If you’re over this, typically the air-flow in your box is deficient.

If you’re worried about noise, take a look at QuietPC.com - they’ve a range of quite / silent CPU fans to fit pretty much anything. They’re not cheap and you might find specific units cheaper, but they have a good catalogue.

If on the other hand you’re building a 2U server and noise isn’t an issue, I use these …

http://uk.farnell.com/sunon/pmd1208pmb1-a-2-gn/fan-80x80x38mm-12vdc/dp/1568485

Stick 4 of these in a 2U and you can cool pretty much anything, no matter how bad the airflow design of the case … .:slight_smile:

Hello Mad.
Now everything is “fixed”, after a few hours on, I get using “sensors”:

Adapter: Virtual device
temp1: +53.0°C (crit = +75.0°C)

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +31.0°C
Core0 Temp: +28.0°C
Core1 Temp: +29.0°C
Core1 Temp: +17.0°C

So do I still have a problem?