exp wrote:And yes, I installed a software watchdog last time, but for some reason it does not seem to take action x_X
I coded some other stuff now which I will install once the server is back up.
Note if you are going to write your own watchdog: it should run with higher priority, possibly "realtime" scheduling, be small, and should detect its own unexpected behaviour and commit suicide if this happens. Also, must handle SIGSEGV (which is not simple, since your stack and heap may be f***ed up).
exp wrote:And for those who like numbers:
this is the current situation @ server
sysload: 674.82 412.29 574.67 - processes running: 4147/11685 - swap used: 997 MB
Err... how did you get this information without shell or web access?
So something is varying... in the last 5 minutes we had less load than in the last 15 minutes, and then, the last 1 minute we had more load than the last 15 minutes. I couldn't understand the two numbers of processes... perhaps current and maximum?
But your swap is way too big, and here we have a real problem. Looks like you have about 1GB of swap, and this is what is limiting your system load. 1GB is
too much swap for our current harddisk technology. People tend to use that outdated rule that you should have twice swap space as you have RAM. This is no longer true since RAM modules are getting faster and larger faster than HDs are getting faster.
Do yourself a small experience: write a 1GB file to your disk, then read it again, and measure how much time it gets (you must already know presuming that you regularly hash your files). Now imagine this being used as virtual RAM space for a couple of processes that allocated too much memory in an unpredictable way, being swapped in and out all the time... This can surely bring your system down. Linux is pretty stupid swapping... if you give it enough cord, it will hang itself on it.
Currently, I recommend you to think twice if you are going to setup more than 256MB or 512MB of swap. A swap of 1GB is only suitable for >=2GB RAM, if you really need it and have a good SCSI 160MB/s HD. A swap of 2GB only for >=8GB RAM on a RAID-1 array with 320MB/s throughput. Even in these cases, you have to have a good control on the behaviour of your processes (tip: use ulimit).
On all other cases, having a
small swap is better. In a swap-intensive situation, where processes allocate memory out of control, they will fill the swap in less time, and get killed quicker without raising the system load. Also, with smaller swap space, the harddisk heads won't have to move too much on a swap-intensive situation, so not degrading system performance that much.
Finally, now that you said that the swap is full, and I presume you are on a swap-intensive situation with a high system load, for
more than 24 hours, you have yet another problem: hardware stress. Even the most tough harddisks don't survive for too long if writing constantly to the same region of the disk for so much time. The magnetic surface of the disk degrades a little tiny bit on each rewrite (bad sectors being the result in the long run).
Even if you can't reboot it, if you know a way to turn it off until the admin gets back (i.e. call his mother and ask her to turn of his room's circuit breaker, whatever...) is better than letting it in the current state.
exp wrote:Problem is that I would need to update the kernel to really use that. And I don't dare to do that, because the system might no longer boot up for good then. And as you can see the response times of the local admin are pretty bad
You know that currently you have the option to run a Linux inside another, right? Although it is not easy to get it right, it gives you higher control on what looks like virtual servers, inside a single machine.
Best regards.