[torqueusers] sudden pbs_server & pbs_mom segfaults
dzila at tassadar.physics.auth.gr
Mon Jun 1 02:16:25 MDT 2009
>>> That is expected.
>>> The old database is gone, and with it the job counter.
>> Well I do not mind much about the job counter, though I see that
>> /var/spool/pbs/tmpdir has some leftovers that could cause name
>> collisions in the future. I am in the process of cleaning them up.
>>> Likewise for any pending and running jobs.
>>> You may need to kill the leftover processes on the nodes by hand.
>> I cannot do that cause my users will complain, they already complain
>> about being unable to see the status of their jobs.
> Oh! Users ... they complain so often ...
> and ask for a level of perfection that they usually
> don't show themselves ..
That is so true... still I would probably also complain if I was in
their place. I am always talking about regular folks, not the kind of
"Mr Perfect" that tends to be abundant in academic environments.
>> I have marked the
>> nodes with jobs as off-line and I let them know when the running jobs
> On our very old cluster with ancient PBS I tried to do this a few times
> after crashes, with very limited success, even though the database was
> still the same. Some processes may have lost connection to their peers,
> others on a same MPI job may have died, etc.
> However, it may be worth waiting a reasonable amount of time to see if
> they finish (depending on what you expect from your queues max walltime).
> The tradeoff to offer the users is: wait and see if the jobs finish with
> no guarantee of success, or kill everything now and start fresh with
> no additional wasted time.
>> When they do finish though, I am not sure if they are gonna get
>> moved from /var/spool/pbs/tmpdir to the home directory of the user now
>> that serverdb was recreated...anyone can guess?
> I guess not, as the pbs_server lost track of them, they are no longer on
> the database. You may need to do it by hand.
I may be lucky on this, looks like most of the running jobs despite our
advise in their jobs all they do is cd /to/shared/home/folder and run
their jobs from there.
> BTW, since your PBS database was corrupted, and the reason was not
> determined, you may want to shutdown the (head) node where the
> pbs_server is installed, and check the disk and filesystem integrity.
> Some users may try to bite your jugular for this,
> but if the database gets corrupted again,
> they will come in a pack to do it,
> hence it may be worth to have some downtime and check the disk.
> Or better, you can schedule periodic maintenance downtime.
> Maybe a different schedule for the head node,
> compute nodes, storage, etc.
> Post a note to all users (/etc/motd ?), so that they are advised
> ahead of time.
> Large clusters may have alternate head nodes, etc, and can stay up
> all the time, but on small clusters
> some downtime is fair (and wise) game.
> Running all the hardware 24/7/365/forever is risky.
All these are good ideas, thank you for sharing them. I was advocating
for something like that (regular maintenance) unfortunately I didn't
have enough time to pursue it. Good time to resume this effort, I guess.
For the record the pbs_server has not crashed again, but last time it
took about a week for the issue to resurface and it is not over yet.
GridAUTH Operations Centre @ Aristotle University of Thessaloniki , Greece
Tel: +302310998988 Fax: +302310994309
More information about the torqueusers