[torqueusers] sudden pbs_server & pbs_mom segfaults

Dimitris Zilaskos dzila at tassadar.physics.auth.gr
Mon Jun 1 02:16:25 MDT 2009


>>> That is expected.
>>> The old database is gone, and with it the job counter.
>> Well I do not mind much about the job counter, though I see that 
>> /var/spool/pbs/tmpdir has some leftovers that could cause name 
>> collisions in the future. I am in the process of cleaning them up.
>>
>>> Likewise for any pending and running jobs.
>>> You may need to kill the leftover processes on the nodes by hand.
>> I cannot do that cause my users will complain, they already complain 
>> about being unable to see the status of their jobs. 
> 
> Oh! Users ... they complain so often ...
> and ask for a level of perfection that they usually
> don't show themselves ..

That is so true... still I would probably also complain if I was in 
their place. I am always talking about regular folks, not the kind of 
"Mr Perfect" that tends to be abundant in academic environments.
.
> 
>> I have marked the 
>> nodes with jobs as off-line and I let them know when the running jobs 
>> finish. 
> 
> On our very old cluster with ancient PBS I tried to do this a few times
> after crashes, with very limited success, even though the database was 
> still the same.  Some processes may have lost connection to their peers,
> others on a same MPI job may have died, etc.
> However, it may be worth waiting a reasonable amount of time to see if 
> they finish (depending on what you expect from your queues max walltime).
> 
> The tradeoff to offer the users is: wait and see if the jobs finish with
> no guarantee of success, or kill everything now and start fresh with
> no additional wasted time.
> 
>> When they do finish though, I am not sure if they are gonna get 
>> moved from /var/spool/pbs/tmpdir to the home directory of the user now 
>> that serverdb was recreated...anyone can guess?
> 
> I guess not, as the pbs_server lost track of them, they are no longer on
> the database.  You may need to do it by hand.
> 
I may be lucky on this, looks like most of the running jobs despite our 
advise in their jobs all they do is cd /to/shared/home/folder and run 
their jobs from there.

> BTW, since your PBS database was corrupted, and the reason was not 
> determined, you may want to shutdown the (head) node where the 
> pbs_server is installed, and check the disk and filesystem integrity.
> Some users may try to bite your jugular for this,
> but if the database gets corrupted again,
> they will come in a pack to do it,
> hence it may be worth to have some downtime and check the disk.
> 
> Or better, you can schedule periodic maintenance downtime.
> Maybe a different schedule for the head node,
> compute nodes, storage, etc.
> Post a note to all users (/etc/motd ?), so that they are advised
> ahead of time.
> Large clusters may have alternate head nodes, etc, and can stay up
> all the time, but on small clusters
> some downtime is fair (and wise) game.
> Running all the hardware 24/7/365/forever is risky.
> 
>


All these are good ideas, thank you for sharing them. I was advocating 
for something like that (regular maintenance) unfortunately I didn't 
have enough time to pursue it. Good time to resume this effort, I guess.

For the record the pbs_server has not crashed again, but last time it 
took about a week for the issue to resurface and it is not over yet.

Cheers,

-- 
=============================================================================
Dimitris Zilaskos
GridAUTH Operations Centre @ Aristotle University of Thessaloniki , Greece
Tel: +302310998988 Fax: +302310994309
http://www.grid.auth.gr
=============================================================================


More information about the torqueusers mailing list