[torqueusers] sudden pbs_server & pbs_mom segfaults

Gus Correa gus at ldeo.columbia.edu
Thu May 28 08:56:39 MDT 2009


Hi Dimitris

Dimitris Zilaskos wrote:
> Hi Gus,
> 
>>>
>>> Looks like, after recreating serverdb, that the job counter has been 
>>> reset, and already running jobs are invisible to the qstat. Can I do 
>>> something so they show up? I have backup of the old serverdb.
>>>
>> Hi Dimitris
>>
>> That is expected.
>> The old database is gone, and with it the job counter.
> 
> Well I do not mind much about the job counter, though I see that 
> /var/spool/pbs/tmpdir has some leftovers that could cause name 
> collisions in the future. I am in the process of cleaning them up.
> 
>> Likewise for any pending and running jobs.
>> You may need to kill the leftover processes on the nodes by hand.
> 
> I cannot do that cause my users will complain, they already complain 
> about being unable to see the status of their jobs. 

Oh! Users ... they complain so often ...
and ask for a level of perfection that they usually
don't show themselves ...

> I have marked the 
> nodes with jobs as off-line and I let them know when the running jobs 
> finish. 

On our very old cluster with ancient PBS I tried to do this a few times
after crashes, with very limited success, even though the database was 
still the same.  Some processes may have lost connection to their peers,
others on a same MPI job may have died, etc.
However, it may be worth waiting a reasonable amount of time to see if 
they finish (depending on what you expect from your queues max walltime).

The tradeoff to offer the users is: wait and see if the jobs finish with
no guarantee of success, or kill everything now and start fresh with
no additional wasted time.

> When they do finish though, I am not sure if they are gonna get 
> moved from /var/spool/pbs/tmpdir to the home directory of the user now 
> that serverdb was recreated...anyone can guess?

I guess not, as the pbs_server lost track of them, they are no longer on
the database.  You may need to do it by hand.

BTW, since your PBS database was corrupted, and the reason was not 
determined, you may want to shutdown the (head) node where the 
pbs_server is installed, and check the disk and filesystem integrity.
Some users may try to bite your jugular for this,
but if the database gets corrupted again,
they will come in a pack to do it,
hence it may be worth to have some downtime and check the disk.

Or better, you can schedule periodic maintenance downtime.
Maybe a different schedule for the head node,
compute nodes, storage, etc.
Post a note to all users (/etc/motd ?), so that they are advised
ahead of time.
Large clusters may have alternate head nodes, etc, and can stay up
all the time, but on small clusters
some downtime is fair (and wise) game.
Running all the hardware 24/7/365/forever is risky.

Good luck!

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

> 
>> I recreated the database a few times here, after the whole Torque+Maui
>> was working, just to start the job counter fresh.
>>
>> Not sure if the old mom, server, and scheduler and logs
>> are preserved, though, but they may.
>>
>>
> 
> Cheers,
> 
> 



More information about the torqueusers mailing list