[torqueusers] sudden pbs_server & pbs_mom segfaults

Dimitris Zilaskos dzila at tassadar.physics.auth.gr
Thu May 28 08:17:26 MDT 2009


Hi Gus,

>>
>> Looks like, after recreating serverdb, that the job counter has been 
>> reset, and already running jobs are invisible to the qstat. Can I do 
>> something so they show up? I have backup of the old serverdb.
>>
> Hi Dimitris
> 
> That is expected.
> The old database is gone, and with it the job counter.

Well I do not mind much about the job counter, though I see that 
/var/spool/pbs/tmpdir has some leftovers that could cause name 
collisions in the future. I am in the process of cleaning them up.

> Likewise for any pending and running jobs.
> You may need to kill the leftover processes on the nodes by hand.

I cannot do that cause my users will complain, they already complain 
about being unable to see the status of their jobs. I have marked the 
nodes with jobs as off-line and I let them know when the running jobs 
finish. When they do finish though, I am not sure if they are gonna get 
moved from /var/spool/pbs/tmpdir to the home directory of the user now 
that serverdb was recreated...anyone can guess?

> I recreated the database a few times here, after the whole Torque+Maui
> was working, just to start the job counter fresh.
> 
> Not sure if the old mom, server, and scheduler and logs
> are preserved, though, but they may.
> 
>

Cheers,


-- 
=============================================================================
Dimitris Zilaskos
GridAUTH Operations Centre @ Aristotle University of Thessaloniki , Greece
Tel: +302310998988 Fax: +302310994309
http://www.grid.auth.gr
=============================================================================


More information about the torqueusers mailing list