[torqueusers] mom server communication failings.

David Groep davidg at nikhef.nl
Wed Aug 4 01:25:44 MDT 2004


Hi Steve,

Steve Traylen wrote:
 > ...
 > Looking at this some more it looks like a job can get into a state that
 > it can never get out of.
 >
 > The exec_host appears to have been set during the first execution attempt
 > when the pbs_mom rejected. So now the job is queued with an exec_host set
 >

No solution for the stability problems, but at least you
should be able to move your jobs away from the suck nodes.
I found this solution by Davide on the NIKHEF NDPF syslog list:

> Message-Id: <200407161159.i6GBx09X000743 at arwen.nikhef.nl>
> From: "Davide Salomoni" <Davide.Salomoni at nikhef.nl>
> To: <ndpf-syslog at nikhef.nl>
> Date: Fri, 16 Jul 2004 13:59:11 +0200
> Subject: [ndpf-syslog] running a pbs job on a node different from the one
> 	allocated
> A torque recipe:
> 
> today I had the case where torque for some reason got confused: it was
> marking a node that I physically removed yesterday from the farm as "free"
> and, consequently, it had allocated a job to be run on that node.
> 
> I restarted torque and things got better (the removed node was now marked as
> "down"), but the job in question had been allocated the dead node already
> and restarting torque did not change that.
> 
> Obviously the job could not run and was therefore put in the "idle" state. 
> If you tried to run the job explicitly with qrun, you'd get the message 
> 
> [root at tbn18 pytail]# qrun 72924
> qrun: Resource temporarily unavailable 72924.tbn18.nikhef.nl
> 
> The solution was to find a free node (with 'pbsnodes -a') and then move the
> job to that node with
> 
> [root at tbn18 pytail]# qrun -H node16-5.farmnet.nikhef.nl 72924
> 
> I checked on the worker node (node16-5), and the job actually ran there.
> 
> Davide
> 

Hope this helps a bit.

	David Groep.

-- 
David Groep

** National Institute for Nuclear and High Energy Physics, PDP/Grid group **
** Room: H1.56 Phone: +31 20 5922179, PObox 41882, NL-1009DB Amsterdam NL **


More information about the torqueusers mailing list