[torqueusers] mom server communication failings.
davidg at nikhef.nl
Wed Aug 4 01:25:44 MDT 2004
Steve Traylen wrote:
> Looking at this some more it looks like a job can get into a state that
> it can never get out of.
> The exec_host appears to have been set during the first execution attempt
> when the pbs_mom rejected. So now the job is queued with an exec_host set
No solution for the stability problems, but at least you
should be able to move your jobs away from the suck nodes.
I found this solution by Davide on the NIKHEF NDPF syslog list:
> Message-Id: <200407161159.i6GBx09X000743 at arwen.nikhef.nl>
> From: "Davide Salomoni" <Davide.Salomoni at nikhef.nl>
> To: <ndpf-syslog at nikhef.nl>
> Date: Fri, 16 Jul 2004 13:59:11 +0200
> Subject: [ndpf-syslog] running a pbs job on a node different from the one
> A torque recipe:
> today I had the case where torque for some reason got confused: it was
> marking a node that I physically removed yesterday from the farm as "free"
> and, consequently, it had allocated a job to be run on that node.
> I restarted torque and things got better (the removed node was now marked as
> "down"), but the job in question had been allocated the dead node already
> and restarting torque did not change that.
> Obviously the job could not run and was therefore put in the "idle" state.
> If you tried to run the job explicitly with qrun, you'd get the message
> [root at tbn18 pytail]# qrun 72924
> qrun: Resource temporarily unavailable 72924.tbn18.nikhef.nl
> The solution was to find a free node (with 'pbsnodes -a') and then move the
> job to that node with
> [root at tbn18 pytail]# qrun -H node16-5.farmnet.nikhef.nl 72924
> I checked on the worker node (node16-5), and the job actually ran there.
Hope this helps a bit.
** National Institute for Nuclear and High Energy Physics, PDP/Grid group **
** Room: H1.56 Phone: +31 20 5922179, PObox 41882, NL-1009DB Amsterdam NL **
More information about the torqueusers