[torqueusers] mom server communication failings.

Steve Traylen s.traylen at rl.ac.uk
Mon Aug 9 11:53:46 MDT 2004


On Wed, Aug 04, 2004 at 09:25:44AM +0200 or thereabouts, David Groep wrote:
> Hi Steve,
> 
> Steve Traylen wrote:
> > ...
> >Obviously the job could not run and was therefore put in the "idle" state. 
> >If you tried to run the job explicitly with qrun, you'd get the message 
> >
> >[root at tbn18 pytail]# qrun 72924
> >qrun: Resource temporarily unavailable 72924.tbn18.nikhef.nl
> >
> >The solution was to find a free node (with 'pbsnodes -a') and then move the
> >job to that node with
> >
> >[root at tbn18 pytail]# qrun -H node16-5.farmnet.nikhef.nl 72924
> >
> >I checked on the worker node (node16-5), and the job actually ran there.

Hi there is a warning to go with doing that.

If we have a maui.log failure  such as  

   MPBSJobModify(33180,Resource_List,Resource,lcg0600.gridpp.rl.ac.uk)
   job '33180' cannot be started: (rc: 15070  errmsg: 'Server could  
         not connect to MOM'  hostlist: 'lcg0600.gridpp.rl.ac.uk')


   and then we tell the job to go somewhere else, eg

   qrun -H lcg0612.gridpp.rl.ac.uk 33177

   then maui gets confused about where the job is and how many cpus
   it is running, it thinks it is still running on the first host
   that it failed against and pbs thinks it is on the second host.

   $ checkjob 33180
       Allocated Nodes:
       [lcg0600.gridpp.rl.ac:1]

   $ checknode lcg0600.gridpp.rl.ac.uk
       Reservations:
       Job '32924'(x1)  -2:22:57 -> 11:21:37:03 (12:00:00:00)
       Job '33180'(x1)  -00:04:12 -> 11:23:55:48 (12:00:00:00)

   The important bit here appears to be these two lines and maybe 
   maui is at fault. If it modifies a job resource that then
   can not be implemented it should set it back to what it was.

   MPBSJobModify(33180,Resource_List,Resource,lcg0600.gridpp.rl.ac.uk)
   job '33180' cannot be started: (rc: 15070  errmsg: 'Server could  
         not connect to MOM'  hostlist: 'lcg0600.gridpp.rl.ac.uk')

   Steve

   









> >
> >Davide
> >
> 
> Hope this helps a bit.
> 
> 	David Groep.
> 
> -- 
> David Groep
> 
> ** National Institute for Nuclear and High Energy Physics, PDP/Grid group **
> ** Room: H1.56 Phone: +31 20 5922179, PObox 41882, NL-1009DB Amsterdam NL **
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers

-- 
Steve Traylen
s.traylen at rl.ac.uk
http://www.gridpp.ac.uk/


More information about the torqueusers mailing list