[torqueusers] jobs assigned to node, but fail to run
Ronald Starink
ronalds at nikhef.nl
Sat Feb 4 03:12:57 MST 2006
We haven't tried the upgrade yet. It takes some planning because it is a
pretty large cluster. If you think that 2.0.0p7 contains changes that
may resolve this, we will certainly give it a try.
Thanks,
Ronald
Garrick Staples wrote:
> Is this reproducible with 2.0.0p7?
>
> On Wed, Feb 01, 2006 at 04:17:06PM +0100, Ronald Starink alleged:
>> Hi,
>>
>> We have this frequently recurring problem that jobs get assigned to a
>> node according to the Torque server, but remain in state Queued.
>> Unfortunately, this also blocks scheduling of other queued jobs, leaving
>> us with a mostly empty cluster and a full queue. The MOM log file
>> repeatedly says that a job is in unexpected state TRANSICM.
>>
>> Below I have added a detailed overview of information from log files. In
>> this case, a simple restart of the MOM resolved the problem. However, we
>> see this problem various times per week (or day). Sometimes, Torque
>> becomes unresponsive, i.e. timeouts when runnings qstat or pbsnodes. The
>> only solution then is to kill (-9) Maui and Torque and restart them to
>> get the system up again. This is pretty annoying, particularly since we
>> have no idea how to prevent this problem. Therefore, any help or
>> suggestions are welcome!
>>
>> We run Torque version 2.0.0p4 together with Maui 3.2.6p13 on a linux
>> system (Red Hat Enterprise 3 compatible).
>>
>> Thanks,
>> Ronald
>>
>> --
>> Ronald Starink
>>
>> ** National Institute for Nuclear and High Energy Physics
>> ** Room: H1.57 Phone: +31 20 5925180
>> ** PObox 41882, NL-1009DB Amsterdam NL
>>
>>
>>
>>
>> --- Lengthy details below ---
>>
>>
>> There are many queued jobs, but also many free CPUs:
>> [root at tbn20 server_logs]# pbsnodes -a | grep "state = free" | wc -l
>> 106
>>
>> Normally, at least a fraction of the queued jobs should be running.
>> However, there are jobs assigned to one particular node (node15-7),
>> where they won't run.
>> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
>> 748172.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748173.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748175.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748176.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748178.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748180.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748181.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748182.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748183.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748184.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748185.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
>> 72:00 Q -- node15-7
>> 748264.tbn20.nikhef. dteam004 test STDIN -- 1 -- --
>> 00:30 Q -- node15-7
>> 748265.tbn20.nikhef. lhcb002 qlong STDIN -- 1 -- --
>> 64:00 Q -- node15-7
>>
>> Apparently this blocks Torque to assign jobs to other nodes too.
>>
>> The problematic node (node15-7) is a dual CPU machine, configured to run
>> 2 jobs:
>> [root at tbn20 server_logs]# pbsnodes -a node15-7.farmnet.nikhef.nl
>> node15-7.farmnet.nikhef.nl
>> state = free
>> np = 2
>> properties = farm,rhel3,xps2800,halloween
>> ntype = cluster
>> status = opsys=linux,uname=Linux node15-7.farmnet.nikhef.nl
>> 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 14:26:33 EDT 2005 i686,sessions=?
>> 0,nsessions=?
>> 0,nusers=0,idletime=615,totmem=5204136kb,availmem=4942120kb,physmem=2055436kb,ncpus=4,loadave=0.00,netload=341352578,state=free,jobs=?
>>
>> 0,rectime=1138779543
>>
>>
>>
>> On node15-7, the MOM log shows every second the following 4 lines:
>>
>> 02/01/2006 08:43:38;0001; pbs_mom;Svr;pbs_mom;Success (0) in
>> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'
>> 02/01/2006 08:43:38;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
>> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
>> type=JobScript, from PBS_Server at tbn20.nikhef.nl
>> 02/01/2006 08:43:38;0001; pbs_mom;Svr;pbs_mom;Success (0) in
>> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected
>> state 'TRANSICM'
>> 02/01/2006 08:43:38;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
>> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
>> type=JobScript, from PBS_Server at tbn20.nikhef.nl
>>
>>
>> The Torque server doesn't know about job 747900:
>> [root at tbn20 server_logs]# qstat -an1 | grep 747900
>> [root at tbn20 server_logs]#
>>
>> And the node apparently also doesn't know about the jobs assigned by the
>> Torque server:
>> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
>>
>> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl Version:
>> 2.0.0p4
>> Server[0]: tbn20.nikhef.nl (connection is active)
>> Last Msg From Server: 0 seconds (JobScript)
>> Last Msg To Server: 11 seconds
>> HomeDirectory: /var/spool/pbs/mom_priv
>> MOM active: 675964 seconds
>> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
>> JobList: NONE
>>
>> diagnostics complete
>>
>>
>> Restart the MOM on the node to get out this state:
>>
>> [root at node15-7 root]# service pbs_mom restart
>> Stopping pbs_mom: [ OK ]
>> Starting pbs_mom: [ OK ]
>>
>> Now there are only 2 jobs assigned to node15-7 and they are running:
>>
>> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
>> 748264.tbn20.nikhef. dteam004 test STDIN 25880 1 -- --
>> 00:30 R -- node15-7
>> 748265.tbn20.nikhef. lhcb002 qlong STDIN 25893 1 -- --
>> 64:00 R -- node15-7
>>
>> The other jobs that were assigned to this node but were still queued,
>> are now also running, but on different nodes:
>>
>> [root at tbn20 server_logs]# qstat -an1 | grep 74817
>> 748171.tbn20.nikhef. dzero013 dzero STDIN 31345 1 -- --
>> 72:00 R 00:03 node15-2
>> 748172.tbn20.nikhef. dzero013 dzero STDIN 14078 1 -- --
>> 72:00 R 00:00 node15-8
>> 748173.tbn20.nikhef. dzero013 dzero STDIN 14122 1 -- --
>> 72:00 R 00:00 node15-8
>> 748174.tbn20.nikhef. dzero013 dzero STDIN 30344 1 -- --
>> 72:00 R 00:00 node15-9
>> 748175.tbn20.nikhef. dzero013 dzero STDIN 30164 1 -- --
>> 72:00 R 00:00 node15-11
>> 748176.tbn20.nikhef. dzero013 dzero STDIN 25087 1 -- --
>> 72:00 R 00:00 node15-12
>> 748177.tbn20.nikhef. dzero013 dzero STDIN 25135 1 -- --
>> 72:00 R 00:00 node15-12
>> 748178.tbn20.nikhef. dzero013 dzero STDIN 11590 1 -- --
>> 72:00 R 00:00 node15-13
>> 748179.tbn20.nikhef. dzero013 dzero STDIN 26972 1 -- --
>> 72:00 R 00:00 node15-14
>>
>> All jobs that were queued are now running again.
>>
>> Interesting detail: after restarting the MOM, the job 747900 that was in
>> unexpected state TRANSICM is now visible from the Torque server:
>>
>> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
>>
>> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl Version:
>> 2.0.0p4
>> Server[0]: tbn20.nikhef.nl (connection is active)
>> Last Msg From Server: 9 seconds (StatusJob)
>> Last Msg To Server: 42 seconds
>> HomeDirectory: /var/spool/pbs/mom_priv
>> MOM active: 3906 seconds
>> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
>> Job[747900.tbn20.nikhef.nl] State=TRANSICM
>> Job[748265.tbn20.nikhef.nl] State=RUNNING
>> Job[748266.tbn20.nikhef.nl] State=RUNNING
>> Assigned CPU Count: 3
>>
>> diagnostics complete
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list