[torqueusers] jobs assigned to node, but fail to run
Garrick Staples
garrick at usc.edu
Fri Feb 3 09:41:21 MST 2006
Is this reproducible with 2.0.0p7?
On Wed, Feb 01, 2006 at 04:17:06PM +0100, Ronald Starink alleged:
> Hi,
>
> We have this frequently recurring problem that jobs get assigned to a
> node according to the Torque server, but remain in state Queued.
> Unfortunately, this also blocks scheduling of other queued jobs, leaving
> us with a mostly empty cluster and a full queue. The MOM log file
> repeatedly says that a job is in unexpected state TRANSICM.
>
> Below I have added a detailed overview of information from log files. In
> this case, a simple restart of the MOM resolved the problem. However, we
> see this problem various times per week (or day). Sometimes, Torque
> becomes unresponsive, i.e. timeouts when runnings qstat or pbsnodes. The
> only solution then is to kill (-9) Maui and Torque and restart them to
> get the system up again. This is pretty annoying, particularly since we
> have no idea how to prevent this problem. Therefore, any help or
> suggestions are welcome!
>
> We run Torque version 2.0.0p4 together with Maui 3.2.6p13 on a linux
> system (Red Hat Enterprise 3 compatible).
>
> Thanks,
> Ronald
>
> --
> Ronald Starink
>
> ** National Institute for Nuclear and High Energy Physics
> ** Room: H1.57 Phone: +31 20 5925180
> ** PObox 41882, NL-1009DB Amsterdam NL
>
>
>
>
> --- Lengthy details below ---
>
>
> There are many queued jobs, but also many free CPUs:
> [root at tbn20 server_logs]# pbsnodes -a | grep "state = free" | wc -l
> 106
>
> Normally, at least a fraction of the queued jobs should be running.
> However, there are jobs assigned to one particular node (node15-7),
> where they won't run.
> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
> 748172.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748173.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748175.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748176.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748178.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748180.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748181.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748182.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748183.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748184.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748185.tbn20.nikhef. dzero013 dzero STDIN -- 1 -- --
> 72:00 Q -- node15-7
> 748264.tbn20.nikhef. dteam004 test STDIN -- 1 -- --
> 00:30 Q -- node15-7
> 748265.tbn20.nikhef. lhcb002 qlong STDIN -- 1 -- --
> 64:00 Q -- node15-7
>
> Apparently this blocks Torque to assign jobs to other nodes too.
>
> The problematic node (node15-7) is a dual CPU machine, configured to run
> 2 jobs:
> [root at tbn20 server_logs]# pbsnodes -a node15-7.farmnet.nikhef.nl
> node15-7.farmnet.nikhef.nl
> state = free
> np = 2
> properties = farm,rhel3,xps2800,halloween
> ntype = cluster
> status = opsys=linux,uname=Linux node15-7.farmnet.nikhef.nl
> 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 14:26:33 EDT 2005 i686,sessions=?
> 0,nsessions=?
> 0,nusers=0,idletime=615,totmem=5204136kb,availmem=4942120kb,physmem=2055436kb,ncpus=4,loadave=0.00,netload=341352578,state=free,jobs=?
>
> 0,rectime=1138779543
>
>
>
> On node15-7, the MOM log shows every second the following 4 lines:
>
> 02/01/2006 08:43:38;0001; pbs_mom;Svr;pbs_mom;Success (0) in
> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'
> 02/01/2006 08:43:38;0080; pbs_mom;Req;req_reject;Reject reply
> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
> type=JobScript, from PBS_Server at tbn20.nikhef.nl
> 02/01/2006 08:43:38;0001; pbs_mom;Svr;pbs_mom;Success (0) in
> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected
> state 'TRANSICM'
> 02/01/2006 08:43:38;0080; pbs_mom;Req;req_reject;Reject reply
> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
> type=JobScript, from PBS_Server at tbn20.nikhef.nl
>
>
> The Torque server doesn't know about job 747900:
> [root at tbn20 server_logs]# qstat -an1 | grep 747900
> [root at tbn20 server_logs]#
>
> And the node apparently also doesn't know about the jobs assigned by the
> Torque server:
> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
>
> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl Version:
> 2.0.0p4
> Server[0]: tbn20.nikhef.nl (connection is active)
> Last Msg From Server: 0 seconds (JobScript)
> Last Msg To Server: 11 seconds
> HomeDirectory: /var/spool/pbs/mom_priv
> MOM active: 675964 seconds
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> JobList: NONE
>
> diagnostics complete
>
>
> Restart the MOM on the node to get out this state:
>
> [root at node15-7 root]# service pbs_mom restart
> Stopping pbs_mom: [ OK ]
> Starting pbs_mom: [ OK ]
>
> Now there are only 2 jobs assigned to node15-7 and they are running:
>
> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
> 748264.tbn20.nikhef. dteam004 test STDIN 25880 1 -- --
> 00:30 R -- node15-7
> 748265.tbn20.nikhef. lhcb002 qlong STDIN 25893 1 -- --
> 64:00 R -- node15-7
>
> The other jobs that were assigned to this node but were still queued,
> are now also running, but on different nodes:
>
> [root at tbn20 server_logs]# qstat -an1 | grep 74817
> 748171.tbn20.nikhef. dzero013 dzero STDIN 31345 1 -- --
> 72:00 R 00:03 node15-2
> 748172.tbn20.nikhef. dzero013 dzero STDIN 14078 1 -- --
> 72:00 R 00:00 node15-8
> 748173.tbn20.nikhef. dzero013 dzero STDIN 14122 1 -- --
> 72:00 R 00:00 node15-8
> 748174.tbn20.nikhef. dzero013 dzero STDIN 30344 1 -- --
> 72:00 R 00:00 node15-9
> 748175.tbn20.nikhef. dzero013 dzero STDIN 30164 1 -- --
> 72:00 R 00:00 node15-11
> 748176.tbn20.nikhef. dzero013 dzero STDIN 25087 1 -- --
> 72:00 R 00:00 node15-12
> 748177.tbn20.nikhef. dzero013 dzero STDIN 25135 1 -- --
> 72:00 R 00:00 node15-12
> 748178.tbn20.nikhef. dzero013 dzero STDIN 11590 1 -- --
> 72:00 R 00:00 node15-13
> 748179.tbn20.nikhef. dzero013 dzero STDIN 26972 1 -- --
> 72:00 R 00:00 node15-14
>
> All jobs that were queued are now running again.
>
> Interesting detail: after restarting the MOM, the job 747900 that was in
> unexpected state TRANSICM is now visible from the Torque server:
>
> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
>
> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl Version:
> 2.0.0p4
> Server[0]: tbn20.nikhef.nl (connection is active)
> Last Msg From Server: 9 seconds (StatusJob)
> Last Msg To Server: 42 seconds
> HomeDirectory: /var/spool/pbs/mom_priv
> MOM active: 3906 seconds
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Job[747900.tbn20.nikhef.nl] State=TRANSICM
> Job[748265.tbn20.nikhef.nl] State=RUNNING
> Job[748266.tbn20.nikhef.nl] State=RUNNING
> Assigned CPU Count: 3
>
> diagnostics complete
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060203/a5380143/attachment.bin
More information about the torqueusers
mailing list