[torqueusers] jobs assigned to node, but fail to run

Garrick Staples garrick at usc.edu
Fri Feb 3 09:41:21 MST 2006


Is this reproducible with 2.0.0p7?

On Wed, Feb 01, 2006 at 04:17:06PM +0100, Ronald Starink alleged:
> Hi,
> 
> We have this frequently recurring problem that jobs get assigned to a
> node according to the Torque server, but remain in state Queued.
> Unfortunately, this also blocks scheduling of other queued jobs, leaving
> us with a mostly empty cluster and a full queue. The MOM log file
> repeatedly says that a job is in unexpected state TRANSICM.
> 
> Below I have added a detailed overview of information from log files. In
> this case, a simple restart of the MOM resolved the problem. However, we
> see this problem various times per week (or day). Sometimes, Torque
> becomes unresponsive, i.e. timeouts when runnings qstat or pbsnodes. The
> only solution then is to kill (-9) Maui and Torque and restart them to 
> get the system up again. This is pretty annoying, particularly since we 
> have  no idea how to prevent this problem. Therefore, any help or 
> suggestions are welcome!
> 
> We run Torque version 2.0.0p4 together with Maui 3.2.6p13 on a linux
> system (Red Hat Enterprise 3 compatible).
> 
> Thanks,
> Ronald
> 
> -- 
> Ronald Starink
> 
> ** National Institute for Nuclear and High Energy Physics
> ** Room: H1.57 Phone: +31 20 5925180
> ** PObox 41882, NL-1009DB Amsterdam NL
> 
> 
> 
> 
> --- Lengthy details below ---
> 
> 
> There are many queued jobs, but also many free CPUs:
> [root at tbn20 server_logs]# pbsnodes -a | grep "state = free" | wc -l
>     106
> 
> Normally, at least a fraction of the queued jobs should be running.
> However, there are jobs assigned to one particular node (node15-7),
> where they won't run.
> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
> 748172.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748173.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748175.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748176.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748178.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748180.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748181.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748182.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748183.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748184.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748185.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>  72:00 Q   --    node15-7
> 748264.tbn20.nikhef. dteam004 test     STDIN         --      1  --    --
>  00:30 Q   --    node15-7
> 748265.tbn20.nikhef. lhcb002  qlong    STDIN         --      1  --    --
>  64:00 Q   --    node15-7
> 
> Apparently this blocks Torque to assign jobs to other nodes too.
> 
> The problematic node (node15-7) is a dual CPU machine, configured to run
> 2 jobs:
> [root at tbn20 server_logs]# pbsnodes -a node15-7.farmnet.nikhef.nl
> node15-7.farmnet.nikhef.nl
>      state = free
>      np = 2
>      properties = farm,rhel3,xps2800,halloween
>      ntype = cluster
>      status = opsys=linux,uname=Linux node15-7.farmnet.nikhef.nl
> 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 14:26:33 EDT 2005 i686,sessions=?
> 0,nsessions=?
> 0,nusers=0,idletime=615,totmem=5204136kb,availmem=4942120kb,physmem=2055436kb,ncpus=4,loadave=0.00,netload=341352578,state=free,jobs=? 
> 
> 0,rectime=1138779543
> 
> 
> 
> On node15-7, the MOM log shows every second the following 4 lines:
> 
> 02/01/2006 08:43:38;0001;   pbs_mom;Svr;pbs_mom;Success (0) in
> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'
> 02/01/2006 08:43:38;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
> type=JobScript, from PBS_Server at tbn20.nikhef.nl
> 02/01/2006 08:43:38;0001;   pbs_mom;Svr;pbs_mom;Success (0) in
> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected
>  state 'TRANSICM'
> 02/01/2006 08:43:38;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
> type=JobScript, from PBS_Server at tbn20.nikhef.nl
> 
> 
> The Torque server doesn't know about job 747900:
> [root at tbn20 server_logs]# qstat -an1 | grep 747900
> [root at tbn20 server_logs]#
> 
> And the node apparently also doesn't know about the jobs assigned by the
> Torque server:
> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
> 
> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl   Version:
> 2.0.0p4
> Server[0]: tbn20.nikhef.nl (connection is active)
>   Last Msg From Server:   0 seconds (JobScript)
>   Last Msg To Server:     11 seconds
> HomeDirectory:          /var/spool/pbs/mom_priv
> MOM active:             675964 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> JobList:                NONE
> 
> diagnostics complete
> 
> 
> Restart the MOM on the node to get out this state:
> 
> [root at node15-7 root]# service pbs_mom restart
> Stopping pbs_mom:                                          [  OK  ]
> Starting pbs_mom:                                          [  OK  ]
> 
> Now there are only 2 jobs assigned to node15-7 and they are running:
> 
> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
> 748264.tbn20.nikhef. dteam004 test     STDIN       25880     1  --    --
>  00:30 R   --    node15-7
> 748265.tbn20.nikhef. lhcb002  qlong    STDIN       25893     1  --    --
>  64:00 R   --    node15-7
> 
> The other jobs that were assigned to this node but were still queued,
> are now also running, but on different nodes:
> 
> [root at tbn20 server_logs]# qstat -an1 | grep 74817
> 748171.tbn20.nikhef. dzero013 dzero    STDIN       31345     1  --    --
>  72:00 R 00:03   node15-2
> 748172.tbn20.nikhef. dzero013 dzero    STDIN       14078     1  --    --
>  72:00 R 00:00   node15-8
> 748173.tbn20.nikhef. dzero013 dzero    STDIN       14122     1  --    --
>  72:00 R 00:00   node15-8
> 748174.tbn20.nikhef. dzero013 dzero    STDIN       30344     1  --    --
>  72:00 R 00:00   node15-9
> 748175.tbn20.nikhef. dzero013 dzero    STDIN       30164     1  --    --
>  72:00 R 00:00   node15-11
> 748176.tbn20.nikhef. dzero013 dzero    STDIN       25087     1  --    --
>  72:00 R 00:00   node15-12
> 748177.tbn20.nikhef. dzero013 dzero    STDIN       25135     1  --    --
>  72:00 R 00:00   node15-12
> 748178.tbn20.nikhef. dzero013 dzero    STDIN       11590     1  --    --
>  72:00 R 00:00   node15-13
> 748179.tbn20.nikhef. dzero013 dzero    STDIN       26972     1  --    --
>  72:00 R 00:00   node15-14
> 
> All jobs that were queued are now running again.
> 
> Interesting detail: after restarting the MOM, the job 747900 that was in
> unexpected state TRANSICM is now visible from the Torque server:
> 
> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
> 
> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl   Version: 
> 2.0.0p4
> Server[0]: tbn20.nikhef.nl (connection is active)
>   Last Msg From Server:   9 seconds (StatusJob)
>   Last Msg To Server:     42 seconds
> HomeDirectory:          /var/spool/pbs/mom_priv
> MOM active:             3906 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Job[747900.tbn20.nikhef.nl]  State=TRANSICM
> Job[748265.tbn20.nikhef.nl]  State=RUNNING
> Job[748266.tbn20.nikhef.nl]  State=RUNNING
> Assigned CPU Count:     3
> 
> diagnostics complete
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060203/a5380143/attachment.bin


More information about the torqueusers mailing list