[torqueusers] jobs assigned to node, but fail to run

Ronald Starink ronalds at nikhef.nl
Wed Feb 1 08:17:06 MST 2006


Hi,

We have this frequently recurring problem that jobs get assigned to a
node according to the Torque server, but remain in state Queued.
Unfortunately, this also blocks scheduling of other queued jobs, leaving
us with a mostly empty cluster and a full queue. The MOM log file
repeatedly says that a job is in unexpected state TRANSICM.

Below I have added a detailed overview of information from log files. In
this case, a simple restart of the MOM resolved the problem. However, we
see this problem various times per week (or day). Sometimes, Torque
becomes unresponsive, i.e. timeouts when runnings qstat or pbsnodes. The
only solution then is to kill (-9) Maui and Torque and restart them to 
get the system up again. This is pretty annoying, particularly since we 
have  no idea how to prevent this problem. Therefore, any help or 
suggestions are welcome!

We run Torque version 2.0.0p4 together with Maui 3.2.6p13 on a linux
system (Red Hat Enterprise 3 compatible).

Thanks,
Ronald

-- 
Ronald Starink

** National Institute for Nuclear and High Energy Physics
** Room: H1.57 Phone: +31 20 5925180
** PObox 41882, NL-1009DB Amsterdam NL




--- Lengthy details below ---


There are many queued jobs, but also many free CPUs:
[root at tbn20 server_logs]# pbsnodes -a | grep "state = free" | wc -l
     106

Normally, at least a fraction of the queued jobs should be running.
However, there are jobs assigned to one particular node (node15-7),
where they won't run.
[root at tbn20 server_logs]# qstat -an1 | grep node15-7
748172.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748173.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748175.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748176.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748178.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748180.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748181.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748182.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748183.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748184.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748185.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
  72:00 Q   --    node15-7
748264.tbn20.nikhef. dteam004 test     STDIN         --      1  --    --
  00:30 Q   --    node15-7
748265.tbn20.nikhef. lhcb002  qlong    STDIN         --      1  --    --
  64:00 Q   --    node15-7

Apparently this blocks Torque to assign jobs to other nodes too.

The problematic node (node15-7) is a dual CPU machine, configured to run
2 jobs:
[root at tbn20 server_logs]# pbsnodes -a node15-7.farmnet.nikhef.nl
node15-7.farmnet.nikhef.nl
      state = free
      np = 2
      properties = farm,rhel3,xps2800,halloween
      ntype = cluster
      status = opsys=linux,uname=Linux node15-7.farmnet.nikhef.nl
2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 14:26:33 EDT 2005 i686,sessions=?
0,nsessions=?
0,nusers=0,idletime=615,totmem=5204136kb,availmem=4942120kb,physmem=2055436kb,ncpus=4,loadave=0.00,netload=341352578,state=free,jobs=? 

0,rectime=1138779543



On node15-7, the MOM log shows every second the following 4 lines:

02/01/2006 08:43:38;0001;   pbs_mom;Svr;pbs_mom;Success (0) in
req_jobscript, job 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'
02/01/2006 08:43:38;0080;   pbs_mom;Req;req_reject;Reject reply
code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
type=JobScript, from PBS_Server at tbn20.nikhef.nl
02/01/2006 08:43:38;0001;   pbs_mom;Svr;pbs_mom;Success (0) in
req_jobscript, job 747900.tbn20.nikhef.nl in unexpected
  state 'TRANSICM'
02/01/2006 08:43:38;0080;   pbs_mom;Req;req_reject;Reject reply
code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
type=JobScript, from PBS_Server at tbn20.nikhef.nl


The Torque server doesn't know about job 747900:
[root at tbn20 server_logs]# qstat -an1 | grep 747900
[root at tbn20 server_logs]#

And the node apparently also doesn't know about the jobs assigned by the
Torque server:
[root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0

Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl   Version:
2.0.0p4
Server[0]: tbn20.nikhef.nl (connection is active)
   Last Msg From Server:   0 seconds (JobScript)
   Last Msg To Server:     11 seconds
HomeDirectory:          /var/spool/pbs/mom_priv
MOM active:             675964 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
JobList:                NONE

diagnostics complete


Restart the MOM on the node to get out this state:

[root at node15-7 root]# service pbs_mom restart
Stopping pbs_mom:                                          [  OK  ]
Starting pbs_mom:                                          [  OK  ]

Now there are only 2 jobs assigned to node15-7 and they are running:

[root at tbn20 server_logs]# qstat -an1 | grep node15-7
748264.tbn20.nikhef. dteam004 test     STDIN       25880     1  --    --
  00:30 R   --    node15-7
748265.tbn20.nikhef. lhcb002  qlong    STDIN       25893     1  --    --
  64:00 R   --    node15-7

The other jobs that were assigned to this node but were still queued,
are now also running, but on different nodes:

[root at tbn20 server_logs]# qstat -an1 | grep 74817
748171.tbn20.nikhef. dzero013 dzero    STDIN       31345     1  --    --
  72:00 R 00:03   node15-2
748172.tbn20.nikhef. dzero013 dzero    STDIN       14078     1  --    --
  72:00 R 00:00   node15-8
748173.tbn20.nikhef. dzero013 dzero    STDIN       14122     1  --    --
  72:00 R 00:00   node15-8
748174.tbn20.nikhef. dzero013 dzero    STDIN       30344     1  --    --
  72:00 R 00:00   node15-9
748175.tbn20.nikhef. dzero013 dzero    STDIN       30164     1  --    --
  72:00 R 00:00   node15-11
748176.tbn20.nikhef. dzero013 dzero    STDIN       25087     1  --    --
  72:00 R 00:00   node15-12
748177.tbn20.nikhef. dzero013 dzero    STDIN       25135     1  --    --
  72:00 R 00:00   node15-12
748178.tbn20.nikhef. dzero013 dzero    STDIN       11590     1  --    --
  72:00 R 00:00   node15-13
748179.tbn20.nikhef. dzero013 dzero    STDIN       26972     1  --    --
  72:00 R 00:00   node15-14

All jobs that were queued are now running again.

Interesting detail: after restarting the MOM, the job 747900 that was in
unexpected state TRANSICM is now visible from the Torque server:

[root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0

Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl   Version: 
2.0.0p4
Server[0]: tbn20.nikhef.nl (connection is active)
   Last Msg From Server:   9 seconds (StatusJob)
   Last Msg To Server:     42 seconds
HomeDirectory:          /var/spool/pbs/mom_priv
MOM active:             3906 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Job[747900.tbn20.nikhef.nl]  State=TRANSICM
Job[748265.tbn20.nikhef.nl]  State=RUNNING
Job[748266.tbn20.nikhef.nl]  State=RUNNING
Assigned CPU Count:     3

diagnostics complete




More information about the torqueusers mailing list