[torqueusers] jobs assigned to node, but fail to run

Ronald Starink ronalds at nikhef.nl
Sat Feb 4 03:12:57 MST 2006


We haven't tried the upgrade yet. It takes some planning because it is a 
  pretty large cluster. If you think that 2.0.0p7 contains changes that 
may resolve this, we will certainly give it a try.

Thanks,
Ronald

Garrick Staples wrote:
> Is this reproducible with 2.0.0p7?
> 
> On Wed, Feb 01, 2006 at 04:17:06PM +0100, Ronald Starink alleged:
>> Hi,
>>
>> We have this frequently recurring problem that jobs get assigned to a
>> node according to the Torque server, but remain in state Queued.
>> Unfortunately, this also blocks scheduling of other queued jobs, leaving
>> us with a mostly empty cluster and a full queue. The MOM log file
>> repeatedly says that a job is in unexpected state TRANSICM.
>>
>> Below I have added a detailed overview of information from log files. In
>> this case, a simple restart of the MOM resolved the problem. However, we
>> see this problem various times per week (or day). Sometimes, Torque
>> becomes unresponsive, i.e. timeouts when runnings qstat or pbsnodes. The
>> only solution then is to kill (-9) Maui and Torque and restart them to 
>> get the system up again. This is pretty annoying, particularly since we 
>> have  no idea how to prevent this problem. Therefore, any help or 
>> suggestions are welcome!
>>
>> We run Torque version 2.0.0p4 together with Maui 3.2.6p13 on a linux
>> system (Red Hat Enterprise 3 compatible).
>>
>> Thanks,
>> Ronald
>>
>> -- 
>> Ronald Starink
>>
>> ** National Institute for Nuclear and High Energy Physics
>> ** Room: H1.57 Phone: +31 20 5925180
>> ** PObox 41882, NL-1009DB Amsterdam NL
>>
>>
>>
>>
>> --- Lengthy details below ---
>>
>>
>> There are many queued jobs, but also many free CPUs:
>> [root at tbn20 server_logs]# pbsnodes -a | grep "state = free" | wc -l
>>     106
>>
>> Normally, at least a fraction of the queued jobs should be running.
>> However, there are jobs assigned to one particular node (node15-7),
>> where they won't run.
>> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
>> 748172.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748173.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748175.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748176.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748178.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748180.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748181.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748182.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748183.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748184.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748185.tbn20.nikhef. dzero013 dzero    STDIN         --      1  --    --
>>  72:00 Q   --    node15-7
>> 748264.tbn20.nikhef. dteam004 test     STDIN         --      1  --    --
>>  00:30 Q   --    node15-7
>> 748265.tbn20.nikhef. lhcb002  qlong    STDIN         --      1  --    --
>>  64:00 Q   --    node15-7
>>
>> Apparently this blocks Torque to assign jobs to other nodes too.
>>
>> The problematic node (node15-7) is a dual CPU machine, configured to run
>> 2 jobs:
>> [root at tbn20 server_logs]# pbsnodes -a node15-7.farmnet.nikhef.nl
>> node15-7.farmnet.nikhef.nl
>>      state = free
>>      np = 2
>>      properties = farm,rhel3,xps2800,halloween
>>      ntype = cluster
>>      status = opsys=linux,uname=Linux node15-7.farmnet.nikhef.nl
>> 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 14:26:33 EDT 2005 i686,sessions=?
>> 0,nsessions=?
>> 0,nusers=0,idletime=615,totmem=5204136kb,availmem=4942120kb,physmem=2055436kb,ncpus=4,loadave=0.00,netload=341352578,state=free,jobs=? 
>>
>> 0,rectime=1138779543
>>
>>
>>
>> On node15-7, the MOM log shows every second the following 4 lines:
>>
>> 02/01/2006 08:43:38;0001;   pbs_mom;Svr;pbs_mom;Success (0) in
>> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'
>> 02/01/2006 08:43:38;0080;   pbs_mom;Req;req_reject;Reject reply
>> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
>> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
>> type=JobScript, from PBS_Server at tbn20.nikhef.nl
>> 02/01/2006 08:43:38;0001;   pbs_mom;Svr;pbs_mom;Success (0) in
>> req_jobscript, job 747900.tbn20.nikhef.nl in unexpected
>>  state 'TRANSICM'
>> 02/01/2006 08:43:38;0080;   pbs_mom;Req;req_reject;Reject reply
>> code=15004(Invalid request REJHOST=node15-7.farmnet.nikhef.nl MSG=job
>> 747900.tbn20.nikhef.nl in unexpected state 'TRANSICM'), aux=0,
>> type=JobScript, from PBS_Server at tbn20.nikhef.nl
>>
>>
>> The Torque server doesn't know about job 747900:
>> [root at tbn20 server_logs]# qstat -an1 | grep 747900
>> [root at tbn20 server_logs]#
>>
>> And the node apparently also doesn't know about the jobs assigned by the
>> Torque server:
>> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
>>
>> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl   Version:
>> 2.0.0p4
>> Server[0]: tbn20.nikhef.nl (connection is active)
>>   Last Msg From Server:   0 seconds (JobScript)
>>   Last Msg To Server:     11 seconds
>> HomeDirectory:          /var/spool/pbs/mom_priv
>> MOM active:             675964 seconds
>> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
>> JobList:                NONE
>>
>> diagnostics complete
>>
>>
>> Restart the MOM on the node to get out this state:
>>
>> [root at node15-7 root]# service pbs_mom restart
>> Stopping pbs_mom:                                          [  OK  ]
>> Starting pbs_mom:                                          [  OK  ]
>>
>> Now there are only 2 jobs assigned to node15-7 and they are running:
>>
>> [root at tbn20 server_logs]# qstat -an1 | grep node15-7
>> 748264.tbn20.nikhef. dteam004 test     STDIN       25880     1  --    --
>>  00:30 R   --    node15-7
>> 748265.tbn20.nikhef. lhcb002  qlong    STDIN       25893     1  --    --
>>  64:00 R   --    node15-7
>>
>> The other jobs that were assigned to this node but were still queued,
>> are now also running, but on different nodes:
>>
>> [root at tbn20 server_logs]# qstat -an1 | grep 74817
>> 748171.tbn20.nikhef. dzero013 dzero    STDIN       31345     1  --    --
>>  72:00 R 00:03   node15-2
>> 748172.tbn20.nikhef. dzero013 dzero    STDIN       14078     1  --    --
>>  72:00 R 00:00   node15-8
>> 748173.tbn20.nikhef. dzero013 dzero    STDIN       14122     1  --    --
>>  72:00 R 00:00   node15-8
>> 748174.tbn20.nikhef. dzero013 dzero    STDIN       30344     1  --    --
>>  72:00 R 00:00   node15-9
>> 748175.tbn20.nikhef. dzero013 dzero    STDIN       30164     1  --    --
>>  72:00 R 00:00   node15-11
>> 748176.tbn20.nikhef. dzero013 dzero    STDIN       25087     1  --    --
>>  72:00 R 00:00   node15-12
>> 748177.tbn20.nikhef. dzero013 dzero    STDIN       25135     1  --    --
>>  72:00 R 00:00   node15-12
>> 748178.tbn20.nikhef. dzero013 dzero    STDIN       11590     1  --    --
>>  72:00 R 00:00   node15-13
>> 748179.tbn20.nikhef. dzero013 dzero    STDIN       26972     1  --    --
>>  72:00 R 00:00   node15-14
>>
>> All jobs that were queued are now running again.
>>
>> Interesting detail: after restarting the MOM, the job 747900 that was in
>> unexpected state TRANSICM is now visible from the Torque server:
>>
>> [root at tbn20 server_logs]# momctl -h node15-7.farmnet.nikhef.nl -d 0
>>
>> Host: node15-7.farmnet.nikhef.nl/node15-7.farmnet.nikhef.nl   Version: 
>> 2.0.0p4
>> Server[0]: tbn20.nikhef.nl (connection is active)
>>   Last Msg From Server:   9 seconds (StatusJob)
>>   Last Msg To Server:     42 seconds
>> HomeDirectory:          /var/spool/pbs/mom_priv
>> MOM active:             3906 seconds
>> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
>> Job[747900.tbn20.nikhef.nl]  State=TRANSICM
>> Job[748265.tbn20.nikhef.nl]  State=RUNNING
>> Job[748266.tbn20.nikhef.nl]  State=RUNNING
>> Assigned CPU Count:     3
>>
>> diagnostics complete
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list