[torqueusers] qstat showing wrong

Laurence Dawson larry.dawson at vanderbilt.edu
Wed Aug 24 11:08:09 MDT 2005


We are running torque 1.2.0p5 on our cluster. qstat is showing all jobs 
with very low cputimes (0 seconds up to about 17 seconds). An sample 
extract from qstat is pasted below These times are clearly incorrect, 
the jobs are a mix of single and multiple cpu jobs. No jobs are being 
recorded correctly according to qstat. See the logs below for details - 
but is the problem related to this message in th e momlog below?:
pbs_mom;Svr;pbs_mom;No child processes (10) in is_update_stat, cannot 
specify protocol

Extract from qstat:
3569.vmpsched    oct_0.20         redmilps         00:00:14 R all
3578.vmpsched    pbsFullDamaged1  guratzrf         00:00:07 R all
3749.vmpsched    si1000.pbs       albadra          00:00:00 R all
3750.vmpsched    si2000.pbs       albadra          00:00:01 R all
3753.vmpsched    kr700.pbs        albadra          00:00:00 R all
3754.vmpsched    kr1300.pbs       albadra          00:00:00 R all
3758.vmpsched    t1.091251_fin    delhomjp         00:00:01 R all
3759.vmpsched    t1.09125         delhomjp         00:00:01 R all

Here are some details on one of these jobs as a sample, it is pretty 
representative:
[root at vmps18 root]# checkjob 3749
job 3749

State: Running
Creds:  user:albadra  group:isde  account:isde  class:all
WallTime: 2:31:41 of 25:00:00:00
SubmitTime: Wed Aug 24 09:25:06
  (Time Queued  Total: 00:01:41  Eligible: 00:00:08)

StartTime: Wed Aug 24 09:26:47
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: base
Network: ---  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: ---  Arch: ---  Features: x86
Dedicated Resources Per Task: PROCS: 1  MEM: 400M
Allocated Nodes:
[vmp400:1]


IWD: /home/albadra/ise/diode/iso/test8d/ionsi  Executable:  si1000.pbs
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:      RESTARTABLE,FSVIOLATION
Attr:        FSVIOLATION

Reservation '3749' (-2:30:13 -> 24:21:29:47  Duration: 25:00:00:00)
PE:  1.00  StartPriority:  995

[root at vmp400 root]# ps -eaf | grep albadra
albadra  14992 21276  0 09:24 ?        00:00:00 -bash
albadra  15295 14992  0 09:24 ?        00:00:00 /bin/bsh 
/usr/spool/PBS/mom_priv/jobs/3749.vmpsch.SC
albadra  15296 15295  6 09:24 ?        00:09:22 dessis si1000_des.cmd
albadra  15337 21276  0 09:28 ?        00:00:00 -bash
albadra  15638 15337  0 09:28 ?        00:00:00 /bin/bsh 
/usr/spool/PBS/mom_priv/jobs/3754.vmpsch.SC
albadra  15639 15638  4 09:28 ?        00:07:00 dessis kr1300_des.cmd
root     16245 16159  0 12:00 pts/0    00:00:00 grep albadra

The mom logs show something that might be related

08/24/2005 09:24:51;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
3749.vmpsched started, pid = 14992
08/24/2005 09:24:51;0008;   pbs_mom;Job;3749.vmpsched;Job Modified at 
request of PBS_Server at vmpsched
08/24/2005 09:28:20;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
3754.vmpsched started, pid = 15337
08/24/2005 09:28:20;0008;   pbs_mom;Job;3754.vmpsched;Job Modified at 
request of PBS_Server at vmpsched
08/24/2005 11:39:02;0001;   pbs_mom;Svr;pbs_mom;im_eof, End of File from 
addr 10.0.0.3:15001
08/24/2005 11:47:33;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) 
in is_update_stat, cannot specify protocol version
08/24/2005 11:47:33;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) 
in is_update_stat, cannot specify protocol
08/24/2005 11:47:33;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) 
in is_update_stat, cannot specify protocol
08/24/2005 11:47:33;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) 
in is_update_stat, cannot specify protocol
08/24/2005 11:47:33;0001;   pbs_mom;Svr;pbs_mom;im_eof, Premature end of 
message from addr 10.0.0.3:15001





More information about the torqueusers mailing list