[torqueusers] qstat showing wrong
Laurence Dawson
larry.dawson at vanderbilt.edu
Wed Aug 24 11:08:09 MDT 2005
We are running torque 1.2.0p5 on our cluster. qstat is showing all jobs
with very low cputimes (0 seconds up to about 17 seconds). An sample
extract from qstat is pasted below These times are clearly incorrect,
the jobs are a mix of single and multiple cpu jobs. No jobs are being
recorded correctly according to qstat. See the logs below for details -
but is the problem related to this message in th e momlog below?:
pbs_mom;Svr;pbs_mom;No child processes (10) in is_update_stat, cannot
specify protocol
Extract from qstat:
3569.vmpsched oct_0.20 redmilps 00:00:14 R all
3578.vmpsched pbsFullDamaged1 guratzrf 00:00:07 R all
3749.vmpsched si1000.pbs albadra 00:00:00 R all
3750.vmpsched si2000.pbs albadra 00:00:01 R all
3753.vmpsched kr700.pbs albadra 00:00:00 R all
3754.vmpsched kr1300.pbs albadra 00:00:00 R all
3758.vmpsched t1.091251_fin delhomjp 00:00:01 R all
3759.vmpsched t1.09125 delhomjp 00:00:01 R all
Here are some details on one of these jobs as a sample, it is pretty
representative:
[root at vmps18 root]# checkjob 3749
job 3749
State: Running
Creds: user:albadra group:isde account:isde class:all
WallTime: 2:31:41 of 25:00:00:00
SubmitTime: Wed Aug 24 09:25:06
(Time Queued Total: 00:01:41 Eligible: 00:00:08)
StartTime: Wed Aug 24 09:26:47
Total Tasks: 1
Req[0] TaskCount: 1 Partition: base
Network: --- Memory >= 0 Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: x86
Dedicated Resources Per Task: PROCS: 1 MEM: 400M
Allocated Nodes:
[vmp400:1]
IWD: /home/albadra/ise/diode/iso/test8d/ionsi Executable: si1000.pbs
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE,FSVIOLATION
Attr: FSVIOLATION
Reservation '3749' (-2:30:13 -> 24:21:29:47 Duration: 25:00:00:00)
PE: 1.00 StartPriority: 995
[root at vmp400 root]# ps -eaf | grep albadra
albadra 14992 21276 0 09:24 ? 00:00:00 -bash
albadra 15295 14992 0 09:24 ? 00:00:00 /bin/bsh
/usr/spool/PBS/mom_priv/jobs/3749.vmpsch.SC
albadra 15296 15295 6 09:24 ? 00:09:22 dessis si1000_des.cmd
albadra 15337 21276 0 09:28 ? 00:00:00 -bash
albadra 15638 15337 0 09:28 ? 00:00:00 /bin/bsh
/usr/spool/PBS/mom_priv/jobs/3754.vmpsch.SC
albadra 15639 15638 4 09:28 ? 00:07:00 dessis kr1300_des.cmd
root 16245 16159 0 12:00 pts/0 00:00:00 grep albadra
The mom logs show something that might be related
08/24/2005 09:24:51;0001; pbs_mom;Job;TMomFinalizeJob3;job
3749.vmpsched started, pid = 14992
08/24/2005 09:24:51;0008; pbs_mom;Job;3749.vmpsched;Job Modified at
request of PBS_Server at vmpsched
08/24/2005 09:28:20;0001; pbs_mom;Job;TMomFinalizeJob3;job
3754.vmpsched started, pid = 15337
08/24/2005 09:28:20;0008; pbs_mom;Job;3754.vmpsched;Job Modified at
request of PBS_Server at vmpsched
08/24/2005 11:39:02;0001; pbs_mom;Svr;pbs_mom;im_eof, End of File from
addr 10.0.0.3:15001
08/24/2005 11:47:33;0001; pbs_mom;Svr;pbs_mom;No child processes (10)
in is_update_stat, cannot specify protocol version
08/24/2005 11:47:33;0001; pbs_mom;Svr;pbs_mom;No child processes (10)
in is_update_stat, cannot specify protocol
08/24/2005 11:47:33;0001; pbs_mom;Svr;pbs_mom;No child processes (10)
in is_update_stat, cannot specify protocol
08/24/2005 11:47:33;0001; pbs_mom;Svr;pbs_mom;No child processes (10)
in is_update_stat, cannot specify protocol
08/24/2005 11:47:33;0001; pbs_mom;Svr;pbs_mom;im_eof, Premature end of
message from addr 10.0.0.3:15001
More information about the torqueusers
mailing list