[torqueusers] "Cloned" job

Nicolas Ferré nicolas.ferre at univ-provence.fr
Sun Oct 16 01:52:34 MDT 2011


Dear Torque users,

For some days, a weird thing happens quite often on our cluster,
running torque + maui. Some single-node jobs are launched two or three
times, ie "clones" are running simultaneously on different nodes.
Example: job 61213
> qstat -n1|grep 61213
 61213.slater.up.     fredbip7 journey  opt_part_anionMQ    629     1
 4    4gb 48:00 R 44:18   lame10/3+lame10/2+lame10/1+lame10/0
> pbsnodes -a|grep 61213 (or better pestat |grep 61213)
  lame7  busy*   11*  24103   8  32104   1285  4/2    8      0
61215 fredbip7 61212 fredbip7 61213 fredbip7
  lame9  free     7*  24103   8  32104   1190  3/2    4      0
61214 fredbip7 61213 fredbip7
  lame10  busy*   11*  24103   8  32104   1318  4/3    8      0
61215 fredbip7 61213 fredbip7 61289 stereo
> checkjob 61213

checking job 61213

State: Running
Creds:  user:fredbip7  group:fredbip7  class:journey  qos:DEFAULT
WallTime: 1:20:22:15 of 2:00:00:00
SubmitTime: Fri Oct 14 13:23:40
  (Time Queued  Total: 00:02:21  Eligible: 00:01:04)

StartTime: Fri Oct 14 13:26:01
Total Tasks: 4

Req[0]  TaskCount: 4  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 1024M
Allocated Nodes:
[lame10:4]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 3
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '61213' (-1:20:21:46 -> 3:38:14  Duration: 2:00:00:00)
Messages:  cannot start job - RM failure, rc: 15084, msg: 'Premature
end of message'
PE:  4.00  StartPriority:  1


I can see this strange message "premature end of message", but looking
at torque doc it does not seem to be related to the problem.

Any idea?

Nicolas Ferré
Laboratoire Chimie Provence
Aix-Marseille Université, France
Phone : +33 413550532


More information about the torqueusers mailing list