[torqueusers] "Cloned" job
Nicolas Ferré
nicolas.ferre at univ-provence.fr
Sun Oct 16 01:52:34 MDT 2011
Dear Torque users,
For some days, a weird thing happens quite often on our cluster,
running torque + maui. Some single-node jobs are launched two or three
times, ie "clones" are running simultaneously on different nodes.
Example: job 61213
> qstat -n1|grep 61213
61213.slater.up. fredbip7 journey opt_part_anionMQ 629 1
4 4gb 48:00 R 44:18 lame10/3+lame10/2+lame10/1+lame10/0
> pbsnodes -a|grep 61213 (or better pestat |grep 61213)
lame7 busy* 11* 24103 8 32104 1285 4/2 8 0
61215 fredbip7 61212 fredbip7 61213 fredbip7
lame9 free 7* 24103 8 32104 1190 3/2 4 0
61214 fredbip7 61213 fredbip7
lame10 busy* 11* 24103 8 32104 1318 4/3 8 0
61215 fredbip7 61213 fredbip7 61289 stereo
> checkjob 61213
checking job 61213
State: Running
Creds: user:fredbip7 group:fredbip7 class:journey qos:DEFAULT
WallTime: 1:20:22:15 of 2:00:00:00
SubmitTime: Fri Oct 14 13:23:40
(Time Queued Total: 00:02:21 Eligible: 00:01:04)
StartTime: Fri Oct 14 13:26:01
Total Tasks: 4
Req[0] TaskCount: 4 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Dedicated Resources Per Task: PROCS: 1 MEM: 1024M
Allocated Nodes:
[lame10:4]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 3
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '61213' (-1:20:21:46 -> 3:38:14 Duration: 2:00:00:00)
Messages: cannot start job - RM failure, rc: 15084, msg: 'Premature
end of message'
PE: 4.00 StartPriority: 1
I can see this strange message "premature end of message", but looking
at torque doc it does not seem to be related to the problem.
Any idea?
Nicolas Ferré
Laboratoire Chimie Provence
Aix-Marseille Université, France
Phone : +33 413550532
More information about the torqueusers
mailing list