[torqueusers] cannot start job - RM failure
bill at Princeton.EDU
Thu Dec 23 07:29:55 MST 2004
torque-1.1.0p4 or torque-1.1.0p5 - doesn't seem to make a difference
2.4.21-20.EL.c0 SMP on an EM64T
torque/maui compiled with gcc-3.2.3-42 64bit
Once the machines were place into production, constant problems are now
seen when 100s of jobs are queued. At first they start to run but then
jobs only run sporadically.
Doing a checkjob, here is my output:
checking job 9168
Creds: user:martin group:casl class:default qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Wed Dec 22 09:16:25
(Time Queued Total: 1:00:01:50 Eligible: 00:00:04)
Total Tasks: 1
Req TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 191
Messages: cannot start job - RM failure, rc: 15041, msg: ' MSG=send
PE: 1.00 StartPriority: 1
job can run in partition DEFAULT (150 procs available. 1 procs required)
This is a serial job. The mom logs don't seem to shed any light yet
they seem to be rejecting jobs due to some resource. Memory is abundant
with no shared memory segments. The job requests nothing but a walltime
and a single node.
Is this a problem with torque or is there something that I've missed
along the way causing these errors? I've searched through the archives,
seen a reference to this problem, but no solutions.
Thanks and have a great holiday to all!
More information about the torqueusers