[torqueusers] cannot start job - RM failure
Bill Wichser
bill at Princeton.EDU
Thu Dec 23 07:29:55 MST 2004
torque-1.1.0p4 or torque-1.1.0p5 - doesn't seem to make a difference
maui-3.2.6p11
CentOS-3.3 x86_64
2.4.21-20.EL.c0 SMP on an EM64T
torque/maui compiled with gcc-3.2.3-42 64bit
Once the machines were place into production, constant problems are now
seen when 100s of jobs are queued. At first they start to run but then
jobs only run sporadically.
Doing a checkjob, here is my output:
checking job 9168
State: Idle
Creds: user:martin group:casl class:default qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Wed Dec 22 09:16:25
(Time Queued Total: 1:00:01:50 Eligible: 00:00:04)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 191
PartitionMask: [ALL]
Flags: RESTARTABLE
Messages: cannot start job - RM failure, rc: 15041, msg: ' MSG=send
failed, JOB_SUBSTATE_RUNNING'
PE: 1.00 StartPriority: 1
job can run in partition DEFAULT (150 procs available. 1 procs required)
This is a serial job. The mom logs don't seem to shed any light yet
they seem to be rejecting jobs due to some resource. Memory is abundant
with no shared memory segments. The job requests nothing but a walltime
and a single node.
Is this a problem with torque or is there something that I've missed
along the way causing these errors? I've searched through the archives,
seen a reference to this problem, but no solutions.
Thanks and have a great holiday to all!
Bill
More information about the torqueusers
mailing list