[torqueusers] cannot start job - RM failure

Bill Wichser bill at Princeton.EDU
Thu Dec 23 07:29:55 MST 2004


torque-1.1.0p4 or torque-1.1.0p5 - doesn't seem to make a difference
maui-3.2.6p11
CentOS-3.3 x86_64
2.4.21-20.EL.c0 SMP on an EM64T

torque/maui compiled with gcc-3.2.3-42 64bit

Once the machines were place into production, constant problems are now 
seen when 100s of jobs are queued.  At first they start to run but then 
jobs only run sporadically.

Doing a checkjob, here is my output:

checking job 9168

State: Idle
Creds:  user:martin  group:casl  class:default  qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Wed Dec 22 09:16:25
   (Time Queued  Total: 1:00:01:50  Eligible: 00:00:04)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 191
PartitionMask: [ALL]
Flags:       RESTARTABLE

Messages:  cannot start job - RM failure, rc: 15041, msg: ' MSG=send 
failed, JOB_SUBSTATE_RUNNING'
PE:  1.00  StartPriority:  1
job can run in partition DEFAULT (150 procs available.  1 procs required)


This is a serial job.  The mom logs don't seem to shed any light yet 
they seem to be rejecting jobs due to some resource.  Memory is abundant 
with no shared memory segments.  The job requests nothing but a walltime 
and a single node.

Is this a problem with torque or is there something that I've missed 
along the way causing these errors?  I've searched through the archives, 
seen a reference to this problem, but no solutions.

Thanks and have a great holiday to all!

Bill



More information about the torqueusers mailing list