[torqueusers] Unexpected state TRANSICM
spitfire14 at bluewin.ch
Mon Apr 4 16:51:52 MDT 2005
I have the following problem on an Itanium2 cluster running RedHat AS
2.1 with torque 1.2.0p2 and Maui maui-3.2.6p11. We have an user
submitting a large amount of jobs (1 per second). After some time jobs
get deferred and the throughput gets ridiculously low.
On the master we have the following messages
04/05 00:11:03 ERROR: job '2418184' cannot be started: (rc: 15031
errmsg: 'Premature end of message' hostlist: 'cpt14')
04/05 00:28:12 ERROR: job '2418191' cannot be started: (rc: 15041
errmsg: ' MSG=send failed, STARTING' hostlist: 'cpt14')
And on the compute node these
04/05/2005 00:28:12;0001; pbs_mom;Svr;pbs_mom;Success (0) in
req_jobscript, job in unexpected state 'TRANSICM'
04/05/2005 00:28:12;0080; pbs_mom;Req;req_reject;Reject reply
code=15004( MSG=job in unexpected state 'TRANSICM'), aux=0,
type=JobScript, from PBS_Server at frt
Restarting pbs_mom cures the symptom but it keeps coming back.
More information about the torqueusers