[torqueusers] Unexpected state TRANSICM

spitfire14 spitfire14 at bluewin.ch
Mon Apr 4 16:51:52 MDT 2005


I have the following problem on an Itanium2 cluster running RedHat AS 
2.1 with torque 1.2.0p2 and Maui maui-3.2.6p11. We have an user 
submitting a large amount of jobs (1 per second). After some time jobs 
get deferred and the throughput gets ridiculously low.

On the master we have the following messages

04/05 00:11:03 ERROR:    job '2418184' cannot be started: (rc: 15031 
errmsg: 'Premature end of message'  hostlist: 'cpt14')
04/05 00:28:12 ERROR:    job '2418191' cannot be started: (rc: 15041 
errmsg: ' MSG=send failed, STARTING'  hostlist: 'cpt14')

And on the compute node  these

04/05/2005 00:28:12;0001;   pbs_mom;Svr;pbs_mom;Success (0) in 
req_jobscript, job in unexpected state 'TRANSICM'
04/05/2005 00:28:12;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15004( MSG=job in unexpected state 'TRANSICM'), aux=0, 
type=JobScript, from PBS_Server at frt

Restarting pbs_mom cures the symptom but it keeps coming back.

Any solutions

More information about the torqueusers mailing list