[torqueusers] pbs_mom error
ajit mote
ajitm at cdac.in
Thu Nov 2 21:30:08 MST 2006
Hi all,
we are using torque-2.0.0.p5 and scheduler Moab ...
cluster of 18 node running with no problem from last 3 month ...
suddenly from last 2 days , jobs are deferred ...
after checkjob -vv Job_id :
Node Availability for Partition PARAM-Linux --------
NOTE: job cannot run (job has hold in place)
xn01.npsf.cdac.ernet.in available: 2 tasks supported
xn02.npsf.cdac.ernet.in available: 2 tasks supported
xn03.npsf.cdac.ernet.in rejected: State (Busy)
xn04.npsf.cdac.ernet.in rejected: State (Busy)
xn05.npsf.cdac.ernet.in rejected: State (Busy)
xn06.npsf.cdac.ernet.in rejected: State (Busy)
xn07.npsf.cdac.ernet.in rejected: State (Busy)
xn08.npsf.cdac.ernet.in rejected: State (Busy)
xn09.npsf.cdac.ernet.in available: 2 tasks supported
xn10.npsf.cdac.ernet.in available: 2 tasks supported
xn11.npsf.cdac.ernet.in available: 2 tasks supported
xn12.npsf.cdac.ernet.in available: 2 tasks supported
xn13.npsf.cdac.ernet.in available: 2 tasks supported
xn14.npsf.cdac.ernet.in available: 2 tasks supported
xn15.npsf.cdac.ernet.in available: 2 tasks supported
xn16.npsf.cdac.ernet.in available: 2 tasks supported
NOTE: non-idle expected state 'Deferred'
Message[0] job rejected by RM 'PARAM-Linux' - job started on hostlist
xn02.npsf.cdac.ernet.in at time 08:40:36_11/03, job reported idle at
time 08:40:38_11/03
it shows that node are free and up , still job does not get
executed ...
also i observed that when submit number of different jobs at different
time , every time it select xn02 and deferred ...
i checked mom log it gives following error :
11/03/2006 09:42:09;0100; pbs_mom;Req;;Type JobScript
request received from PBS_Server at xn01.npsf.cdac.ernet.011/03/2006
09:42:09;0100; pbs_mom;Req;;Type ReadyToCommit request received from
PBS_Server at xn01.npsf.cdac.er011/03/2006 09:42:09;0100;
pbs_mom;Req;;Type Commit request received from
PBS_Server at xn01.npsf.cdac.ernet.in,011/03/2006 09:42:09;0001;
pbs_mom;Svr;pbs_mom;Success (0) in TMomFinalizeJob3, read of pipe for
sid failed f)11/03/2006 09:42:09;0001;
pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
11/03/2006 09:42:09;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
11/03/2006 09:42:09;0100; pbs_mom;Req;;Type StatusJob request received
from PBS_Server at xn01.npsf.cdac.ernet.211/03/2006 09:42:09;0100;
pbs_mom;Req;;Type ModifyJob request received from
PBS_Server at xn01.npsf.cdac.ernet.011/03/2006 09:42:09;0008;
pbs_mom;Job;51808.xn01.npsf.cdac.ernet.in;Job Modified at request of
PBS_Server at xnn11/03/2006 09:42:09;0100; pbs_mom;Req;;Type DeleteJob
request received from PBS_Server at xn01.npsf.cdac.ernet.3
this is the error which come every time "read of pipe for sid fail
ed for job " ...
what should i do to get rid of this ? ...
thanks .
--
"Live Life Dangerously"
More information about the torqueusers
mailing list