[torqueusers] pbs_mom error

ajit mote ajitm at cdac.in
Thu Nov 2 21:30:08 MST 2006


Hi all,
   we are using torque-2.0.0.p5 and scheduler Moab ...
   cluster of 18 node running with no problem from last 3 month ...
   suddenly from last 2 days , jobs are deferred ...
   after checkjob -vv Job_id : 
               Node Availability for Partition PARAM-Linux --------
NOTE:  job cannot run  (job has hold in place)
xn01.npsf.cdac.ernet.in  available: 2 tasks supported
xn02.npsf.cdac.ernet.in  available: 2 tasks supported
xn03.npsf.cdac.ernet.in  rejected: State (Busy)
xn04.npsf.cdac.ernet.in  rejected: State (Busy)
xn05.npsf.cdac.ernet.in  rejected: State (Busy)
xn06.npsf.cdac.ernet.in  rejected: State (Busy)
xn07.npsf.cdac.ernet.in  rejected: State (Busy)
xn08.npsf.cdac.ernet.in  rejected: State (Busy)
xn09.npsf.cdac.ernet.in  available: 2 tasks supported
xn10.npsf.cdac.ernet.in  available: 2 tasks supported
xn11.npsf.cdac.ernet.in  available: 2 tasks supported
xn12.npsf.cdac.ernet.in  available: 2 tasks supported
xn13.npsf.cdac.ernet.in  available: 2 tasks supported
xn14.npsf.cdac.ernet.in  available: 2 tasks supported
xn15.npsf.cdac.ernet.in  available: 2 tasks supported
xn16.npsf.cdac.ernet.in  available: 2 tasks supported
NOTE:  non-idle expected state 'Deferred'
Message[0] job rejected by RM 'PARAM-Linux' - job started on hostlist
xn02.npsf.cdac.ernet.in at time 08:40:36_11/03, job reported idle at
time 08:40:38_11/03

  it shows that node are free and up , still job does not get
executed ...
  also i observed that when submit number of different jobs at different
time , every time it select xn02 and deferred  ...

  i checked mom log it gives following error :
              11/03/2006 09:42:09;0100;   pbs_mom;Req;;Type JobScript
request received from PBS_Server at xn01.npsf.cdac.ernet.011/03/2006
09:42:09;0100;   pbs_mom;Req;;Type ReadyToCommit request received from
PBS_Server at xn01.npsf.cdac.er011/03/2006 09:42:09;0100;
pbs_mom;Req;;Type Commit request received from
PBS_Server at xn01.npsf.cdac.ernet.in,011/03/2006 09:42:09;0001;
pbs_mom;Svr;pbs_mom;Success (0) in TMomFinalizeJob3, read of pipe for
sid failed f)11/03/2006 09:42:09;0001;
pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
11/03/2006 09:42:09;0008;   pbs_mom;Req;send_sisters;sending ABORT to
sisters
11/03/2006 09:42:09;0100;   pbs_mom;Req;;Type StatusJob request received
from PBS_Server at xn01.npsf.cdac.ernet.211/03/2006 09:42:09;0100;
pbs_mom;Req;;Type ModifyJob request received from
PBS_Server at xn01.npsf.cdac.ernet.011/03/2006 09:42:09;0008;
pbs_mom;Job;51808.xn01.npsf.cdac.ernet.in;Job Modified at request of
PBS_Server at xnn11/03/2006 09:42:09;0100;   pbs_mom;Req;;Type DeleteJob
request received from PBS_Server at xn01.npsf.cdac.ernet.3 

    this is the error which come every time "read of pipe for sid fail
ed for job " ...
    
     what should i do to get rid of this ? ...   

    thanks .
-- 
"Live Life Dangerously"



More information about the torqueusers mailing list