[torqueusers] possible protocol problem.

Dave Jackson jacksond at supercluster.org
Tue Nov 16 21:21:35 MST 2004


Chris,

  TORQUE 1.1.0p5 has at least 10 patches which may be directly
applicable to your current problem.  If it is possible, a first step
would be to upgrade to the latest release.  Second, I would recommend
reviewing section 3.1 of the online TORQUE admin manual.  It recommends
a number of key settings for large systems including disabling RPP.  

  If the problem persists after taking these steps, we would be very
interested in working with you directly to correct this problem.

Dave

On Tue, 2004-11-16 at 06:59, Chris Johnson wrote:
>      Hi people,
> 
>      We have a strange problem that we saw in OpenPBS as well a few
> years ago.  This caused use to switch to PBSPro which seems to have
> solved the problem.  We were seeing this in Open on earlier RH
> versions.  We're currently seeing it with torque 1.0.1p6 in FC2.
> 
>      Our nodes sometimes go into a comatose state in which they can
> be pinged but not ssh'ed or rsh'ed to at all.  The network seems up
> but many of the higher functions are none responsive.  Torque, and
> earlier OpenPBS, don't respond well at all.  The master doesn't detect
> these nodes as being down.  The scheduler thinks they're still up and
> hangs on them unless they are specifically marked as offline.  This of
> course makes scheduling jobs a real pain because it doesn't happen
> very fast if at all. 
> 
>      We were hoping to replace Pro with torque to avoid the licensing
> hassles.  But this undetected down node problem pretty much squashes
> that idea.  
> 
>      Has anybody seen this problem?  As I said, PBSPro seems to have
> solved this problem very nicely.  Torque seems not seem to have done so.  
> Are there any clues on this one?  Are there are time-out tunings that
> would fix it?
> 
>      Help appreciated.
> 
> -------------------------------------------------------------------------------
> Chris Johnson               |Internet: johnson at nmr.mgh.harvard.edu
> Systems Administrator       |Web:      http://www.nmr.mgh.harvard.edu/~johnson
> NMR Center                  |Voice:    617.726.0949
> Mass. General Hospital      |FAX:      617.726.7422
> 149 (2301) 13th Street      |If ignorance is bliss, why aren't there more
> Charlestown, MA., 02129 USA |happy people?    Tea bag tag
> -------------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list