[torqueusers] possible protocol problem.

Chris Johnson johnson at nmr.mgh.harvard.edu
Tue Nov 16 06:59:55 MST 2004

     Hi people,

     We have a strange problem that we saw in OpenPBS as well a few
years ago.  This caused use to switch to PBSPro which seems to have
solved the problem.  We were seeing this in Open on earlier RH
versions.  We're currently seeing it with torque 1.0.1p6 in FC2.

     Our nodes sometimes go into a comatose state in which they can
be pinged but not ssh'ed or rsh'ed to at all.  The network seems up
but many of the higher functions are none responsive.  Torque, and
earlier OpenPBS, don't respond well at all.  The master doesn't detect
these nodes as being down.  The scheduler thinks they're still up and
hangs on them unless they are specifically marked as offline.  This of
course makes scheduling jobs a real pain because it doesn't happen
very fast if at all. 

     We were hoping to replace Pro with torque to avoid the licensing
hassles.  But this undetected down node problem pretty much squashes
that idea.  

     Has anybody seen this problem?  As I said, PBSPro seems to have
solved this problem very nicely.  Torque seems not seem to have done so.  
Are there any clues on this one?  Are there are time-out tunings that
would fix it?

     Help appreciated.

Chris Johnson               |Internet: johnson at nmr.mgh.harvard.edu
Systems Administrator       |Web:      http://www.nmr.mgh.harvard.edu/~johnson
NMR Center                  |Voice:    617.726.0949
Mass. General Hospital      |FAX:      617.726.7422
149 (2301) 13th Street      |If ignorance is bliss, why aren't there more
Charlestown, MA., 02129 USA |happy people?    Tea bag tag

More information about the torqueusers mailing list