[torqueusers] maui & old gcc optimizer bug. (fwd)
Chris Johnson
johnson at nmr.mgh.harvard.edu
Wed Nov 16 11:47:07 MST 2005
Hi all,
Have a little annoyance here which is driving me up the wall. I
have two mini maui cluster with torque running. The well behaved one
is on CentOS 4.1 with opteron architecture. The less than ideal one
is on FC2 with P-III hardware.
The one on the opterons is running terrific.
The one on the P-III's gives me maui.log errors like this
(rc: 15041 errmsg: 'Execution server rejected request MSG=send failed,
STARTING' hostlist: 'node15')
and doesn't run jobs very often putting them in defered state yada
yada. It isn't pretty. In fact it's quite horrifically ugly.
After some googling, I came across an old reference indicating a
similar problem being caused by a gcc compiler optimizer bug gen'ing
up bad code. Ok, so I tried recompiling maui with -O0 although the
man page says this is the default.
No joy, same bad behavior.
Do I need to recompile torque as well? Does -O0 work? What the
f*&k is going on? Excuse me, I've been fiting this one for a while
now. Help GREATLY appreciated. I need to replace the C scheduler
with something. I'd like to use maui. But I'd like to be able to get
it to work right twice before I commit the whole cluster to it.
One other thing, probably related, maui keeps crashing and the
last line in the log is
ERROR: cannot get node info: Unknown Job Id
Thank you.
-------------------------------------------------------------------------------
Chris Johnson |Internet: johnson at nmr.mgh.harvard.edu
Systems Administrator |Web: http://www.nmr.mgh.harvard.edu/~johnson
NMR Center |Voice: 617.726.0949
Mass. General Hospital |FAX: 617.726.7422
149 (2301) 13th Street |"Quantum mechanics demands that magic exists"
Charlestown, MA., 02129 USA | Me
-------------------------------------------------------------------------------
More information about the torqueusers
mailing list