[torquedev] Torque (server) bug in parsing nodecounts
Michael.Meier at rrze.uni-erlangen.de
Thu May 14 02:25:36 MDT 2009
[I would have preferred to handle this via bugzilla, but it's still down]
One of our users recently triggered a bad bug in torque. What she did was to
submit a jobscript requesting 128128128128[repeat 128 times]128128
nodes, i.e. a number of nodes with 384 digits, due to hitting a few wrong
keys in 'vi'.
The result of this tiny user error however was pretty bad: Instead of just
rejecting it, the torque server accepted the job, then segfaulted. Each
subsequent attempt to restart the torque server resulted in a segfault again,
because it had already written the .JB file to its 'jobs' dir and crashed
when attempting to reread it.
I've since been able to have a look at the reason for this crash: the static
function "number" in src/server/node_manager.c. The function looks like it
was written by a 6 year old on crack doing his first C program. No sanity
or boundary checks are performed, input is just happily copied over the stack
when it's too long.
Attached you will find a quick patch against yesterdays 2.3.7 snapshot that
fixes at least the worst errors in this function - and prevents the crash in
the case mentioned above.
However, the whole function as it is is probably redundant - it seems it is
just a (crappy!) reimplementation of strto(u)l. At least for a
non-hotfix-release it would probably be better to use that instead of
reinventing the wheel.
Michael Meier, HPC Services
Regionales Rechenzentrum Erlangen
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28973, Fax: +49 9131 302941
michael.meier at rrze.uni-erlangen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 554 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20090514/c5253717/attachment.bin
More information about the torquedev