[torqueusers] Weird problem with 2.0.0p4
Åke Sandgren
ake.sandgren at hpc2n.umu.se
Fri Dec 23 11:53:43 MST 2005
On Thu, 2005-12-22 at 09:52 -0800, Garrick Staples wrote:
> On Thu, Dec 22, 2005 at 11:10:09AM +0100, ?ke Sandgren alleged:
> > Hi!
> >
> > I just found a weird problem with our 2.0.0p4 install.
> > jobs with nodes=2 or larger doesn't work...
> >
> > The only thing that starts on the MS are the following processes
> > UID PID PPID C STIME TTY TIME CMD
> > ake 4189 4074 99 11:08 ? 00:00:01 -bash
> > ake 4211 4189 0 11:08 ? 00:00:00 pbs_demux
> >
> > i.e. the submit script itself never starts.
> > There is nothing in the logs that i can find.
> > 2.0.0p2 works ok (although on another machine)
>
> So what's that bash process doing? Bash doesn't sit around doing
> nothing. It exits when it runs out of input. Is it stuck reading on
> stdin? Where does lsof show it's stdio fds?
>
> Configure options? Using --enable-shell-pipe or
> --enable-shell-use-argv?
Stupid me sent the reply only to myself....
What happens is that the bash gets stuck in a waitpid loop, where
waitpid returns 0 (that 0 return is coming from demux which happens to
be a child of that bash). Why this didn't happen before i don't know.
It happens on our Ubuntu Breezy cluster but not on the Debian Woody.
Bash in this case is 3.0.
The solution i finally went with is doing an extra fork in the parent
before exec:ing bash which makes bash and demux siblings instead.
The exact patch is attached.
So far no ill effects.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: demux.patch
Type: text/x-patch
Size: 862 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051223/b647382a/demux.bin
More information about the torqueusers
mailing list