[torqueusers] Weird problem with 2.0.0p4

Åke Sandgren ake.sandgren at hpc2n.umu.se
Fri Dec 23 11:53:43 MST 2005


On Thu, 2005-12-22 at 09:52 -0800, Garrick Staples wrote:
> On Thu, Dec 22, 2005 at 11:10:09AM +0100, ?ke Sandgren alleged:
> > Hi!
> > 
> > I just found a weird problem with our 2.0.0p4 install.
> > jobs with nodes=2 or larger doesn't work...
> > 
> > The only thing that starts on the MS are the following processes
> > UID        PID  PPID  C STIME TTY          TIME CMD
> > ake       4189  4074 99 11:08 ?        00:00:01 -bash
> > ake       4211  4189  0 11:08 ?        00:00:00 pbs_demux
> > 
> > i.e. the submit script itself never starts.
> > There is nothing in the logs that i can find.
> > 2.0.0p2 works ok (although on another machine)
> 
> So what's that bash process doing?  Bash doesn't sit around doing
> nothing.  It exits when it runs out of input.  Is it stuck reading on
> stdin?  Where does lsof show it's stdio fds?
> 
> Configure options?  Using --enable-shell-pipe or
> --enable-shell-use-argv?

Stupid me sent the reply only to myself....

What happens is that the bash gets stuck in a waitpid loop, where
waitpid returns 0 (that 0 return is coming from demux which happens to
be a child of that bash). Why this didn't happen before i don't know.
It happens on our Ubuntu Breezy cluster but not on the Debian Woody.
Bash in this case is 3.0.

The solution i finally went with is doing an extra fork in the parent
before exec:ing bash which makes bash and demux siblings instead.

The exact patch is attached.
So far no ill effects.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: demux.patch
Type: text/x-patch
Size: 862 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051223/b647382a/demux.bin


More information about the torqueusers mailing list