[torqueusers] resources problem
bernd-schubert at gmx.de
Wed Nov 29 12:52:03 MST 2006
thanks for your help.
On Wednesday 29 November 2006 20:06, Garrick Staples wrote:
> On Wed, Nov 29, 2006 at 03:25:34PM +0100, Bernd Schubert alleged:
> > Hi,
> > we just had the problem that the job one of our group members required
> > more resources (memory) than requested, but still torque didn't kill it.
> > Also, qstat reports by far too low resources for this special program.
> > For all other programs presently running the resources reported are fine,
> > only this program is troublesome.
> > While looking whats so special about it, we see its basically a mpi
> > program compiled with mpicc, however, it is NOT started using mpirun, but
> > on our cluster its just queued as any other program using 'qstat
> > program_name'.
> 'qsub' can't submit binaries, so the batch script is probably running
Ah sorry, I forgot. We have a 'frontend' for qsub which creates those scripts
itself (among several other things). However, this tcsub script certainly
doesn't call mpirun. A simple call of tcsub is
"tcsub prog_name prog_parameters"
I'm so used to it, that I forgot that qsub doesn't accept binaries. I have
attached the script given to qsub, there's really no mpirun call.
> If your mpirun is using rsh/ssh to spawn the remote processes, then
> they won't be tracked and added to the job's usage.
There's only one process on the node where the program is started.
I have absolutely no experience with MPI, could a program compiled with mpicc
call itself mpirun? The attached script submitted to qsub calls an
executable "transport". This is no script, but the binary compiled with
mpicc. Running "strings" on this binary I see that it has several times a
string "MPIRUN" and one time the sentence "! Could not create p4 procgroup.
Possible missing file or program started without mpirun."
However, ps ax doesn't show any rsh/ssh processes, only this binary.
PCI / Theoretische Chemie
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1888 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061129/98a3f829/21880.hitch.bin
More information about the torqueusers