[torqueusers] momctl error - A description

Garrick Staples garrick at clusterresources.com
Tue Mar 6 15:34:37 MST 2007


On Tue, Mar 06, 2007 at 05:31:46PM -0500, michael young alleged:
> Garrick Staples wrote:
> 
> >On Tue, Mar 06, 2007 at 04:53:16PM -0500, michael young alleged:
> > 
> >
> >>Sorry about that.
> >>
> >>Backgroud:
> >>We have a cluster of Sun servers.
> >>1 master and 12 slave nodes.
> >>AMD Opteron Processor 248 2.2 GHz, 4GB ram, 74 GB SCSI HD
> >>It runs Spartan '04 on Red Hat Enterprise Linux AS release 4 (Nahant 
> >>Update 1).
> >>master node's name: cluster
> >>slave node's names: he1 - he12
> >>
> >>Problem:
> >>When a job is submitted to the cluster, it runs only on the master node.
> >>It does not pass any work to the slave nodes.
> >>   
> >>
> >
> >While the job is running, does 'qstat -n <jobid>' show that the job is
> >assigned to a node?
> > 
> >
> How do I determan the jobid?
> Just running qstat give no output.

Run qstat while a job is running.  If no jobs are in the queue, then qstat
won't print anything.

The jobid is first printed to the user when running qsub, and running
'qstat' with no arguments lists all jobs with their ids.

 
> >Since your master node isn't running pbs_mom, this implies that the
> >problem is in your job script.  Is your job script using $PBS_NODEFILE
> >to spawn the processes?
> >
> > 
> >
> Where do I find the job script?
> I did a 'env' and there is no "$PBS_NODEFILE"

Inside of the job environment, the job will have the list of nodes
assigned to the job in the file named in $PBS_NODEFILE.

For example, if you launching an MPI program with mpirun, then you would
pass the nodes with something like:

  np=`wc -l < $PBS_NODEFILE`
  mpirun -machinefile $PBS_NODEFILE -np $np ./command



More information about the torqueusers mailing list