[torqueusers] Qstat reporting false node use

Garrick Staples garrick at clusterresources.com
Thu Apr 12 15:08:56 MDT 2007


On Wed, Apr 11, 2007 at 11:38:56AM -0700, Clevenger, Kevin alleged:
> Hi,
> 
> Whene running multiple NAMD jobs on the cluster (Rocks 4.2.1) we see qstat -n report that the jobs start on separate nodes, but when you look at the processes with cluster-ps they in fact are not. Anyone know why this is and how to straigten it out? Output below.

Note that TORQUE has nothing to do with launching processes; it just
runs your job script.  It is the job script's responsibility to launch
processes on the nodes listed in $PBS_NODEFILE.

 
> Thanks
> 
> Kevin
> 
> ###################################################
> 
> $ qstat -n
> 
> cluster.coh.org: 
>                                                                    Req'd  Req'd   Elap
> Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
> -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
> 153.cluster.coh.org      bob     longrun  eq32.submi   5042     8   1    --  1000: R 00:23
>    c-0-24+c-0-24+c-0-23+c-0-23+c-0-22+c-0-22+c-0-21+c-0-21+c-0-20+c-0-20+c-0-19
>    +c-0-19+c-0-18+c-0-18+c-0-17+c-0-17
> 154.cluster.coh.org      bob     longrun  eq08.submi  32618     4   1    --  1000: R 00:22
>    c-0-16+c-0-16+c-0-15+c-0-15+c-0-14+c-0-14+c-0-13+c-0-13
> 155.cluster.coh.org      bob     longrun  TAK779-eq0   1383     4   1    --  1000: R 00:18
>    c-0-12+c-0-12+c-0-11+c-0-11+c-0-10+c-0-10+c-0-9+c-0-9

Ok, so we should see jobs running on c-0-12, c-0-16, and c-0-24.


> c-0-12: 
> bob      1414  0.0  0.0  5848  764 ?        S    11:08   0:00 /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 /home/bob/vaidsimpl /home/bob/CCR2TAK779/MD/eq08-con.namd

> c-0-16: 
> bob     32649  0.0  0.0  5848  764 ?        S    11:04   0:00 /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 /home/bob/vaidsimpl /home/bob/CCR2APO/MD/eq08-con.namd

> c-0-24: 
> bob      5069  0.0  0.0  5848  764 ?        S    11:03   0:00 /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 /home/bob/vaidsimpl /home/bob/STAT3/eq32.namd

Good, We see the jobs running on the correct nodes.  But it appears your
command is using a private nodes file to launch processes whereever it
wants.



More information about the torqueusers mailing list