[torqueusers] Qstat reporting false node use
Garrick Staples
garrick at clusterresources.com
Thu Apr 12 15:08:56 MDT 2007
On Wed, Apr 11, 2007 at 11:38:56AM -0700, Clevenger, Kevin alleged:
> Hi,
>
> Whene running multiple NAMD jobs on the cluster (Rocks 4.2.1) we see qstat -n report that the jobs start on separate nodes, but when you look at the processes with cluster-ps they in fact are not. Anyone know why this is and how to straigten it out? Output below.
Note that TORQUE has nothing to do with launching processes; it just
runs your job script. It is the job script's responsibility to launch
processes on the nodes listed in $PBS_NODEFILE.
> Thanks
>
> Kevin
>
> ###################################################
>
> $ qstat -n
>
> cluster.coh.org:
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
> 153.cluster.coh.org bob longrun eq32.submi 5042 8 1 -- 1000: R 00:23
> c-0-24+c-0-24+c-0-23+c-0-23+c-0-22+c-0-22+c-0-21+c-0-21+c-0-20+c-0-20+c-0-19
> +c-0-19+c-0-18+c-0-18+c-0-17+c-0-17
> 154.cluster.coh.org bob longrun eq08.submi 32618 4 1 -- 1000: R 00:22
> c-0-16+c-0-16+c-0-15+c-0-15+c-0-14+c-0-14+c-0-13+c-0-13
> 155.cluster.coh.org bob longrun TAK779-eq0 1383 4 1 -- 1000: R 00:18
> c-0-12+c-0-12+c-0-11+c-0-11+c-0-10+c-0-10+c-0-9+c-0-9
Ok, so we should see jobs running on c-0-12, c-0-16, and c-0-24.
> c-0-12:
> bob 1414 0.0 0.0 5848 764 ? S 11:08 0:00 /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 /home/bob/vaidsimpl /home/bob/CCR2TAK779/MD/eq08-con.namd
> c-0-16:
> bob 32649 0.0 0.0 5848 764 ? S 11:04 0:00 /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 /home/bob/vaidsimpl /home/bob/CCR2APO/MD/eq08-con.namd
> c-0-24:
> bob 5069 0.0 0.0 5848 764 ? S 11:03 0:00 /home/bob/vaidsim ++remote-shell ssh ++nodelist /share/data/etc/nodelist +p16 /home/bob/vaidsimpl /home/bob/STAT3/eq32.namd
Good, We see the jobs running on the correct nodes. But it appears your
command is using a private nodes file to launch processes whereever it
wants.
More information about the torqueusers
mailing list