[torqueusers] Dependencies being ignored from some submit hosts.

John Hanks griznog at gmail.com
Thu Feb 21 08:34:27 MST 2008


Another bit of data:

user at submitA ~/Testing $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
179.hostA                  STDIN            user          00:00:00 R
uinta
user at submitA ~/Testing $ qdel 179
qdel: Unknown Job Id 179.hostA.hpc.usu.edu
user at submitA ~/Testing $ qdel 179.moab
qdel: Unknown Job Id 179.hostA.hpc.usu.edu
user at submitA ~/Testing $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
179.hostA                  STDIN            user          00:00:00 R
uinta
user at submitA ~/Testing $

This seems to all be caused by the submitA host insisting on using the
FQDN when looking for a job running/submitted on hostA.

jbh

On Wed, Feb 20, 2008 at 7:50 PM, John Hanks <griznog at gmail.com> wrote:
> Sorry for rambling, but...
>
> it looks to me like job 171 sumbits ok and is called 171.hostA. Then 172
> submits and tries to depend on 171, but for some reason torque looks for
> 171.hostA.hpc.usu.edu instead of 171.hostA. I've tried changing the
> server_name file on submitA to be
>
> hostA
> hostA.hpc.usu.edu
>
> but get the same result for both. I've also tried using the job id syntax of
> 171 and 171.hostA in the "-W depend=...", but still get the same result
> either way.
>
> jbh
>
>
>
> On Wed, Feb 20, 2008 at 7:44 PM, John Hanks <griznog at gmail.com> wrote:
> > Looking at more log message I see these:
> >
> > 02/20/2008 19:32:01;0080;PBS_Server;Job;167.hostA.hpc.usu.edu;Unknown Job
> Id
> >
> > Is this because the sumbitA host is adding .hpc.usu.edu to the job name?
> If so, where is it picking this up from?
> >
> > jbh
> >
> >
> >
> >
> >
> >
> > On Wed, Feb 20, 2008 at 7:39 PM, John Hanks <griznog at gmail.com> wrote:
> >
> > > Jobs submitted on submitA do not have dependencies listed by qstat -f.
> > >
> > > Jobs submitted on hostA have this line in qstat -f output:
> > >
> > > depend = afterany:169.hostA at hostA
> > >
> > > All jobs correctly display the submit args with "-W depend=..."
> > >
> > > Torque logs these lines when the job asking for dependencies is
> submitted from submitA:
> > >
> > > 02/20/2008 19:31:58;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=RegisterDependency, from
> @hostA.hpc.usu.edu
> > > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Job Queued at request
> of A00017456 at submitA, owner = A00017456 at submitA, job name = job.sh, queue =
> uinta
> > > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Dependency request for
> job rejected by 166.hostA.hpc.usu.edu
> > >
> > >
> > > Thanks,
> > >
> > > jbh
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Feb 20, 2008 at 3:29 PM, Garrick Staples <garrick at usc.edu>
> wrote:
> > >
> > > >
> > > >
> > > >
> > > > On Wed, Feb 20, 2008 at 03:11:46PM -0700, John Hanks alleged:
> > > >
> > > >
> > > >
> > > > > Hello,
> > > > >
> > > > > I have a test setup, torque 2.2.1 and moab 5.2.1 running on a host,
> > > > > call it hostA and a submit host called submitA which only has teh
> > > > > torque clients (qsub, qstat, etc.).  I can successfully sumbint jobs
> > > > > from sumbitA to hostA with qsub, but get odd behavior when using -W
> > > > > depend=afterany:JOBID. For example
> > > > >
> > > > > as a user on hostA I can do
> > > > >
> > > > > $ qsub job.sh
> > > > > hostA.165
> > > > > $ qsub -W depend=afterany:165 job.sh
> > > > > hostA.166
> > > > >
> > > > > Then look at job 166 with checkjob and see it correctly handles the
> dependency:
> > > > >
> > > > > NOTE:  job cannot run  (job has hold in place)
> > > > > NOTE:  job cannot run  (dependency 165 jobsuccessfulcomplete not
> met)
> > > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
> iteration)
> > > > >
> > > > > however, if I do the same thing from submitA
> > > > >
> > > > > $ qsub job.sh
> > > > > hostA.167
> > > > > $ qsub -W depend=afterany:167 job.sh
> > > > > hostA.168
> > > > >
> > > > > Then look at the job with checkjob it says:
> > > > >
> > > > > NOTE:  job cannot run  (job has hold in place)
> > > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
> iteration)
> > > > >
> > > > > and treats this as a hold, so that the job never runs until I do a
> > > > > manual releasehold to release the hold.
> > > > >
> > > > > I have server_name on both hostA and submitA set to point to hostA
> and
> > > > > torque has
> > > > >
> > > > > set server submit_hosts = submitA
> > > > >
> > > > > in it's configuration. What do I need to do to have dependencies
> > > > > handled correctly from any submit host?
> > > >
> > > > 'checkjob' is a maui program and doesn't really say what is going on
> within torque.
> > > >
> > > > Does 'qstat -f' show that the deps are correctly set up within torque?
> > > >
> > > > --
> > > > Garrick Staples, GNU/Linux HPCC SysAdmin
> > > > University of Southern California
> > > >
> > > > Please avoid sending me Word or PowerPoint attachments.
> > > > See http://www.gnu.org/philosophy/no-word-attachments.html
> > > >
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > >
> > > >
> > >
> > >
> >
> >
>
>


More information about the torqueusers mailing list