[torqueusers] Dependencies being ignored from some submit hosts.

John Hanks griznog at gmail.com
Wed Feb 20 19:50:36 MST 2008


Sorry for rambling, but...

it looks to me like job 171 sumbits ok and is called 171.hostA. Then 172
submits and tries to depend on 171, but for some reason torque looks for
171.hostA.hpc.usu.edu instead of 171.hostA. I've tried changing the
server_name file on submitA to be

hostA
hostA.hpc.usu.edu

but get the same result for both. I've also tried using the job id syntax of
171 and 171.hostA in the "-W depend=...", but still get the same result
either way.

jbh

On Wed, Feb 20, 2008 at 7:44 PM, John Hanks <griznog at gmail.com> wrote:

> Looking at more log message I see these:
>
> 02/20/2008 19:32:01;0080;PBS_Server;Job;167.hostA.hpc.usu.edu;Unknown Job
> Id
>
> Is this because the sumbitA host is adding .hpc.usu.edu to the job name?
> If so, where is it picking this up from?
>
> jbh
>
>
>
> On Wed, Feb 20, 2008 at 7:39 PM, John Hanks <griznog at gmail.com> wrote:
>
> > Jobs submitted on submitA do not have dependencies listed by qstat -f.
> >
> > Jobs submitted on hostA have this line in qstat -f output:
> >
> > depend = afterany:169.hostA at hostA
> >
> > All jobs correctly display the submit args with "-W depend=..."
> >
> > Torque logs these lines when the job asking for dependencies is
> > submitted from submitA:
> >
> > 02/20/2008 19:31:58;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=RegisterDependency, from @
> > hostA.hpc.usu.edu
> > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Job Queued at request
> > of A00017456 at submitA, owner = A00017456 at submitA, job name = job.sh,
> > queue = uinta
> > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Dependency request for
> > job rejected by 166.hostA.hpc.usu.edu
> >
> >
> > Thanks,
> >
> > jbh
> >
> > On Wed, Feb 20, 2008 at 3:29 PM, Garrick Staples <garrick at usc.edu>
> > wrote:
> >
> > > On Wed, Feb 20, 2008 at 03:11:46PM -0700, John Hanks alleged:
> > > > Hello,
> > > >
> > > > I have a test setup, torque 2.2.1 and moab 5.2.1 running on a host,
> > > > call it hostA and a submit host called submitA which only has teh
> > > > torque clients (qsub, qstat, etc.).  I can successfully sumbint jobs
> > > > from sumbitA to hostA with qsub, but get odd behavior when using -W
> > > > depend=afterany:JOBID. For example
> > > >
> > > > as a user on hostA I can do
> > > >
> > > > $ qsub job.sh
> > > > hostA.165
> > > > $ qsub -W depend=afterany:165 job.sh
> > > > hostA.166
> > > >
> > > > Then look at job 166 with checkjob and see it correctly handles the
> > > dependency:
> > > >
> > > > NOTE:  job cannot run  (job has hold in place)
> > > > NOTE:  job cannot run  (dependency 165 jobsuccessfulcomplete not
> > > met)
> > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
> > > iteration)
> > > >
> > > > however, if I do the same thing from submitA
> > > >
> > > > $ qsub job.sh
> > > > hostA.167
> > > > $ qsub -W depend=afterany:167 job.sh
> > > > hostA.168
> > > >
> > > > Then look at the job with checkjob it says:
> > > >
> > > > NOTE:  job cannot run  (job has hold in place)
> > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
> > > iteration)
> > > >
> > > > and treats this as a hold, so that the job never runs until I do a
> > > > manual releasehold to release the hold.
> > > >
> > > > I have server_name on both hostA and submitA set to point to hostA
> > > and
> > > > torque has
> > > >
> > > > set server submit_hosts = submitA
> > > >
> > > > in it's configuration. What do I need to do to have dependencies
> > > > handled correctly from any submit host?
> > >
> > > 'checkjob' is a maui program and doesn't really say what is going on
> > > within torque.
> > >
> > > Does 'qstat -f' show that the deps are correctly set up within torque?
> > >
> > > --
> > > Garrick Staples, GNU/Linux HPCC SysAdmin
> > > University of Southern California
> > >
> > > Please avoid sending me Word or PowerPoint attachments.
> > > See http://www.gnu.org/philosophy/no-word-attachments.html
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080220/1be6ec40/attachment.html


More information about the torqueusers mailing list