[torqueusers] Dependencies being ignored from some submit hosts.
John Hanks
griznog at gmail.com
Fri Feb 22 09:04:38 MST 2008
So all the q* cmds I've looked at seem to call get_server() to
create/modify a job id from the job argument if it exists. The last
thing get_server does is:
if (get_fullhostname(def_server,host_server,PBS_MAXSERVERNAME,NULL) != 0)
{
/* FAILURE */
return(1);
}
strcat(job_id_out,host_server);
Which makes job_id_out be of the form JOBID.FQDN, which is breaking my
clients ability to use q* commands because the server calls all jobs
JOBID.SHORTHOSTNAME. I have to think I'm not to first person who
wanted to run q* command on one machine that controlled pbs_server on
another machine, so this has to be something I've broken. I just have
no idea what it is. Any suggestions for where to poke next would be
appreciated. Is there a way to force pbs_server to use FQDN for job
ids?
Thanks,
jbh
On Thu, Feb 21, 2008 at 8:34 AM, John Hanks <griznog at gmail.com> wrote:
> Another bit of data:
>
> user at submitA ~/Testing $ qstat
> Job id Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 179.hostA STDIN user 00:00:00 R
> uinta
> user at submitA ~/Testing $ qdel 179
> qdel: Unknown Job Id 179.hostA.hpc.usu.edu
> user at submitA ~/Testing $ qdel 179.moab
> qdel: Unknown Job Id 179.hostA.hpc.usu.edu
> user at submitA ~/Testing $ qstat
> Job id Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 179.hostA STDIN user 00:00:00 R
> uinta
> user at submitA ~/Testing $
>
> This seems to all be caused by the submitA host insisting on using the
> FQDN when looking for a job running/submitted on hostA.
>
> jbh
>
>
>
> On Wed, Feb 20, 2008 at 7:50 PM, John Hanks <griznog at gmail.com> wrote:
> > Sorry for rambling, but...
> >
> > it looks to me like job 171 sumbits ok and is called 171.hostA. Then 172
> > submits and tries to depend on 171, but for some reason torque looks for
> > 171.hostA.hpc.usu.edu instead of 171.hostA. I've tried changing the
> > server_name file on submitA to be
> >
> > hostA
> > hostA.hpc.usu.edu
> >
> > but get the same result for both. I've also tried using the job id syntax of
> > 171 and 171.hostA in the "-W depend=...", but still get the same result
> > either way.
> >
> > jbh
> >
> >
> >
> > On Wed, Feb 20, 2008 at 7:44 PM, John Hanks <griznog at gmail.com> wrote:
> > > Looking at more log message I see these:
> > >
> > > 02/20/2008 19:32:01;0080;PBS_Server;Job;167.hostA.hpc.usu.edu;Unknown Job
> > Id
> > >
> > > Is this because the sumbitA host is adding .hpc.usu.edu to the job name?
> > If so, where is it picking this up from?
> > >
> > > jbh
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Feb 20, 2008 at 7:39 PM, John Hanks <griznog at gmail.com> wrote:
> > >
> > > > Jobs submitted on submitA do not have dependencies listed by qstat -f.
> > > >
> > > > Jobs submitted on hostA have this line in qstat -f output:
> > > >
> > > > depend = afterany:169.hostA at hostA
> > > >
> > > > All jobs correctly display the submit args with "-W depend=..."
> > > >
> > > > Torque logs these lines when the job asking for dependencies is
> > submitted from submitA:
> > > >
> > > > 02/20/2008 19:31:58;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=RegisterDependency, from
> > @hostA.hpc.usu.edu
> > > > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Job Queued at request
> > of A00017456 at submitA, owner = A00017456 at submitA, job name = job.sh, queue =
> > uinta
> > > > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Dependency request for
> > job rejected by 166.hostA.hpc.usu.edu
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > jbh
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Feb 20, 2008 at 3:29 PM, Garrick Staples <garrick at usc.edu>
> > wrote:
> > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Feb 20, 2008 at 03:11:46PM -0700, John Hanks alleged:
> > > > >
> > > > >
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I have a test setup, torque 2.2.1 and moab 5.2.1 running on a host,
> > > > > > call it hostA and a submit host called submitA which only has teh
> > > > > > torque clients (qsub, qstat, etc.). I can successfully sumbint jobs
> > > > > > from sumbitA to hostA with qsub, but get odd behavior when using -W
> > > > > > depend=afterany:JOBID. For example
> > > > > >
> > > > > > as a user on hostA I can do
> > > > > >
> > > > > > $ qsub job.sh
> > > > > > hostA.165
> > > > > > $ qsub -W depend=afterany:165 job.sh
> > > > > > hostA.166
> > > > > >
> > > > > > Then look at job 166 with checkjob and see it correctly handles the
> > dependency:
> > > > > >
> > > > > > NOTE: job cannot run (job has hold in place)
> > > > > > NOTE: job cannot run (dependency 165 jobsuccessfulcomplete not
> > met)
> > > > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
> > iteration)
> > > > > >
> > > > > > however, if I do the same thing from submitA
> > > > > >
> > > > > > $ qsub job.sh
> > > > > > hostA.167
> > > > > > $ qsub -W depend=afterany:167 job.sh
> > > > > > hostA.168
> > > > > >
> > > > > > Then look at the job with checkjob it says:
> > > > > >
> > > > > > NOTE: job cannot run (job has hold in place)
> > > > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
> > iteration)
> > > > > >
> > > > > > and treats this as a hold, so that the job never runs until I do a
> > > > > > manual releasehold to release the hold.
> > > > > >
> > > > > > I have server_name on both hostA and submitA set to point to hostA
> > and
> > > > > > torque has
> > > > > >
> > > > > > set server submit_hosts = submitA
> > > > > >
> > > > > > in it's configuration. What do I need to do to have dependencies
> > > > > > handled correctly from any submit host?
> > > > >
> > > > > 'checkjob' is a maui program and doesn't really say what is going on
> > within torque.
> > > > >
> > > > > Does 'qstat -f' show that the deps are correctly set up within torque?
> > > > >
> > > > > --
> > > > > Garrick Staples, GNU/Linux HPCC SysAdmin
> > > > > University of Southern California
> > > > >
> > > > > Please avoid sending me Word or PowerPoint attachments.
> > > > > See http://www.gnu.org/philosophy/no-word-attachments.html
> > > > >
> > > > > _______________________________________________
> > > > > torqueusers mailing list
> > > > > torqueusers at supercluster.org
> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
More information about the torqueusers
mailing list