[torqueusers] Dependencies being ignored from some submit hosts.

John Hanks griznog at gmail.com
Fri Feb 22 09:04:38 MST 2008


So all the q* cmds I've looked at seem to call get_server() to
create/modify a job id from the job argument if it exists. The last
thing get_server does is:

    if (get_fullhostname(def_server,host_server,PBS_MAXSERVERNAME,NULL) != 0)
      {
      /* FAILURE */

      return(1);
      }

    strcat(job_id_out,host_server);

Which makes job_id_out be of the form JOBID.FQDN, which is breaking my
clients ability to use q* commands because the server calls all jobs
JOBID.SHORTHOSTNAME. I have to think I'm not to first person who
wanted to run q* command on one machine that controlled pbs_server on
another machine, so this has to be something I've broken. I just have
no idea what it is. Any suggestions for where to poke next would be
appreciated. Is there a way to force pbs_server to use FQDN for job
ids?

Thanks,

jbh

On Thu, Feb 21, 2008 at 8:34 AM, John Hanks <griznog at gmail.com> wrote:
> Another bit of data:
>
>  user at submitA ~/Testing $ qstat
>  Job id                    Name             User            Time Use S Queue
>  ------------------------- ---------------- --------------- -------- - -----
>  179.hostA                  STDIN            user          00:00:00 R
>  uinta
>  user at submitA ~/Testing $ qdel 179
>  qdel: Unknown Job Id 179.hostA.hpc.usu.edu
>  user at submitA ~/Testing $ qdel 179.moab
>  qdel: Unknown Job Id 179.hostA.hpc.usu.edu
>  user at submitA ~/Testing $ qstat
>  Job id                    Name             User            Time Use S Queue
>  ------------------------- ---------------- --------------- -------- - -----
>  179.hostA                  STDIN            user          00:00:00 R
>  uinta
>  user at submitA ~/Testing $
>
>  This seems to all be caused by the submitA host insisting on using the
>  FQDN when looking for a job running/submitted on hostA.
>
>  jbh
>
>
>
>  On Wed, Feb 20, 2008 at 7:50 PM, John Hanks <griznog at gmail.com> wrote:
>  > Sorry for rambling, but...
>  >
>  > it looks to me like job 171 sumbits ok and is called 171.hostA. Then 172
>  > submits and tries to depend on 171, but for some reason torque looks for
>  > 171.hostA.hpc.usu.edu instead of 171.hostA. I've tried changing the
>  > server_name file on submitA to be
>  >
>  > hostA
>  > hostA.hpc.usu.edu
>  >
>  > but get the same result for both. I've also tried using the job id syntax of
>  > 171 and 171.hostA in the "-W depend=...", but still get the same result
>  > either way.
>  >
>  > jbh
>  >
>  >
>  >
>  > On Wed, Feb 20, 2008 at 7:44 PM, John Hanks <griznog at gmail.com> wrote:
>  > > Looking at more log message I see these:
>  > >
>  > > 02/20/2008 19:32:01;0080;PBS_Server;Job;167.hostA.hpc.usu.edu;Unknown Job
>  > Id
>  > >
>  > > Is this because the sumbitA host is adding .hpc.usu.edu to the job name?
>  > If so, where is it picking this up from?
>  > >
>  > > jbh
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  > > On Wed, Feb 20, 2008 at 7:39 PM, John Hanks <griznog at gmail.com> wrote:
>  > >
>  > > > Jobs submitted on submitA do not have dependencies listed by qstat -f.
>  > > >
>  > > > Jobs submitted on hostA have this line in qstat -f output:
>  > > >
>  > > > depend = afterany:169.hostA at hostA
>  > > >
>  > > > All jobs correctly display the submit args with "-W depend=..."
>  > > >
>  > > > Torque logs these lines when the job asking for dependencies is
>  > submitted from submitA:
>  > > >
>  > > > 02/20/2008 19:31:58;0080;PBS_Server;Req;req_reject;Reject reply
>  > code=15001(Unknown Job Id), aux=0, type=RegisterDependency, from
>  > @hostA.hpc.usu.edu
>  > > > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Job Queued at request
>  > of A00017456 at submitA, owner = A00017456 at submitA, job name = job.sh, queue =
>  > uinta
>  > > > 02/20/2008 19:31:58;0008;PBS_Server;Job;167.hostA;Dependency request for
>  > job rejected by 166.hostA.hpc.usu.edu
>  > > >
>  > > >
>  > > > Thanks,
>  > > >
>  > > > jbh
>  > > >
>  > > >
>  > > >
>  > > >
>  > > >
>  > > > On Wed, Feb 20, 2008 at 3:29 PM, Garrick Staples <garrick at usc.edu>
>  > wrote:
>  > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > > On Wed, Feb 20, 2008 at 03:11:46PM -0700, John Hanks alleged:
>  > > > >
>  > > > >
>  > > > >
>  > > > > > Hello,
>  > > > > >
>  > > > > > I have a test setup, torque 2.2.1 and moab 5.2.1 running on a host,
>  > > > > > call it hostA and a submit host called submitA which only has teh
>  > > > > > torque clients (qsub, qstat, etc.).  I can successfully sumbint jobs
>  > > > > > from sumbitA to hostA with qsub, but get odd behavior when using -W
>  > > > > > depend=afterany:JOBID. For example
>  > > > > >
>  > > > > > as a user on hostA I can do
>  > > > > >
>  > > > > > $ qsub job.sh
>  > > > > > hostA.165
>  > > > > > $ qsub -W depend=afterany:165 job.sh
>  > > > > > hostA.166
>  > > > > >
>  > > > > > Then look at job 166 with checkjob and see it correctly handles the
>  > dependency:
>  > > > > >
>  > > > > > NOTE:  job cannot run  (job has hold in place)
>  > > > > > NOTE:  job cannot run  (dependency 165 jobsuccessfulcomplete not
>  > met)
>  > > > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
>  > iteration)
>  > > > > >
>  > > > > > however, if I do the same thing from submitA
>  > > > > >
>  > > > > > $ qsub job.sh
>  > > > > > hostA.167
>  > > > > > $ qsub -W depend=afterany:167 job.sh
>  > > > > > hostA.168
>  > > > > >
>  > > > > > Then look at the job with checkjob it says:
>  > > > > >
>  > > > > > NOTE:  job cannot run  (job has hold in place)
>  > > > > > BLOCK MSG: non-idle state 'Hold' (recorded at last scheduling
>  > iteration)
>  > > > > >
>  > > > > > and treats this as a hold, so that the job never runs until I do a
>  > > > > > manual releasehold to release the hold.
>  > > > > >
>  > > > > > I have server_name on both hostA and submitA set to point to hostA
>  > and
>  > > > > > torque has
>  > > > > >
>  > > > > > set server submit_hosts = submitA
>  > > > > >
>  > > > > > in it's configuration. What do I need to do to have dependencies
>  > > > > > handled correctly from any submit host?
>  > > > >
>  > > > > 'checkjob' is a maui program and doesn't really say what is going on
>  > within torque.
>  > > > >
>  > > > > Does 'qstat -f' show that the deps are correctly set up within torque?
>  > > > >
>  > > > > --
>  > > > > Garrick Staples, GNU/Linux HPCC SysAdmin
>  > > > > University of Southern California
>  > > > >
>  > > > > Please avoid sending me Word or PowerPoint attachments.
>  > > > > See http://www.gnu.org/philosophy/no-word-attachments.html
>  > > > >
>  > > > > _______________________________________________
>  > > > > torqueusers mailing list
>  > > > > torqueusers at supercluster.org
>  > > > > http://www.supercluster.org/mailman/listinfo/torqueusers
>  > > > >
>  > > > >
>  > > >
>  > > >
>  > >
>  > >
>  >
>  >
>


More information about the torqueusers mailing list