[torqueusers] Question about prologue scripts and sending jobs back to queue.

John Hanks griznog at gmail.com
Wed Jan 23 16:02:51 MST 2008


On Jan 23, 2008 3:14 PM, Garrick Staples <garrick at usc.edu> wrote:
> On Wed, Jan 23, 2008 at 02:30:59PM -0700, John Hanks alleged:
> > Moab 5.1.0.
> >
> > I found DEFERTIME and I think that'll solve my deferred issue. I'll
> > have to read more about resources, I just assumed adding
> > 'fakeresource' to my test nodes in my nodes file would be sufficient
> > but it makes sense that I'd have to tell moab about it. Just need to
> > figure out how.
>
> You don't necessarily have to tell moab about it.  Moab can read this stuff
> from pbs_server.
>

I added a NODECFG line to specify my fakeresource on a single node.
Then I submitted a job and had things configured so that the health
check would fail and offline the node. It worked, the job requeued
then a minute later started on another node that didn't have the fake
resource feature. I qsub'd a second job and it started directly on a
node without the feature. Some snippets are:

moab.cfg:

NODECFG[uinta-0003]   FEATURES+=griznog

nodes:

uinta-0003 np=4 myrinet griznog

I reset everything and did a fresh test. From qstat -f of job after it
correctly gets rejected due to health check failure:

A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
    Job_Name = job-testqpeek.sh
    Job_Owner = A00017456 at uinta
    job_state = Q
    queue = uinta
    server = uinta
    Checkpoint = u
    ctime = Wed Jan 23 15:36:29 2008
    Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jan 23 15:36:34 2008
    Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
    Priority = 0
    qtime = Wed Jan 23 15:36:29 2008
    Rerunable = True
    Resource_List.feature = griznog
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    Resource_List.walltime = 00:30:00
    Shell_Path_List = /bin/bash
    Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=A00017456,
	PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
	usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
	bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
	genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
	blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
	PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
	PBS_O_QUEUE=uinta
    comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
	at time 15:36:30_01/23,
	 job reported idle at time 15:36:34_01/23 (see RM logs for details)

    etime = Wed Jan 23 15:36:29 2008
    exit_status = -3
    submit_args = -l feature=griznog Testing/job-testqpeek.sh



qstat after it gets requeued and then starts on a node that lacks the feature.

A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
    Job_Name = job-testqpeek.sh
    Job_Owner = A00017456 at uinta
    resources_used.cput = 00:00:00
    resources_used.mem = 5244kb
    resources_used.vmem = 156184kb
    resources_used.walltime = 00:11:18
    job_state = R
    queue = uinta
    server = uinta
    Checkpoint = u
    ctime = Wed Jan 23 15:36:29 2008
    Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
    exec_host = uinta-0044/1+uinta-0044/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jan 23 15:37:39 2008
    Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
    Priority = 0
    qtime = Wed Jan 23 15:36:29 2008
    Rerunable = True
    Resource_List.feature = griznog
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    Resource_List.walltime = 00:30:00
    session_id = 9003
    Shell_Path_List = /bin/bash
    Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=A00017456,
	PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
	usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
	bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
	genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
	blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
	PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
	PBS_O_QUEUE=uinta
    comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
	at time 15:36:30_01/23,
	 job reported idle at time 15:36:34_01/23 (see RM logs for details)

    etime = Wed Jan 23 15:36:29 2008
    exit_status = -3
    submit_args = -l feature=griznog Testing/job-testqpeek.sh

I don't know what bit I'm missing to make it enforce the feature list
after the requeue. It's not critically important right now because I
am only using the feature to confine the jobs to the node(s) I am
using for testing. But I do have long term plans for using arbitrary
features with licensing and other things so I'd like to know how to
enforce them when they are requested.

Thanks,

jbh


More information about the torqueusers mailing list