[torqueusers] Question about prologue scripts and sending jobs back to queue.

John Hanks griznog at gmail.com
Wed Jan 23 17:00:45 MST 2008


On Jan 23, 2008 3:14 PM, Garrick Staples <garrick at usc.edu> wrote:
> On Wed, Jan 23, 2008 at 02:30:59PM -0700, John Hanks alleged:
> > Moab 5.1.0.
> >
> > I found DEFERTIME and I think that'll solve my deferred issue. I'll
> > have to read more about resources, I just assumed adding
> > 'fakeresource' to my test nodes in my nodes file would be sufficient
> > but it makes sense that I'd have to tell moab about it. Just need to
> > figure out how.
>
> You don't necessarily have to tell moab about it.  Moab can read this stuff
> from pbs_server.
>

I added a NODECFG line to specify my fakeresource on a single node.
Then I submitted a job and had things configured so that the health
check would fail and offline the node. It worked, the job requeud then
a minute later started on another node that didn't have the fake
resource feature. I qsub'd a second job and it started directly on a
node without the feature. Some snippets are:

moab.cfg:

NODECFG[uinta-0003]   FEATURES+=griznog

nodes:

uinta-0003 np=4 myrinet griznog

I reset everything and did a fresh test. From qstat -f of job after it
correctly gets rejected due to health check failure and the node gets
offlined:

A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
    Job_Name = job-testqpeek.sh
    Job_Owner = A00017456 at uinta
    job_state = Q
    queue = uinta
    server = uinta
    Checkpoint = u
    ctime = Wed Jan 23 15:36:29 2008
    Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jan 23 15:36:34 2008
    Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
    Priority = 0
    qtime = Wed Jan 23 15:36:29 2008
    Rerunable = True
    Resource_List.feature = griznog
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    Resource_List.walltime = 00:30:00
    Shell_Path_List = /bin/bash
    Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=A00017456,
	PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
	usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
	bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
	genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
	blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
	PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
	PBS_O_QUEUE=uinta
    comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
	at time 15:36:30_01/23,
	 job reported idle at time 15:36:34_01/23 (see RM logs for details)

    etime = Wed Jan 23 15:36:29 2008
    exit_status = -3
    submit_args = -l feature=griznog Testing/job-testqpeek.sh



qstat after it gets requeued and incorrectly starts on a node that
lacks the feature.

A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
    Job_Name = job-testqpeek.sh
    Job_Owner = A00017456 at uinta
    resources_used.cput = 00:00:00
    resources_used.mem = 5244kb
    resources_used.vmem = 156184kb
    resources_used.walltime = 00:11:18
    job_state = R
    queue = uinta
    server = uinta
    Checkpoint = u
    ctime = Wed Jan 23 15:36:29 2008
    Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
    exec_host = uinta-0044/1+uinta-0044/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Jan 23 15:37:39 2008
    Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
    Priority = 0
    qtime = Wed Jan 23 15:36:29 2008
    Rerunable = True
    Resource_List.feature = griznog
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    Resource_List.walltime = 00:30:00
    session_id = 9003
    Shell_Path_List = /bin/bash
    Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=A00017456,
	PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
	usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
	bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
	genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
	blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
	PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
	PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
	PBS_O_QUEUE=uinta
    comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
	at time 15:36:30_01/23,
	 job reported idle at time 15:36:34_01/23 (see RM logs for details)

    etime = Wed Jan 23 15:36:29 2008
    exit_status = -3
    submit_args = -l feature=griznog Testing/job-testqpeek.sh

I don't know what bit I'm missing to make it enforce the feature list
after the requeue.

Thanks,

jbh


More information about the torqueusers mailing list