[torqueusers] Question about prologue scripts and sending jobs
back to queue.
John Hanks
griznog at gmail.com
Wed Jan 23 16:02:51 MST 2008
On Jan 23, 2008 3:14 PM, Garrick Staples <garrick at usc.edu> wrote:
> On Wed, Jan 23, 2008 at 02:30:59PM -0700, John Hanks alleged:
> > Moab 5.1.0.
> >
> > I found DEFERTIME and I think that'll solve my deferred issue. I'll
> > have to read more about resources, I just assumed adding
> > 'fakeresource' to my test nodes in my nodes file would be sufficient
> > but it makes sense that I'd have to tell moab about it. Just need to
> > figure out how.
>
> You don't necessarily have to tell moab about it. Moab can read this stuff
> from pbs_server.
>
I added a NODECFG line to specify my fakeresource on a single node.
Then I submitted a job and had things configured so that the health
check would fail and offline the node. It worked, the job requeued
then a minute later started on another node that didn't have the fake
resource feature. I qsub'd a second job and it started directly on a
node without the feature. Some snippets are:
moab.cfg:
NODECFG[uinta-0003] FEATURES+=griznog
nodes:
uinta-0003 np=4 myrinet griznog
I reset everything and did a fresh test. From qstat -f of job after it
correctly gets rejected due to health check failure:
A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
Job_Name = job-testqpeek.sh
Job_Owner = A00017456 at uinta
job_state = Q
queue = uinta
server = uinta
Checkpoint = u
ctime = Wed Jan 23 15:36:29 2008
Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Jan 23 15:36:34 2008
Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
Priority = 0
qtime = Wed Jan 23 15:36:29 2008
Rerunable = True
Resource_List.feature = griznog
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.walltime = 00:30:00
Shell_Path_List = /bin/bash
Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=A00017456,
PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
PBS_O_QUEUE=uinta
comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
at time 15:36:30_01/23,
job reported idle at time 15:36:34_01/23 (see RM logs for details)
etime = Wed Jan 23 15:36:29 2008
exit_status = -3
submit_args = -l feature=griznog Testing/job-testqpeek.sh
qstat after it gets requeued and then starts on a node that lacks the feature.
A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
Job_Name = job-testqpeek.sh
Job_Owner = A00017456 at uinta
resources_used.cput = 00:00:00
resources_used.mem = 5244kb
resources_used.vmem = 156184kb
resources_used.walltime = 00:11:18
job_state = R
queue = uinta
server = uinta
Checkpoint = u
ctime = Wed Jan 23 15:36:29 2008
Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
exec_host = uinta-0044/1+uinta-0044/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Jan 23 15:37:39 2008
Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
Priority = 0
qtime = Wed Jan 23 15:36:29 2008
Rerunable = True
Resource_List.feature = griznog
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.walltime = 00:30:00
session_id = 9003
Shell_Path_List = /bin/bash
Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=A00017456,
PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
PBS_O_QUEUE=uinta
comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
at time 15:36:30_01/23,
job reported idle at time 15:36:34_01/23 (see RM logs for details)
etime = Wed Jan 23 15:36:29 2008
exit_status = -3
submit_args = -l feature=griznog Testing/job-testqpeek.sh
I don't know what bit I'm missing to make it enforce the feature list
after the requeue. It's not critically important right now because I
am only using the feature to confine the jobs to the node(s) I am
using for testing. But I do have long term plans for using arbitrary
features with licensing and other things so I'd like to know how to
enforce them when they are requested.
Thanks,
jbh
More information about the torqueusers
mailing list