[torqueusers] Question about prologue scripts and sending jobs
back to queue.
John Hanks
griznog at gmail.com
Wed Jan 23 17:00:45 MST 2008
On Jan 23, 2008 3:14 PM, Garrick Staples <garrick at usc.edu> wrote:
> On Wed, Jan 23, 2008 at 02:30:59PM -0700, John Hanks alleged:
> > Moab 5.1.0.
> >
> > I found DEFERTIME and I think that'll solve my deferred issue. I'll
> > have to read more about resources, I just assumed adding
> > 'fakeresource' to my test nodes in my nodes file would be sufficient
> > but it makes sense that I'd have to tell moab about it. Just need to
> > figure out how.
>
> You don't necessarily have to tell moab about it. Moab can read this stuff
> from pbs_server.
>
I added a NODECFG line to specify my fakeresource on a single node.
Then I submitted a job and had things configured so that the health
check would fail and offline the node. It worked, the job requeud then
a minute later started on another node that didn't have the fake
resource feature. I qsub'd a second job and it started directly on a
node without the feature. Some snippets are:
moab.cfg:
NODECFG[uinta-0003] FEATURES+=griznog
nodes:
uinta-0003 np=4 myrinet griznog
I reset everything and did a fresh test. From qstat -f of job after it
correctly gets rejected due to health check failure and the node gets
offlined:
A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
Job_Name = job-testqpeek.sh
Job_Owner = A00017456 at uinta
job_state = Q
queue = uinta
server = uinta
Checkpoint = u
ctime = Wed Jan 23 15:36:29 2008
Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Jan 23 15:36:34 2008
Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
Priority = 0
qtime = Wed Jan 23 15:36:29 2008
Rerunable = True
Resource_List.feature = griznog
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.walltime = 00:30:00
Shell_Path_List = /bin/bash
Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=A00017456,
PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
PBS_O_QUEUE=uinta
comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
at time 15:36:30_01/23,
job reported idle at time 15:36:34_01/23 (see RM logs for details)
etime = Wed Jan 23 15:36:29 2008
exit_status = -3
submit_args = -l feature=griznog Testing/job-testqpeek.sh
qstat after it gets requeued and incorrectly starts on a node that
lacks the feature.
A00017456 at uinta ~ $ qstat -f 1632
Job Id: 1632.uinta
Job_Name = job-testqpeek.sh
Job_Owner = A00017456 at uinta
resources_used.cput = 00:00:00
resources_used.mem = 5244kb
resources_used.vmem = 156184kb
resources_used.walltime = 00:11:18
job_state = R
queue = uinta
server = uinta
Checkpoint = u
ctime = Wed Jan 23 15:36:29 2008
Error_Path = uinta:/home/A00017456/job-testqpeek.sh.e1632
exec_host = uinta-0044/1+uinta-0044/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Jan 23 15:37:39 2008
Output_Path = uinta:/home/A00017456/job-testqpeek.sh.o1632
Priority = 0
qtime = Wed Jan 23 15:36:29 2008
Rerunable = True
Resource_List.feature = griznog
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.walltime = 00:30:00
session_id = 9003
Shell_Path_List = /bin/bash
Variable_List = PBS_O_HOME=/home/A00017456,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=A00017456,
PBS_O_PATH=/home/A00017456/bin:/sbin:/usr/sbin:/usr/local/sbin:/bin:/
usr/bin:/usr/local/bin:/usr/X11R6/bin:/opt/apps/softenv/softenv-1.6.2/
bin:/opt/apps/tess/tess-1.1:/opt/apps/mrbayes/mrbayes-3.1.2:/opt/apps/
genome/bin:/opt/apps/blast/wu-blast/wu-blast-2.0:/opt/apps/blast/ncbi-
blast/blast-2.2.17/bin:/opt/apps/clustalw/clustalw2.0,
PBS_O_MAIL=/var/spool/mail/A00017456,PBS_O_SHELL=/bin/bash,
PBS_SERVER=uinta,PBS_O_HOST=uinta,PBS_O_WORKDIR=/home/A00017456,
PBS_O_QUEUE=uinta
comment = job rejected by RM 'uinta' - job started on hostlist uinta-0003
at time 15:36:30_01/23,
job reported idle at time 15:36:34_01/23 (see RM logs for details)
etime = Wed Jan 23 15:36:29 2008
exit_status = -3
submit_args = -l feature=griznog Testing/job-testqpeek.sh
I don't know what bit I'm missing to make it enforce the feature list
after the requeue.
Thanks,
jbh
More information about the torqueusers
mailing list