Fw: [torqueusers] More newbie questions

David Chin david.w.h.chin at gmail.com
Mon Jan 22 16:03:37 MST 2007


On 1/22/07, Ilja Livenson <ilja at nicpb.ee> wrote:
> mmok, sorry, then what is the problem? :) I mean, if the job was working
> fine?
>

I want at least two queues:

*  "long" which will only send to nodes n01, n02, n03, ..., n18.
* "compile" which will only send to the head node named master

As you can see, when I submit to the queue "long", the job runs
on master rather than one of the slaves. That's my problem.

Yes, the job runs fine, but on the wrong node.

--Dave

> Ilja
>
> David Chin wrote:
> > Yes, it ran correctly. I submitted to the queue "long":
> >
> > Job: 150.jvneumann.dfci.harvard.edu
> >
> > 01/22/2007 17:03:46  S    enqueuing into long, state 1 hop 1
> > 01/22/2007 17:03:46  S    Job Queued at request of
> > dwchin at master.dfci.harvard.edu,
> >                          owner = dwchin at master.dfci.harvard.edu, job
> > name = STDIN,
> >                          queue = long
> > 01/22/2007 17:03:46  A    queue=long
> > 01/22/2007 17:03:47  S    Job Modified at request of
> > maui at master.dfci.harvard.edu
> > 01/22/2007 17:03:47  S    Job Run at request of
> > maui at master.dfci.harvard.edu
> > 01/22/2007 17:03:47  S    Exit_status=0 resources_used.cput=00:00:00
> >                          resources_used.mem=0kb resources_used.vmem=0kb
> >                          resources_used.walltime=00:00:00
> > 01/22/2007 17:03:47  S    dequeuing from long, state COMPLETE
> > 01/22/2007 17:03:47  M    scan_for_terminated: job
> > 150.jvneumann.dfci.harvard.edu
> >                          task 1 terminated, sid 4870
> > 01/22/2007 17:03:47  M    job was terminated
> > 01/22/2007 17:03:47  A    user=dwchin group=montecarlo jobname=STDIN
> > queue=long
> >                          ctime=1169503426 qtime=1169503426
> > etime=1169503426
> >                          start=1169503427 exec_host=master/0
> >                          Resource_List.neednodes=master
> > 01/22/2007 17:03:47  A    user=dwchin group=montecarlo jobname=STDIN
> > queue=long
> >                          ctime=1169503426 qtime=1169503426
> > etime=1169503426
> >                          start=1169503427 exec_host=master/0
> >                          Resource_List.neednodes=master session=4870
> > end=1169503427
> >                          Exit_status=0 resources_used.cput=00:00:00
> >                          resources_used.mem=0kb resources_used.vmem=0kb
> >                          resources_used.walltime=00:00:00
> >
> >
> > I do want the master node to be a worker, but only for
> > a queue that I call "compile". That's my intent: to have the
> > queues "short", "long" and "medium" only send jobs to
> > the cluster nodes, and the queue "compile" to only send
> > jobs to the master node.
> >
> > On 1/22/07, Ilja Livenson <ilja at nicpb.ee> wrote:
> >> Did the job succeed? If not, try looking in the
> >> %TORQUE_DIR%/undelivered. (torque dir might be /var/spool/torque).
> >> I see that you have also set that master node is a worker node (might be
> >> a bad decision if you the master node runs anything besides mom/maui).
> >> try running it on a certain node and then doing tracejob.
> >>
> >> David Chin wrote:
> >> > On 1/22/07, Ilja Livenson <ilja at nicpb.ee> wrote:
> >> >>
> >> >> have you tried tracejob <JOBID> ?
> >> >>
> >> >
> >> > Here's the output.
> >> >
> >> > Job: 147.jvneumann.dfci.harvard.edu
> >> >
> >> > 01/22/2007 16:52:53  S    enqueuing into long, state 1 hop 1
> >> > 01/22/2007 16:52:53  S    Job Queued at request of
> >> >                          dwchin at master.dfci.harvard.edu, owner =
> >> >                          dwchin at master.dfci.harvard.edu, job name =
> >> >                          queue = long
> >> > 01/22/2007 16:52:53  A    queue=long
> >> > 01/22/2007 16:52:54  S    Job Modified at request of
> >> >                          maui at master.dfci.harvard.edu
> >> > 01/22/2007 16:52:54  S    Job Run at request of maui at master.dfci.har
> >> > 01/22/2007 16:52:54  A    user=dwchin group=montecarlo jobname=STDIN
> >> >                          ctime=1169502773 qtime=1169502773 etime=11
> >> >                          start=1169502774 exec_host=master/0
> >> >                          Resource_List.neednodes=master
> >> >
> >> >
> >>
> >>
> >
> >
>
>


-- 
Email: david.w.h.chin at gmail.com    dwchin at lroc.harvard.edu
Public key: http://gallatin.physics.lsa.umich.edu/~dwchin/crypto.html
      pub   1024D/1C557DDF 2006-07-21 [expires: 2007-07-21]
      Key fingerprint = 4EEB A409 5010 3679 4EA7  D420 4E52 202A 1C55 7DDF


More information about the torqueusers mailing list