[torqueusers] acl_hosts oddity

Garrick Staples garrick at usc.edu
Tue Jan 31 17:07:31 MST 2006


On Tue, Jan 31, 2006 at 03:17:43PM -0500, nathaniel.x.woody at gsk.com alleged:
> First of all, thank you for your previous assistance on figuring out 
> $tmpdir.  For anyone else who struggles with that, the three pieces we 
> needed were 1) running configure with "--enable-wordexp" and 2) setting 
> $tmpdir /localscratch in the mom_priv/config file and 3) setting the 
> TMPDIR environment variable to $PBS_JOBID in the job request.  Now, torque 
> happily creates a directory for each job that wants it and keeps all the 
> jobs seperate.  The job script just cd's to the $TMPDIR directory. Thanks, 
> it works quite nicely now!
> 

1) no longer necessary with p6 and p7.
2) yup.
3) is only true if the job is overriding $TMPDIR.  MOM sets $TMPDIR for
the job, but allows the job to override it.


> I have noticed something of an oddity (I think), using torque2.0.0p5 and 
> am curious if what I'm seeing is the expected behavior.  When I enable 
> acl_hosts, (qmgr "s s acl_hosts_enable=true"), this breaks torque in kind 
> of a bizarre way.  It looks like this prevents mom's from returning 
> completed job information.  I have to add compute nodes to the acl_hosts 
> list (qmgr -c "s s acl_hosts += node1") in order to get the job to return. 
>  I suppose this means that returning the job info requires server services 
> that are blocked by enabling acl_hosts?

That does sound odd.  I've never used server acl_hosts, so I'm not
familiar with the behaviour.  But this sounds like something we can
change.

I've have a bunch of stuff on my plate right now and will likely forget
this.  Can you make a bug report?  You can assign the bug to me.

 
> Eventually, after several minutes, the job get's reported as exceeding the 
> wallclock time.  I get a weird "MOAB_INFO: job exceeded wallclock limit" 
> error and the job gets deleted.  I think this is just the scheduler 
> stepping in at some statjob polling interval and killing the job? 

And this happens in advance of the actual walltime limit of the job?


> On a lark, I checked and specifying "ALLOWCOMPUTEHOSTSUBMIT true" in a 
> torque.cfg file didn't appear to have any effect on this, which it seems 
> like it should.  At this point it appears that setting that parameter 
> allows a compute node to do any operation except return a job result?

ALLOWCOMPUTEHOSTSUBMIT is to accept new job submits from nodes (running
qsub on nodes).

 
> If the above is the expected behavior, what kind of wildcard matching is 
> allowed in the acl_hosts list?

You can use * as a glob.  *.gsk.com, node*.gsk.com, etc.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060131/017f29d5/attachment.bin


More information about the torqueusers mailing list