[torqueusers] torque 3.0.3 on uv

David Beer dbeer at adaptivecomputing.com
Wed Jan 4 09:53:13 MST 2012



----- Original Message -----
> 
> 
> 
> 
> HI All,
> 
> 
> 
> I’ve started configuring torque 3.0.3 on an SGI UV system (following
> http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml )
> and am having problems.
> 
> 
> 
> I started with a working non-numa 3.0.3 setup as a sanity check.
> 
> 
> 
> I configured it with –enable-numa-support and made a nodes file:
> 
> cherax-1 np=48 num_numa_nodes=6
> 
> and mom.layout
> 
> #cpus=0-15 mem=0-1 /boot
> 
> cpus=16-23 mem=2
> 
> cpus=24-31 mem=3
> 
> #cpus=32-47 mem=4-5 /user
> 
> cpus=48-55 mem=6
> 
> cpus=55-63 mem=7
> 
> cpus=64-71 mem=8
> 
> cpus=72-79 mem=9
> 
> (note that some of the blades are set aside for io etc. and not all
> are currently on or configured).

For me this is the first red flag. I don't know that we have anyone successfully using non-sequential layouts (skipping a blade in the middle). I know we have other sites, in fact it is typical, that skip some at the beginning or end for the boot set, but I don't think anyone is skipping in the middle. Would it be possible to move that user either to the front or to the back?

> 
> 
> 
> ‘pbsnodes –a’ then reports sensible info about (virtual) nodes
> cherax-1-0 through cherax-1-5
> 
> 
> 
> However then it gets messy.
> 
> 
> 
> I couldn’t submit jobs anymore (ruserok errors). Putting cherax-1 in
> a .rhosts file allowed me to submit a job which seemed to run ok but
> it failed to finish cleanly:
> 
> 01/04/2012 14:41:46 S Reply sent for request type JobObituary on
> socket 17
> 
> 01/04/2012 14:41:46 M scan_for_terminated: job
> 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991
> 
> 01/04/2012 14:41:46 M job was terminated
> 
> 01/04/2012 14:41:46 M obit sent to server
> 
> 01/04/2012 14:41:46 M server rejected job obit - 15001
> 
> 01/04/2012 14:41:47 M removed job script
> 
> 01/04/2012 14:41:54 S preparing to send 'a' mail for job
> 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job
> does not exist on node)
> 

Can you turn logging up (say to 10 or so) on the mom and the server and then reproduce this and email it to me? (The complete log)

> 
> 
> The server log has messages about nodes changing state (I think the
> state=512 is unexpected):
> 
> 01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status
> from node cherax-1
> 
> 01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting
> state for node cherax-1-4 - state=512, newstate=0
> 
> 

If you could reproduce this in the same set, that'd be great. Its hard to know if 512 is unexpected or not without knowing why TORQUE set the state to 512.

> 
> Is it possible that the name ‘cherax-1’ is being handled badly with a
> trailing hyphen-digit, similar to the virtual node designation?
> 

It is possible, although I know for a fact that we have a site with the same naming convention which doesn't experience these problems. I would look at the non-sequential blades first.

> 
> 
> Last, I also tried a non-uniform layout with numa_node_str
> 
> Nodes: cherax-1 numa_node_str=16,8,8,16
> 
> (with a compatible mom.layout) and pbs_server crashed:
> 
> Jan 4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916]
> trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in
> pbs_server[400000+58000]
> 
> Has anyone used such a setup successfully?
> 

I can't say for sure (hopefully someone else will chime in) but I thought we had sites using it. Would it be possible for you to enable core dumping and send me the core? If it is too large to email, you can upload it to our scp server. If this is necessary I'll send you the details directly. I would really like to fix this and I'm thinking it should be fairly straightforward. 


-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list