[torqueusers] torque 3.0.3 on uv

David Beer dbeer at adaptivecomputing.com
Wed Jan 4 10:51:12 MST 2012


Actually, I was able to reproduce and fix the crash. Let me know if you would like a snapshot build.

David

----- Original Message -----
> 
> 
> 
> 
> HI All,
> 
> 
> 
> I’ve started configuring torque 3.0.3 on an SGI UV system (following
> http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml )
> and am having problems.
> 
> 
> 
> I started with a working non-numa 3.0.3 setup as a sanity check.
> 
> 
> 
> I configured it with –enable-numa-support and made a nodes file:
> 
> cherax-1 np=48 num_numa_nodes=6
> 
> and mom.layout
> 
> #cpus=0-15 mem=0-1 /boot
> 
> cpus=16-23 mem=2
> 
> cpus=24-31 mem=3
> 
> #cpus=32-47 mem=4-5 /user
> 
> cpus=48-55 mem=6
> 
> cpus=55-63 mem=7
> 
> cpus=64-71 mem=8
> 
> cpus=72-79 mem=9
> 
> (note that some of the blades are set aside for io etc. and not all
> are currently on or configured).
> 
> 
> 
> ‘pbsnodes –a’ then reports sensible info about (virtual) nodes
> cherax-1-0 through cherax-1-5
> 
> 
> 
> However then it gets messy.
> 
> 
> 
> I couldn’t submit jobs anymore (ruserok errors). Putting cherax-1 in
> a .rhosts file allowed me to submit a job which seemed to run ok but
> it failed to finish cleanly:
> 
> 01/04/2012 14:41:46 S Reply sent for request type JobObituary on
> socket 17
> 
> 01/04/2012 14:41:46 M scan_for_terminated: job
> 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991
> 
> 01/04/2012 14:41:46 M job was terminated
> 
> 01/04/2012 14:41:46 M obit sent to server
> 
> 01/04/2012 14:41:46 M server rejected job obit - 15001
> 
> 01/04/2012 14:41:47 M removed job script
> 
> 01/04/2012 14:41:54 S preparing to send 'a' mail for job
> 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job
> does not exist on node)
> 
> 
> 
> The server log has messages about nodes changing state (I think the
> state=512 is unexpected):
> 
> 01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status
> from node cherax-1
> 
> 01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting
> state for node cherax-1-4 - state=512, newstate=0
> 
> 
> 
> Is it possible that the name ‘cherax-1’ is being handled badly with a
> trailing hyphen-digit, similar to the virtual node designation?
> 
> 
> 
> Last, I also tried a non-uniform layout with numa_node_str
> 
> Nodes: cherax-1 numa_node_str=16,8,8,16
> 
> (with a compatible mom.layout) and pbs_server crashed:
> 
> Jan 4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916]
> trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in
> pbs_server[400000+58000]
> 
> Has anyone used such a setup successfully?
> 
> 
> 
> Regards,
> 
> 
> 
> Gareth
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list