[torqueusers] torque 3.0.3 on uv
David Beer
dbeer at adaptivecomputing.com
Wed Jan 4 10:51:12 MST 2012
Actually, I was able to reproduce and fix the crash. Let me know if you would like a snapshot build.
David
----- Original Message -----
>
>
>
>
> HI All,
>
>
>
> I’ve started configuring torque 3.0.3 on an SGI UV system (following
> http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml )
> and am having problems.
>
>
>
> I started with a working non-numa 3.0.3 setup as a sanity check.
>
>
>
> I configured it with –enable-numa-support and made a nodes file:
>
> cherax-1 np=48 num_numa_nodes=6
>
> and mom.layout
>
> #cpus=0-15 mem=0-1 /boot
>
> cpus=16-23 mem=2
>
> cpus=24-31 mem=3
>
> #cpus=32-47 mem=4-5 /user
>
> cpus=48-55 mem=6
>
> cpus=55-63 mem=7
>
> cpus=64-71 mem=8
>
> cpus=72-79 mem=9
>
> (note that some of the blades are set aside for io etc. and not all
> are currently on or configured).
>
>
>
> ‘pbsnodes –a’ then reports sensible info about (virtual) nodes
> cherax-1-0 through cherax-1-5
>
>
>
> However then it gets messy.
>
>
>
> I couldn’t submit jobs anymore (ruserok errors). Putting cherax-1 in
> a .rhosts file allowed me to submit a job which seemed to run ok but
> it failed to finish cleanly:
>
> 01/04/2012 14:41:46 S Reply sent for request type JobObituary on
> socket 17
>
> 01/04/2012 14:41:46 M scan_for_terminated: job
> 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991
>
> 01/04/2012 14:41:46 M job was terminated
>
> 01/04/2012 14:41:46 M obit sent to server
>
> 01/04/2012 14:41:46 M server rejected job obit - 15001
>
> 01/04/2012 14:41:47 M removed job script
>
> 01/04/2012 14:41:54 S preparing to send 'a' mail for job
> 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job
> does not exist on node)
>
>
>
> The server log has messages about nodes changing state (I think the
> state=512 is unexpected):
>
> 01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status
> from node cherax-1
>
> 01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting
> state for node cherax-1-4 - state=512, newstate=0
>
>
>
> Is it possible that the name ‘cherax-1’ is being handled badly with a
> trailing hyphen-digit, similar to the virtual node designation?
>
>
>
> Last, I also tried a non-uniform layout with numa_node_str
>
> Nodes: cherax-1 numa_node_str=16,8,8,16
>
> (with a compatible mom.layout) and pbs_server crashed:
>
> Jan 4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916]
> trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in
> pbs_server[400000+58000]
>
> Has anyone used such a setup successfully?
>
>
>
> Regards,
>
>
>
> Gareth
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
David Beer
Direct Line: 801-717-3386 | Fax: 801-717-3738
Adaptive Computing
1712 S East Bay Blvd, Suite 300
Provo, UT 84606
More information about the torqueusers
mailing list