[torqueusers] torque 3.0.3 on uv

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Tue Jan 3 21:35:04 MST 2012

HI All,

I've started configuring torque 3.0.3 on an SGI UV system (following  http://www.clusterresources.com/torquedocs/1.7torqueonnuma.shtml) and am having problems.

I started with a working non-numa 3.0.3 setup as a sanity check.

I configured it with -enable-numa-support and made a nodes file:
cherax-1 np=48 num_numa_nodes=6
and mom.layout
#cpus=0-15 mem=0-1 /boot
cpus=16-23 mem=2
cpus=24-31 mem=3
#cpus=32-47 mem=4-5 /user
cpus=48-55 mem=6
cpus=55-63 mem=7
cpus=64-71 mem=8
cpus=72-79 mem=9
(note that some of the blades are set aside for io etc. and not all are currently on or configured).

'pbsnodes -a' then reports sensible info about (virtual) nodes cherax-1-0 through cherax-1-5

However then it gets messy.

I couldn't submit jobs anymore (ruserok errors).  Putting cherax-1 in a .rhosts file allowed me to submit a job which seemed to run ok but it failed to finish cleanly:
01/04/2012 14:41:46  S    Reply sent for request type JobObituary on socket 17
01/04/2012 14:41:46  M    scan_for_terminated: job 248.cherax-1.hpsc.csiro.au task 1 terminated, sid=148991
01/04/2012 14:41:46  M    job was terminated
01/04/2012 14:41:46  M    obit sent to server
01/04/2012 14:41:46  M    server rejected job obit - 15001
01/04/2012 14:41:47  M    removed job script
01/04/2012 14:41:54  S    preparing to send 'a' mail for job 248.cherax-1.hpsc.csiro.au to wil240 at cherax-1.hpsc.csiro.au (Job does not exist on node)

The server log has messages about nodes changing state (I think the state=512 is unexpected):
01/04/2012 15:22:56;0040;PBS_Server;Req;is_stat_get;received status from node cherax-1
01/04/2012 15:22:56;0040;PBS_Server;Req;update_node_state;adjusting state for node cherax-1-4 - state=512, newstate=0

Is it possible that the name 'cherax-1' is being handled badly with a trailing hyphen-digit, similar to the virtual node designation?

Last, I also tried a non-uniform layout with numa_node_str
Nodes: cherax-1 numa_node_str=16,8,8,16
(with a compatible mom.layout) and pbs_server crashed:
Jan  4 13:55:27 cherax-1 kernel: [66593.114396] pbs_server[118916] trap divide error ip:4106f5 sp:7fffcb74ec90 error:0 in pbs_server[400000+58000]
Has anyone used such a setup successfully?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120104/94f22b07/attachment-0001.html 

More information about the torqueusers mailing list