[torqueusers] IB epilogoue/prolog, and any other concerns
Roy Dragseth
roy.dragseth at cc.uit.no
Fri May 30 16:00:06 MDT 2008
Do anyone have a good test algorithm to check if all nodes in a job has a
healthy infiniband connection? I'm looking for something to put into a
prolog script.
We're running a 704 node cluster where 384 nodes have ib and once in a while a
large job crashes with this message:
[0,1,132][btl_openib_component.c:1328:btl_openib_component_progress] from
c12-3 to: c13-3 error polling
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 57073528
opcode 42
The problem is that it is only detectable by running a network intensive
application, there are no report of problems in the syslogs and the
diagnostic tools do not show any problems with the HCA.
I'm leaning towards setting up a prolog script that runs a short MPI benchmark
to test the fabric and just offline the nodes having trouble and reschedule
the job.
Is the tm interface available to prolog scripts? If not it will be a
practical problem to get the nodelist to the testing application.
Do anyone have a better idea?
r.
More information about the torqueusers
mailing list