[torqueusers] IB epilogoue/prolog, and any other concerns
roy.dragseth at cc.uit.no
Fri May 30 16:00:06 MDT 2008
Do anyone have a good test algorithm to check if all nodes in a job has a
healthy infiniband connection? I'm looking for something to put into a
We're running a 704 node cluster where 384 nodes have ib and once in a while a
large job crashes with this message:
c12-3 to: c13-3 error polling
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 57073528
The problem is that it is only detectable by running a network intensive
application, there are no report of problems in the syslogs and the
diagnostic tools do not show any problems with the HCA.
I'm leaning towards setting up a prolog script that runs a short MPI benchmark
to test the fabric and just offline the nodes having trouble and reschedule
Is the tm interface available to prolog scripts? If not it will be a
practical problem to get the nodelist to the testing application.
Do anyone have a better idea?
More information about the torqueusers