[torqueusers] IB epilogoue/prolog, and any other concerns

Roy Dragseth roy.dragseth at cc.uit.no
Fri May 30 16:00:06 MDT 2008


Do anyone have a good test algorithm to check if all nodes in a job has a 
healthy infiniband connection?  I'm looking for something to put into a 
prolog script.

We're running a 704 node cluster where 384 nodes have ib and once in a while a 
large job crashes with this message:

[0,1,132][btl_openib_component.c:1328:btl_openib_component_progress] from 
c12-3 to: c13-3 error polling 
HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id   57073528 
opcode 42

The problem is that it is only detectable by running a network intensive 
application, there are no report of problems in the syslogs and the 
diagnostic tools do not show any problems with the HCA.

I'm leaning towards setting up a prolog script that runs a short MPI benchmark 
to test the fabric and just offline the nodes having trouble and reschedule 
the job.

Is the tm interface available to prolog scripts?  If not it will be a 
practical problem to get the nodelist to the testing application.

Do anyone have a better idea?

r.


More information about the torqueusers mailing list