[torqueusers] List of discussion or documentation on infiniband

Gus Correa gus at ldeo.columbia.edu
Wed Jul 8 14:19:30 MDT 2009


Hi Chris, list

Like you, I never found any substantial, clear,
and easy to follow Infiniband documentation,
or a tutorial on how to setup, configure, monitor,
and maintain Infiniband networks (IB).
AFAIK, weak documentation is a general problem with IB.

There is the open fabrics site, which was already pointed out to you:
http://www.openfabrics.org/

There is also the open fabrics general list,
which IMHO it is not really so general, but too technical,
mostly for developers, not "user-friendly":

http://lists.openfabrics.org/pipermail/general/
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

On a more user-oriented tone I was pointed to this
"Getting Started with Infiniband":
http://people.redhat.com/dledford/infiniband_get_started.html

I dug this "Introduction to Infiniband Architecture":
http://www.oreillynet.com/pub/a/network/2002/02/04/windows.html

and of course the Wikipedia article:
http://en.wikipedia.org/wiki/InfiniBand

More generic info about IB is in Jeff Layton's Jan/2008 article
on ClusterMonkey:

http://www.clustermonkey.net//content/view/222/2/

Cisco had a reasonable IB technology guide for beginners,
but it seems to have removed it from its web site
(at least I can't find them there anymore).
I found a copy here:

direkt.jacob-computer.de/content/datenblatt/168758_1.pdf

If you downloaded OFED or have it in your cluster,
there are text documents in the directory:

/wherever/you/untarred/it/OFED-1.4/docs

Start with the README file therein.

In  /etc/infiniband/ there is some info about how
OFED was configured in your system (if you use OFED).
The startup scripts are /etc/init.d/openibd (all nodes)
and /etc/init.d/opensmd (subnet manager, most likely your head node).

Some useful IB commands (try their man pages first):

sminfo
ibv_devinfo
ibnodes
ibhosts
ibstat
ibstatus
ibdiagnet
ibchecknet
qperf

I realize this is a bit off topic, but not so much,
since it is about clusters, and Torque is a cluster tool.
Anyway, you may find more information if you dig the
on cluster-specific lists, particularly the Beowulf and
Rocks Clusters list archives:

http://www.beowulf.org/archive/index.html
http://marc.info/?l=npaci-rocks-discussion


I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


ChrisJob.fr wrote:
>    Hi
> 
>    We have an infiniband HPC cluster. Sometimes we have problem with MPI 
> programs and we must restart the infiniband. After everything is OK for 
> 2 weeks.
>    Do you know where I can find a discussion list about infiniband ? Or 
> documention on the subject ?
> 
> Thank you for yout help
> Chris
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list