[torqueusers] Same job on several nodes

Vincent LIARD vincent.liard at scilab.org
Thu Feb 4 03:55:59 MST 2010


Hi Simon,

Thanks for your answer !

> > I am building a heterogeneous cluster so as to compare performance of
> > the same program on various hardware architectures. For this purpose, I
> > was advised to use torque.
> > 
> > Thus, I am looking forward to execute the very same job on all nodes of
> > my cluster. So far, I've considered '-t 1-n' and '-l nodes=n' qsub
> > options but none appears to fit my need. 
> > 
> > Indeed, on the one hand, '-l nodes=n' reserves n nodes but won't spread
> > the sequential job, and, on the other hand, '-t 1-n' will spawn n jobs
> > but won't necessarily attach them to n different nodes. So, what I want
> > is some kind of mix of both options : n jobs run on n different nodes.
> > 
> > Do you know of a means to do this ? Of course, I could iterate over the
> > nodes hostnames and attach that many jobs to each node... But I wouldn't
> > come to this end if there is a more straightforward way.
> 
> I suppose the best way to do this is to add corresponding properties to
> nodes (describing the architecture) and simply generate the necessary
> amount of jobs with -l nodes=1:property.

I'm not sure to get it right. By "generate the necessary amount of
jobs", do you mean doing so by that many individual calls to qsub ? And
using "property" to attach each to the desired node, I guess. 
Am I correct ?

As a first try, I did :
for n in `cat nodes`; do (echo hostname | qsub -l nodes=$n); done;
Is it more or less what you are talking about ?

> > Moreover, provided I overcome the first step, I will be interested in
> > gathering performance measures (CPU load, RAM used, job duration...)
> > from all job executions to compare the results. Is there an easy way to
> > do so ? Can moab help in this regards ?
> 
> Torque records cpu time, walltime (run time), memory and virtual memory
> usage. Check accounting information.

This is very good news to me. Unfortunately, I am stuck with tracejob
which I still don't manage to get to run because of a persistent memory
error : http://www.clusterresources.com/bugzilla/show_bug.cgi?id=49
I don't see where to start from to make it work. I guess I will overcome
this sooner or later, anyway, but for now I feel pretty lost.

Vincent




More information about the torqueusers mailing list