[torqueusers] Questions from a new Torque/Maui admin regarding user proxies, sparse mom instances, and node/resource scheduling

Douglas Wade Needham dneedham at cmu.edu
Wed Dec 2 12:49:40 MST 2009


Greetings all,

I apologize in advance for the length of this message and the
noob-ishness of the questions, but I am stuck between a learning curve
and time constraints.

In recent weeks, I was given the task of setting up Torque and Maui on a
cluster here at the university of which I took over administration a few
months ago.  The profs in charge of this cluster have have made a couple
of requests which I want to ask for input from more experienced
Torque/Maui users about.  In addition, I want to anticipate some of the
requests I know will soon come.  Here are some background details:

      * 64 compute nodes, 2*4core processors
        
        Two nodes are currently dedicated to special functions,
        including the cloudhead (where Torque+Maui currently reside)
        
      * Additional storage servers providing several PVFS2 partitions.
      * Additional virtual servers on a machine for login (from where
        jobs will be submitted), monitoring (Ganglia), and install
        functionallity.
      * Nodes are running Debian with Torque 2.4.2, Maui 3.2.6 (both
        installed from source), and OpenMPI 1.2.7rc2 (installed from the
        OS).

As of the moment, two of the compute nodes have mom running on them, and
I am able to submit jobs from the login node to the cloudhead using our
cloud user id (which is throughout the cluster, but has a local home
directory on each node), and after some mistakes on my part, I can get a
job to run on these two nodes, using processors as I expect, and get my
results back into a global scratch directory.  This is with a mostly
out-of-the-tarball configuration, with the changes mainly being to allow
for submission from our login node, and to do auto_node_np.

Now, for my first problem/question.  There is the request by those
making the decisions to try to keep from propagating all the user
accounts and home directories across the cluster.  And so, I am looking
at the use of the '-u' option.  When I try to submit to this user using
my personal account using `-u clouduser`, the job submits fine but ends
up queued, and showq shows it listed as a blocked job, with a state of
BatchHold.  Checking the maui log, I find the following:

        12/02 14:08:25 MPBSJobLoad(36,36.cloudhead.opencloud,J,TaskList,0)
        12/02 14:08:25 MReqCreate(36,SrcRQ,DstRQ,DoCreate)
        12/02 14:08:25 INFO:     processing node request line '2:ppn=8'
        12/02 14:08:25 MJobSetHold(36,8,00:00:00,(null),job not authorized to use proxy credentials)
        12/02 14:08:25 ALERT:    job '36' cannot run (deferring job for 0 seconds)
        12/02 14:08:25 MJobSetCreds(36,dneedham,clouduser,)
        12/02 14:08:25 INFO:     default QOS for job 36 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
        12/02 14:08:25 INFO:     default QOS for job 36 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
        12/02 14:08:25 INFO:     default QOS for job 36 set to DEFAULT(0) (P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
        12/02 14:08:25 INFO:     job '36' loaded:  16 dneedham clouduser   3600       Idle   0 1259780904   [NONE] [NONE] [NONE] >=      0 >=      0 [NONE] 1259780905
        12/02 14:08:25 INFO:     2 PBS jobs detected on RM cloudhead
        12/02 14:08:25 INFO:     jobs detected: 2
        12/02 14:08:25 MStatClearUsage(node,Active)
        12/02 14:08:25 MClusterUpdateNodeState()
        12/02 14:08:25 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
        12/02 14:08:25 INFO:     job '36' Priority:        1
        12/02 14:08:25 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:      0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
        12/02 14:08:25 MStatClearUsage([NONE],Active)
        12/02 14:08:25 INFO:     total jobs selected (ALL): 0/1 [Hold: 1]
        12/02 14:08:25 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
        12/02 14:08:25 INFO:     job '36' Priority:        1
        12/02 14:08:25 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:      0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
        12/02 14:08:25 MStatClearUsage([NONE],Idle)
        12/02 14:08:25 INFO:     total jobs selected (ALL): 0/1 [Hold: 1]


I have set allow_proxy_user to true for Torque, and the .rhosts file for
the clouduser on cloudhead is the following:

        login dneedham
        login.opencloud dneedham
        login.opencloud.pdl.cmu.local dneedham

where a 'who am i' does list my account as going over from
'login.opencloud' (though we unfortunately have rexec/rcmd disabled, so
I cannot test it to be 100% sure).  But the moment I issue the runjob  

So at this point, I have to ask... Can Torque+Maui support this, and if
so, what am I doing wrong/missing?



My second question is this.  The profs want to end up with the
non-homogeneous situation where only a couple of nodes run the client
mom, and MPI is used to run things on the other nodes (sort of the
reverse of the case for Appendix H).  Everything I have read says that
mom has to be run on all nodes for Torque/Maui to handle those nodes.
It sounds to me that this is not desirable in the least, since there
would be no control by Torque/Maui on who runs what on those other
nodes, and jobs could end up colliding.  Is this assessment correct, and
is this a undesirable configuration?


Next, we sometimes do runs where we have users wanting to allocate
entire sets of nodes on the same switch (or section thereof) to avoid
switch interconnects.  In anticipation of such requests, I defined each
node with a property like "SWITCH1" or "SWITCH3".  But I am wondering if
there is there a way for these users to submit their job in such a way
that they get resources/nodes on the same switch, but not restrict
themselves to a specific switch (e.g. they will take N nodes on either
switch 1 or switch 3 but not split across both)?  

Lastly, I know some jobs will want to use MPI between nodes in a dynamic
set of nodes (perhaps handed out by Torque/Maui), but internally do
things like run multiple processes (via system()/exec()) on the nodes,
and have all processors dedicated to that job.  From my quick passes
through the docs, I am feeling that this is probably best approached
with the user submitting a reservation for all resources on a set of
nodes, and then running their job within that reservation, but only
using 1 core per node.  Am I correct, or are there better ways to do
this?


Thanks!

- Doug





-- 
Douglas Wade Needham            Senior Research Programmer
Carnegie Mellon University      Parallel Data Laboratory/Data Center Observatory
Email:  dneedham @ cmu . edu    http://www.pdl.cmu.edu
Disclaimer: My opinions are my own.  Since I don't want them, why
            should my employer, or anybody else for that matter! 



More information about the torqueusers mailing list