[torquedev] PPN/Node state bug?
SimonT at mail.muni.cz
Thu Sep 17 07:08:18 MDT 2009
>>> I'm still not sure what this is meant to do!
>> Well, by default, all requests for nodes are exclusive (confirmed by
>> reading the server code). This should mean (according to Torque
>> documentation), that when you allocate a node to a job, you can't run
>> anything else on that node until the job finishes.
> Umm, but the server doesn't decide that, that's a scheduler
> decision IMHO. Certainly with Maui/Moab that is controlled
> by the policies you specify in the them. It's the scheduler
> that decides which vnodes get assigned to which jobs, not
> the pbs_server.
Of course he doesn't decide this. But he has to provide correct
information about the state of the cluster (he doesn't) + he has to
check validity of requests (he does, but not correctly).
The server is the one who allocates jobs to nodes. Of course he doesn't
do this by himself, but he is the one who does the actual work (updating
cluster status, sending job to MOM, etc...).
If I require from the server to run a job that requires 10 nodes and the
cluster only has 5, it has to be rejected. The same for requiring a run
of another job on an exclusive node.
I never used Maui/Moab (and it isn't planned) so I don't know how they
use the server, but I certainly need the server to verify every request
because the state of the server might have changed. For example another
scheduler might have run an exclusive job on a node I still see as free.
>> I also have PBS pro documentation here and it actually supports a
>> separate flag for this: "#excl" (not sure what is default).
> Never used PBSPro so no idea.
>> If you don't need exclusive allocation, you have to specify "#shared"
>> (for example if you just need to calculate something, then you don't
>> really care that there are other jobs on other CPU's of the machine).
> Never heard of that before, sorry!
> Torque has cpuset support to avoid some of those issues anyway.
How would cpuset help me in this case?
Anyway there is no issue, the server has support for this, it just
doesn't work correctly. Server doesn't report the state + shared and
exclusive are per-cpu not per-node (as it should be according to
Mgr. Simon Toth
160 00 Praha 6
More information about the torquedev