[torquedev] job arrays?
Andrew J Caird
acaird at umich.edu
Tue Apr 18 12:11:57 MDT 2006
Thanks for the detailed answer - I agree that your 'te' tool is very
similar to what I am thinking about.
The basic problem I have is that everyone is pretty well trained to use
"qstat" and "qsub" so integrating the functionality there would make more
sense to them (and make my job easier). Wrapping the q-commands would
also work, but then there's no functional difference between that and
simply adding the features to Torque. Also, people are comparing Torque
to PBSPro (which is also in use here) and it would be nice to offer
similar functionality since the command set is so similar.
Thanks again for the insight, it certainly gives me more to think about.
On Fri, 7 Apr 2006, Lennart Karlsson wrote:
> You wrote:
>> If I have a job that I want to run with 500 parameters, but I have 100
>> computers and 20 other users with limits of 20 nodes per person. So I
>> submit my job array of 500 jobs, and they start when and where they can
>> within the constraints of the scheduler - to the scheduler it looks like
>> 500 jobs. qsub, qstat, qdel, etc. , though, treat it as one job by
>> default, so qdel'ing it kills all of them. There would be an option to
>> qstat to get details out of a job array.
>> My weak understanding of mpiexec is that it doesn't do this.
>> Does that make sense? I am struggling with it myself, so any dialog would
>> be appreciated.
> Within HPC4U, an EC funded research project (http://www.hpc4u.org) aiming
> for an SLA defined high fault tolerance level of cluster and grid job runs,
> I have written a small tool (te - test engine) that does something similar.
> The tool is set up with a number of test cases, each test case containing
> a job script, a binary to run, some input data and a verification script.
> The verification script looks at the output data to find if the test has run
> to completion and also may try to see if the output looks "good" in some
> predefined way, like having a certain file size and having answers within a
> certain value range.
> To run a test means to run a test case, check for completion, save the
> results and verify the results.
> The similarity to your problem comes at the aggregate level. In the test tool
> you can make an combined test case, containing several simple test cases,
> and handle the aggregate test case in the same way as a simple one.
> An aggregated test case is saved within the test tool as a file,
> containing the names (actually directory names within a certain
> directory) of all the test cases contained within it, which makes
> it straightforward to implement the same operators for an aggregated test
> case as you have for a simple test case.
> Of course everything is done on a meta-level. You write commands like
> te start grandtestcase # Start a test
> te status grandtestcase # See if it is finished
> te save grandtestcase # Save results, so it can be verified
> te verify grandtestcase # Verify results
> te stop grandtestcase # Terminate the test prematurely
> te delete grandtestcase # Remove all traces of the test
> te show grandtestcase # Show definition of test case
> and the tool runs 'qstat', 'showq', and other such commands to implement this,
> if Torque and Maui is used. (Within HPC4U, CCS is mainly used as scheduler and
> queueing system.)
> This presentation is just to give you an idea about a different way to
> solve your problem. I am not selling/promoting the tool itself, in fact
> it is not packaged in a way suited for distribution outside the project.
> In a way it would be better to have job aggregation integrated in the
> queueing system, as you wish, but perhaps this makes the queueing
> system unnecessarily complex; Unix is all about making lean, single-minded
> programs interoperate, isn't it? :-) And perhaps you would always like
> to be able to add something more, like some verification step for your jobs, ensuring at least e.g. that you do not have empty output
> files after an aggregated run. So I propose that you look into doing
> some tool in the same line as my test engine.
> -- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
> National Supercomputer Centre in Linkoping, Sweden
More information about the torquedev