[Mauiusers] Cluster resources.

Eygene Ryabinkin rea+maui at grid.kiae.ru
Fri Oct 3 08:56:17 MDT 2008


Emmanuel, good day.

Thu, Oct 02, 2008 at 01:55:31PM -0600, Emmanuel V. Dominguez wrote:
> I am fairly new to maui and I apologize if my lingo is not correct. Here 
> is my situation I have a user that when she request time on our cluster 
> on any node she gather her application will stop when ram reaches 2.9 
> gigs. I have looked through documentation but have either overlooked or 
> could not find if there is a place in Maui where resource limits can be 
> set. Here is what she sent me all help is appreciated thanks!

My understanding about OS resource limits were that they are set by the
framework that is governing task startup (e.g., Torque), but not by the
Maui -- it is a scheduler.  Yes, Maui can track resource usage and even
can try to kill jobs, but it won't do this unless this was configured by
administrator.  As I can imagine that you're the administrator of the
system, you will be aware of such configuration.

What batch system you're using?  If it is PBS/Torque (and since you're
mentioning qsub, I can assume that it is the case; sure, you can use PBS
compatibility layer from SLURM or something else, so I can be wrong),
then you can try the 'tracejob' command to extract all information
associated with the particular job.  MOM logs on the execution node(s)
can be helpful too.

>  Okay, regardless of whether I qsub this or get a node interactively, I
> still have the same problem. One such example is:
> cd /cluster/..
> qsub run_vehgen
> which has in it #PBS -l nodes=1:ppn=2:p3,walltime=999:00:00
> in the log file I get the message
> terminate called recursively
> When I run one other script (run_blkgrploc on say test where the
> network is just a little too big I get a different messge but same issue:
> # ERROR in [POPL]:  Unknown exception
> @ ERROR in [POPL]: 24005 Failure exit, Unknown exception has occurred
> 
> Anyways, what I do is ssh onto the machine and watch the memory usage with
> top. It continues to grow until it reaches about 2.9GB and then the
> program dies and give the respective error message.

Try to strace the process in question -- what would be the last dozens
of calls?  This can shed some light on why the program terminates.
-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20081003/aa93a5dc/attachment.bin


More information about the mauiusers mailing list