[torqueusers] getsize() failed for file in mom_set_limits

Arnau Bria arnaubria at pic.es
Tue Sep 23 08:35:40 MDT 2008


Hi all,

I tried to configure size parameter in pbs config file in wns for later
specify, at queue level, default resources needed by a job. 

WNs:
#cat /var/spool/pbs/mom_priv/config
[...]
size[fs=/home]

# pbsnodes td033
td033
     state = free
     np = 8
     properties = slc4,magic
     ntype = cluster
     status = opsys=linux,uname=Linux td033.pic.es 2.6.9-42.0.3.ELsmp
#1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=4836 9685
15182,nsessions=3,nusers=2,idletime=627924,totmem=32637848kb, availmem=30943568kb,physmem=16632016kb,ncpus=8,loadave=2.99,gres=cpu_factor:=1.52375, etload=2895785926,size=82590804kb:108277440kb,stat
		  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


and I specify resources_default.file=9000000kb at queue level:
Qmgr: s q long resources_default.file=9000000kb

Then I submit a job and I get this error:

getsize() failed for file in mom_set_limits


Googling it I found :

http://www.supercluster.org/pipermail/torqueusers/2007-February/005037.html

which recommends ddisk instead of size...

But I has no sense if we read torque admin manual:

>        size[fs=<FS>]
>               Specifies  that  the  available and configured disk
> space in the <FS> filesystem is to be reported to the pbs_server
> and  sched- uler.   NOTE:  To request disk space on a per job basis,
> specify the file resource as in 'qsub -l nodes=1,file=1000kb'  For
> exam- ple,   the   available   and   configured   disk  space  in  the
>               /localscratch filesystem will be reported:
>
>               size[fs=/localscratch]

And has sense if it does what Dave Jackson says:


> >   The failure in TORQUE is occurring because TORQUE is trying to
> > set the file ulimit.  My guess is that this is not what you want.
> > You want Maui to schedule diskspace as a consumable resource but
> > not enforce anything at the OS level?  Is this correct?  Moab
> > supports a 'ddisk' (dedicated disk) RM extension which is a disk
> > constraint enforced by the scheduler as a consumable resource but
> > not enforced via ulimits, ie,

So, could someone clarify it for me?

Is "size" a valid resource for requesting (and not reserving) a
minimal amount of space when submitting a job?

Do I have to use ddisk? Cuase I tried and torque/MAUI does not take
care of it:

[arnaubria at ui01 ~]$ echo sleep 5|qsub -l ddisk=900gb -q short
562161.pbs02.pic.es
[arnaubria at ui01 ~]$ qstat -f 562161.pbs02.pic.es
Job Id: 562161.pbs02.pic.es
    Job_Name = STDIN
    Job_Owner = arnaubria at ui01.pic.es
    job_state = Q
    queue = short
    server = pbs02.pic.es
    Checkpoint = u
    ctime = Tue Sep 23 16:32:10 2008
    Error_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/STDIN.e562161
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Sep 23 16:32:10 2008
    Output_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/STDIN.o562161
    Priority = 0
    qtime = Tue Sep 23 16:32:10 2008
    Rerunable = True
    Resource_List.cput = 01:30:00
    Resource_List.ddisk = 900gb
    Resource_List.walltime = 03:00:00
    Variable_List = PBS_O_HOME=/nfs/pic.es/user/a/arnaubria,
	PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=arnaubria,
	PBS_O_PATH=/usr/kerberos/bin:/opt/glite/bin:/opt/glite/externals/bin:
	/opt/lcg/bin:/opt/lcg/sbin:/opt/edg/bin:/opt/edg/sbin:/opt/globus/sbin
	:/opt/globus/bin:/opt/gpt/sbin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6
	/bin:/opt/d-cache//srm/bin:/opt/d-cache//dcap/bin:/usr/java/jdk1.5.0_1
	4/bin:/nfs/pic.es/user/a/arnaubria/bin,
	PBS_O_MAIL=/var/spool/mail/arnaubria,PBS_O_SHELL=/bin/bash,
	PBS_O_HOST=ui01.pic.es,PBS_O_WORKDIR=/nfs/pic.es/user/a/arnaubria,
	PBS_O_QUEUE=short
    etime = Tue Sep 23 16:32:10 2008
    submit_args = -l ddisk=900gb -q short


# checkjob 562161.pbs02.pic.es


checking job 562161

State: Running
Creds:  user:arnaubria  group:grid  class:short  qos:DEFAULT
WallTime: 00:00:00 of 3:00:00
SubmitTime: Tue Sep 23 16:32:10
  (Time Queued  Total: 00:00:20  Eligible: 00:00:20)

StartTime: Tue Sep 23 16:32:30
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc4]
Allocated Nodes:
[td020.pic.es:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       BACKFILL RESTARTABLE

Reservation '562161' (00:00:00 -> 3:00:00  Duration: 3:00:00)
PE:  1.00  StartPriority:  0

And I have no node with 900gb of free space...
(I have also tried using kb instead of gb)

TIA,
Arnau


More information about the torqueusers mailing list