[torqueusers] getsize() failed for file in mom_set_limits
Arnau Bria
arnaubria at pic.es
Tue Sep 23 08:35:40 MDT 2008
Hi all,
I tried to configure size parameter in pbs config file in wns for later
specify, at queue level, default resources needed by a job.
WNs:
#cat /var/spool/pbs/mom_priv/config
[...]
size[fs=/home]
# pbsnodes td033
td033
state = free
np = 8
properties = slc4,magic
ntype = cluster
status = opsys=linux,uname=Linux td033.pic.es 2.6.9-42.0.3.ELsmp
#1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=4836 9685
15182,nsessions=3,nusers=2,idletime=627924,totmem=32637848kb, availmem=30943568kb,physmem=16632016kb,ncpus=8,loadave=2.99,gres=cpu_factor:=1.52375, etload=2895785926,size=82590804kb:108277440kb,stat
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and I specify resources_default.file=9000000kb at queue level:
Qmgr: s q long resources_default.file=9000000kb
Then I submit a job and I get this error:
getsize() failed for file in mom_set_limits
Googling it I found :
http://www.supercluster.org/pipermail/torqueusers/2007-February/005037.html
which recommends ddisk instead of size...
But I has no sense if we read torque admin manual:
> size[fs=<FS>]
> Specifies that the available and configured disk
> space in the <FS> filesystem is to be reported to the pbs_server
> and sched- uler. NOTE: To request disk space on a per job basis,
> specify the file resource as in 'qsub -l nodes=1,file=1000kb' For
> exam- ple, the available and configured disk space in the
> /localscratch filesystem will be reported:
>
> size[fs=/localscratch]
And has sense if it does what Dave Jackson says:
> > The failure in TORQUE is occurring because TORQUE is trying to
> > set the file ulimit. My guess is that this is not what you want.
> > You want Maui to schedule diskspace as a consumable resource but
> > not enforce anything at the OS level? Is this correct? Moab
> > supports a 'ddisk' (dedicated disk) RM extension which is a disk
> > constraint enforced by the scheduler as a consumable resource but
> > not enforced via ulimits, ie,
So, could someone clarify it for me?
Is "size" a valid resource for requesting (and not reserving) a
minimal amount of space when submitting a job?
Do I have to use ddisk? Cuase I tried and torque/MAUI does not take
care of it:
[arnaubria at ui01 ~]$ echo sleep 5|qsub -l ddisk=900gb -q short
562161.pbs02.pic.es
[arnaubria at ui01 ~]$ qstat -f 562161.pbs02.pic.es
Job Id: 562161.pbs02.pic.es
Job_Name = STDIN
Job_Owner = arnaubria at ui01.pic.es
job_state = Q
queue = short
server = pbs02.pic.es
Checkpoint = u
ctime = Tue Sep 23 16:32:10 2008
Error_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/STDIN.e562161
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Sep 23 16:32:10 2008
Output_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/STDIN.o562161
Priority = 0
qtime = Tue Sep 23 16:32:10 2008
Rerunable = True
Resource_List.cput = 01:30:00
Resource_List.ddisk = 900gb
Resource_List.walltime = 03:00:00
Variable_List = PBS_O_HOME=/nfs/pic.es/user/a/arnaubria,
PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=arnaubria,
PBS_O_PATH=/usr/kerberos/bin:/opt/glite/bin:/opt/glite/externals/bin:
/opt/lcg/bin:/opt/lcg/sbin:/opt/edg/bin:/opt/edg/sbin:/opt/globus/sbin
:/opt/globus/bin:/opt/gpt/sbin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6
/bin:/opt/d-cache//srm/bin:/opt/d-cache//dcap/bin:/usr/java/jdk1.5.0_1
4/bin:/nfs/pic.es/user/a/arnaubria/bin,
PBS_O_MAIL=/var/spool/mail/arnaubria,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=ui01.pic.es,PBS_O_WORKDIR=/nfs/pic.es/user/a/arnaubria,
PBS_O_QUEUE=short
etime = Tue Sep 23 16:32:10 2008
submit_args = -l ddisk=900gb -q short
# checkjob 562161.pbs02.pic.es
checking job 562161
State: Running
Creds: user:arnaubria group:grid class:short qos:DEFAULT
WallTime: 00:00:00 of 3:00:00
SubmitTime: Tue Sep 23 16:32:10
(Time Queued Total: 00:00:20 Eligible: 00:00:20)
StartTime: Tue Sep 23 16:32:30
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc4]
Allocated Nodes:
[td020.pic.es:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: BACKFILL RESTARTABLE
Reservation '562161' (00:00:00 -> 3:00:00 Duration: 3:00:00)
PE: 1.00 StartPriority: 0
And I have no node with 900gb of free space...
(I have also tried using kb instead of gb)
TIA,
Arnau
More information about the torqueusers
mailing list