[torqueusers] getsize() failed for file in mom_set_limits
Steve Young
chemadm at hamilton.edu
Tue Sep 23 08:46:48 MDT 2008
I have used this set up in order for users to request machines with
enough scratch disk space for their jobs:
On the node in mom_priv/config:
size[fs=/scratch]
Which correlates to the /scratch partition on the machine.
Then when I do a checknode on the server I see:
checking node node0001
State: Running (in current state for 00:00:00)
Configured Resources: PROCS: 32 MEM: 30G SWAP: 18G DISK: 1649G
Utilized Resources: DISK: 154G
Dedicated Resources: PROCS: 30 MEM: 30G
Now all my users need to do is specify this in their batch scripts to
request the amount of scratch space they need:
#PBS -l file=500gb
I'm not sure if this solves your situation as your talking about /home
space as opposed to some other partition (/scratch in my example). But
I just wanted to point out that this set up has been working for me.
Hope this helps,
-Steve
On Sep 23, 2008, at 10:35 AM, Arnau Bria wrote:
> Hi all,
>
> I tried to configure size parameter in pbs config file in wns for
> later
> specify, at queue level, default resources needed by a job.
>
> WNs:
> #cat /var/spool/pbs/mom_priv/config
> [...]
> size[fs=/home]
>
> # pbsnodes td033
> td033
> state = free
> np = 8
> properties = slc4,magic
> ntype = cluster
> status = opsys=linux,uname=Linux td033.pic.es 2.6.9-42.0.3.ELsmp
> #1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=4836 9685
> 15182,nsessions=3,nusers=2,idletime=627924,totmem=32637848kb,
> availmem
> =
> 30943568kb
> ,physmem=16632016kb,ncpus=8,loadave=2.99,gres=cpu_factor:=1.52375,
> etload=2895785926,size=82590804kb:108277440kb,stat
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>
> and I specify resources_default.file=9000000kb at queue level:
> Qmgr: s q long resources_default.file=9000000kb
>
> Then I submit a job and I get this error:
>
> getsize() failed for file in mom_set_limits
>
>
> Googling it I found :
>
> http://www.supercluster.org/pipermail/torqueusers/2007-February/005037.html
>
> which recommends ddisk instead of size...
>
> But I has no sense if we read torque admin manual:
>
>> size[fs=<FS>]
>> Specifies that the available and configured disk
>> space in the <FS> filesystem is to be reported to the pbs_server
>> and sched- uler. NOTE: To request disk space on a per job basis,
>> specify the file resource as in 'qsub -l nodes=1,file=1000kb' For
>> exam- ple, the available and configured disk space in
>> the
>> /localscratch filesystem will be reported:
>>
>> size[fs=/localscratch]
>
> And has sense if it does what Dave Jackson says:
>
>
>>> The failure in TORQUE is occurring because TORQUE is trying to
>>> set the file ulimit. My guess is that this is not what you want.
>>> You want Maui to schedule diskspace as a consumable resource but
>>> not enforce anything at the OS level? Is this correct? Moab
>>> supports a 'ddisk' (dedicated disk) RM extension which is a disk
>>> constraint enforced by the scheduler as a consumable resource but
>>> not enforced via ulimits, ie,
>
> So, could someone clarify it for me?
>
> Is "size" a valid resource for requesting (and not reserving) a
> minimal amount of space when submitting a job?
>
> Do I have to use ddisk? Cuase I tried and torque/MAUI does not take
> care of it:
>
> [arnaubria at ui01 ~]$ echo sleep 5|qsub -l ddisk=900gb -q short
> 562161.pbs02.pic.es
> [arnaubria at ui01 ~]$ qstat -f 562161.pbs02.pic.es
> Job Id: 562161.pbs02.pic.es
> Job_Name = STDIN
> Job_Owner = arnaubria at ui01.pic.es
> job_state = Q
> queue = short
> server = pbs02.pic.es
> Checkpoint = u
> ctime = Tue Sep 23 16:32:10 2008
> Error_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/STDIN.e562161
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = a
> mtime = Tue Sep 23 16:32:10 2008
> Output_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/
> STDIN.o562161
> Priority = 0
> qtime = Tue Sep 23 16:32:10 2008
> Rerunable = True
> Resource_List.cput = 01:30:00
> Resource_List.ddisk = 900gb
> Resource_List.walltime = 03:00:00
> Variable_List = PBS_O_HOME=/nfs/pic.es/user/a/arnaubria,
> PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=arnaubria,
> PBS_O_PATH=/usr/kerberos/bin:/opt/glite/bin:/opt/glite/externals/bin:
> /opt/lcg/bin:/opt/lcg/sbin:/opt/edg/bin:/opt/edg/sbin:/opt/globus/
> sbin
> :/opt/globus/bin:/opt/gpt/sbin:/usr/local/bin:/bin:/usr/bin:/usr/
> X11R6
> /bin:/opt/d-cache//srm/bin:/opt/d-cache//dcap/bin:/usr/java/
> jdk1.5.0_1
> 4/bin:/nfs/pic.es/user/a/arnaubria/bin,
> PBS_O_MAIL=/var/spool/mail/arnaubria,PBS_O_SHELL=/bin/bash,
> PBS_O_HOST=ui01.pic.es,PBS_O_WORKDIR=/nfs/pic.es/user/a/arnaubria,
> PBS_O_QUEUE=short
> etime = Tue Sep 23 16:32:10 2008
> submit_args = -l ddisk=900gb -q short
>
>
> # checkjob 562161.pbs02.pic.es
>
>
> checking job 562161
>
> State: Running
> Creds: user:arnaubria group:grid class:short qos:DEFAULT
> WallTime: 00:00:00 of 3:00:00
> SubmitTime: Tue Sep 23 16:32:10
> (Time Queued Total: 00:00:20 Eligible: 00:00:20)
>
> StartTime: Tue Sep 23 16:32:30
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: DEFAULT
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [slc4]
> Allocated Nodes:
> [td020.pic.es:1]
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 1
> PartitionMask: [ALL]
> Flags: BACKFILL RESTARTABLE
>
> Reservation '562161' (00:00:00 -> 3:00:00 Duration: 3:00:00)
> PE: 1.00 StartPriority: 0
>
> And I have no node with 900gb of free space...
> (I have also tried using kb instead of gb)
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list