[torqueusers] getsize() failed for file in mom_set_limits

Steve Young chemadm at hamilton.edu
Tue Sep 23 08:46:48 MDT 2008


I have used this set up in order for users to request machines with  
enough scratch disk space for their jobs:

On the node in mom_priv/config:

size[fs=/scratch]

Which correlates to the /scratch partition on the machine.

Then when I do a checknode on the server I see:

checking node node0001

State:   Running  (in current state for 00:00:00)
Configured Resources: PROCS: 32  MEM: 30G  SWAP: 18G  DISK: 1649G
Utilized   Resources: DISK: 154G
Dedicated  Resources: PROCS: 30  MEM: 30G

Now all my users need to do is specify this in their batch scripts to  
request the amount of scratch space they need:

#PBS -l file=500gb

I'm not sure if this solves your situation as your talking about /home  
space as opposed to some other partition (/scratch in my example). But  
I just wanted to point out that this set up has been working for me.
Hope this helps,

-Steve



On Sep 23, 2008, at 10:35 AM, Arnau Bria wrote:

> Hi all,
>
> I tried to configure size parameter in pbs config file in wns for  
> later
> specify, at queue level, default resources needed by a job.
>
> WNs:
> #cat /var/spool/pbs/mom_priv/config
> [...]
> size[fs=/home]
>
> # pbsnodes td033
> td033
>     state = free
>     np = 8
>     properties = slc4,magic
>     ntype = cluster
>     status = opsys=linux,uname=Linux td033.pic.es 2.6.9-42.0.3.ELsmp
> #1 SMP Thu Oct 5 15:04:03 CDT 2006 i686,sessions=4836 9685
> 15182,nsessions=3,nusers=2,idletime=627924,totmem=32637848kb,  
> availmem 
> = 
> 30943568kb 
> ,physmem=16632016kb,ncpus=8,loadave=2.99,gres=cpu_factor:=1.52375,  
> etload=2895785926,size=82590804kb:108277440kb,stat
> 		  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>
> and I specify resources_default.file=9000000kb at queue level:
> Qmgr: s q long resources_default.file=9000000kb
>
> Then I submit a job and I get this error:
>
> getsize() failed for file in mom_set_limits
>
>
> Googling it I found :
>
> http://www.supercluster.org/pipermail/torqueusers/2007-February/005037.html
>
> which recommends ddisk instead of size...
>
> But I has no sense if we read torque admin manual:
>
>>       size[fs=<FS>]
>>              Specifies  that  the  available and configured disk
>> space in the <FS> filesystem is to be reported to the pbs_server
>> and  sched- uler.   NOTE:  To request disk space on a per job basis,
>> specify the file resource as in 'qsub -l nodes=1,file=1000kb'  For
>> exam- ple,   the   available   and   configured   disk  space  in   
>> the
>>              /localscratch filesystem will be reported:
>>
>>              size[fs=/localscratch]
>
> And has sense if it does what Dave Jackson says:
>
>
>>>  The failure in TORQUE is occurring because TORQUE is trying to
>>> set the file ulimit.  My guess is that this is not what you want.
>>> You want Maui to schedule diskspace as a consumable resource but
>>> not enforce anything at the OS level?  Is this correct?  Moab
>>> supports a 'ddisk' (dedicated disk) RM extension which is a disk
>>> constraint enforced by the scheduler as a consumable resource but
>>> not enforced via ulimits, ie,
>
> So, could someone clarify it for me?
>
> Is "size" a valid resource for requesting (and not reserving) a
> minimal amount of space when submitting a job?
>
> Do I have to use ddisk? Cuase I tried and torque/MAUI does not take
> care of it:
>
> [arnaubria at ui01 ~]$ echo sleep 5|qsub -l ddisk=900gb -q short
> 562161.pbs02.pic.es
> [arnaubria at ui01 ~]$ qstat -f 562161.pbs02.pic.es
> Job Id: 562161.pbs02.pic.es
>    Job_Name = STDIN
>    Job_Owner = arnaubria at ui01.pic.es
>    job_state = Q
>    queue = short
>    server = pbs02.pic.es
>    Checkpoint = u
>    ctime = Tue Sep 23 16:32:10 2008
>    Error_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/STDIN.e562161
>    Hold_Types = n
>    Join_Path = n
>    Keep_Files = n
>    Mail_Points = a
>    mtime = Tue Sep 23 16:32:10 2008
>    Output_Path = ui01.pic.es:/nfs/pic.es/user/a/arnaubria/ 
> STDIN.o562161
>    Priority = 0
>    qtime = Tue Sep 23 16:32:10 2008
>    Rerunable = True
>    Resource_List.cput = 01:30:00
>    Resource_List.ddisk = 900gb
>    Resource_List.walltime = 03:00:00
>    Variable_List = PBS_O_HOME=/nfs/pic.es/user/a/arnaubria,
> 	PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=arnaubria,
> 	PBS_O_PATH=/usr/kerberos/bin:/opt/glite/bin:/opt/glite/externals/bin:
> 	/opt/lcg/bin:/opt/lcg/sbin:/opt/edg/bin:/opt/edg/sbin:/opt/globus/ 
> sbin
> 	:/opt/globus/bin:/opt/gpt/sbin:/usr/local/bin:/bin:/usr/bin:/usr/ 
> X11R6
> 	/bin:/opt/d-cache//srm/bin:/opt/d-cache//dcap/bin:/usr/java/ 
> jdk1.5.0_1
> 	4/bin:/nfs/pic.es/user/a/arnaubria/bin,
> 	PBS_O_MAIL=/var/spool/mail/arnaubria,PBS_O_SHELL=/bin/bash,
> 	PBS_O_HOST=ui01.pic.es,PBS_O_WORKDIR=/nfs/pic.es/user/a/arnaubria,
> 	PBS_O_QUEUE=short
>    etime = Tue Sep 23 16:32:10 2008
>    submit_args = -l ddisk=900gb -q short
>
>
> # checkjob 562161.pbs02.pic.es
>
>
> checking job 562161
>
> State: Running
> Creds:  user:arnaubria  group:grid  class:short  qos:DEFAULT
> WallTime: 00:00:00 of 3:00:00
> SubmitTime: Tue Sep 23 16:32:10
>  (Time Queued  Total: 00:00:20  Eligible: 00:00:20)
>
> StartTime: Tue Sep 23 16:32:30
> Total Tasks: 1
>
> Req[0]  TaskCount: 1  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [slc4]
> Allocated Nodes:
> [td020.pic.es:1]
>
>
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> Flags:       BACKFILL RESTARTABLE
>
> Reservation '562161' (00:00:00 -> 3:00:00  Duration: 3:00:00)
> PE:  1.00  StartPriority:  0
>
> And I have no node with 900gb of free space...
> (I have also tried using kb instead of gb)
>
> TIA,
> Arnau
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list