[Mauiusers] Resources problem : cannot select job 62 for partition
DEFAULT (job hold active)
Daniel Boone
daniel.boone at kahosl.be
Thu May 17 02:18:15 MDT 2007
Hi
I did some further testing and intensive logging and I came to the
following info:
maybe this log helps a bit more
/usr/local/maui/log/maui.log
-------------------------
05/16 16:02:19 MStatClearUsage([NONE],Idle)
05/16 16:02:19 MPolicyAdjustUsage(NULL,104,NULL,idle,PU,[ALL],1,NULL)
05/16 16:02:19 MPolicyAdjustUsage(NULL,104,NULL,idle,NULL,[ALL],1,NULL)
05/16 16:02:19 INFO: total jobs selected (ALL): 1/1
05/16 16:02:19 INFO: jobs selected:
[000: 1]
05/16 16:02:19
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
05/16 16:02:19 INFO: total jobs selected in partition ALL: 1/1
05/16 16:02:19 MQueueScheduleRJobs(Q)
05/16 16:02:19
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
05/16 16:02:19 INFO: total jobs selected in partition ALL: 1/1
05/16 16:02:19
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
05/16 16:02:19 INFO: total jobs selected in partition DEFAULT: 1/1
05/16 16:02:19 MQueueScheduleIJobs(Q,DEFAULT)
05/16 16:02:19 INFO: checking job 104(1) state: Idle (ex: Idle)
05/16 16:02:19 MJobSelectMNL(104,DEFAULT,NULL,MNodeList,NodeMap,MaxSpeed,2)
----------------- is this the reason why it fails? ------
05/16 16:02:19 MReqGetFNL(104,0,DEFAULT,NULL,DstNL,NC,TC,2140000000,0)
05/16 16:02:19 INFO: 2 feasible tasks found for job 104:0 in
partition DEFAULT (10 Needed)
05/16 16:02:19 INFO: inadequate feasible tasks found for job 104:0
in partition DEFAULT (2 < 10)
05/16 16:02:19 INFO: 5/16 16:02:19
MJobPReserve(104,DEFAULT,ResCount,ResCountRej)
--------------------------------------------
05/16 16:02:19 MJobReserve(104,Priority)
05/16 16:02:19 MPolicyGetEStartTime(104,ALL,SOFT,Time)
05/16 16:02:19 INFO: policy start time found for job 104 in 00:00:00
05/16 16:02:19
MJobGetEStartTime(104,NULL,NodeCount,TaskCount,MNodeList,1179324139)
05/16 16:02:19 ALERT: job 104 cannot run in any partition
05/16 16:02:19 ALERT: cannot create new reservation for job 104
(shape[1] 10)
05/16 16:02:19 ALERT: cannot create new reservation for job 104
05/16 16:02:19 MJobSetHold(104,16,1:00:00,NoResources,cannot create
reservation for job '104' (intital reservation attempt)
)
05/16 16:02:19 ALERT: job '104' cannot run (deferring job for 3600
seconds)
05/16 16:02:19 WARNING: cannot reserve priority job '104'
cannot locate adequate feasible tasks for job 104:0
---------------------------------
may this can help some more.
Daniel Boone schreef:
>
> I tried some new parameters.
>
> print server output of qmgr
> ----------------
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_default.mem = 2000mb
> set queue batch resources_default.nodes = 1
> set queue batch resources_default.pvmem = 16000mb
> set queue batch resources_default.walltime = 06:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server managers = abaqus at em-research00
> set server operators = abaqus at em-research00
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server pbs_version = 2.1.8
> ----------------------
> checkjob output:
> ----------------------
> checking job 90 (RM job '90.em-research00')
>
> State: Idle EState: Deferred
> Creds: user:abaqus group:users class:batch qos:DEFAULT
> WallTime: 00:00:00 of 5:00:00
> SubmitTime: Tue May 15 11:59:03
> (Time Queued Total: 1:58:17 Eligible: 00:00:00)
>
> Total Tasks: 4
>
> Req[0] TaskCount: 4 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 15G
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> Exec: '' ExecSize: 0 ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1 MEM: 250M SWAP: 15G
> NodeAccess: SHARED
> TasksPerNode: 2 NodeCount: 2
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 0 StartCount: 0
> PartitionMask: [ALL]
> SystemQueueTime: Tue May 15 13:00:06
>
> Flags: RESTARTABLE
>
> job is deferred. Reason: NoResources (cannot create reservation for
> job '90' (intital reservation attempt)
> )
> Holds: Defer (hold reason: NoResources)
> PE: 6.07 StartPriority: 57
> cannot select job 90 for partition DEFAULT (job hold active)
> -------------------
> pbs-script:
> -------------------
>
> #!/bin/bash
> #PBS -l nodes=2:ppn=2
> #PBS -l walltime=05:00:00
> #PBS -l mem=1000mb
> #PBS -l vmem=7000mb
> #PBS -j oe
> #PBS -M daniel.boone at kahosl.be
> #PBS -m bae
> # Go to the directory from which you submitted the job
> mkdir $PBS_O_WORKDIR
> string="$PBS_O_WORKDIR/plus2gb.inp"
>
> scp 10.1.0.52:$string $PBS_O_WORKDIR
>
> cd $PBS_O_WORKDIR
> #module load abaqus
> #
> /Apps/abaqus/Commands/abaqus job=plus2gb queue=abaqus4cpu
> input=Standard_plus2gbyte.inp cpus=4
> ---------------------------
> abaqus environment file.
> --------------------------
> import os
> os.environ['LAMRSH'] = 'ssh'
>
> max_cpus=6
>
> mp_host_list=[['em-research00',3],['10.1.0.97',2]]
>
>
> run_mode = BATCH
> scratch = "/home/abaqus"
>
> queue_name=["cpu","abaqus4cpu"]
> queue_cmd="qsub -r n -q batch -S /bin/bash -V -l nodes=1:ppn=1 %S"
> cpu="qsub -r n -q batch -S /bin/bash -V -l nodes=1:ppn=2 %S"
> abaqus4cpu="qsub -r n -q batch -S /bin/bash -V -l nodes=2:ppn=2 %S"
>
> pre_memory = "3000 mb"
> standard_memory = "7000 mb"
>
> ---------------------------
> but still no changes
>
> thanks for al the help until now.
> rishi pathak schreef:
>
>> Also try in your job script file
>> PBS -l pvmem=<amount of virtual memory>
>>
>> On 5/15/07, *rishi pathak* <mailmaverick666 at gmail.com
>> <mailto:mailmaverick666 at gmail.com>> wrote:
>>
>> I did not see any specific queue in th submit script
>> have you specified the following for the queue you are using
>>
>> resources_default.mem #available ram
>> resources_default.pvmem #virtual memory
>>
>>
>>
>>
>>
>> On 5/15/07, *Daniel Boone* <daniel.boone at kahosl.be
>> <mailto:daniel.boone at kahosl.be>> wrote:
>>
>> Hi
>>
>> I need to use the swap. I know I don't have enough RAM, but
>> the job must
>> be able to run. Even if it swaps a lot.
>> Time is not an issue here.
>> On 1 machine the job uses about 7.4GB swap. We don't have any
>> other
>> machines with more RAM to run it on.
>> Otherwise the other option is to run the job outside
>> torque/maui, but I
>> rather don't do that.
>>
>> Can some tell me how to read the checkjob -v output, because I
>> don't
>> understand how to find errors in it.
>>
>> rishi pathak schreef:
>> > Hi
>> > system memory(RAM) available to per process is less than the
>> requested
>> > amount
>> > It is not considering swap as an extention of RAM
>> > Try with reduced system memory
>> >
>> >
>> >
>> > On 5/14/07, *Daniel Boone* <daniel.boone at kahosl.be
>> <mailto:daniel.boone at kahosl.be>
>> > <mailto: daniel.boone at kahosl.be
>> <mailto:daniel.boone at kahosl.be>>> wrote:
>> >
>> > Hi
>> >
>> > I'm having the following problem. When I submit a very
>> > memory-intensive(most swap) job, the job doesn't want to
>> start.
>> > It gives the error: cannot select job 62 for partition
>> DEFAULT
>> > (job hold
>> > active)
>> > But I don't understand what the error means.
>> >
>> > I run torque 2.1.8 with maui maui-3.2.6p19
>> >
>> > checkjob -v returns the following:
>> > -------------------
>> > checking job 62 (RM job '62.em-research00')
>> >
>> > State: Idle EState: Deferred
>> > Creds: user:abaqus group:users class:batch qos:DEFAULT
>> > WallTime: 00:00:00 of 6:00:00
>> > SubmitTime: Mon May 14 14:13:41
>> > (Time Queued Total: 1:53:39 Eligible: 00:00:00)
>> >
>> > Total Tasks: 4
>> >
>> > Req[0] TaskCount: 4 Partition: ALL
>> > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>> > Opsys: [NONE] Arch: [NONE] Features: [NONE]
>> > Exec: '' ExecSize: 0 ImageSize: 0
>> > Dedicated Resources Per Task: PROCS: 1 MEM: 3875M
>> > NodeAccess: SHARED
>> > TasksPerNode: 2 NodeCount: 2
>> >
>> >
>> > IWD: [NONE] Executable: [NONE]
>> > Bypass: 0 StartCount: 0
>> > PartitionMask: [ALL]
>> > SystemQueueTime: Mon May 14 15:14:13
>> >
>> > Flags: RESTARTABLE
>> >
>> > job is deferred. Reason: NoResources (cannot create
>> reservation for
>> > job '62' (intital reservation attempt)
>> > )
>> > Holds: Defer (hold reason: NoResources)
>> > PE: 19.27 StartPriority: 53
>> > cannot select job 62 for partition DEFAULT (job hold active)
>> > ------------------------
>> > checknode of the two nodes:checking node em-research00
>> > ------------
>> > State: Idle (in current state for 2:31:21)
>> > Configured Resources: PROCS: 3 MEM: 2010M SWAP:
>> 33G DISK: 72G
>> >
>> >
>> > Utilized Resources: DISK: 9907M
>> > Dedicated Resources: [NONE]
>> > Opsys: linux Arch: [NONE]
>> > Speed: 1.00 Load: 0.000
>> > Network: [DEFAULT]
>> > Features: [F]
>> > Attributes: [Batch]
>> > Classes: [batch 3:3]
>> >
>> > Total Time: 2:29:18 Up: 2:29:18 (100.00%) Active:
>> 00:00:00 (0.00% )
>> >
>> > Reservations:
>> > NOTE: no reservations on node
>> >
>> > --------------------
>> > State: Idle (in current state for 2:31:52)
>> > Configured Resources: PROCS: 2 MEM: 2012M SWAP:
>> 17G DISK: 35G
>> > Utilized Resources: DISK: 24G
>> > Dedicated Resources: [NONE]
>> > Opsys: linux Arch: [NONE]
>> > Speed: 1.00 Load: 0.590
>> > Network: [DEFAULT]
>> > Features: [NONE]
>> > Attributes: [Batch]
>> > Classes: [batch 2:2]
>> >
>> > Total Time: 2:29:49 Up: 2:29:49 ( 100.00%) Active:
>> 00:00:00 ( 0.00%)
>> >
>> > Reservations:
>> > NOTE: no reservations on node
>> > -----------------
>> > The pbs scipt I'm using:
>> > #!/bin/bash
>> > #PBS -l nodes=2:ppn=2
>> > #PBS -l walltime=06:00:00
>> > #PBS -l mem=15500mb
>> > #PBS -j oe
>> > # Go to the directory from which you submitted the job
>> > mkdir $PBS_O_WORKDIR
>> > string="$PBS_O_WORKDIR/plus2gb.inp"
>> > scp 10.1.0.52:$string $PBS_O_WORKDIR
>> > #scp 10.1.0.52:$PBS_O_WORKDIR'/'$PBS_JOBNAME ./
>> > cd $PBS_O_WORKDIR
>> > #module load abaqus
>> > #
>> > /Apps/abaqus/Commands/abaqus job=plus2gb queue=cpu2
>> > input=Standard_plus2gbyte.inp cpus=4 mem=15000mb
>> > ---------------------------
>> > If you need some extra info please let me know.
>> >
>> > Thank you
>> >
>> > _______________________________________________
>> > mauiusers mailing list
>> > mauiusers at supercluster.org
>> <mailto:mauiusers at supercluster.org> <mailto:
>> mauiusers at supercluster.org <mailto:mauiusers at supercluster.org>>
>> > http://www.supercluster.org/mailman/listinfo/mauiusers
>> >
>> >
>> >
>> >
>> > --
>> > Regards--
>> > Rishi Pathak
>> > National PARAM Supercomputing Facility
>> > Center for Development of Advanced Computing(C-DAC)
>> > Pune University Campus,Ganesh Khind Road
>> > Pune-Maharastra
>>
>>
>>
>>
>> --
>> Regards--
>> Rishi Pathak
>> National PARAM Supercomputing Facility
>> Center for Development of Advanced Computing(C-DAC)
>> Pune University Campus,Ganesh Khind Road
>> Pune-Maharastra
>>
>>
>>
>>
>> --
>> Regards--
>> Rishi Pathak
>> National PARAM Supercomputing Facility
>> Center for Development of Advanced Computing(C-DAC)
>> Pune University Campus,Ganesh Khind Road
>> Pune-Maharastra
>>
>
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
>
More information about the mauiusers
mailing list