[torqueusers] Starving job problem

Prakash Velayutham velayups at email.uc.edu
Wed Oct 19 07:57:10 MDT 2005


Thanks. But my question is whether Torque tries to drain the system for 
only those resources that can satisfy the requirements of the starving 
job or does it drain all the resources period.

Prakash

Thomas Dargel wrote:

>Hi Prakash,
>
>two things can be adjusted in the sched_config file to modify
>the staving-behavior of your queueing-system:
>
>1) enlarge the number when a job is considered starving:
>max_starve: 24:00:00 (default: 24h --> e.g. 240:00:00)
>
>or
>
>2) switch starving completly off:
>
>   change
>help_starving_jobs    true    ALL
>
>   to
>help_starving_jobs      false   ALL
>
>In both cases you have to restart your pbs_sched.
>
>Good luck, I hope this solves your problem
>best regards,
>
> Thomas.
>
>On Tue, Oct 18, 2005 at 05:50:49PM -0400, Prakash Velayutham wrote:
>  
>
>>Hi,
>>
>>I have a heterogeneous cluster with 4 opteron and 2 xeon compute nodes. 
>>This is the issue I am facing / noticing.
>>
>>1. Three jobs with "-lnodes=1:ppn=2:opteron" are running on 3 opterons
>>2. One job with "-lnodes=1" is running on the remaining opteron node 
>>(with 1 free CPU).
>>3. One job with "-lnodes=1:ppn=2:opteron" is in queue as all the 
>>opterons are taken. This job has been in queue for over a day now and 
>>hence is officially a starving job.
>>4. One job (which is submitted after the starving job) is in queue with 
>>"-lnodes=1"
>>5. One job (which is submitted after the starving job) is in queue with 
>>"-lnodes=1:xeon"
>>
>>The output of qstat -f shows:
>>
>>Job Id: 584.x.x.x
>>   Job_Name = xxxxxxxxxxx
>>   Job_Owner = x at x.x.x
>>   job_state = Q
>>   queue = users
>>   server = x.x.x
>>   Checkpoint = u
>>   ctime = Mon Oct 17 13:45:29 2005
>>   Error_Path = x.x.x:/x/x
>>   Hold_Types = n
>>   Join_Path = n
>>   Keep_Files = n
>>   Mail_Points = a
>>   mtime = Mon Oct 17 13:45:30 2005
>>   Output_Path = x.x.x:/x/x
>>   Priority = 0
>>   qtime = Mon Oct 17 13:45:29 2005
>>   Rerunable = True
>>   Resource_List.nodect = 1
>>   Resource_List.nodes = 1:ppn=2:opteron
>>   comment = Not Running: Not enough of the right type of nodes are 
>>available
>>   etime = Mon Oct 17 13:45:29 2005
>>
>>Job Id: 591.x.x.x
>>   Job_Name = xxxxxxx
>>   Job_Owner = x at x.x.x
>>   job_state = Q
>>   queue = users
>>   server = x.x.x
>>   Checkpoint = u
>>   ctime = Tue Oct 18 15:57:37 2005
>>   Error_Path = x.x.x:/x/x/x
>>   Hold_Types = n
>>   Join_Path = n
>>   Keep_Files = n
>>   Mail_Points = a
>>   mtime = Tue Oct 18 15:57:37 2005
>>   Output_Path = x.x.x:/x/x/x
>>   Priority = 0
>>   qtime = Tue Oct 18 15:57:37 2005
>>   Rerunable = True
>>   Resource_List.nodect = 1
>>   Resource_List.nodes = 1
>>   comment = Not Running: Draining system to allow starving job to run
>>   etime = Tue Oct 18 15:57:37 2005
>>
>>Job Id: 594.x.x.x
>>   Job_Name = xxxxxxx
>>   Job_Owner = x at x.x.x
>>   job_state = Q
>>   queue = users
>>   server = x.x.x
>>   Checkpoint = u
>>   ctime = Tue Oct 18 17:09:13 2005
>>   Error_Path = x.x.x:/x/x/x
>>   Hold_Types = n
>>   Join_Path = n
>>   Keep_Files = n
>>   Mail_Points = a
>>   mtime = Tue Oct 18 17:09:14 2005
>>   Output_Path = x.x.x:/x/x/x
>>   Priority = 0
>>   qtime = Tue Oct 18 17:09:13 2005
>>   Rerunable = True
>>   Resource_List.nodect = 1
>>   Resource_List.nodes = 1:xeon
>>   comment = Not Running: Draining system to allow starving job to run
>>   etime = Tue Oct 18 17:09:13 2005
>>
>>Why are the last 2 jobs not running on the xeon nodes? They did not 
>>specifically mention to run on opterons, so in that case, why should the 
>>xeon nodes be drained to run the job 584 as it only requires opteron nodes?
>>
>>Any suggestions greatly appreciated.
>>
>>Prakash
>>


More information about the torqueusers mailing list