[torqueusers] Starving job problem
Prakash Velayutham
velayups at email.uc.edu
Tue Oct 18 15:50:49 MDT 2005
Hi,
I have a heterogeneous cluster with 4 opteron and 2 xeon compute nodes.
This is the issue I am facing / noticing.
1. Three jobs with "-lnodes=1:ppn=2:opteron" are running on 3 opterons
2. One job with "-lnodes=1" is running on the remaining opteron node
(with 1 free CPU).
3. One job with "-lnodes=1:ppn=2:opteron" is in queue as all the
opterons are taken. This job has been in queue for over a day now and
hence is officially a starving job.
4. One job (which is submitted after the starving job) is in queue with
"-lnodes=1"
5. One job (which is submitted after the starving job) is in queue with
"-lnodes=1:xeon"
The output of qstat -f shows:
Job Id: 584.x.x.x
Job_Name = xxxxxxxxxxx
Job_Owner = x at x.x.x
job_state = Q
queue = users
server = x.x.x
Checkpoint = u
ctime = Mon Oct 17 13:45:29 2005
Error_Path = x.x.x:/x/x
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 17 13:45:30 2005
Output_Path = x.x.x:/x/x
Priority = 0
qtime = Mon Oct 17 13:45:29 2005
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2:opteron
comment = Not Running: Not enough of the right type of nodes are
available
etime = Mon Oct 17 13:45:29 2005
Job Id: 591.x.x.x
Job_Name = xxxxxxx
Job_Owner = x at x.x.x
job_state = Q
queue = users
server = x.x.x
Checkpoint = u
ctime = Tue Oct 18 15:57:37 2005
Error_Path = x.x.x:/x/x/x
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Oct 18 15:57:37 2005
Output_Path = x.x.x:/x/x/x
Priority = 0
qtime = Tue Oct 18 15:57:37 2005
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1
comment = Not Running: Draining system to allow starving job to run
etime = Tue Oct 18 15:57:37 2005
Job Id: 594.x.x.x
Job_Name = xxxxxxx
Job_Owner = x at x.x.x
job_state = Q
queue = users
server = x.x.x
Checkpoint = u
ctime = Tue Oct 18 17:09:13 2005
Error_Path = x.x.x:/x/x/x
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Oct 18 17:09:14 2005
Output_Path = x.x.x:/x/x/x
Priority = 0
qtime = Tue Oct 18 17:09:13 2005
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:xeon
comment = Not Running: Draining system to allow starving job to run
etime = Tue Oct 18 17:09:13 2005
Why are the last 2 jobs not running on the xeon nodes? They did not
specifically mention to run on opterons, so in that case, why should the
xeon nodes be drained to run the job 584 as it only requires opteron nodes?
Any suggestions greatly appreciated.
Prakash
More information about the torqueusers
mailing list