[torqueusers] Starving job problem

Prakash Velayutham velayups at email.uc.edu
Tue Oct 18 15:50:49 MDT 2005


Hi,

I have a heterogeneous cluster with 4 opteron and 2 xeon compute nodes. 
This is the issue I am facing / noticing.

1. Three jobs with "-lnodes=1:ppn=2:opteron" are running on 3 opterons
2. One job with "-lnodes=1" is running on the remaining opteron node 
(with 1 free CPU).
3. One job with "-lnodes=1:ppn=2:opteron" is in queue as all the 
opterons are taken. This job has been in queue for over a day now and 
hence is officially a starving job.
4. One job (which is submitted after the starving job) is in queue with 
"-lnodes=1"
5. One job (which is submitted after the starving job) is in queue with 
"-lnodes=1:xeon"

The output of qstat -f shows:

Job Id: 584.x.x.x
    Job_Name = xxxxxxxxxxx
    Job_Owner = x at x.x.x
    job_state = Q
    queue = users
    server = x.x.x
    Checkpoint = u
    ctime = Mon Oct 17 13:45:29 2005
    Error_Path = x.x.x:/x/x
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Oct 17 13:45:30 2005
    Output_Path = x.x.x:/x/x
    Priority = 0
    qtime = Mon Oct 17 13:45:29 2005
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2:opteron
    comment = Not Running: Not enough of the right type of nodes are 
available
    etime = Mon Oct 17 13:45:29 2005

Job Id: 591.x.x.x
    Job_Name = xxxxxxx
    Job_Owner = x at x.x.x
    job_state = Q
    queue = users
    server = x.x.x
    Checkpoint = u
    ctime = Tue Oct 18 15:57:37 2005
    Error_Path = x.x.x:/x/x/x
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Oct 18 15:57:37 2005
    Output_Path = x.x.x:/x/x/x
    Priority = 0
    qtime = Tue Oct 18 15:57:37 2005
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    comment = Not Running: Draining system to allow starving job to run
    etime = Tue Oct 18 15:57:37 2005

Job Id: 594.x.x.x
    Job_Name = xxxxxxx
    Job_Owner = x at x.x.x
    job_state = Q
    queue = users
    server = x.x.x
    Checkpoint = u
    ctime = Tue Oct 18 17:09:13 2005
    Error_Path = x.x.x:/x/x/x
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Oct 18 17:09:14 2005
    Output_Path = x.x.x:/x/x/x
    Priority = 0
    qtime = Tue Oct 18 17:09:13 2005
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1:xeon
    comment = Not Running: Draining system to allow starving job to run
    etime = Tue Oct 18 17:09:13 2005

Why are the last 2 jobs not running on the xeon nodes? They did not 
specifically mention to run on opterons, so in that case, why should the 
xeon nodes be drained to run the job 584 as it only requires opteron nodes?

Any suggestions greatly appreciated.

Prakash


More information about the torqueusers mailing list