[torqueusers] Jobs in Q status

diego bacchin diego.bacchin at bmr-genomics.it
Thu Dec 20 09:47:55 MST 2012


Hi to All!
I have a cluster with 32 nodes b001 b032
When a user launch 25/30 jobs togheter sometimes 1 job remains in Q 
status and it never start although there are free nodes.
I have tried to move the job on other queue but the problem still remains.

This is the log:
12/19/2012 15:24:41;0100;PBS_Server;Job;54061.master.nfs;enqueuing into 
queuename, state 1 hop 1
12/19/2012 15:24:41;0008;PBS_Server;Job;54061.master.nfs;Job Queued at 
request of user at b032.nfs, owner = user at b032.nfs, job name = 4_2_SM_var, 
queue = queuename
12/19/2012 15:24:52;0008;PBS_Server;Job;54061.master.nfs;could not 
locate requested resources 'b013:ppn=12' (node_spec failed) cannot 
allocate node 'b013' to job - node not currently available (nps 
needed/free: 12/0, gpus needed/free: 0/0, joblist: 
54060.master.nfs:0,54060.master.nfs:1,54060.master.nfs:2,54060.master.nfs:3,54060.master.nfs:4,54060.master.nfs:5,54060.master.nfs:6,54060.master.nfs:7,54060.master.nfs:8,54060.master.nfs:9,54060.master.nfs:10,54060.master.nfs:11)
# I have tried to requeue the job in the same queue
12/19/2012 15:51:20;0100;PBS_Server;Job;54061.master.nfs;dequeuing from 
queuename, state QUEUED
12/19/2012 15:51:20;0100;PBS_Server;Job;54061.master.nfs;enqueuing into 
queuename, state 1 hop 1
12/19/2012 15:51:20;0008;PBS_Server;Job;54061.master.nfs;Job moved to 
queuename at request of root at master.nfs
12/19/2012 15:51:31;0100;PBS_Server;Job;54061.master.nfs;dequeuing from 
queuename, state QUEUED
# I have tried to requeue the job in another queue with the same nodes 
but different policies
12/19/2012 15:51:31;0100;PBS_Server;Job;54061.master.nfs;enqueuing into 
queuename2, state 1 hop 1
12/19/2012 15:51:31;0008;PBS_Server;Job;54061.master.nfs;Job moved to 
queuename2 at request of root at master.nfs
# I have tried to requeue the job in a third queue with different nodes
12/20/2012 12:21:20;0100;PBS_Server;Job;54061.master.nfs;dequeuing from 
queuename2, state QUEUED
12/20/2012 12:21:20;0100;PBS_Server;Job;54061.master.nfs;enqueuing into 
fatnode, state 1 hop 1
12/20/2012 12:21:20;0008;PBS_Server;Job;54061.master.nfs;Job moved to 
fatnode at request of root at master.nfs

qstat -f 54061.master.nfs
Job Id: 54061.master.nfs
     Job_Name = 4_2_SM_var
     Job_Owner = user at b032.nfs
     job_state = Q
     queue = fatnode
     server = master.nfs
     Checkpoint = u
     ctime = Wed Dec 19 15:24:41 2012
     Error_Path = b032.nfs:/lustre/projects/USER_data/4517695bf8edbc2
     e54ce6d59ab799f27/vcf/2_SM/4_2_SM_var.e54061
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = a
     mtime = Wed Dec 19 15:24:41 2012
     Output_Path = b032.nfs:/lustre/projects/USER_data/4517695bf8edbc
     2e54ce6d59ab799f27/vcf/2_SM/4_2_SM_var.o54061
     Priority = 0
     qtime = Thu Dec 20 12:21:20 2012
     Rerunable = True
     Resource_List.neednodes = 1:ppn=12
     Resource_List.nodect = 1
     Resource_List.nodes = 1:ppn=12
     Resource_List.walltime = 01:00:00
     substate = 10
     Variable_List = PBS_O_QUEUE=queuename,PBS_O_HOME=/u/user,
     PBS_O_LANG=C,PBS_O_LOGNAME=user,
PBS_O_PATH=/opt/mysql/client:/opt/python/2.6.7/bin:/opt/torque//bin:/opt/maui/bin:/bin:/usr/bin
     :/usr/local/sbin:/usr/sbin:/sbin:/u/user/bin,
     PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash,
     PBS_O_HOST=b032.nfs,PBS_SERVER=master.nfs,
     PBS_O_WORKDIR=/lustre/projects/USER_data/4517695bf8edbc2e54
     ce6d59ab799f27/vcf/2_SM
     euser = user
     egroup = group
     queue_rank = 2531
     queue_type = E
     etime = Wed Dec 19 15:24:41 2012
     submit_args = 4.variation.job
     fault_tolerant = False
     submit_host = b032.nfs
     init_work_dir = /lustre/projects/USER_data/4517695bf8edbc2e54ce6
     d59ab799f27/vcf/2_SM

Qmgr: list queue queuename
     queue_type = Execution
     total_jobs = 7
     state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:7 Exiting:0
     resources_max.nodect = 32
     resources_max.walltime = 96:00:00
     resources_min.nodect = 1
     resources_default.neednodes = bladenoht
     resources_default.walltime = 01:00:00
     acl_group_enable = True
     acl_groups = cribi
     mtime = Wed Sep 26 19:23:20 2012
     resources_assigned.nodect = 7
     enabled = True
     started = True

Qmgr: list server
Server master.nfs
     server_state = Active
     scheduling = True
     total_jobs = 18
     state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:16 Exiting:0
     acl_hosts = master.nfs,master,master.sp,localhost.localdomain
     acl_roots = root at master.nfs,root at master,root at master.sp
     managers = maui at master.nfs,maui at localhost.localdomain,root at master.nfs,
                    root at localhost.localdomain
     default_queue = queuename
     log_events = 511
     mail_from = adm
     query_other_jobs = True
     resources_default.walltime = 00:01:00
     resources_assigned.ncpus = 1
     resources_assigned.nodect = 16
     scheduler_iteration = 600
     node_check_rate = 150
     tcp_timeout = 6
     node_pack = True
     mom_job_sync = True
     pbs_version = 3.0.2
     kill_delay = 300
     keep_completed = 600
     submit_hosts = master2.nfs
     allow_node_submit = True
     next_job_number = 54163
     net_counter = 93 48 24

Any suggestion?
Thanks in advantage

-- 
Diego Bacchin
IT System Administrator at
  BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy
  CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy
diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it
366 72 97 232



More information about the torqueusers mailing list