[torqueusers] Jobs in Q status
diego bacchin
diego.bacchin at bmr-genomics.it
Thu Dec 20 09:47:55 MST 2012
Hi to All!
I have a cluster with 32 nodes b001 b032
When a user launch 25/30 jobs togheter sometimes 1 job remains in Q
status and it never start although there are free nodes.
I have tried to move the job on other queue but the problem still remains.
This is the log:
12/19/2012 15:24:41;0100;PBS_Server;Job;54061.master.nfs;enqueuing into
queuename, state 1 hop 1
12/19/2012 15:24:41;0008;PBS_Server;Job;54061.master.nfs;Job Queued at
request of user at b032.nfs, owner = user at b032.nfs, job name = 4_2_SM_var,
queue = queuename
12/19/2012 15:24:52;0008;PBS_Server;Job;54061.master.nfs;could not
locate requested resources 'b013:ppn=12' (node_spec failed) cannot
allocate node 'b013' to job - node not currently available (nps
needed/free: 12/0, gpus needed/free: 0/0, joblist:
54060.master.nfs:0,54060.master.nfs:1,54060.master.nfs:2,54060.master.nfs:3,54060.master.nfs:4,54060.master.nfs:5,54060.master.nfs:6,54060.master.nfs:7,54060.master.nfs:8,54060.master.nfs:9,54060.master.nfs:10,54060.master.nfs:11)
# I have tried to requeue the job in the same queue
12/19/2012 15:51:20;0100;PBS_Server;Job;54061.master.nfs;dequeuing from
queuename, state QUEUED
12/19/2012 15:51:20;0100;PBS_Server;Job;54061.master.nfs;enqueuing into
queuename, state 1 hop 1
12/19/2012 15:51:20;0008;PBS_Server;Job;54061.master.nfs;Job moved to
queuename at request of root at master.nfs
12/19/2012 15:51:31;0100;PBS_Server;Job;54061.master.nfs;dequeuing from
queuename, state QUEUED
# I have tried to requeue the job in another queue with the same nodes
but different policies
12/19/2012 15:51:31;0100;PBS_Server;Job;54061.master.nfs;enqueuing into
queuename2, state 1 hop 1
12/19/2012 15:51:31;0008;PBS_Server;Job;54061.master.nfs;Job moved to
queuename2 at request of root at master.nfs
# I have tried to requeue the job in a third queue with different nodes
12/20/2012 12:21:20;0100;PBS_Server;Job;54061.master.nfs;dequeuing from
queuename2, state QUEUED
12/20/2012 12:21:20;0100;PBS_Server;Job;54061.master.nfs;enqueuing into
fatnode, state 1 hop 1
12/20/2012 12:21:20;0008;PBS_Server;Job;54061.master.nfs;Job moved to
fatnode at request of root at master.nfs
qstat -f 54061.master.nfs
Job Id: 54061.master.nfs
Job_Name = 4_2_SM_var
Job_Owner = user at b032.nfs
job_state = Q
queue = fatnode
server = master.nfs
Checkpoint = u
ctime = Wed Dec 19 15:24:41 2012
Error_Path = b032.nfs:/lustre/projects/USER_data/4517695bf8edbc2
e54ce6d59ab799f27/vcf/2_SM/4_2_SM_var.e54061
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Dec 19 15:24:41 2012
Output_Path = b032.nfs:/lustre/projects/USER_data/4517695bf8edbc
2e54ce6d59ab799f27/vcf/2_SM/4_2_SM_var.o54061
Priority = 0
qtime = Thu Dec 20 12:21:20 2012
Rerunable = True
Resource_List.neednodes = 1:ppn=12
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=12
Resource_List.walltime = 01:00:00
substate = 10
Variable_List = PBS_O_QUEUE=queuename,PBS_O_HOME=/u/user,
PBS_O_LANG=C,PBS_O_LOGNAME=user,
PBS_O_PATH=/opt/mysql/client:/opt/python/2.6.7/bin:/opt/torque//bin:/opt/maui/bin:/bin:/usr/bin
:/usr/local/sbin:/usr/sbin:/sbin:/u/user/bin,
PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=b032.nfs,PBS_SERVER=master.nfs,
PBS_O_WORKDIR=/lustre/projects/USER_data/4517695bf8edbc2e54
ce6d59ab799f27/vcf/2_SM
euser = user
egroup = group
queue_rank = 2531
queue_type = E
etime = Wed Dec 19 15:24:41 2012
submit_args = 4.variation.job
fault_tolerant = False
submit_host = b032.nfs
init_work_dir = /lustre/projects/USER_data/4517695bf8edbc2e54ce6
d59ab799f27/vcf/2_SM
Qmgr: list queue queuename
queue_type = Execution
total_jobs = 7
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:7 Exiting:0
resources_max.nodect = 32
resources_max.walltime = 96:00:00
resources_min.nodect = 1
resources_default.neednodes = bladenoht
resources_default.walltime = 01:00:00
acl_group_enable = True
acl_groups = cribi
mtime = Wed Sep 26 19:23:20 2012
resources_assigned.nodect = 7
enabled = True
started = True
Qmgr: list server
Server master.nfs
server_state = Active
scheduling = True
total_jobs = 18
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:16 Exiting:0
acl_hosts = master.nfs,master,master.sp,localhost.localdomain
acl_roots = root at master.nfs,root at master,root at master.sp
managers = maui at master.nfs,maui at localhost.localdomain,root at master.nfs,
root at localhost.localdomain
default_queue = queuename
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.walltime = 00:01:00
resources_assigned.ncpus = 1
resources_assigned.nodect = 16
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
node_pack = True
mom_job_sync = True
pbs_version = 3.0.2
kill_delay = 300
keep_completed = 600
submit_hosts = master2.nfs
allow_node_submit = True
next_job_number = 54163
net_counter = 93 48 24
Any suggestion?
Thanks in advantage
--
Diego Bacchin
IT System Administrator at
BMR Genomics srl - Via Redipuglia, 19 - PADOVA (PD) - Italy
CRIBI - University of Padova - Via U. Bassi, 58 - PADOVA (PD) - Italy
diego at bmr-genomics.it - diego.bacchin at cribi.unipd.it
366 72 97 232
More information about the torqueusers
mailing list