[torqueusers] Hanging jobs (queued forever)
Micha
onsager at gmx.net
Fri Apr 11 01:58:08 MDT 2008
@list ,
Systems:
OpenSuse 10.1. (mixed 32/64Bit)
Some contributed rpm's
torque-2.3.0-17.1.x86_64.rpm
+ clients, servers etc.
Problem is, I cannot get even a simple job running, even not on a node
identical with the server (hopefully excluding ssh, rsh, access right
issues). The job will sit in the queue forever. I read about wrong
resources and played with many of the qmgr options without success. A
typical example would be:
---snip---
m7:/home/micha/programming/frame/bin # qstat -f
Job Id: 341.m7.mbee.net
Job_Name = test.sh
Job_Owner = micha at m7.mbee.net
job_state = Q
queue = batch
server = m7.mbee.net
Checkpoint = u
ctime = Fri Apr 11 08:27:32 2008
Error_Path = m7.mbee.net:/home/micha/programming/frame/bin/test.sh.e341
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Apr 11 08:27:32 2008
Output_Path = m7.mbee.net:/home/micha/programming/frame/bin/test.sh.o341
Priority = 0
qtime = Fri Apr 11 08:31:02 2008
Rerunable = True
Resource_List.cput = 00:00:30
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:01:00
substate = 10
Variable_List = PBS_O_HOME=/home/micha,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=micha,
PBS_O_PATH=/home/micha/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bi
n:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/jvm/jre/bin:/usr/li
b/mit/bin:/usr/lib/mit/sbin:/usr/lib/qt3/bin:/home/micha/programming/f
rame/bin:/home/micha/programming/frame/bin,
PBS_O_MAIL=/var/spool/mail/micha,PBS_O_SHELL=/bin/bash,
PBS_SERVER=m7.mbee.net,PBS_O_HOST=m7.mbee.net,
PBS_O_WORKDIR=/home/micha/programming/frame/bin,PBS_O_QUEUE=batch
euser = micha
egroup = users
queue_rank = 13
queue_type = E
etime = Fri Apr 11 08:27:32 2008
submit_args = -l nodes=1,walltime=00:01:00,cput=00:00:30 test.sh
m7:/home/micha/programming/frame/bin # qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
resources_default.nodes = 1
mtime = Mon Apr 7 14:39:24 2008
enabled = True
started = True
m7:/home/micha/programming/frame/bin # qmgr -c "list server"
Server m7.mbee.net
server_state = Idle
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0
acl_hosts = m7
default_queue = batch
log_events = 511
mail_from = adm
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
pbs_version = 2.3.0
next_job_number = 342
net_counter = 1 0 0
m7:/home/micha/programming/frame/bin # qmgr -c "list node m7"
Node m7
state = free
np = 2
ntype = cluster
status = opsys=linux,
uname=Linux m7 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64,
sessions=4271 4319 3063 11722 19799,nsessions=5,nusers=2,
idletime=1,totmem=961176kb,availmem=793120kb,physmem=382876kb,
ncpus=? 0,loadave=0.00,netload=1583429,state=free,jobs=,
varattr=,rectime=1207898179
m7:/home/micha/programming/frame/bin #
---snap--
The result is the same with other - even rather 'standard' - arguments, seen in qstat.
BTW, all is well on gentoo systems with an older torque version (2.1.6).
Beside the particular problem, what can I do in general to tackle the
problem? I raised the debug levels (as advised in the FAQ) to the
highest possible level, but I simply cannot gather any further information.
Micha
--
More information about the torqueusers
mailing list