[torqueusers] jobs stuck in Q
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Mon Dec 15 09:39:02 MST 2008
what scheduler are you using? are you using torque's scheduler, or maui, or something else?
--Joe
________________________________
From: torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
Sent: Mon 12/15/2008 10:40 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] jobs stuck in Q
Hi,
I have a server in which jobs are stucked in queue. i have this output
from qstat -f :
Job Id: 2.grid01.spacescience.ro
Job_Name = STDIN
Job_Owner = alice001 at grid01.spacescience.ro
job_state = Q
queue = alice
server = grid01.spacescience.ro
Checkpoint = u
ctime = Mon Dec 15 17:19:39 2008
Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Dec 15 17:20:22 2008
Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
Priority = 0
qtime = Mon Dec 15 17:20:49 2008
Rerunable = True
Resource_List.cput = 48:00:00
Resource_List.walltime = 72:00:00
Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=alice001,
PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,
PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,
PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=grid01.spacescience.ro,
PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice
etime = Mon Dec 15 17:20:49 2008
submit_args = -q alice
and a momctl on a wn gives me this :
[root at grid01 ~]# momctl -d 3 -h wn01
Host: wn01.spacescience.ro/wn01.spacescience.ro Version:
2.3.0-snap.200801151629 PID: 7248
Server[0]: grid01.spacescience.ro (172.16.0.254)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
Last Msg From Server: 284242 seconds (CLUSTER_ADDRS)
Last Msg To Server: 21 seconds
HomeDirectory: /var/spool/pbs/mom_priv
stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
available)
NOTE: syslog enabled
MOM active: 284244 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List:
172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,127.0.0.1
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
diagnostics complete
What can be wrong and where should i look into ?
Thanks for any help,
Adrian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081215/f4121f4b/attachment.html
More information about the torqueusers
mailing list