[torqueusers] jobs stuck in Q

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Mon Dec 15 09:39:02 MST 2008

what scheduler are you using?  are you using torque's scheduler, or maui, or something else?


From: torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
Sent: Mon 12/15/2008 10:40 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] jobs stuck in Q

I have a server in which jobs are stucked in queue. i have this output
from qstat -f :
Job Id: 2.grid01.spacescience.ro
    Job_Name = STDIN
    Job_Owner = alice001 at grid01.spacescience.ro
    job_state = Q
    queue = alice
    server = grid01.spacescience.ro
    Checkpoint = u
    ctime = Mon Dec 15 17:19:39 2008
    Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Dec 15 17:20:22 2008
    Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
    Priority = 0
    qtime = Mon Dec 15 17:20:49 2008
    Rerunable = True
    Resource_List.cput = 48:00:00
    Resource_List.walltime = 72:00:00
    Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,

    etime = Mon Dec 15 17:20:49 2008
    submit_args = -q alice

and a momctl on a wn gives me this :
[root at grid01 ~]# momctl -d 3 -h wn01

Host: wn01.spacescience.ro/wn01.spacescience.ro   Version:
2.3.0-snap.200801151629   PID: 7248
Server[0]: grid01.spacescience.ro (
  Init Msgs Received:     0 hellos/1 cluster-addrs
  Init Msgs Sent:         1 hellos
  Last Msg From Server:   284242 seconds (CLUSTER_ADDRS)
  Last Msg To Server:     21 seconds
HomeDirectory:          /var/spool/pbs/mom_priv
stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
NOTE:  syslog enabled
MOM active:             284244 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:,,,,,,
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

What can be wrong and where should i look into ?
Thanks for any help,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081215/f4121f4b/attachment.html

More information about the torqueusers mailing list