[torqueusers] jobs stuck in Q

Adrian Sevcenco Adrian.Sevcenco at cern.ch
Mon Dec 15 08:40:45 MST 2008


Hi,
I have a server in which jobs are stucked in queue. i have this output
from qstat -f :
Job Id: 2.grid01.spacescience.ro
    Job_Name = STDIN
    Job_Owner = alice001 at grid01.spacescience.ro
    job_state = Q
    queue = alice
    server = grid01.spacescience.ro
    Checkpoint = u
    ctime = Mon Dec 15 17:19:39 2008
    Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Dec 15 17:20:22 2008
    Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
    Priority = 0
    qtime = Mon Dec 15 17:20:49 2008
    Rerunable = True
    Resource_List.cput = 48:00:00
    Resource_List.walltime = 72:00:00
    Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=alice001,

PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
        :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,
        PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,
        PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=grid01.spacescience.ro,
        PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice
    etime = Mon Dec 15 17:20:49 2008
    submit_args = -q alice

and a momctl on a wn gives me this :
[root at grid01 ~]# momctl -d 3 -h wn01

Host: wn01.spacescience.ro/wn01.spacescience.ro   Version:
2.3.0-snap.200801151629   PID: 7248
Server[0]: grid01.spacescience.ro (172.16.0.254)
  Init Msgs Received:     0 hellos/1 cluster-addrs
  Init Msgs Sent:         1 hellos
  Last Msg From Server:   284242 seconds (CLUSTER_ADDRS)
  Last Msg To Server:     21 seconds
HomeDirectory:          /var/spool/pbs/mom_priv
stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
available)
NOTE:  syslog enabled
MOM active:             284244 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    RPP
MemLocked:              TRUE  (mlock)
TCP Timeout:            20 seconds
Prolog:                 /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:
172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,127.0.0.1
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

What can be wrong and where should i look into ?
Thanks for any help,
Adrian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3105 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081215/763cf078/smime.bin


More information about the torqueusers mailing list