[torqueusers] jobs stuck in Q

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Mon Dec 15 11:58:49 MST 2008


what does `checkjob 2` show you (where 2 is the jobid, as taken from your first email)?
 
--Joe

________________________________

From: torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
Sent: Mon 12/15/2008 12:13 PM
To: Greenseid, Joseph M.
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] jobs stuck in Q



Greenseid, Joseph M. wrote:
> what scheduler are you using?  are you using torque's scheduler, or
> maui, or something else?
Hi,
I am using maui .. do you think that the problem can be there?
Now i see that when i try to restart maui.cfg i have :
ERROR:    lost connection to server
ERROR:    cannot request service (status)
i have this as maui.cfg
[root at grid01 maui]# cat maui.cfg
# MAUI configuration example

SERVERHOST              grid01.spacescience.ro
ADMIN1                  root
ADMIN3                  edginfo rgma edguser
ADMINHOSTS              grid01.spacescience.ro
RMCFG[base]             TYPE=PBS
SERVERPORT              40559
SERVERMODE              NORMAL

# Set PBS server polling interval. If you have short # queues or/and
jobs it is worth to set a short interval. (10 seconds)

RMPOLLINTERVAL        00:00:10

# a max. 10 MByte log file in a logical location

LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              1

# Set the delay to 1 minute before Maui tries to run a job again, # in
case it failed to run the first time.
# The default value is 1 hour.

DEFERTIME       00:01:00

# Necessary for MPI grid jobs
ENABLEMULTIREQJOBS TRUE

Any idea anyone ?
Thanks,
Adrian


> --Joe
>
> ------------------------------------------------------------------------
> *From:* torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
> *Sent:* Mon 12/15/2008 10:40 AM
> *To:* torqueusers at supercluster.org
> *Subject:* [torqueusers] jobs stuck in Q
>
> Hi,
> I have a server in which jobs are stucked in queue. i have this output
> from qstat -f :
> Job Id: 2.grid01.spacescience.ro
>     Job_Name = STDIN
>     Job_Owner = alice001 at grid01.spacescience.ro
>     job_state = Q
>     queue = alice
>     server = grid01.spacescience.ro
>     Checkpoint = u
>     ctime = Mon Dec 15 17:19:39 2008
>     Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Mon Dec 15 17:20:22 2008
>     Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
>     Priority = 0
>     qtime = Mon Dec 15 17:20:49 2008
>     Rerunable = True
>     Resource_List.cput = 48:00:00
>     Resource_List.walltime = 72:00:00
>     Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,
>         PBS_O_LOGNAME=alice001,
>
> PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
>         :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,
>         PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,
>         PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=grid01.spacescience.ro,
>         PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice
>     etime = Mon Dec 15 17:20:49 2008
>     submit_args = -q alice
>
> and a momctl on a wn gives me this :
> [root at grid01 ~]# momctl -d 3 -h wn01
>
> Host: wn01.spacescience.ro/wn01.spacescience.ro   Version:
> 2.3.0-snap.200801151629   PID: 7248
> Server[0]: grid01.spacescience.ro (172.16.0.254)
>   Init Msgs Received:     0 hellos/1 cluster-addrs
>   Init Msgs Sent:         1 hellos
>   Last Msg From Server:   284242 seconds (CLUSTER_ADDRS)
>   Last Msg To Server:     21 seconds
> HomeDirectory:          /var/spool/pbs/mom_priv
> stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
> available)
> NOTE:  syslog enabled
> MOM active:             284244 seconds
> Server Update Interval: 45 seconds
> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    RPP
> MemLocked:              TRUE  (mlock)
> TCP Timeout:            20 seconds
> Prolog:                 /var/spool/pbs/mom_priv/prologue (disabled)
> Alarm Time:             0 of 10 seconds
> Trusted Client List:
> 172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,127.0.0.1
> Copy Command:           /usr/bin/scp -rpB
> NOTE:  no local jobs detected
>
> diagnostics complete
>
> What can be wrong and where should i look into ?
> Thanks for any help,
> Adrian
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081215/31a3b30c/attachment-0001.html


More information about the torqueusers mailing list