[torqueusers] jobs stuck in Q

Adrian Sevcenco Adrian.Sevcenco at cern.ch
Mon Dec 15 12:08:31 MST 2008


Greenseid, Joseph M. wrote:
> what does `checkjob 2` show you (where 2 is the jobid, as taken from
> your first email)?
Well, after some time it shows me the same error (lost connection bla
bla) but the process is running and in log it tells me that maui cannot
open address. i think that this is a maui problem so i posted a mail
also in maui user maillist . (btw: i tried changing ports but without
any good results)
Thanks,
Adrian

>  
> --Joe
> 
> ------------------------------------------------------------------------
> *From:* torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
> *Sent:* Mon 12/15/2008 12:13 PM
> *To:* Greenseid, Joseph M.
> *Cc:* torqueusers at supercluster.org
> *Subject:* Re: [torqueusers] jobs stuck in Q
> 
> Greenseid, Joseph M. wrote:
>> what scheduler are you using?  are you using torque's scheduler, or
>> maui, or something else?
> Hi,
> I am using maui .. do you think that the problem can be there?
> Now i see that when i try to restart maui.cfg i have :
> ERROR:    lost connection to server
> ERROR:    cannot request service (status)
> i have this as maui.cfg
> [root at grid01 maui]# cat maui.cfg
> # MAUI configuration example
> 
> SERVERHOST              grid01.spacescience.ro
> ADMIN1                  root
> ADMIN3                  edginfo rgma edguser
> ADMINHOSTS              grid01.spacescience.ro
> RMCFG[base]             TYPE=PBS
> SERVERPORT              40559
> SERVERMODE              NORMAL
> 
> # Set PBS server polling interval. If you have short # queues or/and
> jobs it is worth to set a short interval. (10 seconds)
> 
> RMPOLLINTERVAL        00:00:10
> 
> # a max. 10 MByte log file in a logical location
> 
> LOGFILE               /var/log/maui.log
> LOGFILEMAXSIZE        10000000
> LOGLEVEL              1
> 
> # Set the delay to 1 minute before Maui tries to run a job again, # in
> case it failed to run the first time.
> # The default value is 1 hour.
> 
> DEFERTIME       00:01:00
> 
> # Necessary for MPI grid jobs
> ENABLEMULTIREQJOBS TRUE
> 
> Any idea anyone ?
> Thanks,
> Adrian
> 
> 
>> --Joe
>>
>> ------------------------------------------------------------------------
>> *From:* torqueusers-bounces at supercluster.org on behalf of Adrian Sevcenco
>> *Sent:* Mon 12/15/2008 10:40 AM
>> *To:* torqueusers at supercluster.org
>> *Subject:* [torqueusers] jobs stuck in Q
>>
>> Hi,
>> I have a server in which jobs are stucked in queue. i have this output
>> from qstat -f :
>> Job Id: 2.grid01.spacescience.ro
>>     Job_Name = STDIN
>>     Job_Owner = alice001 at grid01.spacescience.ro
>>     job_state = Q
>>     queue = alice
>>     server = grid01.spacescience.ro
>>     Checkpoint = u
>>     ctime = Mon Dec 15 17:19:39 2008
>>     Error_Path = grid01.spacescience.ro:/home/alice001/STDIN.e2
>>     Hold_Types = n
>>     Join_Path = n
>>     Keep_Files = n
>>     Mail_Points = a
>>     mtime = Mon Dec 15 17:20:22 2008
>>     Output_Path = grid01.spacescience.ro:/home/alice001/STDIN.o2
>>     Priority = 0
>>     qtime = Mon Dec 15 17:20:49 2008
>>     Rerunable = True
>>     Resource_List.cput = 48:00:00
>>     Resource_List.walltime = 72:00:00
>>     Variable_List = PBS_O_HOME=/home/alice001,PBS_O_LANG=en_US.UTF-8,
>>         PBS_O_LOGNAME=alice001,
>>
>> PBS_O_PATH=/usr/kerberos/bin:/opt/edg/bin:/opt/glite/bin:/opt/lcg/bin
>>         :/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/alice001/bin,
>>         PBS_O_MAIL=/var/spool/mail/alice001,PBS_O_SHELL=/bin/bash,
>>        
> PBS_SERVER=grid01.spacescience.ro,PBS_O_HOST=grid01.spacescience.ro,
>>         PBS_O_WORKDIR=/home/alice001,PBS_O_QUEUE=alice
>>     etime = Mon Dec 15 17:20:49 2008
>>     submit_args = -q alice
>>
>> and a momctl on a wn gives me this :
>> [root at grid01 ~]# momctl -d 3 -h wn01
>>
>> Host: wn01.spacescience.ro/wn01.spacescience.ro   Version:
>> 2.3.0-snap.200801151629   PID: 7248
>> Server[0]: grid01.spacescience.ro (172.16.0.254)
>>   Init Msgs Received:     0 hellos/1 cluster-addrs
>>   Init Msgs Sent:         1 hellos
>>   Last Msg From Server:   284242 seconds (CLUSTER_ADDRS)
>>   Last Msg To Server:     21 seconds
>> HomeDirectory:          /var/spool/pbs/mom_priv
>> stdout/stderr spool directory: '/var/spool/pbs/spool/' (1072793 blocks
>> available)
>> NOTE:  syslog enabled
>> MOM active:             284244 seconds
>> Server Update Interval: 45 seconds
>> LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
>> Communication Model:    RPP
>> MemLocked:              TRUE  (mlock)
>> TCP Timeout:            20 seconds
>> Prolog:                 /var/spool/pbs/mom_priv/prologue (disabled)
>> Alarm Time:             0 of 10 seconds
>> Trusted Client List:
>>
> 172.16.0.5,172.16.0.4,172.16.0.3,172.16.0.2,172.16.0.254,172.16.0.1,127.0.0.1
>> Copy Command:           /usr/bin/scp -rpB
>> NOTE:  no local jobs detected
>>
>> diagnostics complete
>>
>> What can be wrong and where should i look into ?
>> Thanks for any help,
>> Adrian
>>
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3105 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20081215/72d7d3e7/smime.bin


More information about the torqueusers mailing list