[Mauiusers] Job is in 'Q' but checkjob shows it is running (!)

Mahmood Naderan nt_mahmood at yahoo.com
Mon Sep 12 13:22:06 MDT 2011


Since the node is up for about 110  days, I think there may be a problem with maui service. With a restart it is now fine.
Thanks for your help

 
// Naderan *Mahmood;


----- Original Message -----
From: Steve Crusan <scrusan at ur.rochester.edu>
To: Mahmood Naderan <nt_mahmood at yahoo.com>
Cc: maui <mauiusers at supercluster.org>
Sent: Monday, September 12, 2011 10:31 PM
Subject: Re: [Mauiusers] Job is in 'Q' but checkjob shows it is running (!)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 12, 2011, at 12:27 PM, Mahmood Naderan wrote:

>> Do you mean why isn't the job running, even though it seems that it *should* be running?
> 
> Exactly...
> 
>> If so, I would say post the output of qstat -f for the job, and checkjob -v
> 
> mahmood at srv1:~$ qstat -f 49153
> Job Id: 49153.srv1
>     Job_Name = bwaves
>     Job_Owner = mahmood at srv1
>     job_state = Q
>     queue = Long
>     server = srv1
>     Checkpoint = u
>     ctime = Mon Sep 12 19:55:29 2011
>     Error_Path = srv1:/home/mahmood/multi2sim-3.0.3/410.bwave
>         s/bwaves.e49153
>     Hold_Types = n
>     Join_Path = oe
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Mon Sep 12 19:55:29 2011
>     Output_Path = srv1:/home/mahmood/multi2sim-3.0.3/410.bwav
>         es/bwaves_128.out
>     Priority = 0
>     qtime = Mon Sep 12 19:55:29 2011
>     Rerunable = True
>     Resource_List.nodect = 1
>     Resource_List.nodes = node2
>     Resource_List.walltime = 960:00:00
>     Variable_List = PBS_O_QUEUE=Long,PBS_O_HOME=/home/mahmood,
>         ...
>     etime = Mon Sep 12 19:55:29 2011
>     submit_args = tor
>     fault_tolerant = False
> 
> mahmood at srv1:~$ checkjob -v 49153
> checking job 49153 (RM job '49153.srv1')
> 
> State: Idle
> Creds:  user:mahmood  group:mahmood  class:Long  qos:DEFAULT
> WallTime: 00:00:00 of 40:00:00:00
> SubmitTime: Mon Sep 12 19:55:29
>   (Time Queued  Total: 00:39:24  Eligible: 00:39:24)
> 
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Exec:  ''  ExecSize: 0  ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SHARED
> NodeCount: 0
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 3  StartCount: 0
> PartitionMask: [ALL]
> Flags:       HOSTLIST RESTARTABLE
> HostList:
>   [node2:1]
> PE:  1.00  StartPriority:  147
> job can run in partition DEFAULT (8 procs available.  1 procs required)

There has got to be a reason why the job won't start even resources are available. I was hoping that checkjob -v would show the node information, but maybe it's different for maui. Can you run a checkjob -v -n <nodeid> <jobid>

The specific node itself seems to be having problems, or maui is not starting it. 

Do you see anything relevant in your /var/spool/maui/logs/maui.log file? If not, I would increase the verbosity of the logging, and restart the maui service.


> 
> 
>> which you seem to have manually selected in your qsub statement
> 
> Yes, As you can see I requested node2
> Resource_List.nodes = node2
> 
> and the output of "pbsnodes -l all" shows that this node is free
> 
> mahmood at srv1:~$ pbsnodes -l all
> srv1                  job-exclusive
> node2                 free
> node3                 job-exclusive
> node4                 free
> 
> 
> Any idea about that?
> 
> // Naderan *Mahmood;
> 
> 
> ----- Original Message -----
> From: Steve Crusan <scrusan at ur.rochester.edu>
> To: Mahmood Naderan <nt_mahmood at yahoo.com>
> Cc: maui <mauiusers at supercluster.org>
> Sent: Monday, September 12, 2011 6:17 PM
> Subject: Re: [Mauiusers] Job is in 'Q' but checkjob shows it is running (!)
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> On Sep 12, 2011, at 5:01 AM, Mahmood Naderan wrote:
> 
>> 
>> 
>> Hi,
>> I sent this email to torque mailing list but seems that it is related to maui. So I restate the problem here.
>> 
>> Can someone explain why the qstat shows a job in "Q" but checkjob says everything is normal?
> 
> 
> Looking below, the job is queued in TORQUE, and idle in Maui (not running), so everything is normal.
> 
> Do you mean why isn't the job running, even though it seems that it *should* be running?
> 
> If so, I would say post the output of qstat -f for the job, and checkjob -v. This seems to be more or less a scheduler configuration, or possibly an issue with the node (which you seem to have manually selected in your qsub statement).
> 
> 
> 
>> 
>> mahmood at srv1:416.gamess$ qstat 49003
>> Job id                    Name             User            Time Use S Queue
>> ------------------------- ---------------- --------------- -------- - -----
>> 49003.srv1                 gamess           mahmood                0 Q Long
>> 
>> 
>> mahmood at srv1:416.gamess$ checkjob 49003
>> checking job 49003
>> 
>> State: Idle
>> Creds:  user:mahmood  group:mahmood  class:Long    qos:DEFAULT
>> WallTime: 00:00:00 of 40:00:00:00
>> SubmitTime: Sun Sep 11 09:51:26
>>    (Time Queued  Total: 00:02:36  Eligible: 00:02:36)
>> 
>> Total Tasks: 1
>> 
>> Req[0]  TaskCount: 1  Partition: ALL
>> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
>> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
>> 
>> 
>> IWD: [NONE]  Executable:  [NONE]
>> Bypass: 0  StartCount: 0
>> PartitionMask: [ALL]
>> Flags:       HOSTLIST RESTARTABLE
>> HostList:
>>    [hawk:1]
>> PE:  1.00  StartPriority:  129
>> job can run in partition DEFAULT (3 procs available.  1 procs required)
>> 
>> Thanks
>> // Naderan *Mahmood;
>> 
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
> 
> ----------------------
> Steve Crusan
> System Administrator
> Center for Research Computing
> University of Rochester
> https://www.crc.rochester.edu/
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJObg2IAAoJENS19LGOpgqKAnIIAKHvbLmV9Hs31IZ4AGHIOFG9
> Wxp+qiXOnIMoKQQjhkkou1zVC4OKHnymcE/LxtiQcAuX+Lu8gd/GAR1tF5FeCF4g
> m7go12yb5Dx97sHgl2SjmRY3duDkx6YMfOGgxCuiN+O5SdkUazuW8GPkW+HPPS7/
> T3gDbG0jizZ6A5LzhJqgPyVC4LKkwYt5v9NQBs/f82ZOGqPusEWdJ4N5oaUYhyG/
> OXSj/xmzMTCYCqfdOUZynq4ACQotRbNmY7wrV+Uc0qWUFtZv/RIwQ/O4P261E/1/
> dfrVX3OEdz9FBy4uoNrgMyNxL2eOanNiKSlhHJnoM04zx0SkAYGDOeGPqYv/vi0=
> =QcC7
> -----END PGP SIGNATURE-----
> 

----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/


-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJObkjnAAoJENS19LGOpgqKwwQH/26RwQZX1BG/M3V/PztkOpPs
CwshkSuBkQGrNqshY6/BenrZpXHGgEYGbqYyFm29NWMyNQ1Vm33mfb0rq84DBkXk
gbME5qwg3uKeATUGuBQoMxdy/JEu1TdqDx4FNwLh8/wLxzhmJcQqatEX4qvEgJWP
oT3m0j29rgENLfVKpZ40P7vHAPafJrnTAQjPsqmoZLnkK0dGOD/zD5T/RiMBKLar
harduBX6s9FpKeHJTwEYGqBdMgxu1nBQ3wna+Tmmjq5HXxdlzlT7HfQSYzWQxtI2
kXU/1S6kaz1AXVUCsJt42MGbmWhAwCBbVP5RCfHvXB6pulMXyOinRDeoYNzc7HU=
=eijX
-----END PGP SIGNATURE-----



More information about the mauiusers mailing list