[Mauiusers] Job is in 'Q' but checkjob shows it is running (!)
Steve Crusan
scrusan at ur.rochester.edu
Mon Sep 12 12:01:03 MDT 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Sep 12, 2011, at 12:27 PM, Mahmood Naderan wrote:
>> Do you mean why isn't the job running, even though it seems that it *should* be running?
>
> Exactly...
>
>> If so, I would say post the output of qstat -f for the job, and checkjob -v
>
> mahmood at srv1:~$ qstat -f 49153
> Job Id: 49153.srv1
> Job_Name = bwaves
> Job_Owner = mahmood at srv1
> job_state = Q
> queue = Long
> server = srv1
> Checkpoint = u
> ctime = Mon Sep 12 19:55:29 2011
> Error_Path = srv1:/home/mahmood/multi2sim-3.0.3/410.bwave
> s/bwaves.e49153
> Hold_Types = n
> Join_Path = oe
> Keep_Files = n
> Mail_Points = a
> mtime = Mon Sep 12 19:55:29 2011
> Output_Path = srv1:/home/mahmood/multi2sim-3.0.3/410.bwav
> es/bwaves_128.out
> Priority = 0
> qtime = Mon Sep 12 19:55:29 2011
> Rerunable = True
> Resource_List.nodect = 1
> Resource_List.nodes = node2
> Resource_List.walltime = 960:00:00
> Variable_List = PBS_O_QUEUE=Long,PBS_O_HOME=/home/mahmood,
> ...
> etime = Mon Sep 12 19:55:29 2011
> submit_args = tor
> fault_tolerant = False
>
> mahmood at srv1:~$ checkjob -v 49153
> checking job 49153 (RM job '49153.srv1')
>
> State: Idle
> Creds: user:mahmood group:mahmood class:Long qos:DEFAULT
> WallTime: 00:00:00 of 40:00:00:00
> SubmitTime: Mon Sep 12 19:55:29
> (Time Queued Total: 00:39:24 Eligible: 00:39:24)
>
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [NONE]
> Exec: '' ExecSize: 0 ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SHARED
> NodeCount: 0
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 3 StartCount: 0
> PartitionMask: [ALL]
> Flags: HOSTLIST RESTARTABLE
> HostList:
> [node2:1]
> PE: 1.00 StartPriority: 147
> job can run in partition DEFAULT (8 procs available. 1 procs required)
There has got to be a reason why the job won't start even resources are available. I was hoping that checkjob -v would show the node information, but maybe it's different for maui. Can you run a checkjob -v -n <nodeid> <jobid>
The specific node itself seems to be having problems, or maui is not starting it.
Do you see anything relevant in your /var/spool/maui/logs/maui.log file? If not, I would increase the verbosity of the logging, and restart the maui service.
>
>
>> which you seem to have manually selected in your qsub statement
>
> Yes, As you can see I requested node2
> Resource_List.nodes = node2
>
> and the output of "pbsnodes -l all" shows that this node is free
>
> mahmood at srv1:~$ pbsnodes -l all
> srv1 job-exclusive
> node2 free
> node3 job-exclusive
> node4 free
>
>
> Any idea about that?
>
> // Naderan *Mahmood;
>
>
> ----- Original Message -----
> From: Steve Crusan <scrusan at ur.rochester.edu>
> To: Mahmood Naderan <nt_mahmood at yahoo.com>
> Cc: maui <mauiusers at supercluster.org>
> Sent: Monday, September 12, 2011 6:17 PM
> Subject: Re: [Mauiusers] Job is in 'Q' but checkjob shows it is running (!)
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> On Sep 12, 2011, at 5:01 AM, Mahmood Naderan wrote:
>
>>
>>
>> Hi,
>> I sent this email to torque mailing list but seems that it is related to maui. So I restate the problem here.
>>
>> Can someone explain why the qstat shows a job in "Q" but checkjob says everything is normal?
>
>
> Looking below, the job is queued in TORQUE, and idle in Maui (not running), so everything is normal.
>
> Do you mean why isn't the job running, even though it seems that it *should* be running?
>
> If so, I would say post the output of qstat -f for the job, and checkjob -v. This seems to be more or less a scheduler configuration, or possibly an issue with the node (which you seem to have manually selected in your qsub statement).
>
>
>
>>
>> mahmood at srv1:416.gamess$ qstat 49003
>> Job id Name User Time Use S Queue
>> ------------------------- ---------------- --------------- -------- - -----
>> 49003.srv1 gamess mahmood 0 Q Long
>>
>>
>> mahmood at srv1:416.gamess$ checkjob 49003
>> checking job 49003
>>
>> State: Idle
>> Creds: user:mahmood group:mahmood class:Long qos:DEFAULT
>> WallTime: 00:00:00 of 40:00:00:00
>> SubmitTime: Sun Sep 11 09:51:26
>> (Time Queued Total: 00:02:36 Eligible: 00:02:36)
>>
>> Total Tasks: 1
>>
>> Req[0] TaskCount: 1 Partition: ALL
>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
>> Opsys: [NONE] Arch: [NONE] Features: [NONE]
>>
>>
>> IWD: [NONE] Executable: [NONE]
>> Bypass: 0 StartCount: 0
>> PartitionMask: [ALL]
>> Flags: HOSTLIST RESTARTABLE
>> HostList:
>> [hawk:1]
>> PE: 1.00 StartPriority: 129
>> job can run in partition DEFAULT (3 procs available. 1 procs required)
>>
>> Thanks
>> // Naderan *Mahmood;
>>
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
> ----------------------
> Steve Crusan
> System Administrator
> Center for Research Computing
> University of Rochester
> https://www.crc.rochester.edu/
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
>
> iQEcBAEBAgAGBQJObg2IAAoJENS19LGOpgqKAnIIAKHvbLmV9Hs31IZ4AGHIOFG9
> Wxp+qiXOnIMoKQQjhkkou1zVC4OKHnymcE/LxtiQcAuX+Lu8gd/GAR1tF5FeCF4g
> m7go12yb5Dx97sHgl2SjmRY3duDkx6YMfOGgxCuiN+O5SdkUazuW8GPkW+HPPS7/
> T3gDbG0jizZ6A5LzhJqgPyVC4LKkwYt5v9NQBs/f82ZOGqPusEWdJ4N5oaUYhyG/
> OXSj/xmzMTCYCqfdOUZynq4ACQotRbNmY7wrV+Uc0qWUFtZv/RIwQ/O4P261E/1/
> dfrVX3OEdz9FBy4uoNrgMyNxL2eOanNiKSlhHJnoM04zx0SkAYGDOeGPqYv/vi0=
> =QcC7
> -----END PGP SIGNATURE-----
>
----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJObkjnAAoJENS19LGOpgqKwwQH/26RwQZX1BG/M3V/PztkOpPs
CwshkSuBkQGrNqshY6/BenrZpXHGgEYGbqYyFm29NWMyNQ1Vm33mfb0rq84DBkXk
gbME5qwg3uKeATUGuBQoMxdy/JEu1TdqDx4FNwLh8/wLxzhmJcQqatEX4qvEgJWP
oT3m0j29rgENLfVKpZ40P7vHAPafJrnTAQjPsqmoZLnkK0dGOD/zD5T/RiMBKLar
harduBX6s9FpKeHJTwEYGqBdMgxu1nBQ3wna+Tmmjq5HXxdlzlT7HfQSYzWQxtI2
kXU/1S6kaz1AXVUCsJt42MGbmWhAwCBbVP5RCfHvXB6pulMXyOinRDeoYNzc7HU=
=eijX
-----END PGP SIGNATURE-----
More information about the mauiusers
mailing list