[torqueusers] Wrong number of allocated nodes

Regina Guilabert Canals regina.guilabert at uib.es
Fri Jul 27 02:57:41 MDT 2007


Thanks Paul,

No other processes are running for the user. Anyway, we removed this  
limitation from the queue and the problem persists.

Tracing a job with tracejob we see that exec_host is not correctly set:


megacelula:/megadisk/people/regina# tracejob 2199
/var/spool/torque/mom_logs/20070727: No matching job records located

Job: 2199.megacelula

07/27/2007 10:45:14  S    enqueuing into batch, state 1 hop 1
07/27/2007 10:45:14  S    Job Queued at request of  
dfsvhs9 at megacelula, owner = dfsvhs9 at megacelula, job name = TEST,  
queue = batch
07/27/2007 10:45:14  S    Job Modified at request of  
Scheduler at megacelula
07/27/2007 10:45:14  S    Job Run at request of Scheduler at megacelula
07/27/2007 10:45:14  A    queue=batch
07/27/2007 10:45:15  L    Job Run
07/27/2007 10:45:15  A    user=dfsvhs9 group=models jobname=TEST  
queue=batch ctime=1185525914 qtime=1185525914 etime=1185525914  
start=1185525915 exec_host=cell14/1+cell14/0
                           Resource_List.neednodes=10:ppn=2  
Resource_List.nodect=10 Resource_List.nodes=10:ppn=2  
Resource_List.walltime=00:20:30


Who sets "exec_host" pbs_server or pbs_sched? and how can we track  
the error?


Regina Guilabert Canals
Grup de Meteorologia

Edif. Mateu Orfila					Tel: +34 971 17 3213
Universitat de les Illes Balears		Fax: +34 971 17 3426
07122 Palma de Mallorca (Spain) 	email: regina.guilabert at uib.es



El 26/07/2007, a las 16:53, Paul Gray escribió:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Thu, Jul 26, 2007 at 12:16:49PM +0200, Regina Guilabert Canals  
> wrote:
>> Dear TORQUE users,
>>
>> Without any apparent reason PBS stop allocating the correct number of
>> nodes yesterday. Now, when we request, for instance, 4 nodes, the job
>> only gets 1 node assigned.
>>
>> Let me illustrate it with an example:
>>
>
> I had the same issue, and the cause was the queue limit on max user  
> processes.
> Your "batch" queue has a limit of 5 processes, was another process  
> running?
>
> - --
> Paul Gray                                         -o)
> 314 East Gym, Dept. of Computer Science           /\\
> University of Northern Iowa                      _\_V
> Message void if penguin violated ...  Don't mess with the penguin
> No one says, "Hey, I can't read that ASCII attachment ya sent me."
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
>
> iD8DBQFGqLVyOH45TZW7mh4RAqWSAKCogdqzimdCtO7qzP08XJIVBvPRDgCeNUH0
> CXOFY1nS0glk2Y+iSn6Vzx4=
> =DE+1
> -----END PGP SIGNATURE-----

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070727/613f5c3a/attachment-0001.html


More information about the torqueusers mailing list