[torqueusers] node bad state

Ghislain ESCORNE ghislain.escorne at obs.ujf-grenoble.fr
Thu Dec 1 06:51:55 MST 2005


Thanks you for your help.
My problem :
When I reinstall my frontend I forget to transfert the /etc/group.
Then message in the log

On the first node :
12/01/2005 14:37:02;0008;   
pbs_mom;Job;52.rock-lgit.obs.ujf-grenoble.fr;ERROR:    received request 
'ABORT_JOB' from 10.255.255.253:1023 for job 
'52.rock-lgit.obs.ujf-grenoble.fr' (job does not exist locally)
12/01/2005 14:37:02;0008;   
pbs_mom;Job;52.rock-lgit.obs.ujf-grenoble.fr;No Group Entry for Group 1110

and on the second node :


12/01/2005 14:37:03;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job 
execution (15023) in 52.rock-lgit.obs.ujf-grenoble.fr, job_start_error 
from node 10.255.255.254:15003 in job_start_error
12/01/2005 14:37:03;0001;   pbs_mom;Svr;pbs_mom;Bad UID for job 
execution (15023) in 52.rock-lgit.obs.ujf-grenoble.fr, abort attempted 
16 times in job_start_error.  ignoring abort request from node 
10.255.255.254:15003
12/01/2005 14:37:03;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters
12/01/2005 14:37:03;0001;   pbs_mom;Req;obit reply;Job not found for 
obit reply
12/01/2005 14:37:03;0001;   
pbs_mom;Job;52.rock-lgit.obs.ujf-grenoble.fr;server rejected job obit - 
unexpected job state
12/01/2005 14:37:03;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15001(Unknown Job Id REJHOST=compute-0-1.local MSG=cannot locate 
job to delete), aux=0, type=DeleteJob, from 
PBS_Server at rock-lgit.obs.ujf-grenoble.fr

Thanks for your help
Bye


Garrick Staples wrote:

>On Wed, Nov 30, 2005 at 06:02:42AM -0500, Ghislain ESCORNE alleged:
>  
>
>>Garrick Staples wrote:
>>
>>    
>>
>>>On Wed, Nov 30, 2005 at 05:18:46AM -0500, Ghislain ESCORNE alleged:
>>>
>>>
>>>      
>>>
>>>>Hello,
>>>>I have a problem when I try to submit many jobs which need to run on 
>>>>more than one node.
>>>>  
>>>>
>>>>        
>>>>
>>>What exactly is the problem?  These emails have had a wealth of
>>>information, but I'm having troubling grasping the actual observed
>>>problem.
>>>
>>>
>>>      
>>>
>>When I submit a script with
>>#PBS -l nodes=1:ppn=2 the script  runs correctly
>>but when I submit
>>#PBS -l nodes=2:ppn=2  the job  stays in queue  (Job bounces from status 
>>R to status Q)
>>the logs of pbs_server show :
>>
>>compute-0-1.local with bad state (state: QUEUED)
>>code=15016(Request invalid for state of job)
>>
>>Thanks for your help
>>    
>>
>
>There are several reasons why a job will fail to start.  Do you see any
>errors in the MOM logs?  Be sure to increase the loglevel on MOM if you
>don't see anything.  Also be sure TORQUE is configured with
>--enable-syslog and look in /var/log/messages (or wherever your syslog
>writes).
>
>And verify the following on all machines:
>  - DNS resolution works correctly with matching forward and reverse
>  - the time is synced correctly
>  - user accounts exist
>  - user home directories can be mounted
>  - prologue scripts exit with 0
>
>  
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>  
>



More information about the torqueusers mailing list