[torqueusers] node bad state
Ghislain ESCORNE
ghislain.escorne at obs.ujf-grenoble.fr
Thu Dec 1 06:51:55 MST 2005
Thanks you for your help.
My problem :
When I reinstall my frontend I forget to transfert the /etc/group.
Then message in the log
On the first node :
12/01/2005 14:37:02;0008;
pbs_mom;Job;52.rock-lgit.obs.ujf-grenoble.fr;ERROR: received request
'ABORT_JOB' from 10.255.255.253:1023 for job
'52.rock-lgit.obs.ujf-grenoble.fr' (job does not exist locally)
12/01/2005 14:37:02;0008;
pbs_mom;Job;52.rock-lgit.obs.ujf-grenoble.fr;No Group Entry for Group 1110
and on the second node :
12/01/2005 14:37:03;0001; pbs_mom;Svr;pbs_mom;Bad UID for job
execution (15023) in 52.rock-lgit.obs.ujf-grenoble.fr, job_start_error
from node 10.255.255.254:15003 in job_start_error
12/01/2005 14:37:03;0001; pbs_mom;Svr;pbs_mom;Bad UID for job
execution (15023) in 52.rock-lgit.obs.ujf-grenoble.fr, abort attempted
16 times in job_start_error. ignoring abort request from node
10.255.255.254:15003
12/01/2005 14:37:03;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
12/01/2005 14:37:03;0001; pbs_mom;Req;obit reply;Job not found for
obit reply
12/01/2005 14:37:03;0001;
pbs_mom;Job;52.rock-lgit.obs.ujf-grenoble.fr;server rejected job obit -
unexpected job state
12/01/2005 14:37:03;0080; pbs_mom;Req;req_reject;Reject reply
code=15001(Unknown Job Id REJHOST=compute-0-1.local MSG=cannot locate
job to delete), aux=0, type=DeleteJob, from
PBS_Server at rock-lgit.obs.ujf-grenoble.fr
Thanks for your help
Bye
Garrick Staples wrote:
>On Wed, Nov 30, 2005 at 06:02:42AM -0500, Ghislain ESCORNE alleged:
>
>
>>Garrick Staples wrote:
>>
>>
>>
>>>On Wed, Nov 30, 2005 at 05:18:46AM -0500, Ghislain ESCORNE alleged:
>>>
>>>
>>>
>>>
>>>>Hello,
>>>>I have a problem when I try to submit many jobs which need to run on
>>>>more than one node.
>>>>
>>>>
>>>>
>>>>
>>>What exactly is the problem? These emails have had a wealth of
>>>information, but I'm having troubling grasping the actual observed
>>>problem.
>>>
>>>
>>>
>>>
>>When I submit a script with
>>#PBS -l nodes=1:ppn=2 the script runs correctly
>>but when I submit
>>#PBS -l nodes=2:ppn=2 the job stays in queue (Job bounces from status
>>R to status Q)
>>the logs of pbs_server show :
>>
>>compute-0-1.local with bad state (state: QUEUED)
>>code=15016(Request invalid for state of job)
>>
>>Thanks for your help
>>
>>
>
>There are several reasons why a job will fail to start. Do you see any
>errors in the MOM logs? Be sure to increase the loglevel on MOM if you
>don't see anything. Also be sure TORQUE is configured with
>--enable-syslog and look in /var/log/messages (or wherever your syslog
>writes).
>
>And verify the following on all machines:
> - DNS resolution works correctly with matching forward and reverse
> - the time is synced correctly
> - user accounts exist
> - user home directories can be mounted
> - prologue scripts exit with 0
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
More information about the torqueusers
mailing list