[torqueusers] Job bounces from status R to status Q

Jonathan G. Atencio jatencio at gmail.com
Mon Oct 31 16:36:55 MST 2005


Hello Fred,

I would make sure that users id and group ids match among the master
node and the compute nodes. I experienced a similar problem and I am
pretty sure that this is related. Please let me know if that is the
case.

Thanks,

Jonathan

On 10/31/05, Magee, Fred (MRC) <fred.magee at atk.com> wrote:
>
>
>
> Good afternoon.
>
>
>
> I've just installed  torque-2.0.0p0 on an Athlon based cluster running Red
> Hat EL 3 U4 with mpich-1.2.6 and mpiexex- 0.77.  I also just reinstalled the
> OS on the master node due to a SCSI controller failure.  I have confirmed
> that root can ssh, scp, rsh and rcp from any node to any node without
> password however rcp yields the following messages:
>
>
>
> Trying krb4 rcp...
>
> trying normal rcp (/usr/bin/rcp)
>
>
>
> and ssh from master to the compute nodes yields:
>
>
>
> Warning: No xauth data; using fake authentication data for X11 forwarding.
>
>
>
> But the command works.
>
>
>
> If I submit a simple test job to one node, it works.  The job just prints
> out the node/processor it is running on as in "Greetings from head process 0
> on node2…".  If I submit the job to two or more nodes I get the following:
>
>
>
> 10/31/2005
> 13:35:47;0100;PBS_Server;Job;6.master.cl.abq.mrc;enqueuing
> into workq, state 1 hop 1
>
> 10/31/2005
> 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Queued
> at request of fmagee at master.cl.abq.mrc, owner = fmagee at master.cl.abq.mrc,
> job name = mine.pbs, queue = workq
>
> 10/31/2005
> 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler
> sent command new
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusServer request received
> from Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusNode request received
> from Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusQueue request received
> from Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received from
> Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received from
> Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type ResourceQuery request received
> from Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type ModifyJob request received
> from Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005
> 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job
> Modified at request of Scheduler at master.cl.abq.mrc
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type RunJob request received from
> Scheduler at master.cl.abq.mrc, sock=10
>
> 10/31/2005
> 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Run at
> request of Scheduler at master.cl.abq.mrc
>
> 10/31/2005
> 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler
> sent command recyc
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type JobObituary request received
> from pbs_mom at node3.cl.abq.mrc, sock=10
>
> 10/31/2005
> 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler
> sent command new
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type JobObituary request received
> from pbs_mom at node3.cl.abq.mrc, sock=11
>
> 10/31/2005
> 13:35:47;0009;PBS_Server;Job;6.master.cl.abq.mrc;obit
> received for job 6.master.cl.abq.mrc from host node3.cl.abq.mrc with bad
> state (state: QUEUED)
>
> 10/31/2005 13:35:47;0080;PBS_Server;Req;req_reject;Reject
> reply code=15016(Request invalid for state of job), aux=0, type=JobObituary,
> from pbs_mom at node3.cl.abq.mrc
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusServer request received
> from Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusNode request received
> from Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type StatusQueue request received
> from Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received from
> Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type SelStat request received from
> Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type ResourceQuery request received
> from Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type RunJob request received from
> Scheduler at master.cl.abq.mrc, sock=12
>
> 10/31/2005
> 13:35:47;0008;PBS_Server;Job;6.master.cl.abq.mrc;Job Run at
> request of Scheduler at master.cl.abq.mrc
>
> 10/31/2005
> 13:35:47;0040;PBS_Server;Svr;master.cl.abq.mrc;Scheduler
> sent command recyc
>
> 10/31/2005 13:35:47;0100;PBS_Server;Req;;Type JobObituary request received
> from pbs_mom at node3.cl.abq.mrc, sock=10
>
>
>
> I have found no reference to reply code 15016 other than "Request invalid
> for state of job".
>
>
>
> Can anyone out there point me to a quick solution for this problem?
>
>
>
> Thanks for your help and have a great day.
>
>
>
> Fred
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>


More information about the torqueusers mailing list