[torqueusers] Unknown Job Id Behavior
Joshua Bernstein
jbernstein at penguincomputing.com
Fri May 23 12:49:02 MDT 2008
> Date: Thu, 22 May 2008 15:42:29 -0700
> From: Joshua Bernstein <jbernstein at penguincomputing.com>
> Subject: [torqueusers] Unknown Job Id Behavior
> To: torqueusers at supercluster.org
> Message-ID: <4835F6D5.6040104 at penguincomputing.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hello All,
>
> Consider a cluster where a node has a job that is running on it, and
> the node reboots, killing the job, as well as pbs_mom. At this point
> pbs_server (via pbsnodes) still reports that the node is marked as
> "job-exclusive", which is fine for now.
>
> The node boots up and pbs_mom starts cleanly. Shortly after the pbs_mom
> on the node recieves a message from the server requesting the status of
> the job:
>
> 05/22/2008 14:49:11;0100; pbs_mom;Req;;Type StatusJob request received
> from PBS_Server at master, sock=10
> 05/22/2008 14:49:11;0080; pbs_mom;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
>
> Here, pbs_mom correctly reports that this JobID is unknown. I would
> figure that pbs_server should do something with this information, and
> hence mark the job as failed, dequeue, and mark the node as back up, but
> nothing seems to happen.
>
> What am I missing here? If a node reboots with a job on it, that dies,
> why doesn't pbs_server understand that and report the job as failed?
>
> I'm running TORQUE 2.1.9
>
> -Josh
>
> On Thu, May 22, 2008 at 03:42:29PM -0700, Joshua Bernstein alleged:
>> > Hello All,
>> >
>> > Consider a cluster where a node has a job that is running on it, and
>> > the node reboots, killing the job, as well as pbs_mom. At this point
>> > pbs_server (via pbsnodes) still reports that the node is marked as
>> > "job-exclusive", which is fine for now.
>> >
>> > The node boots up and pbs_mom starts cleanly. Shortly after the
>> > pbs_mom on the node recieves a message from the server requesting the
>> > status of the job:
>> >
>> > 05/22/2008 14:49:11;0100; pbs_mom;Req;;Type StatusJob request received
>> > from PBS_Server at master, sock=10
>> > 05/22/2008 14:49:11;0080; pbs_mom;Req;req_reject;Reject reply
>> > code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
>> >
>> > Here, pbs_mom correctly reports that this JobID is unknown. I would
>> > figure that pbs_server should do something with this information, and
>> > hence mark the job as failed, dequeue, and mark the node as back up, but
>> > nothing seems to happen.
>> >
>> > What am I missing here? If a node reboots with a job on it, that dies,
>> > why doesn't pbs_server understand that and report the job as failed?
>> >
>> > I'm running TORQUE 2.1.9
>
> That's not correct. Upon boot, pbs_mom should recover the job state and send
> job exit notices to pbs_server.
Well I'd imagine its not correct, but that is the behavior I'm observing.
Perhaps I should mention that our configuration is completely diskless.
So on boot, we create the proper directory structure for the pbs_mom
including /var/spool/torque/mom_priv/jobs. From what I understand this
directory on the node contains information about running jobs.
Though when a diskless node gets rebooted, this directory (stored in
tmpfs) is obliterated and recreated. Hence, when the mom starts up, it
knows nothing about the job.
So I go back to my first point, shouldn't pbs_mom tell pbs_server that
it doesn't know anything about that job and hence pbs_server mark the
job as failed?
> Are you accidently starting pbs_mom with an -r or -p option? The pbs_mom
> manpage details the options (normal boot should have no options).
Nope, I tried it both ways, but our pbs_mom's always start at boot
without the -r or -p option.
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torqueusers
mailing list