[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Fri May 23 12:49:02 MDT 2008


> Date: Thu, 22 May 2008 15:42:29 -0700
> From: Joshua Bernstein <jbernstein at penguincomputing.com>
> Subject: [torqueusers] Unknown Job Id Behavior
> To: torqueusers at supercluster.org
> Message-ID: <4835F6D5.6040104 at penguincomputing.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hello All,
> 
> 	Consider a cluster where a node has a job that is running on it, and 
> the node reboots, killing the job, as well as pbs_mom. At this point 
> pbs_server (via pbsnodes) still reports that the node is marked as 
> "job-exclusive", which is fine for now.
> 
> 	The node boots up and pbs_mom starts cleanly. Shortly after the pbs_mom 
> on the node recieves a message from the server requesting the status of 
> the job:
> 
> 05/22/2008 14:49:11;0100;   pbs_mom;Req;;Type StatusJob request received 
> from PBS_Server at master, sock=10
> 05/22/2008 14:49:11;0080;   pbs_mom;Req;req_reject;Reject reply 
> code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
> 
> Here, pbs_mom correctly reports that this JobID is unknown. I would 
> figure that pbs_server should do something with this information, and 
> hence mark the job as failed, dequeue, and mark the node as back up, but 
> nothing seems to happen.
> 
> What am I missing here? If a node reboots with a job on it, that dies, 
> why doesn't pbs_server understand that and report the job as failed?
> 
> I'm running TORQUE 2.1.9
> 
> -Josh
 >
> On Thu, May 22, 2008 at 03:42:29PM -0700, Joshua Bernstein alleged:
>> > Hello All,
>> > 
>> > 	Consider a cluster where a node has a job that is running on it, and 
>> > the node reboots, killing the job, as well as pbs_mom. At this point 
>> > pbs_server (via pbsnodes) still reports that the node is marked as 
>> > "job-exclusive", which is fine for now.
>> > 
>> > 	The node boots up and pbs_mom starts cleanly. Shortly after the 
>> > 	pbs_mom on the node recieves a message from the server requesting the 
>> > status of the job:
>> > 
>> > 05/22/2008 14:49:11;0100;   pbs_mom;Req;;Type StatusJob request received 
>> > from PBS_Server at master, sock=10
>> > 05/22/2008 14:49:11;0080;   pbs_mom;Req;req_reject;Reject reply 
>> > code=15001(Unknown Job Id), aux=0, type=StatusJob, from PBS_Server at master
>> > 
>> > Here, pbs_mom correctly reports that this JobID is unknown. I would 
>> > figure that pbs_server should do something with this information, and 
>> > hence mark the job as failed, dequeue, and mark the node as back up, but 
>> > nothing seems to happen.
>> > 
>> > What am I missing here? If a node reboots with a job on it, that dies, 
>> > why doesn't pbs_server understand that and report the job as failed?
>> > 
>> > I'm running TORQUE 2.1.9
> 
> That's not correct.  Upon boot, pbs_mom should recover the job state and send
> job exit notices to pbs_server.

Well I'd imagine its not correct, but that is the behavior I'm observing.

Perhaps I should mention that our configuration is completely diskless. 
So on boot, we create the proper directory structure for the pbs_mom 
including /var/spool/torque/mom_priv/jobs. From what I understand this 
directory on the node contains information about running jobs.

Though when a diskless node gets rebooted, this directory (stored in 
tmpfs) is obliterated and recreated. Hence, when the mom starts up, it 
knows nothing about the job.

So I go back to my first point, shouldn't pbs_mom tell pbs_server that 
it doesn't know anything about that job and hence pbs_server mark the 
job as failed?

> Are you accidently starting pbs_mom with an -r or -p option?  The pbs_mom
> manpage details the options  (normal boot should have no options).

Nope, I tried it both ways, but our pbs_mom's always start at boot 
without the -r or -p option.

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list