[torquedev] New TORQUE job state

Garrick Staples garrick at clusterresources.com
Wed Jul 11 18:41:19 MDT 2007


On Wed, Jul 11, 2007 at 11:48:48AM -0700, Wesley Emeneker alleged:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Garrick,
>   I am interning at CRI this summer, and I'm working on integrating
> virtual machine (VM) deployment into Moab.
> I would like to ask about adding some functionality to TORQUE, but I
> need to explain what I'm doing.
> Here goes...
> 
> The plan of what I'm doing is to create and destroy VMs dynamically so
> that we can switch out cluster software environments, on demand, for
> different jobs.
> We are currently able to dynamically create nodes in TORQUE and assign
> jobs to them without TORQUE knowing that the node is actually a virtual
> machine.
> Each virtual machine runs a MOM of its own and coordinates with the
> TORQUE server.
> When a job wants to run inside a VM, Moab provisions the nodes (aka
> boots the VMs), and then gives PBS the VM nodes as the nodelist.
> The job then executes inside the VMs, and when the job is done Moab
> destroys the VMs.
> 
> The problem I am facing occurs when I try to preserve the VMs.
> One of the great features of many VMs is that we can save the state of
> the entire VM to disk and restore it later.
> This will let us do transparent checkpointing and preemption/restoration
> of any job we desire (what I call preservation).
> 
> I'm able to make Moab preserve and restore the job (aka VM), but a
> problem arises because PBS sees the node as job-exclusive even if it is
> down (which it is because the entire VM was saved to disk).
> Because PBS sees the job as active, Moab gets confused and puts the job
> into the Running state (instead of the Idle state that the Preservation
> set).
> Dave and Josh suggested that a new TORQUE job state would be the best
> way to handle this since we must have some kind of coordination between
> Moab job state and PBS job state.
> What I would like is some way to say that the job is "frozen" or
> "preserved" that basically corresponds to some kind of state other than
> running (Queued maybe?).
> We should also be able to "disassociate" a frozen job from a node so
> that the node isn't job-exclusive once the job is frozen.
> 
> Hopefully my explanation is clear (probably not).
> Let me know if you have any questions about what I'm doing.
> I look forward to hearing if this functionality will be possible.

If you *suspend* job in PBS, it goes into the S state.  Internally to
pbs_server, that is state=running and substate=suspended, and externally
represented as "S".

If you *don't* suspend jobs before ripping out nodes, then obviously
pbs_server will consider it to have running jobs on downed nodes.

It would seem to me that moab should first suspend the jobs 
through pbs_server?  



More information about the torquedev mailing list