[torquedev] trunk: job arrays

Glen Beane glen.beane at gmail.com
Thu Aug 9 19:15:08 MDT 2007


One nasty job array bug that I am planning on tackling right now is
that there is a server wide linked list of job arrays and for each job
array there is a linked list of jobs that belong to the array

Neither of these are re-initialized after a server restart so if you
have job arrays queued and the server is shutdown and restarted, you
may end up with some nasty segmentation faults when pbs server
attempts to use some uninitialized linked list pointers that belong to
the job struct.

I think I will probably setup a server_priv/arrays directory where
there would be a file for each array.  The data in these files would
allow me to rebuild the server list of arrays,  and I could also track
how many of the array's jobs have been sucessfullys spawned.   Since
the array job "cloning" is done in batches through pbs_server work
tasks, it would be possible for the server to get shutdown after the
array has been partially built.  Upon restart pbs_server does not
resume the job cloning process.  If after every sucessful job clone we
can update this array file (this would have to be pretty fast), then
it would be possible to resume the job cloning process after a server
restart.

I would love to hear suggestions!


More information about the torquedev mailing list