[torqueusers] searcing for the directory of running job

Christoph (Stucki) von Stuckrad stucki at mi.fu-berlin.de
Thu Jan 31 14:07:53 MST 2013


On Thu, 31 Jan 2013, Michael Jennings wrote:

> TORQUE copies job scripts but does NOT copy executables.  The Linux
> kernel on the compute node(s) keeps the text and data segments of the
> executable in memory even after it is overwritten.

This depends on how you submit things. Nobody keeps you from
explicitly sending the executables together with the job into
the job's spool area via the 'stagein=...files...' statement.
(which would be the preferred way to run a binary which must
be compiled for each job and will change while jobs run).

But if you work with a shared filesystem for all the
cluster-nodes, the behavior of the nodes can be erratic
if you switch binaries on running processes. 'File deletes'
on a local disk and via NFS have different semantics.
The moment, when the node 'notices', that a shared binary
has been changed may vary in time (we've seen >15Min latencies
in LINUX NFS! Having changed the file on the server, the client
did not notice the change for greater than 15 Minutes!)
Also the dynamic loading of shared libraries may bring
different 'compiles' together (old binary runs, but
later loads newer library, which may have side-effects).

So we warn our users NEVER to change a binary, while a job
still runs it, and instead to install different variants
in different places in that case.

Stucki

-- 
Christoph von Stuckrad      * * |nickname |Mail <stucki at mi.fu-berlin.de> \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(Mo.,Mi.):+49 30 838-75 459|
Mathematik & Informatik EDV |\ *|if online|  (Di,Do,Fr):+49 30 77 39 6600|
Takustr. 9 / 14195 Berlin   * * |on IRCnet|Fax(home):   +49 30 77 39 6601/


More information about the torqueusers mailing list