[torqueusers] Mother Superior talking to herself - mom crash

Christopher Samuel samuel at unimelb.edu.au
Sun Oct 9 20:44:04 MDT 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi folks,

On Saturday morning (Melbourne time) we had a node (bruce005)
lose its pbs_mom with the final message in the logs being the
rather cryptic:

10/08/2011 03:36:36;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::check_ms, Mother Superior talking to herself

This is with Torque 2.4.16. I've never seen that before, so
I was wondering if anyone else had ?

The error itself is from check_ms() in src/resmom/mom_comm.c
and has the comment:

 * Check to be sure this is a connection from Mother Superior on
 * a good port.
 * Check to make sure I am not Mother Superior (talking to myself).

So it appears to be something that's known to occur (or at
least be important enough to check that it doesn't happen).

Here's the job info in case it helps.

[root at bruce-m vlsci]# tracejob -q -n 5 881198

Job: 881198.bruce-m.vlsci.unimelb.edu.au

10/07/2011 14:45:56  S    enqueuing into batch, state 1 hop 1
10/07/2011 14:45:56  S    Job Queued at request of evan at bruce.vlsci.unimelb.edu.au, owner = evan at bruce.vlsci.unimelb.edu.au, job name = 212221212112212121121221112212211, queue = batch
10/07/2011 14:45:56  A    queue=batch
10/08/2011 03:35:54  S    Job Run at request of root at bruce-m.vlsci.unimelb.edu.au
10/08/2011 03:36:00  S    unable to run job, MOM rejected/timeout
10/08/2011 03:36:03  S    Job Run at request of root at bruce-m.vlsci.unimelb.edu.au
10/08/2011 03:36:03  A    user=evan group=VR0062 jobname=212221212112212121121221112212211 queue=batch ctime=1317959156 qtime=1317959156 etime=1317959156 start=1318005363
                          owner=evan at bruce.vlsci.unimelb.edu.au
                          exec_host=bruce005/6+bruce005/5+bruce005/4+bruce005/3+bruce005/1+bruce005/0+bruce006/5+bruce006/1+bruce008/1+bruce008/0+bruce011/0+bruce012/6+bruce012/5+bruce012/4+bruce012/3+bruce108/7
                          Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.procs=16 Resource_List.pvmem=3gb Resource_List.walltime=08:00:00 
10/08/2011 11:46:05  S    Job deleted at request of root at bruce-m.vlsci.unimelb.edu.au
10/08/2011 11:46:05  S    Job sent signal SIGTERM on delete
10/08/2011 11:46:05  S    purging job without checking MOM
10/08/2011 11:46:05  S    dequeuing from batch, state RUNNING
10/08/2011 11:46:05  A    requestor=root at bruce-m.vlsci.unimelb.edu.au
[root at bruce-m vlsci]# 
 

- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6SW/QACgkQO2KABBYQAh+G4ACeOtTzVeIor4Hg7OpWMS5v6IAJ
ijAAnjvX7PKLHIpNkcOeUF14wOohMQwf
=n6uA
-----END PGP SIGNATURE-----


More information about the torqueusers mailing list