[torqueusers] Job not start

Albert Shih Albert.Shih at obspm.fr
Wed Nov 7 15:39:58 MST 2007

Hi all

I've very strange problem with my cluster.

Before today every node have two interfaces, one for NFS (192....), one
(public IP) for otherthing.

Now all my interface is in public IP

After that something very strange appear.

When I submit a job, rarely (but not never) job start automatically. Many
job stuck in queue I've got this kind of message :

11/07/2007 23:32:35  S    entering post_sendmom
11/07/2007 23:32:35  S    child reported failure for job after 14 seconds (dest=blade2), rc=10
11/07/2007 23:32:35  S    unable to run job, MOM rejected/timeout
11/07/2007 23:32:40  S    MOM rejected modify request, error: 15001
11/07/2007 23:33:43  S    Job Run at request of root at frontal
11/07/2007 23:33:43  S    forking in send_job
11/07/2007 23:33:43  S    entering post_sendmom

But if I use

	qrun -H node job_id

the job start correctly

Anyone have a idea ?



Albert SHIH
Observatoire de Paris Meudon
SIO batiment 15
Téléphone : 01 45 07 76 26
Heure local/Local time:
Mer 7 nov 2007 23:35:33 CET

More information about the torqueusers mailing list