[torqueusers] Job not start
Albert.Shih at obspm.fr
Wed Nov 7 15:39:58 MST 2007
I've very strange problem with my cluster.
Before today every node have two interfaces, one for NFS (192....), one
(public IP) for otherthing.
Now all my interface is in public IP
After that something very strange appear.
When I submit a job, rarely (but not never) job start automatically. Many
job stuck in queue I've got this kind of message :
11/07/2007 23:32:35 S entering post_sendmom
11/07/2007 23:32:35 S child reported failure for job after 14 seconds (dest=blade2), rc=10
11/07/2007 23:32:35 S unable to run job, MOM rejected/timeout
11/07/2007 23:32:40 S MOM rejected modify request, error: 15001
11/07/2007 23:33:43 S Job Run at request of root at frontal
11/07/2007 23:33:43 S forking in send_job
11/07/2007 23:33:43 S entering post_sendmom
But if I use
qrun -H node job_id
the job start correctly
Anyone have a idea ?
Observatoire de Paris Meudon
SIO batiment 15
Téléphone : 01 45 07 76 26
Heure local/Local time:
Mer 7 nov 2007 23:35:33 CET
More information about the torqueusers