[torqueusers] Jobs Jumping From Queued to Run to Queued???

Dr. Stephan Raub raub at uni-duesseldorf.de
Tue Apr 6 02:44:30 MDT 2010


Hi,

I sometimes have experienced this behavior if

-) a global mounted /home-dir of the user (to whom the job belongs to) is
not mounted

-) the user is not known on this computenode (e.g. because the LDAP service
is down, the nscd has hung up,...)

-) the prolog-script failed (for what reason ever).

-) the ssh-keys are not okay, so that the corresponding user can not jump
from computenode to computenode without using a password.

Perhaps, some of this is the solution for your problem.

Stephan


--
---------------------------------------------------------
| | Dr. rer. nat. Stephan Raub
| | Dipl. Chem.
| | IT-Management / ZIM
| | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 /
| | 25.41.O2.25-2
| | 40225 Düsseldorf / Germany
---------------------------------------------------------

Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse,
bzw. 
sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte
benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
Dank.

Important Note: This e-mail may contain trade secrets or privileged,
undisclosed or otherwise confidential information. If you have received this
e-mail in error, you are hereby notified that any review, copying or
distribution of it is strictly prohibited. Please inform us immediately and
destroy the original transmittal. Thank you for your cooperation.


> -----Ursprüngliche Nachricht-----
> Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] Im Auftrag von Ben Turner
> Gesendet: Dienstag, 6. April 2010 01:28
> An: torqueusers at supercluster.org
> Betreff: [torqueusers] Jobs Jumping From Queued to Run to Queued???
> 
> Hi,
> 
> I have a cluster with five nodes all configured the same. The queueing
> system is working fine on four of them, but on one of them jobs get
> queued,
> then start to run and then jump back to queued state immediately.
> 
> I have looked through all the logs and the only clue I can get is the
> mom_log on the compute node that fails
> 
> Pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in do_rpp, cannot get
> protocol
> End of File
> 
> 
> On the server there is a
> PBS_Server; Svr;WARNING;ALERT:unable to contact node merlion05.qgeo.com
> PBS_Server;Job;14754.merlion00.qgeo.com;unable to run job, MOM
> rejected/rc=2
> PBS_Server;Req;req_reject;Reject reply code 15041(Execution server
> rejected
> request MSG=cannot send job to mom, state=PRERUN), aux=0, type-RunJob
> from
> Scheduler at merlion00.qgeo.com
> 
> Pbsnodes -a reports that the dodgy node merlion05 is OK.
> 
> Do anybody have any insight into this problem. I have been bashing my
> head
> against a wall and have no idea where to go.
> 
> Cheers
> Ben
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list