[torqueusers] Unable to contact nodes

Corey Hirschman corey at rentec.com
Tue Nov 2 14:01:30 MST 2004


Hello,

I am writing again because after upgrading to torque-1.1.0p3 I continue to have problems with jobs not running.  I have enabled the verbose logging in hopes of getting some more information as to why the server loses contact with the mom and why the job is then reported as rejected.

Here are some snippets of the logs to illustrate the problems we are having:




11/02/2004 11:06:11;0008;PBS_Server;Job;225722.monstersq;Job Queued at request of lalit at monster250, owner = lalit at monster250, job name = wdf4795_i.compi, queue = workq
11/02/2004 11:06:11;0040;PBS_Server;Svr;monstersq;Scheduler sent command new
11/02/2004 11:06:11;0004;PBS_Server;Svr;is_request;message '4' received from monster554 (172.26.148.151:1023)
11/02/2004 11:06:11;0004;PBS_Server;Svr;is_request;IS_STATUS received from monster554

11/02/2004 11:06:14;0004;PBS_Server;Svr;is_request;message '4' received from monster572 (172.26.148.226:1023)
11/02/2004 11:06:14;0004;PBS_Server;Svr;is_request;IS_STATUS received from monster572

11/02/2004 11:06:14;0004;PBS_Server;Svr;is_request;message '4' received from monster624 (172.26.148.214:1023)
11/02/2004 11:06:14;0004;PBS_Server;Svr;is_request;IS_STATUS received from monster624




The the job 225722 is submitted.  I left the next few lines in, in case they mean anything.




11/02/2004 11:06:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 225722.monstersq state from QUEUED to QUEUED-QUEUED (1-10)

11/02/2004 11:06:41;0008;PBS_Server;Job;225722.monstersq;Job Modified at request of root at monstersq
11/02/2004 11:06:41;0008;PBS_Server;Job;225722.monstersq;Job Run at request of root at monstersq11/02/2004 11:06:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 225722.monstersq state from QUEUED to RUNNING-JOB_SUBSTATE_RUNNING (4-41)

11/02/2004 11:06:41;0004;PBS_Server;Svr;WARNING;!!! unable to contact node monster611 !!!




The then looks to switch state to running, it looks like the job is then assigned to node monster611 but monster611 could not be contacted.




11/02/2004 11:06:43;0008;PBS_Server;Job;225722.monstersq;unable to run job, MOM rejected
11/02/2004 11:06:43;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 225722.monstersq state from RUNNING to QUEUED-QUEUED (1-10)

11/02/2004 11:06:43;0004;PBS_Server;Svr;is_request;message '4' received from monster622 (172.26.148.212:1023)
11/02/2004 11:06:43;0004;PBS_Server;Svr;is_request;IS_STATUS received from monster622

11/02/2004 11:06:44;0004;PBS_Server;Svr;is_request;message '4' received from monster572 (172.26.148.226:1023)
11/02/2004 11:06:44;0004;PBS_Server;Svr;is_request;IS_STATUS received from monster572

11/02/2004 11:06:44;0004;PBS_Server;Svr;is_request;message '4' received from monster611 (172.26.148.201:1023)
11/02/2004 11:06:44;0004;PBS_Server;Svr;is_request;IS_STATUS received from monster611



We then get the "unable to run job, MOM rejected error and the job is set back to a queued state.  The last line is interesting because just seconds after all this happens, there is contact with monster611 and the very next job, 225723, is assigned to monster611 and runs just fine.

I also checked the mom logs on monster611 and there is no indication that job 225722 was ever even attemped to run there.  There are no errors in the log and it appears that monster611 never even knew the server tried to submit a job.

In addition to these erros, we are also getting errors such as this:

11/02/2004 09:49:14;0001;PBS_Server;Svr;PBS_Server;Address already in use (98) in contact_sched, Could not contact Scheduler - port 15004


This seems like the server is just having problems talking to the scheduler (Maui) in addition to the moms.  These errors seems to happen during periods of high activity, like a bunch of jobs being submitted or deleted.  I also noticed that the number of TIME_WAITS gets very high on the machine also with well over 1000 lines like this in netstat:

tcp        0      0 172.26.146.201:875      172.26.148.136:15002    TIME_WAIT

Sorry to be so verbose myself, but I would really like to figure out how I can solve this problem.


More information about the torqueusers mailing list