From soubari at yahoo.com Thu Aug 2 08:40:48 2012 From: soubari at yahoo.com (sam oubari) Date: Thu, 2 Aug 2012 07:40:48 -0700 (PDT) Subject: [torqueusers] Help! Old Puzzle... In-Reply-To: <1316063003.97212.YahooMailNeo@web110604.mail.gq1.yahoo.com> References: <1316055881.47486.YahooMailNeo@web110609.mail.gq1.yahoo.com> <6254e42d-d6c3-4c96-a72f-4de7f618ecd0@mail> <1316063003.97212.YahooMailNeo@web110604.mail.gq1.yahoo.com> Message-ID: <1343918448.56708.YahooMailNeo@web110605.mail.gq1.yahoo.com> Hello All, ? I am still having a puzzle where a job does not start when its time arrives.? It only impacts a repeating job on?one queue?that re-qsubs?itself at end of each run at?10 or 30 mins intervals.? About a couple times a week, it will get stuck at Q.? Always happens during work hours, mostly before 3pm, and many times around the supposedly slow lunch hour.? In the?server_logs, there is odd entry a minute or two before scheduled start: ? 07/09/2012 10:47:30;0008;PBS_Server;Job;6035.naboo.linnbenton.edu;Job Modified at request of rpt_prod at naboo.linnbenton.edu ? qstat shows Hold_Types changing from n to o.? When?it happens, we simply issue QRUN on the stuck job.?We average about a 1000 qsubs per day?mostly using two queues?(most are?small jobs, 1 minute or less) .? Restarting TORQUE weekly did not help.? We have a busy but very?simple TORQUE 2.5.6 environment (No external nodes/users, all local?in a?VM host under Oracle VM 2.2.2): ? #?uname -a Linux naboo.linnbenton.edu 2.6.18-274.7.1.0.1.el5 #1 SMP Thu Oct 20 22:20:30 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux #?qstat -q server: naboo.linnbenton.edu Queue??????????? Memory CPU Time Walltime Node? Run Que Lm? State ---------------- ------ -------- -------- ----? --- --- --? ----- sys_ban??????????? --????? --?????? --????? --??? 1? 17? 1?? E R sys_srv??????????? --????? --?????? --????? --??? 8?? 8 10?? E R sys_tst??????????? --????? --?????? --????? --??? 0?? 4? 1?? E R sys_ban_quick????? --????? --?????? --????? --??? 0?? 0? 1?? E R ?????????????????????????????????????????????? ----- ----- ?????????????????????????????????????????????????? 9??? 29 # qmgr -c "list que sys_ban" Queue sys_ban ??????? queue_type = Execution ??????? max_queuable = 300 ??????? total_jobs = 19 ??????? state_count = Transit:0 Queued:0 Held:0 Waiting:18 Running:0 Exiting:0 ??????? max_running = 1 ??????? resources_default.nodes = 1 ??????? resources_default.walltime = 168:00:00 ??????? mtime = Sat Jul 28 01:36:45 2012 ??????? resources_assigned.nodect = 0 ??????? enabled = True ??????? started = True ? #?ps -ef|grep pbs root????? 8860???? 1? 0 Jul27 ???????? 00:03:32 /usr/local/sbin/pbs_mom root????? 8865???? 1? 0 Jul27 ???????? 00:00:44 /usr/local/sbin/pbs_server root????? 8867???? 1? 0 Jul27 ???????? 00:00:15 /usr/local/sbin/pbs_sched During installs, I issue: ./configure --enable-docs --disable-dependency-tracking --disable-libtool-lock --with-scp? # USED SINCE 2.4.5 We've upgraded several times?and I am running out of ideas, so if you have a similar environment that works, I would love to see your settings? ?For example, what options did you 'configure' with? ? It was suggested to use gdb on MOM, but have not installed gdb yet. Thank you, Sam. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120802/e6d01260/attachment.html From lloyd_brown at byu.edu Thu Aug 2 14:07:47 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Thu, 02 Aug 2012 14:07:47 -0600 Subject: [torqueusers] Torque 2.5.x->4.x upgrade finding already-running jobs? In-Reply-To: <5012B3BE.6070305@byu.edu> References: <50083FA0.6060408@byu.edu> <5011F3C3.6050809@unimelb.edu.au> <5012B24A.8000407@byu.edu> <5012B3BE.6070305@byu.edu> Message-ID: <501ADE13.60807@byu.edu> Just so the community knows what happened here: I have, thus far, been unable to get the v4.1.x pbs_mom's to reacquire the already-running jobs. I've also been unable to trace the exact problem through the code myself. I've also asked AdaptiveComputing via a support ticket if this would be possible, and any help they could give me in terms of what would be the likely reason for the problem, and this was their answer, in part: > It would be nice if this worked, but we're not supporting the migration of running jobs between major upgrades. So, until I can figure out how to follow Torque code (which may take a LONG time), I believe the answer is that this type of migration (2.5.x->4.x) will require a full drain of the cluster. I had hoped to avoid that, but right now, I just don't have the time to fight this anymore. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 07/27/2012 09:29 AM, Lloyd Brown wrote: > Well, so far, I have the pbs_server successfully upgrading and finding > hte jobs, and the pbs_mom's upgrading, but for some reason the pbs_mom's > are ignoring the running processes, and deleting the > TORQUEHOME/mom_priv/jobs/*.{JB,SC,TK} files. They don't delete the jobs > on the server, or kill the running PIDs, though, so it's a little confusing. > > > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 07/27/2012 09:26 AM, David Beer wrote: >> Lloyd, >> >> If you don't use cpusets then the change to hwloc won't affect you at >> all. You have to configure cpusets on in order to use them. At any rate >> I'm interested to hear about the results your test. >> >> David >> >> On Fri, Jul 27, 2012 at 9:22 AM, Lloyd Brown > > wrote: >> >> Interesting. I didn't explicitly enable either one. I don't know >> about the hwloc stuff in 4.x, but don't I have to explicitly compile >> in the cpusets code in 2.x? I don't think I did, in this case. Would >> that have an effect? >> >> In any case, this is why I'm testing this on a staging cluster, not on >> our production system. If I can pull it off, then we can upgrade >> without needing to drain the whole cluster. If not, then we're no >> worse off than if I hadn't tried. >> >> Lloyd Brown >> Systems Administrator >> Fulton Supercomputing Lab >> Brigham Young University >> http://marylou.byu.edu >> >> On 07/26/2012 07:49 PM, Christopher Samuel wrote: >> > To be honest I wouldn't even try that, 4.x uses hwloc rather than >> > its own cpusets code so I don't believe it's worth the risk. >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> >> >> >> >> -- >> David Beer | Software Engineer >> Adaptive Computing >> >> >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From knielson at adaptivecomputing.com Fri Aug 3 11:23:49 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 3 Aug 2012 11:23:49 -0600 Subject: [torqueusers] Job not running due to features when node name specified In-Reply-To: References: Message-ID: On Mon, Jul 30, 2012 at 10:23 AM, Andrus, Brian Contractor wrote: > > I am a bit confused as to how to troubleshoot and understand why this is. > > Running torque 2.5.12 and moab 6.1.6 > I submit a job with a specific node: > qsub -l nodes=compute-3-1 > It queues up fine, but never runs. > qshow shows it as eligible and queued. > When I run checkjob -v on it it says it shows: > compute-3-1 rejected: Features > > ??? Um ok... > qstat shows: > Resource_List.nodect = 1 > Resource_List.nodes = compute-3-1 > Resource_List.pmem = 1gb > > But I can force it with 'qrun 839' and it does run on the node requested. > > One thing I would REALLY like to know is how to determine the specific > features a job is being rejected for on a particular node. > And why would a job be rejected due to features when it clearly can and > should run? > > FWIW, compute-3-1 has no other jobs on it at the time. > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate School > Monterey, California > voice: 831-656-6238 > > > Brian, if compute-3-1 is the name of the host then this is strange indeed. We run jobs with -l nodes= all the time. Do you have some Moab logs and TORQUE server logs when this happens? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120803/3aa5b604/attachment-0001.html From knielson at adaptivecomputing.com Fri Aug 3 11:45:57 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Fri, 3 Aug 2012 11:45:57 -0600 Subject: [torqueusers] Hostname problems on compute node with pbs_mom? In-Reply-To: <5013FF49.1080407@att.net> References: <5013FF49.1080407@att.net> Message-ID: On Sat, Jul 28, 2012 at 9:03 AM, Jeff Layton wrote: > Good morning, > > I'm running Torque 4.0.2 on a cluster and have installed > torque and torque-client on the compute node. When I > try to start pbs_mom, I get the following error in the > mom_logs: > > > 07/28/2012 11:24:20;0002; pbs_mom;Svr;Log;Log opened > 07/28/2012 11:24:20;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 4.0.2, loglevel = 0 > 07/28/2012 11:24:20;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, > Unable to get my full hostname for n0001 error -1 > > > > > The name of the compute node is n0001 (no domain > or anything). > > TIA! > > Jeff > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > Jeff, There is a problem in the code for reporting why the hostname could not be resolved. Please create a bugzilla ticket and I will attach a patch for you to try. It will give log information about why the full hostname could not be found. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120803/8697c20d/attachment.html From laytonjb at att.net Fri Aug 3 12:35:31 2012 From: laytonjb at att.net (Jeff Layton) Date: Fri, 03 Aug 2012 14:35:31 -0400 Subject: [torqueusers] Hostname problems on compute node with pbs_mom? In-Reply-To: References: <5013FF49.1080407@att.net> <501765C1.3040002@unimelb.edu.au> Message-ID: <501C19F3.5050102@att.net> David, I found the error - the local host wasn't in /etc/hosts. I've fixed the problem. Thanks! Jeff > What this error message almost certainly means is that the call to > either getnameinfo() or getaddrinfo() failed. Is there any reason to > believe that is possible? > > David > > On Mon, Jul 30, 2012 at 10:57 PM, Christopher Samuel > > wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 29/07/12 01:03, Jeff Layton wrote: > > > 07/28/2012 11:24:20;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::pbs_mom, > > Unable to get my full hostname for n0001 error -1 > > There's a bunch of reasons that can fail looking at the code, but they > only get reported if the optional argument EMsg is passed through to > get_fullhostname() which isn't the case for the call site that leads > to that error.. :-( > > You might need to instrument the code to find out where it's failing.. > > I'd also suggest reporting a bug about it not logging those errors by > default. http://www.clusterresources.com/bugzilla/ > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au > Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAlAXZcEACgkQO2KABBYQAh9JpACcDXdCkO3///MDQbl5tz6Yy6em > 2QMAn1EZMvsTyJZfO28yL0VCcJ0k5xbU > =eYK5 > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > -- > David Beer | Software Engineer > Adaptive Computing > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120803/6c21ae74/attachment.html From laytonjb at att.net Sun Aug 5 13:59:13 2012 From: laytonjb at att.net (Jeff Layton) Date: Sun, 05 Aug 2012 15:59:13 -0400 Subject: [torqueusers] Job not running Message-ID: <501ED091.8080208@att.net> Good afternoon, I apologize for the eternal question, "why isn't my job running" but I'm not sure where to look next. I'm running Torque 4.0.2 that I built on a Scientific Linux 6.2 box. The job script is, #!/bin/bash #PBS -q batch #PBS -l walltime=00:10:00 #PBS -l nodes=1:ppn=1 date hostname sleep 20 date I submit using qsub and then "qstat -a" looks like, [laytonjb at test1 TEST]$ qstat -a test1: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- It stays like this forever. I looked in the logs and didn't see any anything obvious. Here is some output that may help. Server logs: 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into batch, state 1 hop 1 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at request of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue = batch Scheduler logs: (FIFO scheduler): 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file /opt/torque/sched_priv/accounting/20120805 opened 08/05/2012 15:44:44;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") and the output is below) 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.0.2, loglevel = 0 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 added 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in check_partition_confirm_script, Couldn't stat the partition confirm command '/opt/moab/default/tools/xt4/partition.create.xt4.pl ' - ignore this if you aren't running a cray 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.0.2, loglevel = 0 pbsnodes -a: [root at test1 mom_logs]# pbsnodes -a n0001 state = free np = 1 ntype = cluster status = rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 qmgr -c "p s": [root at test1 mom_logs]# qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = test1 set server managers = laytonjb at test1 set server operators = laytonjb at test1 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server next_job_number = 12 set server moab_array_compatible = True Not sure where to start looking from here. TIA! Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120805/387370f9/attachment-0001.html From laytonjb at att.net Sun Aug 5 17:27:02 2012 From: laytonjb at att.net (Jeff Layton) Date: Sun, 05 Aug 2012 19:27:02 -0400 Subject: [torqueusers] Job not running In-Reply-To: <501ED091.8080208@att.net> References: <501ED091.8080208@att.net> Message-ID: <501F0146.3030406@att.net> Just an FYI - the job would run once I used qrun. Does this point to the scheduler? (I'm just using the default scheduler that comes with Torque (i.e. not Maui). Thanks! Jeff > Good afternoon, > > I apologize for the eternal question, "why isn't my job running" > but I'm not sure where to look next. I'm running Torque 4.0.2 > that I built on a Scientific Linux 6.2 box. > > The job script is, > > #!/bin/bash > #PBS -q batch > #PBS -l walltime=00:10:00 > #PBS -l nodes=1:ppn=1 > > date > hostname > sleep 20 > date > > > I submit using qsub and then "qstat -a" looks like, > > [laytonjb at test1 TEST]$ qstat -a > > test1: > > Req'd Req'd Elap > Job ID Username Queue Jobname SessID > NDS TSK Memory Time S Time > -------------------- ----------- -------- ---------------- ------ > ----- ------ ------ ----- - ----- > 11.test1 laytonjb batch pbs_test2 -- > 1 1 -- 00:10 Q -- > > > It stays like this forever. I looked in the logs and didn't see any > anything obvious. Here is some output that may help. > > > Server logs: > > 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into batch, > state 1 hop 1 > 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at request > of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue > = batch > > > Scheduler logs: (FIFO scheduler): > > 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 > 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed > 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened > 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file > /opt/torque/sched_priv/accounting/20120805 opened > 08/05/2012 15:44:44;0002; > pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 > > > pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") > and the output is below) > > 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown > 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup > 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed > 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened > 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 4.0.2, loglevel = 0 > 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 > 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 added > 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such > file or directory (2) in check_partition_confirm_script, Couldn't stat > the partition confirm command > '/opt/moab/default/tools/xt4/partition.create.xt4.pl > ' - ignore this if you aren't running > a cray > 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent > 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs > 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up > 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259 > 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 4.0.2, loglevel = 0 > > > pbsnodes -a: > > [root at test1 mom_logs]# pbsnodes -a > n0001 > state = free > np = 1 > ntype = cluster > status = > rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux > n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > > > qmgr -c "p s": > [root at test1 mom_logs]# qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = test1 > set server managers = laytonjb at test1 > set server operators = laytonjb at test1 > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 300 > set server job_stat_rate = 45 > set server poll_jobs = True > set server mom_job_sync = True > set server next_job_number = 12 > set server moab_array_compatible = True > > > Not sure where to start looking from here. > > TIA! > > Jeff > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120805/6093cccb/attachment.html From andre.gemuend at scai.fraunhofer.de Mon Aug 6 00:27:43 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Mon, 6 Aug 2012 08:27:43 +0200 (CEST) Subject: [torqueusers] Job not running In-Reply-To: <501F0146.3030406@att.net> Message-ID: <958750731.4011954.1344234463794.JavaMail.root@scai.fraunhofer.de> Hi Jeff, yes, that it runs with qrun points at the scheduler. Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs. Greetings Andre ----- Urspr?ngliche Mail ----- > > Just an FYI - the job would run once I used qrun. Does this point > to the scheduler? (I'm just using the default scheduler that comes > with Torque (i.e. not Maui). > > Thanks! > > Jeff > > > > Good afternoon, > > I apologize for the eternal question, "why isn't my job running" > but I'm not sure where to look next. I'm running Torque 4.0.2 > that I built on a Scientific Linux 6.2 box. > > The job script is, > > #!/bin/bash > #PBS -q batch > #PBS -l walltime=00:10:00 > #PBS -l nodes=1:ppn=1 > > date > hostname > sleep 20 > date > > > I submit using qsub and then "qstat -a" looks like, > > [laytonjb at test1 TEST]$ qstat -a > > test1: > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > -------------------- ----------- -------- ---------------- ------ > ----- ------ ------ ----- - ----- > 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- > > > It stays like this forever. I looked in the logs and didn't see any > anything obvious. Here is some output that may help. > > > Server logs: > > 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into > batch, state 1 hop 1 > 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at > request of laytonjb at test1, owner = laytonjb at test1, job name = > pbs_test2, queue = batch > > > Scheduler logs: (FIFO scheduler): > > 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 > 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed > 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened > 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file > /opt/torque/sched_priv/accounting/20120805 opened > 08/05/2012 15:44:44;0002; > pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 > > > pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") > and the output is below) > > 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown > 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup > 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed > 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened > 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 4.0.2, loglevel = 0 > 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 > 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 > added > 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file > or directory (2) in check_partition_confirm_script, Couldn't stat > the partition confirm command '/opt/moab/default/tools/xt4/ > partition.create.xt4.pl ' - ignore this if you aren't running a cray > 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent > 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs > 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up > 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM > executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259 > 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = > 4.0.2, loglevel = 0 > > > pbsnodes -a: > > [root at test1 mom_logs]# pbsnodes -a > n0001 > state = free > np = 1 > ntype = cluster > status = > rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux > n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 > x86_64,opsys=linux > mom_service_port = 15002 > mom_manager_port = 15003 > gpus = 0 > > > > qmgr -c "p s": > [root at test1 mom_logs]# qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 01:00:00 > set queue batch enabled = True > set queue batch started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_hosts = test1 > set server managers = laytonjb at test1 > set server operators = laytonjb at test1 > set server default_queue = batch > set server log_events = 511 > set server mail_from = adm > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 300 > set server job_stat_rate = 45 > set server poll_jobs = True > set server mom_job_sync = True > set server next_job_number = 12 > set server moab_array_compatible = True > > > Not sure where to start looking from here. > > TIA! > > Jeff > > _______________________________________________ > torqueusers mailing list torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From laytonjb at att.net Mon Aug 6 07:24:12 2012 From: laytonjb at att.net (Jeff Layton) Date: Mon, 06 Aug 2012 09:24:12 -0400 Subject: [torqueusers] Job not running In-Reply-To: <958750731.4011954.1344234463794.JavaMail.root@scai.fraunhofer.de> References: <958750731.4011954.1344234463794.JavaMail.root@scai.fraunhofer.de> Message-ID: <501FC57C.4090706@att.net> Andr?, Thanks! I tried running the job again this morning without the "np=1" and it still hung. I'm attaching the tracejob output as well as the scheduler logs. Note the mom logs are actually named by the compute node name. For example, the first compute node is called n0001 so the mom logs are called n0001.log. I do this since all of the mom logs are in one NFS directory (/opt/torque/mom_logs). I don't know how to raise the log level - can you help with that? Thanks! Jeff Tracejob output: [root at test1 bin]# qstat -a test1: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 13.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- [root at test1 bin]# ./tracejob 13.test1 /opt/torque/mom_logs/20120806: No such file or directory /opt/torque/sched_logs/20120806: No matching job records located Job: 13.test1 08/06/2012 09:32:53 S enqueuing into batch, state 1 hop 1 08/06/2012 09:32:53 S Job Queued at request of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2, queue = batch 08/06/2012 09:32:53 A queue=batch Scheduler Logs: [root at test1 sched_logs]# more 20120806 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file /opt/torque/sched_priv/accounting/20120806 opened 08/06/2012 09:25:54;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 2090 > Hi Jeff, > > yes, that it runs with qrun points at the scheduler. > Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs. > > Greetings > Andre > > ----- Urspr?ngliche Mail ----- >> Just an FYI - the job would run once I used qrun. Does this point >> to the scheduler? (I'm just using the default scheduler that comes >> with Torque (i.e. not Maui). >> >> Thanks! >> >> Jeff >> >> >> >> Good afternoon, >> >> I apologize for the eternal question, "why isn't my job running" >> but I'm not sure where to look next. I'm running Torque 4.0.2 >> that I built on a Scientific Linux 6.2 box. >> >> The job script is, >> >> #!/bin/bash >> #PBS -q batch >> #PBS -l walltime=00:10:00 >> #PBS -l nodes=1:ppn=1 >> >> date >> hostname >> sleep 20 >> date >> >> >> I submit using qsub and then "qstat -a" looks like, >> >> [laytonjb at test1 TEST]$ qstat -a >> >> test1: >> Req'd Req'd Elap >> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time >> -------------------- ----------- -------- ---------------- ------ >> ----- ------ ------ ----- - ----- >> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- >> >> >> It stays like this forever. I looked in the logs and didn't see any >> anything obvious. Here is some output that may help. >> >> >> Server logs: >> >> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into >> batch, state 1 hop 1 >> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at >> request of laytonjb at test1, owner = laytonjb at test1, job name = >> pbs_test2, queue = batch >> >> >> Scheduler logs: (FIFO scheduler): >> >> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 >> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed >> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened >> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file >> /opt/torque/sched_priv/accounting/20120805 opened >> 08/05/2012 15:44:44;0002; >> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 >> >> >> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") >> and the output is below) >> >> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown >> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup >> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = >> 4.0.2, loglevel = 0 >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 >> added >> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file >> or directory (2) in check_partition_confirm_script, Couldn't stat >> the partition confirm command '/opt/moab/default/tools/xt4/ >> partition.create.xt4.pl ' - ignore this if you aren't running a cray >> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent >> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM >> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259 >> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = >> 4.0.2, loglevel = 0 >> >> >> pbsnodes -a: >> >> [root at test1 mom_logs]# pbsnodes -a >> n0001 >> state = free >> np = 1 >> ntype = cluster >> status = >> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux >> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 >> x86_64,opsys=linux >> mom_service_port = 15002 >> mom_manager_port = 15003 >> gpus = 0 >> >> >> >> qmgr -c "p s": >> [root at test1 mom_logs]# qmgr -c "p s" >> # >> # Create queues and set their attributes. >> # >> # >> # Create and define queue batch >> # >> create queue batch >> set queue batch queue_type = Execution >> set queue batch resources_default.nodes = 1 >> set queue batch resources_default.walltime = 01:00:00 >> set queue batch enabled = True >> set queue batch started = True >> # >> # Set server attributes. >> # >> set server scheduling = True >> set server acl_hosts = test1 >> set server managers = laytonjb at test1 >> set server operators = laytonjb at test1 >> set server default_queue = batch >> set server log_events = 511 >> set server mail_from = adm >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 300 >> set server job_stat_rate = 45 >> set server poll_jobs = True >> set server mom_job_sync = True >> set server next_job_number = 12 >> set server moab_array_compatible = True >> >> >> Not sure where to start looking from here. >> >> TIA! >> >> Jeff >> From gus at ldeo.columbia.edu Mon Aug 6 09:58:25 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 06 Aug 2012 11:58:25 -0400 Subject: [torqueusers] Job not running In-Reply-To: <501FC57C.4090706@att.net> References: <958750731.4011954.1344234463794.JavaMail.root@scai.fraunhofer.de> <501FC57C.4090706@att.net> Message-ID: <501FE9A1.8050501@ldeo.columbia.edu> Hi Jeff Are you running pbs_server, pbs_mom, pbs_sched as yourself or root? Would this work? qmgr -c 'set server managers += root@@test1' qmgr -c 'set server operators += root@@test1' From the Torque Admin Guide: http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/1-installConfig/installing.htm "Adaptive Computing recommends that the TORQUE administrator be root. For information about customizing the build at configure time, see Customizing the install." "TORQUE must be installed by a root user. If running sudo fails, switch to root with su-." I am not sure if root must reign on Torque, but in the user guide there are several other references to commands that need to be run by root, processes that need to be owned by root, etc. A list subscribing expert could shed some light here. I hope it helps, Gus Correa On 08/06/2012 09:24 AM, Jeff Layton wrote: > Andr?, > > Thanks! I tried running the job again this morning without > the "np=1" and it still hung. I'm attaching the tracejob output > as well as the scheduler logs. Note the mom logs are actually > named by the compute node name. For example, the first > compute node is called n0001 so the mom logs are called > n0001.log. I do this since all of the mom logs are in one > NFS directory (/opt/torque/mom_logs). > > I don't know how to raise the log level - can you help with > that? > > Thanks! > > Jeff > > > Tracejob output: > [root at test1 bin]# qstat -a > > test1: > Req'd Req'd Elap > Job ID Username Queue Jobname SessID NDS > TSK Memory Time S Time > -------------------- ----------- -------- ---------------- ------ ----- > ------ ------ ----- - ----- > 13.test1 laytonjb batch pbs_test2 -- > 1 1 -- 00:10 Q -- > [root at test1 bin]# ./tracejob 13.test1 > /opt/torque/mom_logs/20120806: No such file or directory > /opt/torque/sched_logs/20120806: No matching job records located > > Job: 13.test1 > > 08/06/2012 09:32:53 S enqueuing into batch, state 1 hop 1 > 08/06/2012 09:32:53 S Job Queued at request of laytonjb at test1, owner > = laytonjb at test1, job name = pbs_test2, queue = batch > 08/06/2012 09:32:53 A queue=batch > > > > > Scheduler Logs: > [root at test1 sched_logs]# more 20120806 > 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened > 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file > /opt/torque/sched_priv/accounting/20120806 opened > 08/06/2012 09:25:54;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched > startup pid 2090 > > > >> Hi Jeff, >> >> yes, that it runs with qrun points at the scheduler. >> Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs. >> >> Greetings >> Andre >> >> ----- Urspr?ngliche Mail ----- >>> Just an FYI - the job would run once I used qrun. Does this point >>> to the scheduler? (I'm just using the default scheduler that comes >>> with Torque (i.e. not Maui). >>> >>> Thanks! >>> >>> Jeff >>> >>> >>> >>> Good afternoon, >>> >>> I apologize for the eternal question, "why isn't my job running" >>> but I'm not sure where to look next. I'm running Torque 4.0.2 >>> that I built on a Scientific Linux 6.2 box. >>> >>> The job script is, >>> >>> #!/bin/bash >>> #PBS -q batch >>> #PBS -l walltime=00:10:00 >>> #PBS -l nodes=1:ppn=1 >>> >>> date >>> hostname >>> sleep 20 >>> date >>> >>> >>> I submit using qsub and then "qstat -a" looks like, >>> >>> [laytonjb at test1 TEST]$ qstat -a >>> >>> test1: >>> Req'd Req'd Elap >>> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time >>> -------------------- ----------- -------- ---------------- ------ >>> ----- ------ ------ ----- - ----- >>> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- >>> >>> >>> It stays like this forever. I looked in the logs and didn't see any >>> anything obvious. Here is some output that may help. >>> >>> >>> Server logs: >>> >>> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into >>> batch, state 1 hop 1 >>> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at >>> request of laytonjb at test1, owner = laytonjb at test1, job name = >>> pbs_test2, queue = batch >>> >>> >>> Scheduler logs: (FIFO scheduler): >>> >>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 >>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed >>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened >>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file >>> /opt/torque/sched_priv/accounting/20120805 opened >>> 08/05/2012 15:44:44;0002; >>> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 >>> >>> >>> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") >>> and the output is below) >>> >>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown >>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup >>> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = >>> 4.0.2, loglevel = 0 >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 >>> added >>> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file >>> or directory (2) in check_partition_confirm_script, Couldn't stat >>> the partition confirm command '/opt/moab/default/tools/xt4/ >>> partition.create.xt4.pl ' - ignore this if you aren't running a cray >>> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent >>> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM >>> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259 >>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = >>> 4.0.2, loglevel = 0 >>> >>> >>> pbsnodes -a: >>> >>> [root at test1 mom_logs]# pbsnodes -a >>> n0001 >>> state = free >>> np = 1 >>> ntype = cluster >>> status = >>> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux >>> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 >>> x86_64,opsys=linux >>> mom_service_port = 15002 >>> mom_manager_port = 15003 >>> gpus = 0 >>> >>> >>> >>> qmgr -c "p s": >>> [root at test1 mom_logs]# qmgr -c "p s" >>> # >>> # Create queues and set their attributes. >>> # >>> # >>> # Create and define queue batch >>> # >>> create queue batch >>> set queue batch queue_type = Execution >>> set queue batch resources_default.nodes = 1 >>> set queue batch resources_default.walltime = 01:00:00 >>> set queue batch enabled = True >>> set queue batch started = True >>> # >>> # Set server attributes. >>> # >>> set server scheduling = True >>> set server acl_hosts = test1 >>> set server managers = laytonjb at test1 >>> set server operators = laytonjb at test1 >>> set server default_queue = batch >>> set server log_events = 511 >>> set server mail_from = adm >>> set server scheduler_iteration = 600 >>> set server node_check_rate = 150 >>> set server tcp_timeout = 300 >>> set server job_stat_rate = 45 >>> set server poll_jobs = True >>> set server mom_job_sync = True >>> set server next_job_number = 12 >>> set server moab_array_compatible = True >>> >>> >>> Not sure where to start looking from here. >>> >>> TIA! >>> >>> Jeff >>> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From laytonjb at att.net Mon Aug 6 11:32:55 2012 From: laytonjb at att.net (Jeff Layton) Date: Mon, 06 Aug 2012 13:32:55 -0400 Subject: [torqueusers] Job not running In-Reply-To: <501FE9A1.8050501@ldeo.columbia.edu> References: <958750731.4011954.1344234463794.JavaMail.root@scai.fraunhofer.de> <501FC57C.4090706@att.net> <501FE9A1.8050501@ldeo.columbia.edu> Message-ID: <501FFFC7.7070509@att.net> Gus. Thanks for the email! Everything is run by root and was installed by root. I tried your suggestions below to add root to the server manager and operators but that didn't change anything. The jobs still hang and I can't find out why. I'm still trying some things but no joy so far. I think the problem is in the scheduler but I can't seem to locate the problem. It's the simple FIFO scheduler that is part of Torque so I don't see any reason why it's holding jobs. The only thing I can think of is that it doesn't think there are any resources available but I can't find a reason why. Thanks! Jeff > Hi Jeff > > Are you running pbs_server, pbs_mom, pbs_sched as yourself or root? > > Would this work? > > qmgr -c 'set server managers += root@@test1' > qmgr -c 'set server operators += root@@test1' > > From the Torque Admin Guide: > > http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/1-installConfig/installing.htm > > "Adaptive Computing recommends that the TORQUE administrator be root. > For information about customizing the build at configure time, see > Customizing the install." > > "TORQUE must be installed by a root user. If running sudo fails, switch > to root with su-." > > I am not sure if root must reign on Torque, but in the user guide > there are several other references to commands that need > to be run by root, processes that need to be owned by root, etc. > A list subscribing expert could shed some light here. > > I hope it helps, > Gus Correa > > > On 08/06/2012 09:24 AM, Jeff Layton wrote: >> Andr?, >> >> Thanks! I tried running the job again this morning without >> the "np=1" and it still hung. I'm attaching the tracejob output >> as well as the scheduler logs. Note the mom logs are actually >> named by the compute node name. For example, the first >> compute node is called n0001 so the mom logs are called >> n0001.log. I do this since all of the mom logs are in one >> NFS directory (/opt/torque/mom_logs). >> >> I don't know how to raise the log level - can you help with >> that? >> >> Thanks! >> >> Jeff >> >> >> Tracejob output: >> [root at test1 bin]# qstat -a >> >> test1: >> Req'd Req'd Elap >> Job ID Username Queue Jobname SessID NDS >> TSK Memory Time S Time >> -------------------- ----------- -------- ---------------- ------ ----- >> ------ ------ ----- - ----- >> 13.test1 laytonjb batch pbs_test2 -- >> 1 1 -- 00:10 Q -- >> [root at test1 bin]# ./tracejob 13.test1 >> /opt/torque/mom_logs/20120806: No such file or directory >> /opt/torque/sched_logs/20120806: No matching job records located >> >> Job: 13.test1 >> >> 08/06/2012 09:32:53 S enqueuing into batch, state 1 hop 1 >> 08/06/2012 09:32:53 S Job Queued at request of laytonjb at test1, owner >> = laytonjb at test1, job name = pbs_test2, queue = batch >> 08/06/2012 09:32:53 A queue=batch >> >> >> >> >> Scheduler Logs: >> [root at test1 sched_logs]# more 20120806 >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file >> /opt/torque/sched_priv/accounting/20120806 opened >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched >> startup pid 2090 >> >> >> >>> Hi Jeff, >>> >>> yes, that it runs with qrun points at the scheduler. >>> Did the pbs_sched daemon perhaps die? Does it work if you submit with default resources (specifically without ppn)? Otherwise, could you paste us a tracejob and qstat -f. And raise the log_level a bit. I don't see a problem in your logs. >>> >>> Greetings >>> Andre >>> >>> ----- Urspr?ngliche Mail ----- >>>> Just an FYI - the job would run once I used qrun. Does this point >>>> to the scheduler? (I'm just using the default scheduler that comes >>>> with Torque (i.e. not Maui). >>>> >>>> Thanks! >>>> >>>> Jeff >>>> >>>> >>>> >>>> Good afternoon, >>>> >>>> I apologize for the eternal question, "why isn't my job running" >>>> but I'm not sure where to look next. I'm running Torque 4.0.2 >>>> that I built on a Scientific Linux 6.2 box. >>>> >>>> The job script is, >>>> >>>> #!/bin/bash >>>> #PBS -q batch >>>> #PBS -l walltime=00:10:00 >>>> #PBS -l nodes=1:ppn=1 >>>> >>>> date >>>> hostname >>>> sleep 20 >>>> date >>>> >>>> >>>> I submit using qsub and then "qstat -a" looks like, >>>> >>>> [laytonjb at test1 TEST]$ qstat -a >>>> >>>> test1: >>>> Req'd Req'd Elap >>>> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time >>>> -------------------- ----------- -------- ---------------- ------ >>>> ----- ------ ------ ----- - ----- >>>> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- >>>> >>>> >>>> It stays like this forever. I looked in the logs and didn't see any >>>> anything obvious. Here is some output that may help. >>>> >>>> >>>> Server logs: >>>> >>>> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into >>>> batch, state 1 hop 1 >>>> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at >>>> request of laytonjb at test1, owner = laytonjb at test1, job name = >>>> pbs_test2, queue = batch >>>> >>>> >>>> Scheduler logs: (FIFO scheduler): >>>> >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file >>>> /opt/torque/sched_priv/accounting/20120805 opened >>>> 08/05/2012 15:44:44;0002; >>>> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 >>>> >>>> >>>> pbs_mom logs: (I tried restarting the mom ("service pbs_mom restart") >>>> and the output is below) >>>> >>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown >>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup >>>> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = >>>> 4.0.2, loglevel = 0 >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server test1 >>>> added >>>> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such file >>>> or directory (2) in check_partition_confirm_script, Couldn't stat >>>> the partition confirm command '/opt/moab/default/tools/xt4/ >>>> partition.create.xt4.pl ' - ignore this if you aren't running a cray >>>> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent >>>> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setup_program_environment;MOM >>>> executable path and mtime at launch: /usr/sbin/pbs_mom 1344179259 >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = >>>> 4.0.2, loglevel = 0 >>>> >>>> >>>> pbsnodes -a: >>>> >>>> [root at test1 mom_logs]# pbsnodes -a >>>> n0001 >>>> state = free >>>> np = 1 >>>> ntype = cluster >>>> status = >>>> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux >>>> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 >>>> x86_64,opsys=linux >>>> mom_service_port = 15002 >>>> mom_manager_port = 15003 >>>> gpus = 0 >>>> >>>> >>>> >>>> qmgr -c "p s": >>>> [root at test1 mom_logs]# qmgr -c "p s" >>>> # >>>> # Create queues and set their attributes. >>>> # >>>> # >>>> # Create and define queue batch >>>> # >>>> create queue batch >>>> set queue batch queue_type = Execution >>>> set queue batch resources_default.nodes = 1 >>>> set queue batch resources_default.walltime = 01:00:00 >>>> set queue batch enabled = True >>>> set queue batch started = True >>>> # >>>> # Set server attributes. >>>> # >>>> set server scheduling = True >>>> set server acl_hosts = test1 >>>> set server managers = laytonjb at test1 >>>> set server operators = laytonjb at test1 >>>> set server default_queue = batch >>>> set server log_events = 511 >>>> set server mail_from = adm >>>> set server scheduler_iteration = 600 >>>> set server node_check_rate = 150 >>>> set server tcp_timeout = 300 >>>> set server job_stat_rate = 45 >>>> set server poll_jobs = True >>>> set server mom_job_sync = True >>>> set server next_job_number = 12 >>>> set server moab_array_compatible = True >>>> >>>> >>>> Not sure where to start looking from here. >>>> >>>> TIA! >>>> >>>> Jeff >>>> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From rroman at cenat.ac.cr Mon Aug 6 17:00:48 2012 From: rroman at cenat.ac.cr (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Mon, 6 Aug 2012 17:00:48 -0600 Subject: [torqueusers] a bit o' help with torque paths and such Message-ID: Hello guys I got a problem here, ill be as direct as i can be. I got 4 nodes, on node 0 there's torque/maui server and a mom, and on the 1-3 theres just the torque mom. my problem is that when i do "exports" as a user using a ssh login i get paths taht i dont get when i launch either an interactive job or a batch job using qsub: As a regular ssh login with a non-root user: > [rroman at zarate-0 ~]$ export > > > declare -x BASHERRECE="bashrc" > > > declare -x BASHLOCAL=".bashrc" > > > declare -x G_BROKEN_FILENAMES="1" > > > declare -x HISTSIZE="1000" > > > declare -x HOME="/home/rroman" > > > declare -x HOSTNAME="zarate-0.cnca" > > > declare -x KDEDIRS="/usr" > > > declare -x KDE_IS_PRELINKED="1" > > > declare -x KMIX_PULSEAUDIO_DISABLE="1" > > > declare -x LANG="en_US.UTF-8" > > > declare -x LD_LIBRARY_PATH="/usr/local/lib/:" > > > declare -x LESSOPEN="|/usr/bin/lesspipe.sh %s" > > > declare -x LOGNAME="rroman" > > > declare -x > LS_COLORS="rs=0:di=01;34:ln=01;36:hl=44;37:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;4 > > 2:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz > > =01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*. > > rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01 > > ;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:* > > .nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;3 > > 5:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra > =01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:" > > > declare -x MAIL="/var/spool/mail/rroman" > > > declare -x OLDPWD > > > declare -x > PATH="/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/maui/bin:/usr/local/maui/sbin > :/usr/local/sbin:/usr/local/bin:/home/rroman/bin" > > > declare -x PERFIL="profile" > > > declare -x PWD="/home/rroman" > > > declare -x QTDIR="/usr/lib/qt-3.3" > > > declare -x QTINC="/usr/lib/qt-3.3/include" > > > declare -x QTLIB="/usr/lib/qt-3.3/lib" > > > declare -x SHELL="/bin/bash" > > > declare -x SHLVL="1" > > > declare -x SSH_CLIENT="192.168.1.50 64228 22" > > > declare -x SSH_CONNECTION="192.168.1.50 64228 192.168.1.200 22" > > > declare -x SSH_TTY="/dev/pts/1" > > > declare -x TERM="cygwin" > > > declare -x TZ="America/Costa_Rica" > > > declare -x USER="rroman" > > Please note the BASHERRECE, BASHLOCAL and PERFIL variables. these are defined in /etc/bashrc, /home/$USER/.bashrc and /etc/profile, also check the PATH, it has /usr/local/bin now with a interactive job (or batch, the results are the same) using qsub: > [rroman at zarate-0 jobs]$ qsub -q batch ./a.sh > > > 25.zarate-0.cnca > > > [rroman at zarate-0 jobs]$ cat a.sh.o > > > a.sh.o12 a.sh.o14 a.sh.o15 a.sh.o21 a.sh.o25 > > > [rroman at zarate-0 jobs]$ cat a.sh.o25 > > > uid=500(rroman) gid=503(cluster) groups=503(cluster) > > > Mon Aug 6 16:49:44 CST 2012 > > > zarate-3.cnca > > > /usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/bin:/usr/bin:/local/sbin:/usr/local/sbin:/usr/sbin:/sbin:/home/rroman/bin > > declare -x BASHLOCAL=".bashrc" > > > declare -x ENVIRONMENT="BATCH" > > > declare -x G_BROKEN_FILENAMES="1" > > > declare -x HISTSIZE="1000" > > > declare -x HOME="/home/rroman" > > > declare -x HOSTNAME="zarate-3.cnca" > > > declare -x KDEDIRS="/usr" > > > declare -x KDE_IS_PRELINKED="1" > > > declare -x KMIX_PULSEAUDIO_DISABLE="1" > > > declare -x LANG="C" > > > declare -x LESSOPEN="|/usr/bin/lesspipe.sh %s" > > > declare -x LOGNAME="rroman" > > > declare -x MAIL="/var/spool/mail/rroman" > > > declare -x OLDPWD > > > declare -x > PATH="/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/bin:/usr/bin:/local/sbin:/usr/local/sbin:/usr/sbin:/sbin:/home/rroman/bin" > > declare -x PBS_ENVIRONMENT="PBS_BATCH" > > > declare -x PBS_JOBCOOKIE="D6AD4C58275CD48820987209370F61CA" > > > declare -x PBS_JOBID="25.zarate-0.cnca" > > > declare -x PBS_JOBNAME="a.sh" > > > declare -x PBS_MOMPORT="15003" > > > declare -x PBS_NODEFILE="/var/torque/aux//25.zarate-0.cnca" > > > declare -x PBS_NODENUM="0" > > > declare -x PBS_O_HOME="/home/rroman" > > > declare -x PBS_O_HOST="zarate-0.cnca" > > > declare -x PBS_O_LANG="en_US.UTF-8" > > > declare -x PBS_O_LOGNAME="rroman" > > > declare -x PBS_O_MAIL="/var/spool/mail/rroman" > > > declare -x > PBS_O_PATH="/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/maui/bin:/usr/local/maui/sbin:/usr/local/sbin:/usr/local/bin:/home/rroman/bin" > > > declare -x PBS_O_QUEUE="batch" > > > declare -x PBS_O_SHELL="/bin/bash" > > > declare -x PBS_O_TZ="America/Costa_Rica" > > > declare -x PBS_O_WORKDIR="/home/rroman/jobs" > > > declare -x PBS_QUEUE="batch" > > > declare -x PBS_TASKNUM="1" > > > declare -x PBS_VNODENUM="0" > > > declare -x PWD="/home/rroman" > > > declare -x QTDIR="/usr/lib/qt-3.3" > > > declare -x QTINC="/usr/lib/qt-3.3/include" > > > declare -x QTLIB="/usr/lib/qt-3.3/lib" > > > declare -x SHELL="/bin/bash" > > > declare -x SHLVL="2" > > > declare -x USER="rroman" > > > [rroman at zarate-0 jobs]$ > > See that there is no BASHERRECE or PERFIL. And the paths are different. My problem is taht the mpicc is in /usr/local/bin and i need it in the pbs jobs, and i would really like NOT to add that path in the pbs jobs scripts, so i need to make it default but i dont know why it doenst work... Check both paths in a normal login and with an interactive job: > [rroman at zarate-0 jobs]$ echo $PATH > > > /usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/maui/bin:/usr/local/maui/sbin:/usr/local/sbin:/usr/local/bin:/home/rroman/bin > > > [rroman at zarate-0 jobs]$ qsub -I -q batch > > > qsub: waiting for job 27.zarate-0.cnca to start > > > eqsub: job 27.zarate-0.cnca ready > > > > > > [rroman at zarate-3 ~]$ echo $PATH > > > /usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/bin:/usr/bin:/local/sbin:/usr/local/sbin:/usr/sbin:/sbin:/home/rroman/bin > Im using centos 6, torque 2.1.10 and mayu 3.3.1 SO FINALLY THE QUESTIONS: 1. Which scripts are run when i launch a PBS job? (/etc/bashrc? /etc/profile.d/*?) 2. how can i add this /usr/local/bin to the path of all pbs jobs? thanks. If more info is needed please point it out! :) -ricardo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120806/60dec571/attachment-0001.html From l.flis at cyf-kr.edu.pl Wed Aug 8 08:27:09 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Wed, 08 Aug 2012 16:27:09 +0200 Subject: [torqueusers] torque IRC channel Message-ID: <5022773D.1040507@cyf-kr.edu.pl> Hi, Is there any IRC channel for Torque community and/or developers available? I've checked #torque on freenode which is registered but empty. I think it could be useful to have more interactive, low latency :) contact with torque community Cheers -- Lukasz Flis From brockp at umich.edu Thu Aug 9 16:18:12 2012 From: brockp at umich.edu (Brock Palen) Date: Thu, 9 Aug 2012 18:18:12 -0400 Subject: [torqueusers] cgroup memory allocation problem Message-ID: <28F7EF75-5623-4CB9-9544-5BE6E3E9B845@umich.edu> I filed this with adaptive but others should be aware of a major problem for high memory use jobs on pbs_moms using cgroups: cgroups in torque4 are assigning memory banks in numa systems based on core layout only. Example: 8 core 48GB memroy two socket machine valid cpus 0-7 valid mems 0-1 If a job is only on the first socket is is assigned to mems 0 if it is on the second, mems 1, if a job is assigned cores on both it is assigned both. The above is fine, Now if I request 1 core and more memory, node has two 24GB memory banks qsub procs=1,mem=47gb the mems is set to 0 and cpus 0 when my job hits 24 gb (the size of mems 0) I start to swap rather than giving me all the assigned memory. A similar case: procs=1,mem=20gb procs=1,mem=20gb procs=1,mem=20gb On am empty node if they are all on the same one, they get assigned cpu 0, 1, and 2 but all get mems 0 and jobs swap. Is there away to just assign all numa nodes in jobs? and just use CPU binding? Currently we are most interested in cpu binding. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 From andre.gemuend at scai.fraunhofer.de Tue Aug 7 01:49:56 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Tue, 7 Aug 2012 09:49:56 +0200 (CEST) Subject: [torqueusers] Job not running In-Reply-To: <501FFFC7.7070509@att.net> Message-ID: <609225716.4302603.1344325796267.JavaMail.root@scai.fraunhofer.de> Hi Jeff, please do a qmgr -c 'set server log_level = 7' and try again. Perhaps we can get some more information about the problem then. And please, send a qstat -f, not -a. :) Greetings Andr? ----- Urspr?ngliche Mail ----- > Gus. > > Thanks for the email! Everything is run by root and was installed > by root. I tried your suggestions below to add root to the server > manager and operators but that didn't change anything. The jobs > still hang and I can't find out why. > > I'm still trying some things but no joy so far. I think the problem > is > in the scheduler but I can't seem to locate the problem. It's the > simple FIFO scheduler that is part of Torque so I don't see any > reason why it's holding jobs. The only thing I can think of is that > it doesn't think there are any resources available but I can't > find a reason why. > > Thanks! > > Jeff > > > > Hi Jeff > > > > Are you running pbs_server, pbs_mom, pbs_sched as yourself or root? > > > > Would this work? > > > > qmgr -c 'set server managers += root@@test1' > > qmgr -c 'set server operators += root@@test1' > > > > From the Torque Admin Guide: > > > > http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/help.htm#topics/1-installConfig/installing.htm > > > > "Adaptive Computing recommends that the TORQUE administrator be > > root. > > For information about customizing the build at configure time, see > > Customizing the install." > > > > "TORQUE must be installed by a root user. If running sudo fails, > > switch > > to root with su-." > > > > I am not sure if root must reign on Torque, but in the user guide > > there are several other references to commands that need > > to be run by root, processes that need to be owned by root, etc. > > A list subscribing expert could shed some light here. > > > > I hope it helps, > > Gus Correa > > > > > > On 08/06/2012 09:24 AM, Jeff Layton wrote: > >> Andr?, > >> > >> Thanks! I tried running the job again this morning without > >> the "np=1" and it still hung. I'm attaching the tracejob output > >> as well as the scheduler logs. Note the mom logs are actually > >> named by the compute node name. For example, the first > >> compute node is called n0001 so the mom logs are called > >> n0001.log. I do this since all of the mom logs are in one > >> NFS directory (/opt/torque/mom_logs). > >> > >> I don't know how to raise the log level - can you help with > >> that? > >> > >> Thanks! > >> > >> Jeff > >> > >> > >> Tracejob output: > >> [root at test1 bin]# qstat -a > >> > >> test1: > >> Req'd > >> Req'd > >> Elap > >> Job ID Username Queue Jobname SessID > >> NDS > >> TSK Memory Time S Time > >> -------------------- ----------- -------- ---------------- ------ > >> ----- > >> ------ ------ ----- - ----- > >> 13.test1 laytonjb batch pbs_test2 -- > >> 1 1 -- 00:10 Q -- > >> [root at test1 bin]# ./tracejob 13.test1 > >> /opt/torque/mom_logs/20120806: No such file or directory > >> /opt/torque/sched_logs/20120806: No matching job records located > >> > >> Job: 13.test1 > >> > >> 08/06/2012 09:32:53 S enqueuing into batch, state 1 hop 1 > >> 08/06/2012 09:32:53 S Job Queued at request of laytonjb at test1, > >> owner > >> = laytonjb at test1, job name = pbs_test2, queue = batch > >> 08/06/2012 09:32:53 A queue=batch > >> > >> > >> > >> > >> Scheduler Logs: > >> [root at test1 sched_logs]# more 20120806 > >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;Log;Log opened > >> 08/06/2012 09:25:54;0002; pbs_sched;Svr;TokenAct;Account file > >> /opt/torque/sched_priv/accounting/20120806 opened > >> 08/06/2012 09:25:54;0002; > >> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched > >> startup pid 2090 > >> > >> > >> > >>> Hi Jeff, > >>> > >>> yes, that it runs with qrun points at the scheduler. > >>> Did the pbs_sched daemon perhaps die? Does it work if you submit > >>> with default resources (specifically without ppn)? Otherwise, > >>> could you paste us a tracejob and qstat -f. And raise the > >>> log_level a bit. I don't see a problem in your logs. > >>> > >>> Greetings > >>> Andre > >>> > >>> ----- Urspr?ngliche Mail ----- > >>>> Just an FYI - the job would run once I used qrun. Does this > >>>> point > >>>> to the scheduler? (I'm just using the default scheduler that > >>>> comes > >>>> with Torque (i.e. not Maui). > >>>> > >>>> Thanks! > >>>> > >>>> Jeff > >>>> > >>>> > >>>> > >>>> Good afternoon, > >>>> > >>>> I apologize for the eternal question, "why isn't my job running" > >>>> but I'm not sure where to look next. I'm running Torque 4.0.2 > >>>> that I built on a Scientific Linux 6.2 box. > >>>> > >>>> The job script is, > >>>> > >>>> #!/bin/bash > >>>> #PBS -q batch > >>>> #PBS -l walltime=00:10:00 > >>>> #PBS -l nodes=1:ppn=1 > >>>> > >>>> date > >>>> hostname > >>>> sleep 20 > >>>> date > >>>> > >>>> > >>>> I submit using qsub and then "qstat -a" looks like, > >>>> > >>>> [laytonjb at test1 TEST]$ qstat -a > >>>> > >>>> test1: > >>>> Req'd Req'd Elap > >>>> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time > >>>> -------------------- ----------- -------- ---------------- > >>>> ------ > >>>> ----- ------ ------ ----- - ----- > >>>> 11.test1 laytonjb batch pbs_test2 -- 1 1 -- 00:10 Q -- > >>>> > >>>> > >>>> It stays like this forever. I looked in the logs and didn't see > >>>> any > >>>> anything obvious. Here is some output that may help. > >>>> > >>>> > >>>> Server logs: > >>>> > >>>> 08/05/2012 16:15:35;0100;PBS_Server;Job;11.test1;enqueuing into > >>>> batch, state 1 hop 1 > >>>> 08/05/2012 16:15:35;0008;PBS_Server;Job;11.test1;Job Queued at > >>>> request of laytonjb at test1, owner = laytonjb at test1, job name = > >>>> pbs_test2, queue = batch > >>>> > >>>> > >>>> Scheduler logs: (FIFO scheduler): > >>>> > >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;die;caught signal 15 > >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log closed > >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;Log;Log opened > >>>> 08/05/2012 15:44:44;0002; pbs_sched;Svr;TokenAct;Account file > >>>> /opt/torque/sched_priv/accounting/20120805 opened > >>>> 08/05/2012 15:44:44;0002; > >>>> pbs_sched;Svr;main;/opt/torque/sbin/pbs_sched startup pid 4782 > >>>> > >>>> > >>>> pbs_mom logs: (I tried restarting the mom ("service pbs_mom > >>>> restart") > >>>> and the output is below) > >>>> > >>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;rm_request;shutdown > >>>> 08/05/2012 16:17:28;0002; pbs_mom;n/a;dep_cleanup;dependent > >>>> cleanup > >>>> 08/05/2012 16:17:28;0002; pbs_mom;Svr;Log;Log closed > >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;Log;Log opened > >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version > >>>> = > >>>> 4.0.2, loglevel = 0 > >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;setpbsserver;test1 > >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;mom_server_add;server > >>>> test1 > >>>> added > >>>> 08/05/2012 16:17:31;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::No such > >>>> file > >>>> or directory (2) in check_partition_confirm_script, Couldn't > >>>> stat > >>>> the partition confirm command '/opt/moab/default/tools/xt4/ > >>>> partition.create.xt4.pl ' - ignore this if you aren't running a > >>>> cray > >>>> 08/05/2012 16:17:31;0002; pbs_mom;n/a;initialize;independent > >>>> 08/05/2012 16:17:31;0080; pbs_mom;Svr;pbs_mom;before > >>>> init_abort_jobs > >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Is up > >>>> 08/05/2012 16:17:31;0002; > >>>> pbs_mom;Svr;setup_program_environment;MOM > >>>> executable path and mtime at launch: /usr/sbin/pbs_mom > >>>> 1344179259 > >>>> 08/05/2012 16:17:31;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version > >>>> = > >>>> 4.0.2, loglevel = 0 > >>>> > >>>> > >>>> pbsnodes -a: > >>>> > >>>> [root at test1 mom_logs]# pbsnodes -a > >>>> n0001 > >>>> state = free > >>>> np = 1 > >>>> ntype = cluster > >>>> status = > >>>> rectime=1344197869,varattr=,jobs=,state=free,netload=120587595,gres=,loadave=0.02,ncpus=3,physmem=2956668kb,availmem=2836956kb,totmem=2956668kb,idletime=4196,nusers=1,nsessions=1,sessions=1560,uname=Linux > >>>> n0001 2.6.32-220.el6.x86_64 #1 SMP Sat Dec 10 17:04:11 CST 2011 > >>>> x86_64,opsys=linux > >>>> mom_service_port = 15002 > >>>> mom_manager_port = 15003 > >>>> gpus = 0 > >>>> > >>>> > >>>> > >>>> qmgr -c "p s": > >>>> [root at test1 mom_logs]# qmgr -c "p s" > >>>> # > >>>> # Create queues and set their attributes. > >>>> # > >>>> # > >>>> # Create and define queue batch > >>>> # > >>>> create queue batch > >>>> set queue batch queue_type = Execution > >>>> set queue batch resources_default.nodes = 1 > >>>> set queue batch resources_default.walltime = 01:00:00 > >>>> set queue batch enabled = True > >>>> set queue batch started = True > >>>> # > >>>> # Set server attributes. > >>>> # > >>>> set server scheduling = True > >>>> set server acl_hosts = test1 > >>>> set server managers = laytonjb at test1 > >>>> set server operators = laytonjb at test1 > >>>> set server default_queue = batch > >>>> set server log_events = 511 > >>>> set server mail_from = adm > >>>> set server scheduler_iteration = 600 > >>>> set server node_check_rate = 150 > >>>> set server tcp_timeout = 300 > >>>> set server job_stat_rate = 45 > >>>> set server poll_jobs = True > >>>> set server mom_job_sync = True > >>>> set server next_job_number = 12 > >>>> set server moab_array_compatible = True > >>>> > >>>> > >>>> Not sure where to start looking from here. > >>>> > >>>> TIA! > >>>> > >>>> Jeff > >>>> > >> _______________________________________________ > >> torqueusers mailing list > >> torqueusers at supercluster.org > >> http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From go-yoshimura at sstc.co.jp Wed Aug 8 00:47:40 2012 From: go-yoshimura at sstc.co.jp (Go Yoshimura) Date: Wed, 08 Aug 2012 15:47:40 +0900 Subject: [torqueusers] How can we cancel a job array with qdel of torque4.1.0? Message-ID: <201208080647.AA13811@winxp-pc.sstc.co.jp> Hello, My name is Go Yoshimura. I am new to torque4.1.0. I have experience using old torque(torque2.3 or torque2.4). - My question is about canceling a job array. I can not cancel jobs in a job array with qdel -t array_range job_identifier with errors like qdel: Unauthorized Request MSG=must have operator or manager privilege to use -m parameter 79.torque02 How can we cancel jobs in a job array with qdel? - I can not cancel jobs in a job array with qdel -t even via root user. I don't get any error messages but no jobs are canceled. - I can cancel a specific job in a job array like qdel 79[8] - I can cancel all jobs like qdel all thank you go ---- ((job array)) [test01 at torque02 torque-4.1.0]$ qstat --version version: 4.1.0 [test01 at torque02 ~]$ qsub -t 1-8 cilk.qsub 79[].torque02 test01 at torque02 ~]$ qstat -Q Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T Cpt ---------------- --- ---- -- -- --- --- --- --- --- --- - --- batch 0 8 yes yes 7 1 0 0 0 0 E 0 [test01 at torque02 ~]$ qstat -t -a torque02: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 79[1].torque02 test01 batch CILK-1 32606 1 8 -- -- C 00:00 79[2].torque02 test01 batch CILK-2 32710 1 8 -- -- R -- 79[3].torque02 test01 batch CILK-3 -- 1 8 -- -- Q -- 79[4].torque02 test01 batch CILK-4 -- 1 8 -- -- Q -- 79[5].torque02 test01 batch CILK-5 -- 1 8 -- -- Q -- 79[6].torque02 test01 batch CILK-6 -- 1 8 -- -- Q -- 79[7].torque02 test01 batch CILK-7 -- 1 8 -- -- Q -- 79[8].torque02 test01 batch CILK-8 -- 1 8 -- -- Q -- [test01 at torque02 ~]$ qdel 79[8] [test01 at torque02 ~]$ qstat -t -a torque02: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 79[1].torque02 test01 batch CILK-1 32606 1 8 -- -- C 00:00 79[2].torque02 test01 batch CILK-2 32710 1 8 -- -- C 00:00 79[3].torque02 test01 batch CILK-3 317 1 8 -- -- R -- 79[4].torque02 test01 batch CILK-4 -- 1 8 -- -- Q -- 79[5].torque02 test01 batch CILK-5 -- 1 8 -- -- Q -- 79[6].torque02 test01 batch CILK-6 -- 1 8 -- -- Q -- 79[7].torque02 test01 batch CILK-7 -- 1 8 -- -- Q -- 79[8].torque02 test01 batch CILK-8 -- 1 8 -- -- C -- [test01 at torque02 ~]$ qdel 79 -t 6-7 qdel: Unauthorized Request MSG=must have operator or manager privilege to use -m parameter 79.torque02 ((cancel job array by root user)) [root at torque02 ~]# qstat -a -t torque02: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 80[1].torque02 test01 batch CILK-1 2257 1 8 -- -- C 00:00 80[2].torque02 test01 batch CILK-2 2332 1 8 -- -- R -- 80[3].torque02 test01 batch CILK-3 -- 1 8 -- -- Q -- 80[4].torque02 test01 batch CILK-4 -- 1 8 -- -- Q -- 80[5].torque02 test01 batch CILK-5 -- 1 8 -- -- Q -- 80[6].torque02 test01 batch CILK-6 -- 1 8 -- -- Q -- 80[7].torque02 test01 batch CILK-7 -- 1 8 -- -- Q -- 80[8].torque02 test01 batch CILK-8 -- 1 8 -- -- Q -- [root at torque02 ~]# qdel -t 4-8 80 [root at torque02 ~]# qstat -a -t torque02: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 80[1].torque02 test01 batch CILK-1 2257 1 8 -- -- C 00:00 80[2].torque02 test01 batch CILK-2 2332 1 8 -- -- C 00:00 80[3].torque02 test01 batch CILK-3 -- 1 8 -- -- Q -- 80[4].torque02 test01 batch CILK-4 -- 1 8 -- -- Q -- 80[5].torque02 test01 batch CILK-5 -- 1 8 -- -- Q -- 80[6].torque02 test01 batch CILK-6 -- 1 8 -- -- Q -- 80[7].torque02 test01 batch CILK-7 -- 1 8 -- -- Q -- 80[8].torque02 test01 batch CILK-8 -- 1 8 -- -- Q -- ---- Go Yoshimura Scalable Systems Co., Ltd. Osaka Office HONMACHI-COLLABO Bldg. 4F, 4-4-2 Kita-kyuhoji-machi, Chuo-ku, Osaka 541-0057 Japan Tel: 81-6-6224-4115 Tokyo Kojimachi Office BUREX Kojimachi 11F, 3-5-2 Kojimachi, Chiyoda-ku, Tokyo 102-0083 Japan Tel: 81-3-5875-4718 Fax: 81-3-3237-7612 From asdasdi57 at yahoo.com Sat Aug 4 03:56:19 2012 From: asdasdi57 at yahoo.com (Asd Asdi) Date: Sat, 4 Aug 2012 02:56:19 -0700 (PDT) Subject: [torqueusers] problem Message-ID: <1344074179.20673.YahooMailNeo@web120101.mail.ne1.yahoo.com> hello All I have one cluster that it has 4 nodes. Each nodes has 24 core but when I send jobs with qsub to nodes only 18 core each nodes start to work and 6 core of each nodes is not used. I do not know how to solve this problem.? I was wondering if you can help me. sincerely Ahmad Bayat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120804/0b779c4f/attachment-0001.html From jars99 at gmail.com Thu Aug 9 18:36:26 2012 From: jars99 at gmail.com (Jared Bristow) Date: Thu, 9 Aug 2012 18:36:26 -0600 Subject: [torqueusers] test Message-ID: This is a test to verify that the torqueusers mailing list is working -Jared Bristow IT Manager at Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120809/decb5d97/attachment.html From roman.ricardo at gmail.com Thu Aug 9 18:39:29 2012 From: roman.ricardo at gmail.com (=?ISO-8859-1?Q?Ricardo_Rom=E1n_Brenes?=) Date: Thu, 9 Aug 2012 18:39:29 -0600 Subject: [torqueusers] problem In-Reply-To: <1344074179.20673.YahooMailNeo@web120101.mail.ne1.yahoo.com> References: <1344074179.20673.YahooMailNeo@web120101.mail.ne1.yahoo.com> Message-ID: Hi asad could you post your node file and the torque server configuration? On Aug 9, 2012 6:23 PM, "Asd Asdi" wrote: > hello All > > I have one cluster that it has 4 nodes. Each nodes has 24 core but when I > send jobs with qsub to nodes > > only 18 core each nodes start to work and 6 core of each nodes is not > used. I do not know how to solve this problem. > > I was wondering if you can help me. > > sincerely > > Ahmad Bayat > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120809/9f6acc7f/attachment.html From chemadm at hamilton.edu Thu Aug 9 09:53:51 2012 From: chemadm at hamilton.edu (Steven Young) Date: Thu, 9 Aug 2012 11:53:51 -0400 Subject: [torqueusers] duplicate job processes create IO wait state on node Message-ID: Hello, I'm running into an issue with a cluster where it seems like two copies of the same job is started which ends up making the node become unstable in an IO wait state since both copies appear to be trying to use the same files for the job. When I do a ps and trace through the PID's and PPID's it looks like the mom daemon fires up the login for the job (ie. -tcsh) and then it creates the SC file. Then that process spawns another copy of the same process. I haven't been able to figure out how to reproduce this problem but after one of the users runs a bunch of jobs eventually some of the nodes get into this state. It's a 12 core machine and the load average will be up around 14. When you look at the running processes there isn't anything running on the machine. Looking at top you see rpciod just churning away and creating this IO wait on the machine. Every time a node gets into this state I see the same thing... all the jobs scheduled on the machine have 2 processes "/bin/sh /var/spool/torque/mom_priv/jobs/..SC" whereas on a working node you only see one of these and is what I would expect. When I look at the mom logs for that node it appears that there was some issues trying to remove the previous jobs from the node just before the jobs that end up being duplicates. Perhaps this is related? I'm hoping someone might be able to give me some hints as to why this could be happening? Thanks, -Steve OS: RedHat Enterprise 6 Torque: 3.0.2 Maui: 3.3.1 UID PID PPID C STIME TTY TIME CMD root 2123 1 0 Aug05 ? 00:01:19 /usr/local/sbin/pbs_mom -q -d /var/spool/torque 10913 2123 0 03:54 ? 00:00:00 -tcsh 11079 10913 0 03:54 ? 00:00:00 /bin/sh /var/spool/torque/mom_priv/jobs/56290..SC 2612 11079 0 03:59 ? 00:00:00 /bin/sh /var/spool/torque/mom_priv/jobs/56290..SC mom_logs on node (I changed the real server name with pbs_server_name): 08/07/2012 03:03:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:08:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:13:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:18:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:23:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:28:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:33:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:38:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:39:53;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 1867 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 2491 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 7426 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 1976 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 2486 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 6034 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2079 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2488 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 13274 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2368 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2568 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 18730 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3875 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3998 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 21115 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 3978 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4128 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 29379 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4068 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4252 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 13662 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4407 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4544 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 31496 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5643 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5757 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 10527 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5728 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5890 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 27051 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5835 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 6012 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 30157 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6018 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6297 task 1 with sig 15 08/07/2012 03:39:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 24477 task 1 with sig 15 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 1867 task 1 gracefully with sig 15 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 1867 task 1 with sig 9 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 2491 task 1 gracefully with sig 15 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=R) after sig 15 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 2491 task 1 with sig 9 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 5020 task 1 gracefully with sig 15 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=5020/state=R) after sig 15 08/07/2012 03:40:05;0080; pbs_mom;Job;52769.pbs_server_name;scan_for_terminated: job 52769.pbs_server_name task 1 terminated, sid=1867 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;job was terminated 08/07/2012 03:40:05;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 1976 task 1 gracefully with sig 15 08/07/2012 03:40:05;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:05;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 1976 task 1 with sig 9 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 2486 task 1 gracefully with sig 15 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=R) after sig 15 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 2486 task 1 with sig 9 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 5001 task 1 gracefully with sig 15 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=5001/state=R) after sig 15 08/07/2012 03:40:15;0080; pbs_mom;Job;52770.pbs_server_name;scan_for_terminated: job 52770.pbs_server_name task 1 terminated, sid=1976 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;job was terminated 08/07/2012 03:40:15;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2079 task 1 gracefully with sig 15 08/07/2012 03:40:15;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2079 task 1 with sig 9 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2488 task 1 gracefully with sig 15 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=R) after sig 15 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2488 task 1 with sig 9 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 5002 task 1 gracefully with sig 15 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=5002/state=R) after sig 15 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:40:26;0080; pbs_mom;Job;52771.pbs_server_name;scan_for_terminated: job 52771.pbs_server_name task 1 terminated, sid=2079 08/07/2012 03:40:26;0080; pbs_mom;Job;52769.pbs_server_name;obit sent to server 08/07/2012 03:40:26;0008; pbs_mom;Job;52771.pbs_server_name;job was terminated 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2368 task 1 gracefully with sig 15 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2368 task 1 with sig 9 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2568 task 1 gracefully with sig 15 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=R) after sig 15 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2568 task 1 with sig 9 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 5000 task 1 gracefully with sig 15 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=5000/state=R) after sig 15 08/07/2012 03:40:36;0080; pbs_mom;Job;52772.pbs_server_name;scan_for_terminated: job 52772.pbs_server_name task 1 terminated, sid=2368 08/07/2012 03:40:36;0080; pbs_mom;Job;52770.pbs_server_name;obit sent to server 08/07/2012 03:40:36;0008; pbs_mom;Job;52772.pbs_server_name;job was terminated 08/07/2012 03:40:36;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:40:36;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:40:36;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3875 task 1 gracefully with sig 15 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3875 task 1 with sig 9 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3998 task 1 gracefully with sig 15 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=R) after sig 15 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3998 task 1 with sig 9 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 5006 task 1 gracefully with sig 15 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=5006/state=R) after sig 15 08/07/2012 03:40:46;0080; pbs_mom;Job;52773.pbs_server_name;scan_for_terminated: job 52773.pbs_server_name task 1 terminated, sid=3875 08/07/2012 03:40:46;0080; pbs_mom;Job;52771.pbs_server_name;obit sent to server 08/07/2012 03:40:46;0008; pbs_mom;Job;52773.pbs_server_name;job was terminated 08/07/2012 03:40:46;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:40:46;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:40:46;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 3978 task 1 gracefully with sig 15 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:48;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:48;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:48;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 3978 task 1 with sig 9 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4128 task 1 gracefully with sig 15 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=R) after sig 15 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4128 task 1 with sig 9 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4997 task 1 gracefully with sig 15 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4997/state=R) after sig 15 08/07/2012 03:40:56;0080; pbs_mom;Job;52774.pbs_server_name;scan_for_terminated: job 52774.pbs_server_name task 1 terminated, sid=3978 08/07/2012 03:40:56;0080; pbs_mom;Job;52772.pbs_server_name;obit sent to server 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;job was terminated 08/07/2012 03:40:56;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:40:56;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:40:56;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4068 task 1 gracefully with sig 15 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4068 task 1 with sig 9 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4252 task 1 gracefully with sig 15 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=R) after sig 15 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4252 task 1 with sig 9 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 5032 task 1 gracefully with sig 15 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=5032/state=R) after sig 15 08/07/2012 03:41:06;0080; pbs_mom;Job;52775.pbs_server_name;scan_for_terminated: job 52775.pbs_server_name task 1 terminated, sid=4068 08/07/2012 03:41:06;0080; pbs_mom;Job;52773.pbs_server_name;obit sent to server 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;job was terminated 08/07/2012 03:41:06;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:06;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:06;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4407 task 1 gracefully with sig 15 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4407 task 1 with sig 9 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4544 task 1 gracefully with sig 15 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=R) after sig 15 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4544 task 1 with sig 9 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 5039 task 1 gracefully with sig 15 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=5039/state=R) after sig 15 08/07/2012 03:41:16;0080; pbs_mom;Job;52776.pbs_server_name;scan_for_terminated: job 52776.pbs_server_name task 1 terminated, sid=4407 08/07/2012 03:41:16;0080; pbs_mom;Job;52774.pbs_server_name;obit sent to server 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;job was terminated 08/07/2012 03:41:16;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:16;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:16;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5036 task 1 gracefully with sig 15 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5036/state=R) after sig 15 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5643 task 1 gracefully with sig 15 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5643 task 1 with sig 9 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5757 task 1 gracefully with sig 15 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=R) after sig 15 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5757 task 1 with sig 9 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 6320 task 1 gracefully with sig 15 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=6320/state=R) after sig 15 08/07/2012 03:41:26;0080; pbs_mom;Job;52777.pbs_server_name;scan_for_terminated: job 52777.pbs_server_name task 1 terminated, sid=5643 08/07/2012 03:41:26;0080; pbs_mom;Job;52775.pbs_server_name;obit sent to server 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;job was terminated 08/07/2012 03:41:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:26;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5054 task 1 gracefully with sig 15 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5054/state=R) after sig 15 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5728 task 1 gracefully with sig 15 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5728 task 1 with sig 9 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5890 task 1 gracefully with sig 15 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5890 task 1 with sig 9 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 6196 task 1 gracefully with sig 15 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=6196/state=R) after sig 15 08/07/2012 03:41:37;0080; pbs_mom;Job;52778.pbs_server_name;scan_for_terminated: job 52778.pbs_server_name task 1 terminated, sid=5728 08/07/2012 03:41:37;0080; pbs_mom;Job;52776.pbs_server_name;obit sent to server 08/07/2012 03:41:37;0008; pbs_mom;Job;52778.pbs_server_name;job was terminated 08/07/2012 03:41:37;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:37;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:37;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5063 task 1 gracefully with sig 15 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5063/state=R) after sig 15 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5835 task 1 gracefully with sig 15 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5835 task 1 with sig 9 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 6012 task 1 gracefully with sig 15 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=R) after sig 15 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 6012 task 1 with sig 9 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 3605 task 1 gracefully with sig 15 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=3605/state=R) after sig 15 08/07/2012 03:41:47;0080; pbs_mom;Job;52779.pbs_server_name;scan_for_terminated: job 52779.pbs_server_name task 1 terminated, sid=5835 08/07/2012 03:41:47;0080; pbs_mom;Job;52777.pbs_server_name;obit sent to server 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;job was terminated 08/07/2012 03:41:47;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:47;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:47;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 5062 task 1 gracefully with sig 15 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=5062/state=R) after sig 15 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6018 task 1 gracefully with sig 15 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6018 task 1 with sig 9 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6297 task 1 gracefully with sig 15 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=R) after sig 15 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6297 task 1 with sig 9 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 32585 task 1 gracefully with sig 15 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=32585/state=R) after sig 15 08/07/2012 03:41:57;0080; pbs_mom;Job;52780.pbs_server_name;scan_for_terminated: job 52780.pbs_server_name task 1 terminated, sid=6018 08/07/2012 03:41:57;0080; pbs_mom;Job;52778.pbs_server_name;obit sent to server 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;job was terminated 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:57;0080; pbs_mom;Job;52779.pbs_server_name;obit sent to server 08/07/2012 03:41:57;0001; pbs_mom;Job;52772.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52773.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52774.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52775.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52776.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52777.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52780.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 57 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req 08/07/2012 03:41:57;0080; pbs_mom;Job;52769.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Job;52770.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Job;52771.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Job;52772.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Job;52773.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0001; pbs_mom;Job;52774.pbs_server_name;preobit_reply, unknown on server, deleting locally 08/07/2012 03:41:57;0080; pbs_mom;Job;52774.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0001; pbs_mom;Job;52775.pbs_server_name;preobit_reply, unknown on server, deleting locally 08/07/2012 03:41:57;0080; pbs_mom;Job;52775.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0001; pbs_mom;Job;52776.pbs_server_name;preobit_reply, unknown on server, deleting locally 08/07/2012 03:41:57;0080; pbs_mom;Job;52776.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=node0013 MSG=cannot locate job to delete), aux=0, type=DeleteJob, from PBS_Server at pbs_server_name 08/07/2012 03:41:57;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=node0013 MSG=cannot locate job to delete), aux=0, type=DeleteJob, from PBS_Server at pbs_server_name 08/07/2012 03:41:57;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=node0013 MSG=cannot locate job to delete), aux=0, type=DeleteJob, from PBS_Server at pbs_server_name 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat 08/07/2012 03:41:57;0080; pbs_mom;Job;52780.pbs_server_name;obit sent to server 08/07/2012 03:41:57;0080; pbs_mom;Job;52777.pbs_server_name;obit sent to server 08/07/2012 03:41:57;0080; pbs_mom;Job;52777.pbs_server_name;removed job script 08/07/2012 03:41:57;0001; pbs_mom;Job;52778.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Job;52779.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 08/07/2012 03:41:57;0001; pbs_mom;Req;obit reply;Job not found for obit reply 08/07/2012 03:41:57;0080; pbs_mom;Job;52779.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Job;52778.pbs_server_name;removed job script 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req 08/07/2012 03:41:57;0080; pbs_mom;Job;52780.pbs_server_name;removed job script 08/07/2012 03:43:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:48:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:53:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56282.pbs_server_name started, pid = 7265 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56283.pbs_server_name started, pid = 7369 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56284.pbs_server_name started, pid = 7454 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56285.pbs_server_name started, pid = 7562 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56286.pbs_server_name started, pid = 8959 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56287.pbs_server_name started, pid = 9080 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56288.pbs_server_name started, pid = 9174 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56289.pbs_server_name started, pid = 9431 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56290.pbs_server_name started, pid = 10913 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56291.pbs_server_name started, pid = 11034 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56292.pbs_server_name started, pid = 11109 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56293.pbs_server_name started, pid = 11247 08/07/2012 03:58:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:03:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:08:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:13:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:18:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:23:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:28:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:33:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:38:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 08/07/2012 04:43:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 From listsarnau at gmail.com Wed Aug 8 01:17:17 2012 From: listsarnau at gmail.com (Arnau Bria) Date: Wed, 8 Aug 2012 09:17:17 +0200 Subject: [torqueusers] negative values in qstat Message-ID: <20120808091717.7f4e30f0@amarrosa.pic.es> Hi all, maybe someone has already seen this: # qstat -q short server: pbs03.pic.es Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- short -- 01:30:00 03:00:00 -- 0 -3 -- E R ----- ----- 0 -3 # qstat short # What's the meaning of this negative value? torque-2.5.9-1.cri.x86_64 TIA, Arnau From gus at ldeo.columbia.edu Fri Aug 10 09:38:12 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 10 Aug 2012 11:38:12 -0400 Subject: [torqueusers] problem In-Reply-To: <1344074179.20673.YahooMailNeo@web120101.mail.ne1.yahoo.com> References: <1344074179.20673.YahooMailNeo@web120101.mail.ne1.yahoo.com> Message-ID: <50252AE4.9090308@ldeo.columbia.edu> On 08/04/2012 05:56 AM, Asd Asdi wrote: > hello All > > I have one cluster that it has 4 nodes. Each nodes has 24 core but > when I send jobs with qsub to nodes > > only 18 core each nodes start to work and 6 core of each nodes is not > used. I do not know how to solve this problem. > > I was wondering if you can help me. > > sincerely > > Ahmad Bayat > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Hi Asd Does the $TORQUE/server_priv/nodes file on the Torque server should have lines like this [with your node's names, of course]? node01 np=24 ... Does your Torque/PBS script request 24 cores/processors per node? Something like this: #PBS -l nodes=4:ppn=24 ... mpiexec -np 96 ./my_mpi_program I think somebody already asked you to send your $TORQUE/server_priv/nodes file and the output of qmgr -c 'p s' A snippet of your job submission script maybe too. These things may help to understand what is the problem. Gus Correa From chemadm at hamilton.edu Fri Aug 10 12:05:42 2012 From: chemadm at hamilton.edu (Steven Young) Date: Fri, 10 Aug 2012 14:05:42 -0400 Subject: [torqueusers] Fwd: duplicate job processes create IO wait state on node References: Message-ID: I sent this yesterday and didn't see it hit the list yet so I thought I'd try sending it again. Begin forwarded message: > From: Steven Young > Date: August 9, 2012 11:53:51 AM EDT > To: torqueusers Mailing List > Subject: duplicate job processes create IO wait state on node > > Hello, > I'm running into an issue with a cluster where it seems like two copies of the same job is started which ends up making the node become unstable in an IO wait state since both copies appear to be trying to use the same files for the job. When I do a ps and trace through the PID's and PPID's it looks like the mom daemon fires up the login for the job (ie. -tcsh) and then it creates the SC file. Then that process spawns another copy of the same process. I haven't been able to figure out how to reproduce this problem but after one of the users runs a bunch of jobs eventually some of the nodes get into this state. It's a 12 core machine and the load average will be up around 14. When you look at the running processes there isn't anything running on the machine. Looking at top you see rpciod just churning away and creating this IO wait on the machine. Every time a node gets into this state I see the same thing... all the jobs scheduled on the machine have 2 processes "/bin/sh /var/spool/torque/mom_priv/jobs/..SC" whereas on a working node you only see one of these and is what I would expect. When I look at the mom logs for that node it appears that there was some issues trying to remove the previous jobs from the node just before the jobs that end up being duplicates. Perhaps this is related? I'm hoping someone might be able to give me some hints as to why this could be happening? Thanks, > > -Steve > > OS: RedHat Enterprise 6 > Torque: 3.0.2 > Maui: 3.3.1 > > > UID PID PPID C STIME TTY TIME CMD > > root 2123 1 0 Aug05 ? 00:01:19 /usr/local/sbin/pbs_mom -q -d /var/spool/torque > 10913 2123 0 03:54 ? 00:00:00 -tcsh > 11079 10913 0 03:54 ? 00:00:00 /bin/sh /var/spool/torque/mom_priv/jobs/56290..SC > 2612 11079 0 03:59 ? 00:00:00 /bin/sh /var/spool/torque/mom_priv/jobs/56290..SC > > > mom_logs on node (I changed the real server name with pbs_server_name): > > 08/07/2012 03:03:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:08:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:13:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:18:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:23:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:28:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:33:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:38:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 1867 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 2491 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 7426 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 1976 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 2486 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 6034 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2079 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2488 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 13274 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2368 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2568 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 18730 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3875 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3998 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 21115 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 3978 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4128 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 29379 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4068 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4252 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 13662 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4407 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4544 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 31496 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5643 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5757 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 10527 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5728 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5890 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 27051 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5835 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 6012 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 30157 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6018 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6297 task 1 with sig 15 > 08/07/2012 03:39:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 24477 task 1 with sig 15 > 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 1867 task 1 gracefully with sig 15 > 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:55;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:56;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:57;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:58;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:39:59;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=1867/state=S) after sig 15 > 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 1867 task 1 with sig 9 > 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 2491 task 1 gracefully with sig 15 > 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=R) after sig 15 > 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:00;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:01;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:02;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:03;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:04;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=2491/state=S) after sig 15 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 2491 task 1 with sig 9 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: killing pid 5020 task 1 gracefully with sig 15 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;kill_task: process (pid=5020/state=R) after sig 15 > 08/07/2012 03:40:05;0080; pbs_mom;Job;52769.pbs_server_name;scan_for_terminated: job 52769.pbs_server_name task 1 terminated, sid=1867 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52769.pbs_server_name;job was terminated > 08/07/2012 03:40:05;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 1976 task 1 gracefully with sig 15 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:05;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:06;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:07;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:08;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:09;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=1976/state=S) after sig 15 > 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 1976 task 1 with sig 9 > 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 2486 task 1 gracefully with sig 15 > 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=R) after sig 15 > 08/07/2012 03:40:10;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:11;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:12;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:13;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:14;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=2486/state=S) after sig 15 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 2486 task 1 with sig 9 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: killing pid 5001 task 1 gracefully with sig 15 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;kill_task: process (pid=5001/state=R) after sig 15 > 08/07/2012 03:40:15;0080; pbs_mom;Job;52770.pbs_server_name;scan_for_terminated: job 52770.pbs_server_name task 1 terminated, sid=1976 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52770.pbs_server_name;job was terminated > 08/07/2012 03:40:15;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2079 task 1 gracefully with sig 15 > 08/07/2012 03:40:15;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:16;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:17;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:18;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:19;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2079/state=S) after sig 15 > 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2079 task 1 with sig 9 > 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2488 task 1 gracefully with sig 15 > 08/07/2012 03:40:20;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=R) after sig 15 > 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:21;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:22;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:23;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:24;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=2488/state=S) after sig 15 > 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 2488 task 1 with sig 9 > 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: killing pid 5002 task 1 gracefully with sig 15 > 08/07/2012 03:40:25;0008; pbs_mom;Job;52771.pbs_server_name;kill_task: process (pid=5002/state=R) after sig 15 > 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:40:26;0080; pbs_mom;Job;52771.pbs_server_name;scan_for_terminated: job 52771.pbs_server_name task 1 terminated, sid=2079 > 08/07/2012 03:40:26;0080; pbs_mom;Job;52769.pbs_server_name;obit sent to server > 08/07/2012 03:40:26;0008; pbs_mom;Job;52771.pbs_server_name;job was terminated > 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:40:26;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2368 task 1 gracefully with sig 15 > 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:26;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:27;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:28;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:29;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:30;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2368/state=S) after sig 15 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2368 task 1 with sig 9 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2568 task 1 gracefully with sig 15 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=R) after sig 15 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:31;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:32;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:33;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:34;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=2568/state=S) after sig 15 > 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 2568 task 1 with sig 9 > 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: killing pid 5000 task 1 gracefully with sig 15 > 08/07/2012 03:40:35;0008; pbs_mom;Job;52772.pbs_server_name;kill_task: process (pid=5000/state=R) after sig 15 > 08/07/2012 03:40:36;0080; pbs_mom;Job;52772.pbs_server_name;scan_for_terminated: job 52772.pbs_server_name task 1 terminated, sid=2368 > 08/07/2012 03:40:36;0080; pbs_mom;Job;52770.pbs_server_name;obit sent to server > 08/07/2012 03:40:36;0008; pbs_mom;Job;52772.pbs_server_name;job was terminated > 08/07/2012 03:40:36;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:40:36;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:40:36;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3875 task 1 gracefully with sig 15 > 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:36;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:37;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:38;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:39;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:40;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3875/state=S) after sig 15 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3875 task 1 with sig 9 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3998 task 1 gracefully with sig 15 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=R) after sig 15 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:41;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:42;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:43;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:44;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=3998/state=S) after sig 15 > 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 3998 task 1 with sig 9 > 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: killing pid 5006 task 1 gracefully with sig 15 > 08/07/2012 03:40:45;0008; pbs_mom;Job;52773.pbs_server_name;kill_task: process (pid=5006/state=R) after sig 15 > 08/07/2012 03:40:46;0080; pbs_mom;Job;52773.pbs_server_name;scan_for_terminated: job 52773.pbs_server_name task 1 terminated, sid=3875 > 08/07/2012 03:40:46;0080; pbs_mom;Job;52771.pbs_server_name;obit sent to server > 08/07/2012 03:40:46;0008; pbs_mom;Job;52773.pbs_server_name;job was terminated > 08/07/2012 03:40:46;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:40:46;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:40:46;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 3978 task 1 gracefully with sig 15 > 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:46;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:47;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:48;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:48;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:48;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:49;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:50;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=3978/state=S) after sig 15 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 3978 task 1 with sig 9 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4128 task 1 gracefully with sig 15 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=R) after sig 15 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:51;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:52;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:53;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:54;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:55;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4128/state=S) after sig 15 > 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4128 task 1 with sig 9 > 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: killing pid 4997 task 1 gracefully with sig 15 > 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;kill_task: process (pid=4997/state=R) after sig 15 > 08/07/2012 03:40:56;0080; pbs_mom;Job;52774.pbs_server_name;scan_for_terminated: job 52774.pbs_server_name task 1 terminated, sid=3978 > 08/07/2012 03:40:56;0080; pbs_mom;Job;52772.pbs_server_name;obit sent to server > 08/07/2012 03:40:56;0008; pbs_mom;Job;52774.pbs_server_name;job was terminated > 08/07/2012 03:40:56;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:40:56;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:40:56;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4068 task 1 gracefully with sig 15 > 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:56;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:57;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:58;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:40:59;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:41:00;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4068/state=S) after sig 15 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4068 task 1 with sig 9 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4252 task 1 gracefully with sig 15 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=R) after sig 15 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:01;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:02;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:03;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:04;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:05;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=4252/state=S) after sig 15 > 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 4252 task 1 with sig 9 > 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: killing pid 5032 task 1 gracefully with sig 15 > 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;kill_task: process (pid=5032/state=R) after sig 15 > 08/07/2012 03:41:06;0080; pbs_mom;Job;52775.pbs_server_name;scan_for_terminated: job 52775.pbs_server_name task 1 terminated, sid=4068 > 08/07/2012 03:41:06;0080; pbs_mom;Job;52773.pbs_server_name;obit sent to server > 08/07/2012 03:41:06;0008; pbs_mom;Job;52775.pbs_server_name;job was terminated > 08/07/2012 03:41:06;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:06;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:06;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4407 task 1 gracefully with sig 15 > 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:06;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:07;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:08;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:09;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:10;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4407/state=S) after sig 15 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4407 task 1 with sig 9 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4544 task 1 gracefully with sig 15 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=R) after sig 15 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:11;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:12;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:13;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:14;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:15;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=4544/state=S) after sig 15 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 4544 task 1 with sig 9 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: killing pid 5039 task 1 gracefully with sig 15 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;kill_task: process (pid=5039/state=R) after sig 15 > 08/07/2012 03:41:16;0080; pbs_mom;Job;52776.pbs_server_name;scan_for_terminated: job 52776.pbs_server_name task 1 terminated, sid=4407 > 08/07/2012 03:41:16;0080; pbs_mom;Job;52774.pbs_server_name;obit sent to server > 08/07/2012 03:41:16;0008; pbs_mom;Job;52776.pbs_server_name;job was terminated > 08/07/2012 03:41:16;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:16;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:16;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5036 task 1 gracefully with sig 15 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5036/state=R) after sig 15 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5643 task 1 gracefully with sig 15 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:16;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:17;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:18;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:19;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:20;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5643/state=S) after sig 15 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5643 task 1 with sig 9 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5757 task 1 gracefully with sig 15 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=R) after sig 15 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:21;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:22;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:23;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:24;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:25;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=5757/state=S) after sig 15 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 5757 task 1 with sig 9 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: killing pid 6320 task 1 gracefully with sig 15 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;kill_task: process (pid=6320/state=R) after sig 15 > 08/07/2012 03:41:26;0080; pbs_mom;Job;52777.pbs_server_name;scan_for_terminated: job 52777.pbs_server_name task 1 terminated, sid=5643 > 08/07/2012 03:41:26;0080; pbs_mom;Job;52775.pbs_server_name;obit sent to server > 08/07/2012 03:41:26;0008; pbs_mom;Job;52777.pbs_server_name;job was terminated > 08/07/2012 03:41:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:26;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5054 task 1 gracefully with sig 15 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5054/state=R) after sig 15 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5728 task 1 gracefully with sig 15 > 08/07/2012 03:41:26;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:27;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:28;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:29;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:30;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5728/state=S) after sig 15 > 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5728 task 1 with sig 9 > 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5890 task 1 gracefully with sig 15 > 08/07/2012 03:41:31;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:32;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:33;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:34;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:35;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=5890/state=S) after sig 15 > 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 5890 task 1 with sig 9 > 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: killing pid 6196 task 1 gracefully with sig 15 > 08/07/2012 03:41:36;0008; pbs_mom;Job;52778.pbs_server_name;kill_task: process (pid=6196/state=R) after sig 15 > 08/07/2012 03:41:37;0080; pbs_mom;Job;52778.pbs_server_name;scan_for_terminated: job 52778.pbs_server_name task 1 terminated, sid=5728 > 08/07/2012 03:41:37;0080; pbs_mom;Job;52776.pbs_server_name;obit sent to server > 08/07/2012 03:41:37;0008; pbs_mom;Job;52778.pbs_server_name;job was terminated > 08/07/2012 03:41:37;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:37;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:37;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5063 task 1 gracefully with sig 15 > 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5063/state=R) after sig 15 > 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5835 task 1 gracefully with sig 15 > 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:37;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:38;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:39;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:40;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:41;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=5835/state=S) after sig 15 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 5835 task 1 with sig 9 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 6012 task 1 gracefully with sig 15 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=R) after sig 15 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:42;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:43;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:44;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:45;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:46;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=6012/state=S) after sig 15 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 6012 task 1 with sig 9 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: killing pid 3605 task 1 gracefully with sig 15 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;kill_task: process (pid=3605/state=R) after sig 15 > 08/07/2012 03:41:47;0080; pbs_mom;Job;52779.pbs_server_name;scan_for_terminated: job 52779.pbs_server_name task 1 terminated, sid=5835 > 08/07/2012 03:41:47;0080; pbs_mom;Job;52777.pbs_server_name;obit sent to server > 08/07/2012 03:41:47;0008; pbs_mom;Job;52779.pbs_server_name;job was terminated > 08/07/2012 03:41:47;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:47;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:47;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 5062 task 1 gracefully with sig 15 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=5062/state=R) after sig 15 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6018 task 1 gracefully with sig 15 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:47;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:48;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:49;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:50;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:51;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6018/state=S) after sig 15 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6018 task 1 with sig 9 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6297 task 1 gracefully with sig 15 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=R) after sig 15 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:52;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:53;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:54;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:55;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:56;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=6297/state=S) after sig 15 > 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 6297 task 1 with sig 9 > 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: killing pid 32585 task 1 gracefully with sig 15 > 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;kill_task: process (pid=32585/state=R) after sig 15 > 08/07/2012 03:41:57;0080; pbs_mom;Job;52780.pbs_server_name;scan_for_terminated: job 52780.pbs_server_name task 1 terminated, sid=6018 > 08/07/2012 03:41:57;0080; pbs_mom;Job;52778.pbs_server_name;obit sent to server > 08/07/2012 03:41:57;0008; pbs_mom;Job;52780.pbs_server_name;job was terminated > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:57;0080; pbs_mom;Job;52779.pbs_server_name;obit sent to server > 08/07/2012 03:41:57;0001; pbs_mom;Job;52772.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52773.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52774.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52775.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52776.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52777.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52780.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 57 > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req > 08/07/2012 03:41:57;0080; pbs_mom;Job;52769.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Job;52770.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Job;52771.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Job;52772.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Job;52773.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0001; pbs_mom;Job;52774.pbs_server_name;preobit_reply, unknown on server, deleting locally > 08/07/2012 03:41:57;0080; pbs_mom;Job;52774.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0001; pbs_mom;Job;52775.pbs_server_name;preobit_reply, unknown on server, deleting locally > 08/07/2012 03:41:57;0080; pbs_mom;Job;52775.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0001; pbs_mom;Job;52776.pbs_server_name;preobit_reply, unknown on server, deleting locally > 08/07/2012 03:41:57;0080; pbs_mom;Job;52776.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=node0013 MSG=cannot locate job to delete), aux=0, type=DeleteJob, from PBS_Server at pbs_server_name > 08/07/2012 03:41:57;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=node0013 MSG=cannot locate job to delete), aux=0, type=DeleteJob, from PBS_Server at pbs_server_name > 08/07/2012 03:41:57;0080; pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id REJHOST=node0013 MSG=cannot locate job to delete), aux=0, type=DeleteJob, from PBS_Server at pbs_server_name > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat > 08/07/2012 03:41:57;0080; pbs_mom;Job;52780.pbs_server_name;obit sent to server > 08/07/2012 03:41:57;0080; pbs_mom;Job;52777.pbs_server_name;obit sent to server > 08/07/2012 03:41:57;0080; pbs_mom;Job;52777.pbs_server_name;removed job script > 08/07/2012 03:41:57;0001; pbs_mom;Job;52778.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Job;52779.pbs_server_name;job recycled into exiting on SIGNULL/KILL from substate 53 > 08/07/2012 03:41:57;0001; pbs_mom;Req;obit reply;Job not found for obit reply > 08/07/2012 03:41:57;0080; pbs_mom;Job;52779.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Job;52778.pbs_server_name;removed job script > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop > 08/07/2012 03:41:57;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req > 08/07/2012 03:41:57;0080; pbs_mom;Job;52780.pbs_server_name;removed job script > 08/07/2012 03:43:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:48:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:53:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56282.pbs_server_name started, pid = 7265 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56283.pbs_server_name started, pid = 7369 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56284.pbs_server_name started, pid = 7454 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56285.pbs_server_name started, pid = 7562 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56286.pbs_server_name started, pid = 8959 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56287.pbs_server_name started, pid = 9080 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56288.pbs_server_name started, pid = 9174 > 08/07/2012 03:54:12;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:12;0001; pbs_mom;Job;TMomFinalizeJob3;job 56289.pbs_server_name started, pid = 9431 > 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56290.pbs_server_name started, pid = 10913 > 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56291.pbs_server_name started, pid = 11034 > 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56292.pbs_server_name started, pid = 11109 > 08/07/2012 03:54:13;0001; pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE > 08/07/2012 03:54:13;0001; pbs_mom;Job;TMomFinalizeJob3;job 56293.pbs_server_name started, pid = 11247 > 08/07/2012 03:58:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:03:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:08:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:13:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:18:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:23:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:28:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:33:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:38:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > 08/07/2012 04:43:11;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.2, loglevel = 0 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120810/df846c84/attachment-0001.html From Gareth.Williams at csiro.au Sun Aug 12 19:55:15 2012 From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au) Date: Mon, 13 Aug 2012 11:55:15 +1000 Subject: [torqueusers] cgroup memory allocation problem In-Reply-To: <28F7EF75-5623-4CB9-9544-5BE6E3E9B845@umich.edu> References: <28F7EF75-5623-4CB9-9544-5BE6E3E9B845@umich.edu> Message-ID: <007DECE986B47F4EABF823C1FBB19C620123995D9929@exvic-mbx04.nexus.csiro.au> > -----Original Message----- > From: Brock Palen [mailto:brockp at umich.edu] > Sent: Friday, 10 August 2012 8:18 AM > To: Torque Users Mailing List > Subject: [torqueusers] cgroup memory allocation problem > > I filed this with adaptive but others should be aware of a major > problem for high memory use jobs on pbs_moms using cgroups: > > cgroups in torque4 are assigning memory banks in numa systems based on > core layout only. > > Example: > > 8 core 48GB memroy two socket machine valid cpus 0-7 valid mems 0-1 > > If a job is only on the first socket is is assigned to mems 0 if it > is on the second, mems 1, if a job is assigned cores on both it is > assigned both. > > The above is fine, > > Now if I request 1 core and more memory, node has two 24GB memory banks > qsub procs=1,mem=47gb > > the mems is set to 0 and cpus 0 when my job hits 24 gb (the size of > mems 0) I start to swap rather than giving me all the assigned memory. > > A similar case: > procs=1,mem=20gb > procs=1,mem=20gb > procs=1,mem=20gb > > On am empty node if they are all on the same one, they get assigned > cpu 0, 1, and 2 but all get mems 0 and jobs swap. > > Is there away to just assign all numa nodes in jobs? and just use CPU > binding? Currently we are most interested in cpu binding. > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > brockp at umich.edu > (734)936-1985 Hi Brock, For reference, we've noticed something related on our UV system. To work around this we set the numa virtual node configuration to define each virtual node to correspond to a socket and are asking/forcing users to request whole nodes except for low processor count low memory jobs. The machine topology would be better reflected if we defined virtual nodes to correspond to socket pairs. > Is there away to just assign all numa nodes in jobs? and just use CPU > binding? Currently we are most interested in cpu binding. You could use a submit filter to round requests up to full nodes or reject jobs... Also you could use the prologue to alter the existing cpuset to include more mems. Note we are running a torque 3 version with cpusets rather than cgroups per-se if that matters. Gareth From nt_mahmood at yahoo.com Mon Aug 13 06:27:49 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 13 Aug 2012 05:27:49 -0700 (PDT) Subject: [torqueusers] qdel with regulat expressions Message-ID: <1344860869.76801.YahooMailNeo@web111707.mail.gq1.yahoo.com> Hi, Assume these jobs are running: set1-model1 set1-model2 set2-model1 set1-model4 set2-model7 Now assume I want to delete set1* Is there any way to delete the jobs with regular expressions? Currently I have to use "qdel 54151 54686 56588" Regards, Mahmood From gas5x at yahoo.com Mon Aug 13 06:54:24 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Mon, 13 Aug 2012 05:54:24 -0700 (PDT) Subject: [torqueusers] qdel with regulat expressions In-Reply-To: <1344860869.76801.YahooMailNeo@web111707.mail.gq1.yahoo.com> Message-ID: <1344862464.18790.YahooMailClassic@web111316.mail.gq1.yahoo.com> Hi Mahmood, You could use qselect, then Unix commands to form the list, with glob or regex expressions, and then xargs it to qdel. qselect -u naderan | grep something | xargs qdel -- Grigory Shamov --- On Mon, 8/13/12, Mahmood Naderan wrote: > From: Mahmood Naderan > Subject: [torqueusers] qdel with regulat expressions > To: "torque cluster" > Date: Monday, August 13, 2012, 5:27 AM > Hi, > Assume these jobs are running: > > set1-model1 > set1-model2 > set2-model1 > set1-model4 > set2-model7 > > Now assume I want to delete set1* > Is there any way to delete the jobs with regular > expressions? > Currently I have to use "qdel 54151 54686 56588" > > Regards, > Mahmood > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From nt_mahmood at yahoo.com Mon Aug 13 08:54:27 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Mon, 13 Aug 2012 07:54:27 -0700 (PDT) Subject: [torqueusers] qdel with regulat expressions In-Reply-To: <1344862464.18790.YahooMailClassic@web111316.mail.gq1.yahoo.com> References: <1344860869.76801.YahooMailNeo@web111707.mail.gq1.yahoo.com> <1344862464.18790.YahooMailClassic@web111316.mail.gq1.yahoo.com> Message-ID: <1344869667.37643.YahooMailNeo@web111709.mail.gq1.yahoo.com> >qselect -u naderan | grep something | xargs qdel That is also good. Thanks... Regards, Mahmood ----- Original Message ----- From: Grigory Shamov To: Mahmood Naderan ; Torque Users Mailing List Cc: Sent: Monday, August 13, 2012 2:54 PM Subject: Re: [torqueusers] qdel with regulat expressions Hi Mahmood, You could use qselect, then Unix commands to form the list, with glob or regex expressions, and then xargs it to qdel. qselect -u naderan | grep something | xargs qdel -- Grigory Shamov --- On Mon, 8/13/12, Mahmood Naderan wrote: > From: Mahmood Naderan > Subject: [torqueusers] qdel with regulat expressions > To: "torque cluster" > Date: Monday, August 13, 2012, 5:27 AM > Hi, > Assume these jobs are running: > > set1-model1 > set1-model2 > set2-model1 > set1-model4 > set2-model7 > > Now assume I want to delete set1* > Is there any way to delete the jobs with regular > expressions? > Currently I have to use "qdel 54151 54686 56588" > > Regards, > Mahmood > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From knielson at adaptivecomputing.com Mon Aug 13 11:59:01 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Mon, 13 Aug 2012 11:59:01 -0600 Subject: [torqueusers] TORQUE mailing lists back up Message-ID: Hi all, You may have noticed the TORQUE mailing list has been pretty quiet the last few days. The problems have been fixed and we are back up. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/7c0a4231/attachment.html From fcaba at uns.edu.ar Mon Aug 13 12:13:35 2012 From: fcaba at uns.edu.ar (Fernando Caba) Date: Mon, 13 Aug 2012 15:13:35 -0300 Subject: [torqueusers] Moving jobs from one node to another Message-ID: <502943CF.2050506@uns.edu.ar> Hy, i want to know something about moving jobs from one node to another. If i need to do some manteinance in one node with a certain number of running jobs (they cannot be killed). Can i move those all jobs (or specific) to another node (free or not)? If yes, how? Sorry because I?m asking again the same, is it a dumb question? Regards Fernando -- ---------------------------------------------------- Ing. Fernando Caba Director General de Telecomunicaciones Universidad Nacional del Sur http://www.dgt.uns.edu.ar Tel/Fax: (54)-291-4595166 Tel: (54)-291-4595101 int. 2050 Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina ---------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4533 bytes Desc: Firma criptogr??fica S/MIME Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/45030d3c/attachment.bin From brockp at umich.edu Mon Aug 13 12:15:12 2012 From: brockp at umich.edu (Brock Palen) Date: Mon, 13 Aug 2012 14:15:12 -0400 Subject: [torqueusers] cgroup memory allocation problem In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620123995D9929@exvic-mbx04.nexus.csiro.au> References: <28F7EF75-5623-4CB9-9544-5BE6E3E9B845@umich.edu> <007DECE986B47F4EABF823C1FBB19C620123995D9929@exvic-mbx04.nexus.csiro.au> Message-ID: <4AA41E44-B979-43D6-AEAA-E587AF1656A7@umich.edu> On Aug 12, 2012, at 9:55 PM, wrote: >> -----Original Message----- >> From: Brock Palen [mailto:brockp at umich.edu] >> Sent: Friday, 10 August 2012 8:18 AM >> To: Torque Users Mailing List >> Subject: [torqueusers] cgroup memory allocation problem >> >> I filed this with adaptive but others should be aware of a major >> problem for high memory use jobs on pbs_moms using cgroups: >> >> cgroups in torque4 are assigning memory banks in numa systems based on >> core layout only. >> >> Example: >> >> 8 core 48GB memroy two socket machine valid cpus 0-7 valid mems 0-1 >> >> If a job is only on the first socket is is assigned to mems 0 if it >> is on the second, mems 1, if a job is assigned cores on both it is >> assigned both. >> >> The above is fine, >> >> Now if I request 1 core and more memory, node has two 24GB memory banks >> qsub procs=1,mem=47gb >> >> the mems is set to 0 and cpus 0 when my job hits 24 gb (the size of >> mems 0) I start to swap rather than giving me all the assigned memory. >> >> A similar case: >> procs=1,mem=20gb >> procs=1,mem=20gb >> procs=1,mem=20gb >> >> On am empty node if they are all on the same one, they get assigned >> cpu 0, 1, and 2 but all get mems 0 and jobs swap. >> >> Is there away to just assign all numa nodes in jobs? and just use CPU >> binding? Currently we are most interested in cpu binding. >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> brockp at umich.edu >> (734)936-1985 > > Hi Brock, > > For reference, we've noticed something related on our UV system. To work around this we set the numa virtual node configuration to define each virtual node to correspond to a socket and are asking/forcing users to request whole nodes except for low processor count low memory jobs. The machine topology would be better reflected if we defined virtual nodes to correspond to socket pairs. > >> Is there away to just assign all numa nodes in jobs? and just use CPU >> binding? Currently we are most interested in cpu binding. > > You could use a submit filter to round requests up to full nodes or reject jobs... Also you could use the prologue to alter the existing cpuset to include more mems. > > Note we are running a torque 3 version with cpusets rather than cgroups per-se if that matters. Gareth, Looking in the mom source in cpuset.c it looks like all torque does is find a match for all memory domains that overlap with the assigned cpus. So there is no book keeping at all to report to the scheduler about the memory layout what has been assigned etc. So we did exactly what you said and created a prologe that copies the memory nodes of the entire system into the job when it starts. Very simple works well. Appears to have solved our problem for now. > > Gareth > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From gus at ldeo.columbia.edu Mon Aug 13 12:48:13 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 13 Aug 2012 14:48:13 -0400 Subject: [torqueusers] [Mauiusers] Moving jobs from one node to another In-Reply-To: References: <502943FE.2030607@uns.edu.ar> Message-ID: <50294BED.6070201@ldeo.columbia.edu> On 08/13/2012 02:17 PM, Denis wrote: > 2012/8/13 Fernando Caba: >> Hy, i want to know something about moving jobs from one node to another. >> If i need to do some manteinance in one node with a certain number of >> running jobs (they cannot be killed). >> Can i move those all jobs (or specific) to another node (free or not)? If >> yes, how? >> >> Sorry because I?m asking again the same, is it a dumb question? > Hello, Fernando. > > You cannot move a running job to another node. That would be possible > with Condor if you link your code against its libraries when > compiling. > > D. Hi Fernando The best thing is to use algorithms and programs that can be restarted from a given state/configuration, and run them for a relatively small time [hours, not days, or weeks, or months], restarting as needed. Not all programs are written this way, but often times they have this capability, and users simply don't know about it or how to use it. This way, if the user loses one job, [s]he doesn't loose too much, and can restart from the state/configuration saved by the previous job in the sequence. Also, you won't feel too guilty for killing a job that has been running for a few hours only, but your user may become very upset if you kill her/his job that has been running for three weeks. Our queues here have a maximum walltime of 12h, but 6h is common in many public computers. A modest job runtime also improves the overall throughput of the cluster, and prevents hogging of the cluster nodes by one or a few users. I hope this helps, Gus Correa >> Regards >> >> Fernando >> >> -- >> ---------------------------------------------------- >> Ing. Fernando Caba >> Director General de Telecomunicaciones >> Universidad Nacional del Sur >> http://www.dgt.uns.edu.ar >> Tel/Fax: (54)-291-4595166 >> Tel: (54)-291-4595101 int. 2050 >> Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina >> ---------------------------------------------------- >> >> >> >> _______________________________________________ >> mauiusers mailing list >> mauiusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/mauiusers >> > > From fcaba at uns.edu.ar Mon Aug 13 13:07:26 2012 From: fcaba at uns.edu.ar (Fernando Caba) Date: Mon, 13 Aug 2012 16:07:26 -0300 Subject: [torqueusers] [Mauiusers] Moving jobs from one node to another In-Reply-To: <50294BED.6070201@ldeo.columbia.edu> References: <502943FE.2030607@uns.edu.ar> <50294BED.6070201@ldeo.columbia.edu> Message-ID: <5029506E.5030800@uns.edu.ar> Gus, thank?s for your answer. I?ll send it to our cluster?s users. Fernando El 13/08/2012 03:48 PM, Gus Correa escribi?: > On 08/13/2012 02:17 PM, Denis wrote: >> 2012/8/13 Fernando Caba: >>> Hy, i want to know something about moving jobs from one node to another. >>> If i need to do some manteinance in one node with a certain number of >>> running jobs (they cannot be killed). >>> Can i move those all jobs (or specific) to another node (free or not)? If >>> yes, how? >>> >>> Sorry because I?m asking again the same, is it a dumb question? >> Hello, Fernando. >> >> You cannot move a running job to another node. That would be possible >> with Condor if you link your code against its libraries when >> compiling. >> >> D. > Hi Fernando > > The best thing is to use algorithms and programs that can be restarted > from a given state/configuration, > and run them for a relatively small time [hours, not days, or weeks, or > months], restarting as needed. > Not all programs are written this way, but often times they have this > capability, and users simply don't know about it > or how to use it. > > This way, if the user loses one job, [s]he doesn't loose too much, and > can restart from the state/configuration > saved by the previous job in the sequence. > Also, you won't feel too guilty for killing a job that has been running > for a few hours only, > but your user may become very upset if you kill her/his job that has > been running for three weeks. > > Our queues here have a maximum walltime of 12h, but 6h is common > in many public computers. > A modest job runtime also improves the overall throughput of the cluster, > and prevents hogging of the cluster nodes by one or a few users. > > I hope this helps, > Gus Correa > >>> Regards >>> >>> Fernando >>> >>> -- >>> ---------------------------------------------------- >>> Ing. Fernando Caba >>> Director General de Telecomunicaciones >>> Universidad Nacional del Sur >>> http://www.dgt.uns.edu.ar >>> Tel/Fax: (54)-291-4595166 >>> Tel: (54)-291-4595101 int. 2050 >>> Avda. Alem 1253, (B8000CPB) Bah?a Blanca - Argentina >>> ---------------------------------------------------- >>> >>> >>> >>> _______________________________________________ >>> mauiusers mailing list >>> mauiusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/mauiusers >>> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4533 bytes Desc: Firma criptogr??fica S/MIME Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/49364acc/attachment.bin From gas5x at yahoo.com Mon Aug 13 14:44:19 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Mon, 13 Aug 2012 13:44:19 -0700 (PDT) Subject: [torqueusers] qdel with regulat expressions In-Reply-To: <1344869667.37643.YahooMailNeo@web111709.mail.gq1.yahoo.com> Message-ID: <1344890659.56526.YahooMailClassic@web111312.mail.gq1.yahoo.com> Hi, Oops, I was wrong. qselect returns list of JobIDs that indeed can be passed to grep but for no use. It can select by names, with qselect -N parameter, but doesnt accept globs there? So one can try to choose jobs by other parameters of qselect (submission time etc) if they are known! -- Grigory Shamov --- On Mon, 8/13/12, Mahmood Naderan wrote: > From: Mahmood Naderan > Subject: Re: [torqueusers] qdel with regulat expressions > To: "Grigory Shamov" > Cc: "Torque Users Mailing List" > Date: Monday, August 13, 2012, 7:54 AM > >qselect -u naderan | grep > something | xargs qdel > > That is also good. Thanks... > > Regards, > Mahmood > > > ----- Original Message ----- > From: Grigory Shamov > To: Mahmood Naderan ; > Torque Users Mailing List > Cc: > Sent: Monday, August 13, 2012 2:54 PM > Subject: Re: [torqueusers] qdel with regulat expressions > > Hi Mahmood, > > You could use qselect, then Unix commands to form the list, > with glob or regex expressions, and then xargs it to qdel. > > qselect -u naderan | grep something | xargs qdel > > -- > Grigory Shamov > > --- On Mon, 8/13/12, Mahmood Naderan > wrote: > > > From: Mahmood Naderan > > Subject: [torqueusers] qdel with regulat expressions > > To: "torque cluster" > > Date: Monday, August 13, 2012, 5:27 AM > > Hi, > > Assume these jobs are running: > > > > set1-model1 > > set1-model2 > > set2-model1 > > set1-model4 > > set2-model7 > > > > Now assume I want to delete set1* > > Is there any way to delete the jobs with regular > > expressions? > > Currently I have to use "qdel 54151 54686 56588" > > > > Regards, > > Mahmood > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > From gas5x at yahoo.com Mon Aug 13 14:47:44 2012 From: gas5x at yahoo.com (Grigory Shamov) Date: Mon, 13 Aug 2012 13:47:44 -0700 (PDT) Subject: [torqueusers] upgrading from Torque 2 to Torque 3, live? Message-ID: <1344890864.28514.YahooMailClassic@web111316.mail.gq1.yahoo.com> Hi All, Perhaps this question was answered, but I dont have it im my mailbox so I'll ask again. I wonder if it is possible to upgrade from 2.5 to 3.0 without losing jobs current jobs? The documentation says that Torque 1 jobs will be lost, and that Torque 4 is incompatible too; but what about 3.0? Did anyone actually try it? Thanks! -- Grigory Shamov HPC Analyst, University of Manitoba From chris.evert at geokinetics.com Mon Aug 13 15:03:22 2012 From: chris.evert at geokinetics.com (Chris Evert) Date: Mon, 13 Aug 2012 16:03:22 -0500 Subject: [torqueusers] qdel with regulat expressions In-Reply-To: <1344890659.56526.YahooMailClassic@web111312.mail.gq1.yahoo.com> References: <1344890659.56526.YahooMailClassic@web111312.mail.gq1.yahoo.com> Message-ID: <50296B9A.80408@geokinetics.com> Grigory, A perl script to print job ids when job name contains arg 1: #!/usr/bin/perl use strict; my $id; my $re; $re = $ARGV[0]; open F, "qstat -f|"; while () { m/^Job Id: (.*)/ && ($id = $1); if (m/Job_Name = (.*)/) { print "$id\n" if ($1 =~ m/$re/); } } close F; Perhaps I should have checked for non-existent $ARGV[0] or print usage. Left as exercise for the reader. Enjoy, Chris On 08/13/2012 03:44 PM, Grigory Shamov wrote: > Hi, > > Oops, I was wrong. qselect returns list of JobIDs that indeed can be passed to grep but for no use. It can select by names, with qselect -N parameter, but doesnt accept globs there? So one can try to choose jobs by other parameters of qselect (submission time etc) if they are known! > > -- > Grigory Shamov > > --- On Mon, 8/13/12, Mahmood Naderan wrote: > >> From: Mahmood Naderan >> Subject: Re: [torqueusers] qdel with regulat expressions >> To: "Grigory Shamov" >> Cc: "Torque Users Mailing List" >> Date: Monday, August 13, 2012, 7:54 AM >>> qselect -u naderan | grep >> something | xargs qdel >> >> That is also good. Thanks... >> >> Regards, >> Mahmood >> >> >> ----- Original Message ----- >> From: Grigory Shamov >> To: Mahmood Naderan ; >> Torque Users Mailing List >> Cc: >> Sent: Monday, August 13, 2012 2:54 PM >> Subject: Re: [torqueusers] qdel with regulat expressions >> >> Hi Mahmood, >> >> You could use qselect, then Unix commands to form the list, >> with glob or regex expressions, and then xargs it to qdel. >> >> qselect -u naderan | grep something | xargs qdel >> >> -- >> Grigory Shamov >> >> --- On Mon, 8/13/12, Mahmood Naderan >> wrote: >> >>> From: Mahmood Naderan >>> Subject: [torqueusers] qdel with regulat expressions >>> To: "torque cluster" >>> Date: Monday, August 13, 2012, 5:27 AM >>> Hi, >>> Assume these jobs are running: >>> >>> set1-model1 >>> set1-model2 >>> set2-model1 >>> set1-model4 >>> set2-model7 >>> >>> Now assume I want to delete set1* >>> Is there any way to delete the jobs with regular >>> expressions? >>> Currently I have to use "qdel 54151 54686 56588" >>> >>> Regards, >>> Mahmood >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >> >> > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From shade34321 at gmail.com Fri Aug 10 21:18:21 2012 From: shade34321 at gmail.com (Shade Alabsa) Date: Fri, 10 Aug 2012 23:18:21 -0400 Subject: [torqueusers] Constantly having to restart Torque Message-ID: Recently we had to upgrade our xcat, pbs, and maui install and since then roughly once a week we have to restart our cluster/pbs. I'm not sure what the problem is but our pbs log files are full of the following errors. 08/10/2012 23:17:09;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:09;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:09;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:25;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:25;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:25;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:29;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:29;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:29;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server Any help you can provide would be great! Thanks! Shade Alabsa -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120810/197e103f/attachment-0001.html From azher at hep.caltech.edu Sun Aug 12 02:10:06 2012 From: azher at hep.caltech.edu (Azher Mughal) Date: Sun, 12 Aug 2012 01:10:06 -0700 Subject: [torqueusers] PBS_Server Errors Message-ID: <502764DE.7030508@hep.caltech.edu> Hi All, I have users jobs in queue, but they are not running despite cluster is free. I am getting these errors in /var/log/messages: Aug 11 20:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10 to host 0 has timed out after 900 seconds - closing stale connection Aug 11 21:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10 to host 0 has timed out after 900 seconds - closing stale connection Aug 11 23:47:30 omega PBS_Server: LOG_ERROR::wait_request, connection 10 to host 0 has timed out after 900 seconds - closing stale connection Some outputs are below. Any suggestions what could be wrong or what else needs to be checked. Thanks -Azher [root at omega torque]# qstat -q server: omega Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- superb -- -- -- -- 0 19 53 E R babar -- -- -- -- 0 0 -- E R babar100 -- 72:00:00 -- -- 0 0 10 E R minos -- -- -- -- 0 33004 -- E R dque -- -- -- -- 0 0 -- E R io -- -- -- -- 0 0 10 E R ----- ----- 0 33023 [root at omega torque]# showbf --loglevel=9 INFO: LOGLEVEL set to 9 MUGetOpt(1,ArgV,C:D:F:hP:V?-:Aa:c:d:f:g:m:M:n:p:q:r:Su:vV,OptArg) INFO: flags loaded INFO: 1 command line args remaining: 'showbf' MSUConnect(S,FALSE,EMsg) INFO: trying to connect to 192.168.1.213 (Port: 40559) INFO: non-blocking mode established MSUSelectWrite(3,30000000) INFO: successful connect to TCP server (sd: 3) MCSendRequest(S) MSUSendData(S,30000000,TRUE,FALSE) MSecGetChecksum2(Buf1,27,Buf2,81,Checksum,[NONE],CSKey) INFO: header created '00000128 CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=' INFO: sending short packet '00000128 CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=CMD=showbf AUTH=root ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE] ' MSUSendPacket(3,Buf,137,30000000,SC) INFO: sending packet '00000128 CK=522ae0966230ab95 TS=1344758465 AUTH=root DT=CMD=showbf AUTH=root ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE] ' MSUSelectWrite(3,30000000) INFO: packet sent (137 bytes of 137) INFO: message sent to server INFO: message sent: 'CMD=showbf AUTH=root ARG=root root ALL ALL 0 0 0 0 0 NC 0 0 [NONE] [NONE] [NONE] ' MSURecvData(S,30000000,TRUE,SC,EMsg) MSURecvPacket(3,BufP,9,NULL,30000000,SC) MSUSelectRead(3,30000000) MSUSelectRead-select failed WARNING: cannot receive message within 30.000000 second timeout (aborting) ALERT: cannot determine packet size ERROR: lost connection to server ERROR: cannot request service (status) [root at omega torque]# showstart 1584704.omega ERROR: lost connection to server ERROR: cannot request service (status) Both Daemons are running: [root at omega torque]# /etc/init.d/pbs_server status pbs_server (pid 5057) is running... [root at omega torque]# /etc/init.d/maui status maui (pid 6096) is running... From guy4261 at gmail.com Mon Aug 13 05:24:50 2012 From: guy4261 at gmail.com (Guy Rapaport) Date: Mon, 13 Aug 2012 14:24:50 +0300 Subject: [torqueusers] Is it mandatory to configure ports during installation? Message-ID: Hello, I am installing Torque 4.0.1 according to these instructions: http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/torqueAdminGuide-4.0.1.pdf Section 1.2.3 deals with Configuring Ports. However, after even if I don't follow the instructions in this section, eventually I do get port numbers for the different services: pbs_server port is 15001 mom_service_port is 15002 mom_manager_port is 15003 15004 is not used AFAIK trqauthd port is 15005 My question is - is it mandatory to follow these steps, i.e. launching pbs_server with the proper arguments, editing /etc/services etc? I am asking because after skipping this step, pbsnodes -a lists my only node (I'm using the same machine as headnode, execution node and sumission node) as with 'state = down', and I have no idea why. Best regards, >g. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/032f3c1d/attachment-0001.html From azher at hep.caltech.edu Mon Aug 13 08:00:47 2012 From: azher at hep.caltech.edu (Azher Mughal) Date: Mon, 13 Aug 2012 07:00:47 -0700 Subject: [torqueusers] exceeds available partition procs Message-ID: <5029088F.1000902@hep.caltech.edu> Hi All, I am trying to find why jobs are queued from all the users, where checkjob is telling: checking job 1588541 State: Idle Creds: user:bckhouse group:minos class:minos qos:DEFAULT WallTime: 00:00:00 of 10:00:00:00 SubmitTime: Fri Aug 10 17:32:51 (Time Queued Total: 2:13:23:49 Eligible: 18:18:51) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [sl5] Dedicated Resources Per Task: PROCS: 1 MEM: 1024M IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 18 PartitionMask: [ALL] Flags: RESTARTABLE Holds: Defer Messages: exceeds available partition procs PE: 1.14 StartPriority: 1 cannot select job 1588541 for partition DEFAULT (job hold active) Any hint ? Thanks -Azher From guy4261 at gmail.com Mon Aug 13 06:19:43 2012 From: guy4261 at gmail.com (Guy Rapaport) Date: Mon, 13 Aug 2012 15:19:43 +0300 Subject: [torqueusers] head node and execution node installed on same machine and do not recognize each other Message-ID: service pbs_mom restart qterm -t quick pbs_server pbsnodes -a pbs_server port is 15001 mom_service_port is 15002 mom_manager_port is 15003 ? 15004 trqauthd port is 15005 netstat -tuplen when restarting mom: 08/13/2012 18:08:19;0002; pbs_mom;n/a;rm_request;shutdown 08/13/2012 18:08:19;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup 08/13/2012 18:08:19;0002; pbs_mom;Svr;Log;Log closed 08/13/2012 18:08:21;0002; pbs_mom;Svr;Log;Log opened 08/13/2012 18:08:21;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.1.0, loglevel = 0 08/13/2012 18:08:21;0002; pbs_mom;Svr;setpbsserver;biostation06 08/13/2012 18:08:21;0002; pbs_mom;Svr;mom_server_add;server biostation06 added 08/13/2012 18:08:21;0002; pbs_mom;n/a;initialize;independent 08/13/2012 18:08:21;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 08/13/2012 18:08:21;0002; pbs_mom;Svr;pbs_mom;Is up 08/13/2012 18:08:21;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/local/sbin/pbs_mom 1344185749 08/13/2012 18:08:21;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.1.0, loglevel = 0 08/13/2012 18:11:03;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Network is unreachable (101) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 132.72.216.159:15001] 08/13/2012 18:11:03;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Inappropriate ioctl for device (25) in tcp_connect_sockaddr, cannot connect to port 8 in socket_connect_addr - errno:101 Network is unreachable 08/13/2012 18:11:03;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Cannot get a valid stream to send update to server 'biostation06' 08/13/2012 18:12:52;0002; pbs_mom;n/a;rm_request;shutdown 08/13/2012 18:12:52;0002; pbs_mom;n/a;dep_cleanup;dependent cleanup 08/13/2012 18:12:52;0002; pbs_mom;Svr;Log;Log closed 08/13/2012 18:12:54;0002; pbs_mom;Svr;Log;Log opened 08/13/2012 18:12:54;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.1.0, loglevel = 0 08/13/2012 18:12:54;0002; pbs_mom;Svr;setpbsserver;biostation06 08/13/2012 18:12:54;0002; pbs_mom;Svr;mom_server_add;server biostation06 added 08/13/2012 18:12:54;0002; pbs_mom;n/a;initialize;independent 08/13/2012 18:12:54;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs 08/13/2012 18:12:54;0002; pbs_mom;Svr;pbs_mom;Is up 08/13/2012 18:12:54;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/local/sbin/pbs_mom 1344185749 08/13/2012 18:12:54;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.1.0, loglevel = 0 n when restarting headnode: 08/13/2012 17:33:16;0086;PBS_Server;Svr;PBS_Server;Shutdown request from root at biostation06.bgu.ac.il 08/13/2012 17:33:16;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown the server, type is Quick 08/13/2012 17:33:17;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed 08/13/2012 17:33:17;0002;PBS_Server;Svr;Log;Log closed 08/13/2012 17:33:19;0002;PBS_Server;Svr;Log;Log opened 08/13/2012 17:33:19;0006;PBS_Server;Svr;PBS_Server;Server biostation06.bgu.ac.il started, initialization type = 1 08/13/2012 17:33:19;0002;PBS_Server;Svr;get_default_threads;Defaulting min_threads to 49 threads 08/13/2012 17:33:19;0002;PBS_Server;Svr;Act;Account file /var/spool/torque/server_priv/accounting/20120813 opened 08/13/2012 17:33:19;0040;PBS_Server;Req;setup_nodes;setup_nodes() 08/13/2012 17:33:19;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch 08/13/2012 17:33:19;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 1 queues 08/13/2012 17:33:19;0080;PBS_Server;Svr;PBS_Server;2 total files read from disk 08/13/2012 17:33:19;0002;PBS_Server;Svr;PBS_Server;handle_job_recovery:3 08/13/2012 17:33:19;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 'biostation06.bgu.ac.il') 08/13/2012 17:33:19;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 13368, loglevel=0 08/13/2012 17:33:34;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.1.0, loglevel = 0 08/13/2012 17:33:34;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::get_node_from_str, Node biostation06 is reporting on node biostation06.bgu.ac.il, which pbs_server doesn't know about -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/5a1b9e1d/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: torqueAdminGuide-4.0.1.pdf Type: application/pdf Size: 2403366 bytes Desc: not available Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/5a1b9e1d/attachment-0001.pdf From shade34321 at gmail.com Mon Aug 13 12:54:39 2012 From: shade34321 at gmail.com (Shade Alabsa) Date: Mon, 13 Aug 2012 14:54:39 -0400 Subject: [torqueusers] Constantly having to restart PBS Message-ID: Recently we had to upgrade our xcat, pbs, and maui install and since then roughly once a week we have to restart our cluster/pbs. I'm not sure what the problem is but our pbs log files are full of the following errors. 08/10/2012 23:17:09;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:09;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:09;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:13;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:17;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:21;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:25;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:25;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:25;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:29;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:29;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:29;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server 08/10/2012 23:17:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0 08/10/2012 23:17:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File 08/10/2012 23:17:33;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server Any help you can provide would be great! Thanks! Shade Alabsa -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120813/23318ac5/attachment-0001.html From vipingjo at gmail.com Tue Aug 14 02:02:59 2012 From: vipingjo at gmail.com (vipingjo) Date: Tue, 14 Aug 2012 16:02:59 +0800 Subject: [torqueusers] Not enough of the right type of nodes are available Message-ID: <201208141602531685695@gmail.com> Hi all, I'm running torque on a virtual machine Ubuntu 10.04 LTS. Only one user is submitting jobs. But only one job is running. Can anyone tell me how to config to make more jobs running? Thanks! Here are some commands and output. If more information is needed, please let me know. $cat 37.N3.1.573.pbs #!/bin/bash #PBS -d /var/N3.0_lane_opt/ #PBS -o /var/37.N3.0.573.pbs.o #PBS -e /var/37.N3.0.573.pbs.e #PBS -l nodes=1:ppn=1 /usr/bin/time -v perl lt.pl $cat 37.N3.1.576.pbs #!/bin/bash #PBS -d /var/N3.1_lane_opt/ #PBS -o /var/37.N3.1.576.pbs.o #PBS -e /var/37.N3.1.576.pbs.e #PBS -l nodes=1:ppn=1 /usr/bin/time -v perl opt.pl $qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 313.ubuntu 37.N3.1.573.pbs viping 00:28:32 R dque 314.ubuntu 37.N3.1.576.pbs viping 0 Q dque $qstat -f 314 ... Priority = 0 Rerunable = True Resource_List.ncpus = 1 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.walltime = 1000:00:00 ... comment = Not Running: Not enough of the right type of nodes are available Qmgr: list queue dque Queue dque queue_type = Execution total_jobs = 2 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:0 max_running = 4 resources_max.ncpus = 4 resources_max.nodes = 2 resources_min.ncpus = 1 resources_default.ncpus = 1 resources_default.neednodes = 1:ppn=1 resources_default.nodect = 1 resources_default.nodes = 1 resources_default.walltime = 1000:00:00 resources_assigned.ncpus = 1 resources_assigned.nodect = 1 max_user_run = 6 enabled = True started = True Qmgr: q root at ubuntu:/usr/bin# qmgr Max open servers: 4 Qmgr: list server Server ubuntu server_state = Active scheduling = True total_jobs = 2 state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:0 acl_hosts = igenas default_queue = dque log_events = 511 mail_from = adm resources_assigned.ncpus = 1 resources_assigned.nodect = 1 scheduler_iteration = 600 node_check_rate = 150 tcp_timeout = 6 pbs_version = 2.3.6 next_job_number = 315 net_counter = 6 2 1 #cat /var/lib/torque/server_name igenas Regards, Viping 2012-08-14 vipingjo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120814/1d0c184f/attachment-0001.html From guy4261 at gmail.com Tue Aug 14 05:12:13 2012 From: guy4261 at gmail.com (Guy Rapaport) Date: Tue, 14 Aug 2012 14:12:13 +0300 Subject: [torqueusers] Is it mandatory to configure ports during installation? In-Reply-To: References: Message-ID: Hello, I am installing Torque 4.0.1 according to these instructions: http://www.adaptivecomputing.com/resources/docs/torque/4-0-1/torqueAdminGuide-4.0.1.pdf Section 1.2.3 deals with Configuring Ports. However, after even if I don't follow the instructions in this section, eventually I do get port numbers for the different services: pbs_server port is 15001 mom_service_port is 15002 mom_manager_port is 15003 15004 is not used AFAIK trqauthd port is 15005 My question is - is it mandatory to follow these steps, i.e. launching pbs_server with the proper arguments, editing /etc/services etc? It was weird to do because a) I didn't know what values should I normally use and b) while mom is a service, pbs_server isn't - I run it like a normal executable. I am asking because after skipping this step, pbsnodes -a lists my only node (I'm using the same machine as headnode, execution node and sumission node) as with 'state = down', and I have no idea why. Also, I'm installing the pbsserver and mom on the same machine (don't ask...) and the pbs_server log says: 08/13/2012 17:33:34;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::get_node_from_str, Node biostation06 is reporting on node biostation06.bgu.ac.il, which pbs_server doesn't know about (biostation06 is the DNS of my machine. The FQDN is biostation06.bgu.ac.il) Ideas, anyone? Best regards, >g. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120814/5430dfaf/attachment.html From tom.asbury at sequentainc.com Mon Aug 13 11:55:42 2012 From: tom.asbury at sequentainc.com (Tom Asbury) Date: Mon, 13 Aug 2012 10:55:42 -0700 Subject: [torqueusers] multiple job array dependency bug with simple test case script Message-ID: <4424F658BE5DB042A4E55CDE3B50CA4923B32F5D11@EX-BE-016-SFO.shared.themessagecenter.com> I've boiled down what I think to be a significant job dependency array bug to a simple test case. The problem is that job array dependencies are not honored when you have more than one array dependency. The script below creates 2 small sleep jobs of different lengths and submits each as an array. A final wrap-up job is submitted as depending upon both of the job arrays successfully finishing. However, the dependent job does not wait for both arrays to exit successfully before starting, rather, it starts early after the first array job finishes: sleep120_array sleep60_array | | | | | - | done | dep job starts | *bug* | --- done dep job should start here #--------SH SCRIPT START------ cat << EOF > sleeper60 sleep 60 EOF cat << EOF > sleeper120 sleep 120 EOF JOB1=`cat sleeper120 | qsub -t 1-2` JOB2=`cat sleeper60 | qsub -t 1-2` echo "cat sleeper60 | qsub -W depend=afterokarray:${JOB1}:${JOB2}" cat sleeper60 | qsub -W depend=afterokarray:${JOB1}:${JOB2} # --------SH SCRIPT END------ Running the script above will give the something like the following: > sh SCRIPT cat sleeper60 | qsub -W depend=afterokarray:55349[].madrid:55350[].madrid 55351.madrid -- after 90 seconds -- > qstat 55349[].madrid STDIN tasbu 0 R batch 55350[].madrid STDIN tasbu 0 C batch 55351.madrid STDIN tasbu 0 R batch 55351.madrid should not be running! Torque version: 3.0.5 A solution to this problem is crucial to our pipeline and I would appreciate any fixes / workarounds. Thanks - Tom From craig.tierney at noaa.gov Wed Aug 15 21:10:25 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Wed, 15 Aug 2012 21:10:25 -0600 Subject: [torqueusers] exceeds available partition procs In-Reply-To: <5029088F.1000902@hep.caltech.edu> References: <5029088F.1000902@hep.caltech.edu> Message-ID: <502C64A1.9020000@noaa.gov> On 8/13/12 8:00 AM, Azher Mughal wrote: > Hi All, > > I am trying to find why jobs are queued from all the users, where > checkjob is telling: > > > checking job 1588541 > > State: Idle > Creds: user:bckhouse group:minos class:minos qos:DEFAULT > WallTime: 00:00:00 of 10:00:00:00 > SubmitTime: Fri Aug 10 17:32:51 > (Time Queued Total: 2:13:23:49 Eligible: 18:18:51) > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [sl5] > Dedicated Resources Per Task: PROCS: 1 MEM: 1024M > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 18 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Holds: Defer > Messages: exceeds available partition procs > PE: 1.14 StartPriority: 1 > cannot select job 1588541 for partition DEFAULT (job hold active) > > Any hint ? > What does "mdiag -P -v" say? Craig From s.prabhakaran at grs-sim.de Thu Aug 16 07:55:35 2012 From: s.prabhakaran at grs-sim.de (Suraj Prabhakaran) Date: Thu, 16 Aug 2012 15:55:35 +0200 Subject: [torqueusers] Torque Maui Communication during Job Submission Message-ID: <98BEC501-4BC5-4D4C-B565-406AD5549F09@grs-sim.de> Hello, I have been looking into torque and maui communication for some days. I have a question regarding job submission. During a qsub command, does Maui get the information about the qsub only from torque or does it also get directly from the client? Again, any pointers to torque-maui documentation with more descriptions could be very helpful! Best regards, Suraj -------------------------- Suraj Prabhakaran German Research School for Simulation Sciences GmbH Laboratory for Parallel Progreamming 52062 Aachen | Germany Tel +49 241 80 99743 Fax +49 241 80 92742 EMail s.prabhakaran at grs-sim.de Web www.grs-sim.de Members: Forschungszentrum J?lich GmbH | RWTH Aachen University Registered in the commercial register of the local court of D?ren (Amtsgericht D?ren) under registration number HRB 5268 Registered office: J?lich Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120816/465b4b50/attachment.html From l.flis at cyf-kr.edu.pl Thu Aug 16 07:58:14 2012 From: l.flis at cyf-kr.edu.pl (Lukasz Flis) Date: Thu, 16 Aug 2012 15:58:14 +0200 Subject: [torqueusers] Torque Maui Communication during Job Submission In-Reply-To: <98BEC501-4BC5-4D4C-B565-406AD5549F09@grs-sim.de> References: <98BEC501-4BC5-4D4C-B565-406AD5549F09@grs-sim.de> Message-ID: <502CFC76.5070906@cyf-kr.edu.pl> Hello, Maui is periodicaly quering pbs_server and gets information about new jobs. Or if you have "server_scheduling" set to true in pbs_server Maui gets notified by pbs_server about new job. There is no direct communication qsub->maui(moab) Cheers, -- Lukasz Flis CYFRONET > > I have been looking into torque and maui communication for some days. I > have a question regarding job submission. > During a qsub command, does Maui get the information about the qsub only > from torque or does it also get directly from the client? > > Again, any pointers to torque-maui documentation with more descriptions > could be very helpful! > > Best regards, > Suraj > > -------------------------- > Suraj Prabhakaran > > German Research School for > Simulation Sciences GmbH > Laboratory for Parallel Progreamming > 52062 Aachen | Germany > > Tel +49 241 80 99743 > Fax +49 241 80 92742 > EMail s.prabhakaran at grs-sim.de > Web www.grs-sim.de > > Members: Forschungszentrum J?lich GmbH | RWTH Aachen University > Registered in the commercial register of the local court of D?ren > (Amtsgericht D?ren) under registration number HRB 5268 > Registered office: J?lich > Executive board: Prof. Marek Behr Ph.D. | Dr. Norbert Drewes > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From ytt515 at yahoo.cn Fri Aug 17 02:58:01 2012 From: ytt515 at yahoo.cn (TingtingYang) Date: Fri, 17 Aug 2012 16:58:01 +0800 (CST) Subject: [torqueusers] Failure on restart job on different node Message-ID: <1345193881.20202.YahooMailClassic@web92203.mail.cnh.yahoo.com> hi all:? ? right now,I want to restart my job on a different node but something was wrong.I follow the below steps to checkpoint/restart:? ? 1.qsub -c enabled ctest,sh(ctest.sh is a serial job)? ? 2.qhold JobID(I can successfully hold mpi/serial job and restart on same node)? ? 3,qalter -l nodes=another node(not previous node running job JObID) JobID? ? 4. qrls jobID job exit with exit_code=139 and/var/log/message saied: Aug 17 16:47:13 node6 restart_script: Invoked: /var/spool/torque/mom_priv/blcr_restart_script 15858 486.node8 ytt ytt /home/share/pbs/486.node8.CK ckpt.486.node8.1345193231? Aug 17 16:48:35 node6 kernel: crtest[15860]: segfault at 0000003bcf633600 rip 0000003bcf633600 rsp 00007fff2e6cb218 error 14 Aug 17 16:48:35 node6 kernel: 486.node8.SC[15859]: segfault at 0000003bcf6e71a0 rip 0000003bcf6e71a0 rsp 00007fffe0da4e18 error 14 Aug 17 16:48:35 node6 kernel: bash[15858]: segfault at 0000003bcf6e71a0 rip 0000003bcf6e71a0 rsp 00007fff4196c7b8 error 14 Aug 17 16:48:35 node6 restart_script: Subcommand (cr_restart --run-on-success='qalter -W checkpoint_restart_status="Successfully restarted job from checkpoint" 486.node8' --run-on-fail-perm='qalter -W checkpoint_restart_status="Permanent failure restarting job from checkpoint" 486.node8' --run-on-fail-temp='qalter -W checkpoint_restart_status="Temporary failure restarting job from checkpoint" 486.node8' --run-on-fail-args='qalter -W checkpoint_restart_status="Argument failure restarting job from checkpoint" 486.node8' --run-on-fail-env='qalter -W checkpoint_restart_status="Environment failure restarting job from checkpoint" 486.node8' --run-on-failure='qalter -W checkpoint_restart_status="General failure restarting job from checkpoint" 486.node8' ckpt.486.node8.1345193231) failed with rc=139:? I use blcr-0.8.4,openmpi-1.6 and torque-2.5.8 and i installed blcr,openmpi and torque on my shared file system as?https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink?said. so can anyone help me please,thank you ? tingting.yang at Beihang university? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120817/865bb641/attachment-0001.html From amjadcsu at gmail.com Fri Aug 17 05:05:12 2012 From: amjadcsu at gmail.com (amjad syed) Date: Fri, 17 Aug 2012 15:05:12 +0400 Subject: [torqueusers] Pbsnodes does not give information about load average in numa configuration. Message-ID: Hello I am trying Torque -3.-0.3. with numa support. This is my first attempt to understand numa and torque, so just consider this a newbie question. I have configured two numa nodes in my mom.layout *cpus=0-15 mem=0 cpus=15-31 mem=1* Also this is my server_priv/nodes file *v2 np=32 num_numa_nodes=2* So when i run pbsnodes -a i dont get loadaverage reported by pbs_mom. * pbsnodes -a v2-0 state = free np = 16 ntype = cluster status = rectime=1345200884,varattr=,jobs=,state=free,netload=? 0,gres=,loadave=? 0,ncpus=16,physmem=16768036kb,availmem=15385544kb,totmem=16768036kb,idletime=355542,nusers=0,nsessions=0,uname=Linux v2 2.6.32-71.el6.x86_64 #1 SMP Fri May 20 03:51:51 BST 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 v2-1 state = free np = 16 ntype = cluster status = rectime=1345200902,varattr=,jobs=,state=free,netload=? 0,gres=,loadave=? 0,ncpus=16,physmem=16777216kb,availmem=16073452kb,totmem=16777216kb,idletime=355542,nusers=0,nsessions=0,uname=Linux v2 2.6.32-71.el6.x86_64 #1 SMP Fri May 20 03:51:51 BST 2011 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 * My question is how can i get this loadaverage information in numa configuration if it does provide one? How does pbs_sched than changes the state of node (from busy to free and vice versa ) based on load average? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120817/4f7fcdab/attachment.html From nt_mahmood at yahoo.com Sat Aug 18 05:18:37 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 18 Aug 2012 04:18:37 -0700 (PDT) Subject: [torqueusers] tcp_request, bad connect from Message-ID: <1345288717.67706.YahooMailNeo@web111703.mail.gq1.yahoo.com> Dear all, In the server log, I see a lot of these errors: Aug 16 00:05:06 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:19894 Aug 16 00:05:06 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:38070 Aug 16 00:05:06 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:39606 Aug 16 00:05:07 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:41398 Aug 16 00:05:07 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:42934 Aug 16 00:05:07 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:44214 Aug 16 00:05:08 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:45238 Aug 16 00:05:08 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:46518 Aug 16 00:05:08 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:47542 Aug 16 00:05:09 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:48566 Aug 16 00:05:09 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:50614 Aug 16 00:05:10 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:51382 Aug 16 00:05:10 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:52150 Aug 16 00:05:10 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:53430 Aug 16 00:05:11 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:54454 Aug 16 00:05:11 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:55734 Aug 16 00:05:11 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:57270 Aug 16 00:05:12 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:60086 Aug 16 00:05:13 hpclab pbs_mom: LOG_ERROR::Success (0) in tcp_request, bad connect from 109.163.230.253:61110 What do they mean? what is 109.163.230.253? the nodes are connected to the server with 192.168.X.Y IP addresses. ? Regards, Mahmood From nt_mahmood at yahoo.com Sat Aug 18 12:11:32 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 18 Aug 2012 11:11:32 -0700 (PDT) Subject: [torqueusers] gui program does not work in interactive mode Message-ID: <1345313492.48424.YahooMailNeo@web111707.mail.gq1.yahoo.com> Dear all, It seems that running a graphical program (which uses X) is not possible in the interactive mode. 1) on the server, i run "ssh -X ws01" and then run the program. Everything is fine and I see the main gui of the program 2) on the server I run "qsub -I -q long", then I enter ws01. This time when I run the program, it says can not open the display. Any thought on this? ? Regards, Mahmood From shenglong.wang at nyu.edu Sat Aug 18 13:29:37 2012 From: shenglong.wang at nyu.edu (Shenglong Wang) Date: Sat, 18 Aug 2012 15:29:37 -0400 Subject: [torqueusers] gui program does not work in interactive mode In-Reply-To: <1345313492.48424.YahooMailNeo@web111707.mail.gq1.yahoo.com> References: <1345313492.48424.YahooMailNeo@web111707.mail.gq1.yahoo.com> Message-ID: Try qsub -X -I -q long On Aug 18, 2012 12:14 PM, "Mahmood Naderan" wrote: > Dear all, > It seems that running a graphical program (which uses X) is not possible > in the interactive mode. > 1) on the server, i run "ssh -X ws01" and then run the program. Everything > is fine and I see the main gui of the program > 2) on the server I run "qsub -I -q long", then I enter ws01. This time > when I run the program, it says can not open the display. > > Any thought on this? > > > Regards, > Mahmood > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120818/d79168d0/attachment.html From nt_mahmood at yahoo.com Sat Aug 18 14:20:37 2012 From: nt_mahmood at yahoo.com (Mahmood Naderan) Date: Sat, 18 Aug 2012 13:20:37 -0700 (PDT) Subject: [torqueusers] gui program does not work in interactive mode In-Reply-To: References: <1345313492.48424.YahooMailNeo@web111707.mail.gq1.yahoo.com> Message-ID: <1345321237.43435.YahooMailNeo@web111714.mail.gq1.yahoo.com> >qsub -X -I -q long Thanks. It is now working Regards, Mahmood ________________________________ From: Shenglong Wang To: Torque Users Mailing List ; Mahmood Naderan Sent: Saturday, August 18, 2012 9:29 PM Subject: Re: [torqueusers] gui program does not work in interactive mode Try qsub -X -I -q long On Aug 18, 2012 12:14 PM, "Mahmood Naderan" wrote: Dear all, >It seems that running a graphical program (which uses X) is not possible in the interactive mode. >1) on the server, i run "ssh -X ws01" and then run the program. Everything is fine and I see the main gui of the program >2) on the server I run "qsub -I -q long", then I enter ws01. This time when I run the program, it says can not open the display. > >Any thought on this? >? > >Regards, >Mahmood > >_______________________________________________ >torqueusers mailing list >torqueusers at supercluster.org >http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120818/bdf8cd57/attachment.html From laytonjb at att.net Sun Aug 19 09:08:41 2012 From: laytonjb at att.net (Jeff Layton) Date: Sun, 19 Aug 2012 11:08:41 -0400 Subject: [torqueusers] Job not running In-Reply-To: <609225716.4302603.1344325796267.JavaMail.root@scai.fraunhofer.de> References: <609225716.4302603.1344325796267.JavaMail.root@scai.fraunhofer.de> Message-ID: <50310179.1030901@att.net> Andr? Apologies for the tardiness of my reply - I was away from my system for a few weeks. I changed the log_level to 7 as you indicated and the server logs are below. I just changed the log_level, restarted the server, fired up a compute node, and submitted the job. I apologize for the length of the log file (BTW - this is the server log file). Any help is greatly appreciated. Thanks! Jeff 08/19/2012 10:47:21;0002;PBS_Server;Svr;Log;Log opened 08/19/2012 10:47:21;0006;PBS_Server;Svr;PBS_Server;Server test1 started, initialization type = 1 08/19/2012 10:47:21;0002;PBS_Server;Svr;get_default_threads;Defaulting min_threads to 9 threads 08/19/2012 10:47:21;0002;PBS_Server;Svr;Act;Account file /opt/torque/server_priv/accounting/20120819 opened 08/19/2012 10:47:21;0040;PBS_Server;Req;setup_nodes;setup_nodes() 08/19/2012 10:47:21;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch 08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 1 queues 08/19/2012 10:47:21;0080;PBS_Server;Svr;PBS_Server;2 total files read from disk 08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 08/19/2012 10:47:21;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 't est1') 08/19/2012 10:47:21;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 2128, loglevel=0 08/19/2012 10:47:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route to host (113) in tcp_connect_sockaddr, Faile d when trying to open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003] 08/19/2012 10:47:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host n0001:15003 08/19/2012 10:47:36;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 10:48:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route to host (113) in tcp_connect_sockaddr, Faile d when trying to open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003] 08/19/2012 10:48:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host n0001:15003 08/19/2012 10:48:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route to host (113) in tcp_connect_sockaddr, Faile d when trying to open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003] 08/19/2012 10:48:33;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host n0001:15003 08/19/2012 10:49:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route to host (113) in tcp_connect_sockaddr, Faile d when trying to open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003] 08/19/2012 10:49:03;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host n0001:15003 08/19/2012 10:52:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 10:57:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 11:02:37;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 11:07:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 11:12:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 11:17:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 0 08/19/2012 11:20:06;0004;PBS_Server;Svr;PBS_Server;attributes set: at request of root at test1 08/19/2012 11:20:06;0004;PBS_Server;Svr;PBS_Server;attributes set: log_level = 7 08/19/2012 11:20:18;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:18;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:24;0086;PBS_Server;Svr;PBS_Server;Starting to shutdown the server, type is By Signal 08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Server shutdown completed 08/19/2012 11:20:24;0002;PBS_Server;Svr;Log;Log closed 08/19/2012 11:20:24;0002;PBS_Server;Svr;Log;Log opened 08/19/2012 11:20:24;0006;PBS_Server;Svr;PBS_Server;Server test1 started, initialization type = 1 08/19/2012 11:20:24;0002;PBS_Server;Svr;get_default_threads;Defaulting min_threads to 9 threads 08/19/2012 11:20:24;0002;PBS_Server;Svr;Act;Account file /opt/torque/server_priv/accounting/20120819 opened 08/19/2012 11:20:24;0040;PBS_Server;Req;setup_nodes;setup_nodes() 08/19/2012 11:20:24;0086;PBS_Server;Svr;PBS_Server;Recovered queue batch 08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Expected 1, recovered 1 queues 08/19/2012 11:20:24;0080;PBS_Server;Svr;PBS_Server;2 total files read from disk 08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Expected 0, recovered 0 jobs 08/19/2012 11:20:24;0006;PBS_Server;Svr;PBS_Server;Using ports Server:15001 Scheduler:15004 MOM:15002 (server: 't est1') 08/19/2012 11:20:24;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 6383, loglevel=0 08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:25;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:20:25;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:25;0080;PBS_Server;node;next_queue;locking batch in method next_queue 08/19/2012 11:20:25;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:29;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:29;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::No route to host (113) in tcp_connect_sockaddr, Failed when trying to op en tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003] 08/19/2012 11:20:37;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host n0001:15003 08/19/2012 11:20:37;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:39;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.2, loglevel = 7 08/19/2012 11:20:39;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:20:54;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:20:54;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:20:54;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:20:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Connection refused (111) in tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 10.1.0.1:15003] 08/19/2012 11:20:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::send_hierarchy, Could not send mom hierarchy to host n0001:15003 08/19/2012 11:20:55;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:09;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:09;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message received from sock 8 (version 3) 08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message received from addr 10.1.0.1:496: mom_port 15002 - rm_port 15003 08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;locking start n0001 in method svr_is_request-AVL_find 08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;locking complete n0001 in method svr_is_request 08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host n0001 (10.1.0.1:496) (sock 8) 08/19/2012 11:21:14;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from n0001 08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 in method svr_is_request-before is_stat_get 08/19/2012 11:21:14;0040;PBS_Server;Req;is_stat_get;received status from node n0001 08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:21:14;0040;PBS_Server;Req;update_node_state;adjusting state for node n0001 - state=514, newstate=0 08/19/2012 11:21:14;0040;PBS_Server;Req;update_node_state;node n0001 marked free 08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 in method svr_is_request-close 08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:21:14;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:21:14;0002;PBS_Server;Svr;send_hierarchy_threadtask;Successfully sent hierarchy to n0001 08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:14;0080;PBS_Server;node;next_queue;locking batch in method next_queue 08/19/2012 11:21:14;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:24;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:24;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message received from sock 10 (version 3) 08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message received from addr 10.1.0.1:628: mom_port 15002 - rm_port 15003 08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;locking start n0001 in method svr_is_request-AVL_find 08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;locking complete n0001 in method svr_is_request 08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host n0001 (10.1.0.1:628) (sock 10) 08/19/2012 11:21:29;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from n0001 08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;unlocking n0001 in method svr_is_request-before is_stat_get 08/19/2012 11:21:29;0040;PBS_Server;Req;is_stat_get;received status from node n0001 08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:21:29;0040;PBS_Server;Req;update_node_state;adjusting state for node n0001 - state=512, newstate=0 08/19/2012 11:21:29;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:29;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:21:29;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:21:29;0080;PBS_Server;node;svr_is_request;unlocking n0001 in method svr_is_request-close 08/19/2012 11:21:39;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:39;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command AuthenticateUser from laytonjb 08/19/2012 11:21:46;0100;PBS_Server;Req;;Type AuthenticateUser request received from laytonjb at test1, sock=10 08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching request AuthenticateUser on sd=10 08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type AuthenticateUser on socket 10 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command QueueJob from laytonjb 08/19/2012 11:21:46;0100;PBS_Server;Req;;Type QueueJob request received from laytonjb at test1, sock=8 08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching request QueueJob on sd=8 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command Disconnect from laytonjb 08/19/2012 11:21:46;0080;PBS_Server;node;find_queuebyname;locking batch in method find_queuebyname 08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;entered spec=1 08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;job allocation debug: 1 requested, 3 svr_clnodes, 1 svr_totnodes 08/19/2012 11:21:46;0080;PBS_Server;node;next_node;locking start n0001 in method next_node-next != NULL 08/19/2012 11:21:46;0080;PBS_Server;node;next_node;locking complete n0001 in method next_node 08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus available on node n0001 08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus free on node n0001 08/19/2012 11:21:46;0080;PBS_Server;node;node_spec;unlocking n0001 in method node_spec-no pos 08/19/2012 11:21:46;0040;PBS_Server;Req;node_spec;job allocation debug(3): returning 1 requested 08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type QueueJob on socket 8 08/19/2012 11:21:46;0080;PBS_Server;node;req_quejob;unlocking batch in method req_quejob-success 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command JobScript from laytonjb 08/19/2012 11:21:46;0100;PBS_Server;Req;;Type JobScript request received from laytonjb at test1, sock=8 08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching request JobScript on sd=8 08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type JobScript on socket 8 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command ReadyToCommit from laytonjb 08/19/2012 11:21:46;0100;PBS_Server;Req;;Type ReadyToCommit request received from laytonjb at test1, sock=8 08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching request ReadyToCommit on sd=8 08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;ready to commit job 08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type ReadyToCommit on socket 8 08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;ready to commit job completed 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command Commit from laytonjb 08/19/2012 11:21:46;0100;PBS_Server;Req;;Type Commit request received from laytonjb at test1, sock=8 08/19/2012 11:21:46;0008;PBS_Server;Job;dispatch_request;dispatching request Commit on sd=8 08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;committing job 08/19/2012 11:21:46;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 20.test1 state from TRANSIT-TRANSICM to QUEUED-QUEUED (1-10) 08/19/2012 11:21:46;0080;PBS_Server;node;find_queuebyname;locking batch in method find_queuebyname 08/19/2012 11:21:46;0100;PBS_Server;Job;20.test1;enqueuing into batch, state 1 hop 1 08/19/2012 11:21:46;0080;PBS_Server;node;set_resc_deflt;unlocking batch in method set_resc_deflt-no pos 08/19/2012 11:21:46;0080;PBS_Server;node;svr_enquejob;unlocking batch in method svr_enquejob-anything 08/19/2012 11:21:46;0080;PBS_Server;node;req_commit;unlocking batch in method req_commit-route success 08/19/2012 11:21:46;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type Commit on socket 8 08/19/2012 11:21:46;0008;PBS_Server;Job;20.test1;Job Queued at request of laytonjb at test1, owner = laytonjb at test1, job name = pbs_test2 , queue = batch 08/19/2012 11:21:46;0080;PBS_Server;Req;dis_request_read;decoding command Disconnect from laytonjb 08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding command AuthenticateUser from laytonjb 08/19/2012 11:21:48;0100;PBS_Server;Req;;Type AuthenticateUser request received from laytonjb at test1, sock=8 08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching request AuthenticateUser on sd=8 08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type AuthenticateUser on socket 8 08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding command Disconnect from laytonjb 08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding command StatusServer from laytonjb 08/19/2012 11:21:48;0100;PBS_Server;Req;;Type StatusServer request received from laytonjb at test1, sock=11 08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching request StatusServer on sd=11 08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type StatusServer on socket 11 08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding command StatusJob from laytonjb 08/19/2012 11:21:48;0100;PBS_Server;Req;;Type StatusJob request received from laytonjb at test1, sock=11 08/19/2012 11:21:48;0008;PBS_Server;Job;dispatch_request;dispatching request StatusJob on sd=11 08/19/2012 11:21:48;0008;PBS_Server;Job;reply_send_svr;Reply sent for request type StatusJob on socket 11 08/19/2012 11:21:48;0002;PBS_Server;Job;req_statjob;Successfully returned the status of queued jobs 08/19/2012 11:21:48;0080;PBS_Server;Req;dis_request_read;decoding command Disconnect from laytonjb 08/19/2012 11:21:49;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:49;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:21:54;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:21:54;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:22:04;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:22:04;0080;PBS_Server;node;next_queue;locking batch in method next_queue 08/19/2012 11:22:04;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:22:09;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:22:09;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message received from sock 10 (version 3) 08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message received from addr 10.1.0.1:339: mom_port 15002 - rm_port 15003 08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;locking start n0001 in method svr_is_request-AVL_find 08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;locking complete n0001 in method svr_is_request 08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;message STATUS (4) received from mom on host n0001 (10.1.0.1:339) (sock 10) 08/19/2012 11:22:14;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received from n0001 08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 in method svr_is_request-before is_stat_get 08/19/2012 11:22:14;0040;PBS_Server;Req;is_stat_get;received status from node n0001 08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:22:14;0040;PBS_Server;Req;update_node_state;adjusting state for node n0001 - state=512, newstate=0 08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking start n0001 in method find_nodebyname-no pos 08/19/2012 11:22:14;0080;PBS_Server;node;find_nodebyname;locking complete n0001 in method find_nodebyname 08/19/2012 11:22:14;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:22:14;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:22:14;0080;PBS_Server;node;svr_is_request;unlocking n0001 in method svr_is_request-close 08/19/2012 11:22:24;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:22:24;0002;PBS_Server;Svr;work_thread;finished work from thread 08/19/2012 11:22:39;0002;PBS_Server;Svr;work_thread;starting work from thread 08/19/2012 11:22:39;0002;PBS_Server;Svr;work_thread;finished work from thread > Hi Jeff, > > please do a > > qmgr -c 'set server log_level = 7' > > and try again. Perhaps we can get some more information about the problem then. > And please, send a qstat -f, not -a. :) > > Greetings > Andr? > > ----- Urspr?ngliche Mail ----- >> Gus. >> >> Thanks for the email! Everything is run by root and was installed >> by root. I tried your suggestions below to add root to the server >> manager and operators but that didn't change anything. The jobs >> still hang and I can't find out why. >> >> I'm still trying some things but no joy so far. I think the problem >> is >> in the scheduler but I can't seem to locate the problem. It's the >> simple FIFO scheduler that is part of Torque so I don't see any >> reason why it's holding jobs. The only thing I can think of is that >> it doesn't think there are any resources available but I can't >> find a reason why. >> >> Thanks! >> >> Jeff >> >> >> From jkusznir at gmail.com Mon Aug 20 13:23:54 2012 From: jkusznir at gmail.com (Jim Kusznir) Date: Mon, 20 Aug 2012 12:23:54 -0700 Subject: [torqueusers] Node Priority Message-ID: Hi all: I've recently updated my cluster and added some more nodes. I now have three categories of nodes: 24 8-core intel nodes (features: intel) 6 16-core AMD nodes with infiniband (features: amd,infiniband) 4 64-core AMD nodes (features: amd, smp) Some of my users don't submit a feature request and don't much care where they get dumped; some users do supply the feature set. I would like the priority of non-featured jobs that can go anywhere to be in the order above. However, right now, they all go to my most limited node type: the 64 core nodes. How do I change the order when "its all the same" to the scheduler? Thanks! --Jim From rsvancara at wsu.edu Mon Aug 20 14:08:00 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Mon, 20 Aug 2012 20:08:00 +0000 Subject: [torqueusers] Node Priority In-Reply-To: References: Message-ID: <1F880D7A2494B346B5AB96481EAE704A1EB506@EXMB-03.ad.wsu.edu> Have you considered a routing queue? Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Jim Kusznir Sent: Monday, August 20, 2012 12:24 PM To: torqueusers at supercluster.org; mauiusers at supercluster.org Subject: [torqueusers] Node Priority Hi all: I've recently updated my cluster and added some more nodes. I now have three categories of nodes: 24 8-core intel nodes (features: intel) 6 16-core AMD nodes with infiniband (features: amd,infiniband) 4 64-core AMD nodes (features: amd, smp) Some of my users don't submit a feature request and don't much care where they get dumped; some users do supply the feature set. I would like the priority of non-featured jobs that can go anywhere to be in the order above. However, right now, they all go to my most limited node type: the 64 core nodes. How do I change the order when "its all the same" to the scheduler? Thanks! --Jim _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From akohlmey at cmm.chem.upenn.edu Mon Aug 20 14:17:28 2012 From: akohlmey at cmm.chem.upenn.edu (Axel Kohlmeyer) Date: Mon, 20 Aug 2012 22:17:28 +0200 Subject: [torqueusers] [Mauiusers] Node Priority In-Reply-To: References: Message-ID: On Mon, Aug 20, 2012 at 9:23 PM, Jim Kusznir wrote: > Hi all: > > I've recently updated my cluster and added some more nodes. I now > have three categories of nodes: > > 24 8-core intel nodes (features: intel) > 6 16-core AMD nodes with infiniband (features: amd,infiniband) > 4 64-core AMD nodes (features: amd, smp) > > Some of my users don't submit a feature request and don't much care > where they get dumped; some users do supply the feature set. > > I would like the priority of non-featured jobs that can go anywhere to > be in the order above. However, right now, they all go to my most > limited node type: the 64 core nodes. > > How do I change the order when "its all the same" to the scheduler? usually nodes are handed out in the reverse order they are listed in the node file. just try to order the nodes in that file accordingly and see if that helps. cheers, axel. > > Thanks! > --Jim > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Dr. Axel Kohlmeyer akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. From rsvancara at wsu.edu Mon Aug 20 17:15:44 2012 From: rsvancara at wsu.edu (Svancara, Randall) Date: Mon, 20 Aug 2012 23:15:44 +0000 Subject: [torqueusers] [Mauiusers] Node Priority In-Reply-To: References: Message-ID: <1F880D7A2494B346B5AB96481EAE704A1EC6E4@EXMB-03.ad.wsu.edu> sort -r for the win!! Randall Svancara High Performance Computing Systems Administrator Washington State University 509-335-3039 -----Original Message----- From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Axel Kohlmeyer Sent: Monday, August 20, 2012 1:17 PM To: Jim Kusznir Cc: mauiusers at supercluster.org; torqueusers at supercluster.org Subject: Re: [torqueusers] [Mauiusers] Node Priority On Mon, Aug 20, 2012 at 9:23 PM, Jim Kusznir wrote: > Hi all: > > I've recently updated my cluster and added some more nodes. I now > have three categories of nodes: > > 24 8-core intel nodes (features: intel) > 6 16-core AMD nodes with infiniband (features: amd,infiniband) > 4 64-core AMD nodes (features: amd, smp) > > Some of my users don't submit a feature request and don't much care > where they get dumped; some users do supply the feature set. > > I would like the priority of non-featured jobs that can go anywhere to > be in the order above. However, right now, they all go to my most > limited node type: the 64 core nodes. > > How do I change the order when "its all the same" to the scheduler? usually nodes are handed out in the reverse order they are listed in the node file. just try to order the nodes in that file accordingly and see if that helps. cheers, axel. > > Thanks! > --Jim > _______________________________________________ > mauiusers mailing list > mauiusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/mauiusers -- Dr. Axel Kohlmeyer akohlmey at gmail.com http://sites.google.com/site/akohlmey/ Institute for Computational Molecular Science Temple University, Philadelphia PA, USA. _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers From andre.gemuend at scai.fraunhofer.de Mon Aug 20 23:55:46 2012 From: andre.gemuend at scai.fraunhofer.de (=?utf-8?Q?Andr=C3=A9_Gem=C3=BCnd?=) Date: Tue, 21 Aug 2012 07:55:46 +0200 (CEST) Subject: [torqueusers] Node Priority In-Reply-To: Message-ID: <1044094277.5993412.1345528546150.JavaMail.root@scai.fraunhofer.de> If you use Maui, you have a lot of powerful tools to deal with that. The easiest way would be to specify MINRESOURCE node allocation, which puts the jobs on the nodes with the least resources that still meet the jobs constraints. That should automatically put all "normal" jobs on the small nodes. Then you can add priorization with node features / classes / groups, etc. I'd also prioritize the jobs with requirements globally, and use the other ones to backfill. Greetings Andr? ----- Urspr?ngliche Mail ----- > Hi all: > > I've recently updated my cluster and added some more nodes. I now > have three categories of nodes: > > 24 8-core intel nodes (features: intel) > 6 16-core AMD nodes with infiniband (features: amd,infiniband) > 4 64-core AMD nodes (features: amd, smp) > > Some of my users don't submit a feature request and don't much care > where they get dumped; some users do supply the feature set. > > I would like the priority of non-featured jobs that can go anywhere > to > be in the order above. However, right now, they all go to my most > limited node type: the 64 core nodes. > > How do I change the order when "its all the same" to the scheduler? > > Thanks! > --Jim > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -- Andr? Gem?nd Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemuend at scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend From pablo.fernandez at cscs.ch Fri Aug 24 02:01:30 2012 From: pablo.fernandez at cscs.ch (Pablo Fernandez) Date: Fri, 24 Aug 2012 10:01:30 +0200 Subject: [torqueusers] Torque 2.5 end-of-support Message-ID: <503734DA.8020702@cscs.ch> Dear all, Does anybody know when is the expected date to end support to Torque 2.5.x series? I have checked here: http://www.adaptivecomputing.com/support/download-center/torque-download/ and it does not say explicitly. Thanks, Pablo From pablo.fernandez at cscs.ch Fri Aug 24 02:47:59 2012 From: pablo.fernandez at cscs.ch (Pablo Fernandez) Date: Fri, 24 Aug 2012 10:47:59 +0200 Subject: [torqueusers] Torque 2.5 end-of-support In-Reply-To: <50373B8E.5010102@cyf-kr.edu.pl> References: <503734DA.8020702@cscs.ch> <50373B8E.5010102@cyf-kr.edu.pl> Message-ID: <50373FBF.1070304@cscs.ch> Yes, I read that... and there is no explicit date of end, when others have. It seems strange that 2.4 and 3.0 are being discontinued, but not 2.5, hence the question. What I would like to know is that there is work being done on the 2.5 series, and it will be in the foreseeable future. Thanks, pablo On 08/24/2012 10:30 AM, Lukasz Flis wrote: > Hi *, > > http://www.adaptivecomputing.com/support/download-center/torque-download/ > > " > The 2.5.x branch is the current maintenance build and will continue to > have minor features and functionality added to it. The build has been > deployed successfully in many environments and is considered reliable. > Build torque-2.5.12.tar.gz is the latest release and the one we > recommend most sites use." > > > Version 3 was in fact numa development branch (2.5 with numa awareness) > replaced later by 4.x > > On 24.08.2012 10:01, Pablo Fernandez wrote: >> Dear all, >> >> Does anybody know when is the expected date to end support to Torque >> 2.5.x series? >> >> I have checked here: >> http://www.adaptivecomputing.com/support/download-center/torque-download/ and >> it does not say explicitly. >> >> Thanks, >> Pablo >> >> _______________________________________________ >> torqueusers mailing list >> torqueusers at supercluster.org >> http://www.supercluster.org/mailman/listinfo/torqueusers >> From jkusznir at gmail.com Fri Aug 24 18:01:15 2012 From: jkusznir at gmail.com (Jim Kusznir) Date: Fri, 24 Aug 2012 17:01:15 -0700 Subject: [torqueusers] extracting a saved serverdb Message-ID: Hi all: I recently rebuilt my head node with a very much newer version of everything. I realized I forgot to dump my torque config to a human-readable text file before doing so, but I did grab my torque server_priv directory. Is there a way for me to dump the db in a human-readable format without distrubing my running torque copy on the head node (eg, not copying it over and restarting torque)? Thanks! --Jim From arkaaloke at gmail.com Mon Aug 27 13:50:29 2012 From: arkaaloke at gmail.com (Arka Aloke Bhattacharya) Date: Mon, 27 Aug 2012 12:50:29 -0700 Subject: [torqueusers] health check Message-ID: Hi, I was configuring torque on a 100-server cluster. I was wondering how common of a practice is it to configure a PBS_MOM to use a health-check script ? How does one ensure that the health-check script covers all eventualities ? Can you give me advice help regarding what are the most common types of failures that the health-check script usually detects ? Thanks a lot, Arka. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120827/3cba508f/attachment.html From lloyd_brown at byu.edu Mon Aug 27 13:58:16 2012 From: lloyd_brown at byu.edu (Lloyd Brown) Date: Mon, 27 Aug 2012 13:58:16 -0600 Subject: [torqueusers] health check In-Reply-To: References: Message-ID: <503BD158.8010302@byu.edu> We use it. Generally we check for things like: - host's firmware up to date (based on a list of "current" that we create) - host's local hard drive not full - host's local filesystems are writable - host can connect to NFS central storage - host's InfiniBand interface is up - host doesn't have a sudden drop in total physical memory since last time it booted (indicates DIMM failure) - host's CPUs are not being throttled (pulled from "rdmsr" somehow One thing we have run into is that when you have checks blocking, etc., from inside the health check script, and something takes too long, it can cause scheduling problems due to lack of pbs_mom responsiveness. The simple work-around is to have the individual checks put their output/status in some sort of file or local database, and then have the health check just grab the most recent statuses from those files/databases. Of course you'll also want to deal with the situation where the most recent output is "too old" (whatever that means in your environment). Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 08/27/2012 01:50 PM, Arka Aloke Bhattacharya wrote: > Hi, > > I was configuring torque on a 100-server cluster. > I was wondering how common of a practice is it to configure a PBS_MOM to > use a health-check script ? > How does one ensure that the health-check script covers all eventualities ? > Can you give me advice help regarding what are the most common types of > failures that the health-check script usually detects ? > > Thanks a lot, > Arka. > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From shade34321 at gmail.com Mon Aug 27 14:19:59 2012 From: shade34321 at gmail.com (Shade Alabsa) Date: Mon, 27 Aug 2012 16:19:59 -0400 Subject: [torqueusers] Gathering accounting information Message-ID: I'm fairly new to Torque and I was trying to get a feel of how other people get their accounting information from Torque's accounting logs. I know it's possible to write a script that parses those logs but my scripting skills just aren't there yet and limited and previously we were using gold but due to complications after updating it Gold no longer works. I've looked for tools through google but most seem out of date or I couldn't find some sort of source code to look at. Thanks! Shade Alabsa -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120827/8ec4974d/attachment.html From mej at lbl.gov Mon Aug 27 14:54:13 2012 From: mej at lbl.gov (Michael Jennings) Date: Mon, 27 Aug 2012 13:54:13 -0700 Subject: [torqueusers] health check In-Reply-To: References: Message-ID: <20120827205411.GN30193@lbl.gov> On Monday, 27 August 2012, at 12:50:29 (-0700), Arka Aloke Bhattacharya wrote: > I was configuring torque on a 100-server cluster. > I was wondering how common of a practice is it to configure a PBS_MOM to > use a health-check script ? It's quite common. > How does one ensure that the health-check script covers all eventualities ? Experience. ;-) > Can you give me advice help regarding what are the most common types > of failures that the health-check script usually detects ? At the risk of self-promotion, I recommend you check out our NHC project: http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check It's specifically engineered to be cross-site, portable, flexible, and support any check you can write in a shell script. And it has specific safeguards in place to protect against the type of lockup issue Lloyd cited in his prior e-mail. Give it a look-see; feedback is most welcome! And for those who are already using it, I know I've been quiet, but the new release will be out very soon with some great new features! :-) HTH, Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From gus at ldeo.columbia.edu Mon Aug 27 16:26:54 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 27 Aug 2012 18:26:54 -0400 Subject: [torqueusers] Gathering accounting information In-Reply-To: References: Message-ID: <503BF42E.5030107@ldeo.columbia.edu> On 08/27/2012 04:19 PM, Shade Alabsa wrote: > I'm fairly new to Torque and I was trying to get a feel of how other > people get their accounting information from Torque's accounting logs. > I know it's possible to write a script that parses those logs but my > scripting skills just aren't there yet and limited and previously we > were using gold but due to complications after updating it Gold no > longer works. I've looked for tools through google but most seem out > of date or I couldn't find some sort of source code to look at. Thanks! > > Shade Alabsa > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers Hi Shade Ole Holm Nielsen wrote a simple little set of bash scripts [pbsacct, pestat] that parse and extract information from the accounting logs. That may be a good starting point: ftp://ftp.fysik.dtu.dk/pub/Torque/ What I like most about them is that it is easy to tweak with the scripts to show the information you need, formatted the way you want. For a visual snapshot of cluster use, I like Fotis Georgatos' 'qtop', which is also just a set of shell scripts built on top of Torque's qstat command, and displays the cluster occupancy in ASCII art: http://fotis.web.cern.ch/fotis/QTOP/ Both tools are simple, no-frills, text based, customizable, but effective, IMHO. I hope this helps. Gus Correa From craig.tierney at noaa.gov Mon Aug 27 16:29:39 2012 From: craig.tierney at noaa.gov (Craig Tierney) Date: Mon, 27 Aug 2012 16:29:39 -0600 Subject: [torqueusers] Torque 2.5 end-of-support In-Reply-To: <50373FBF.1070304@cscs.ch> References: <503734DA.8020702@cscs.ch> <50373B8E.5010102@cyf-kr.edu.pl> <50373FBF.1070304@cscs.ch> Message-ID: <503BF4D3.6050604@noaa.gov> On 8/24/12 2:47 AM, Pablo Fernandez wrote: > Yes, I read that... and there is no explicit date of end, when others have. > > It seems strange that 2.4 and 3.0 are being discontinued, but not 2.5, > hence the question. What I would like to know is that there is work > being done on the 2.5 series, and it will be in the foreseeable future. > > Thanks, > pablo > Pablo, If I understand what I was told, the 3.0 branch was a fork of the 2.5 for issues related to NUMA and CPUSets. If you aren't using CPUsets, then you should be using 2.5.X. Since 4.X is out and provides CPUsets, I think Adaptive is trying to push people off of the separate 3.X branch and get them to the branch they are maintaining for everything. Craig > > On 08/24/2012 10:30 AM, Lukasz Flis wrote: >> Hi *, >> >> http://www.adaptivecomputing.com/support/download-center/torque-download/ >> >> " >> The 2.5.x branch is the current maintenance build and will continue to >> have minor features and functionality added to it. The build has been >> deployed successfully in many environments and is considered reliable. >> Build torque-2.5.12.tar.gz is the latest release and the one we >> recommend most sites use." >> >> >> Version 3 was in fact numa development branch (2.5 with numa awareness) >> replaced later by 4.x >> >> On 24.08.2012 10:01, Pablo Fernandez wrote: >>> Dear all, >>> >>> Does anybody know when is the expected date to end support to Torque >>> 2.5.x series? >>> >>> I have checked here: >>> http://www.adaptivecomputing.com/support/download-center/torque-download/ and >>> it does not say explicitly. >>> >>> Thanks, >>> Pablo >>> >>> _______________________________________________ >>> torqueusers mailing list >>> torqueusers at supercluster.org >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > From shade34321 at gmail.com Mon Aug 27 16:30:56 2012 From: shade34321 at gmail.com (Shade Alabsa) Date: Mon, 27 Aug 2012 18:30:56 -0400 Subject: [torqueusers] Gathering accounting information In-Reply-To: <503BF42E.5030107@ldeo.columbia.edu> References: <503BF42E.5030107@ldeo.columbia.edu> Message-ID: Thanks for replying. I have looked Ole Holm Nielsen script and when I tried running it the script told me I had no accounting files. I've started looking at the script but it's slow progress with me. I will look at the other one as well. Thanks! Shade On Mon, Aug 27, 2012 at 6:26 PM, Gus Correa wrote: > On 08/27/2012 04:19 PM, Shade Alabsa wrote: > > I'm fairly new to Torque and I was trying to get a feel of how other > > people get their accounting information from Torque's accounting logs. > > I know it's possible to write a script that parses those logs but my > > scripting skills just aren't there yet and limited and previously we > > were using gold but due to complications after updating it Gold no > > longer works. I've looked for tools through google but most seem out > > of date or I couldn't find some sort of source code to look at. Thanks! > > > > Shade Alabsa > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > Hi Shade > > Ole Holm Nielsen wrote a simple little set of bash scripts > [pbsacct, pestat] that parse and extract information from the accounting > logs. > That may be a good starting point: > > ftp://ftp.fysik.dtu.dk/pub/Torque/ > > What I like most about them is that it is easy to tweak > with the scripts to show the information you need, > formatted the way you want. > > For a visual snapshot of cluster use, I like Fotis Georgatos' > 'qtop', which is also just a set of shell scripts built > on top of Torque's qstat command, and displays the > cluster occupancy in ASCII art: > > http://fotis.web.cern.ch/fotis/QTOP/ > > Both tools are simple, no-frills, text based, customizable, > but effective, IMHO. > > I hope this helps. > Gus Correa > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120827/60495715/attachment.html From gus at ldeo.columbia.edu Mon Aug 27 16:43:16 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 27 Aug 2012 18:43:16 -0400 Subject: [torqueusers] Gathering accounting information In-Reply-To: References: <503BF42E.5030107@ldeo.columbia.edu> Message-ID: <503BF804.8040605@ldeo.columbia.edu> On 08/27/2012 06:30 PM, Shade Alabsa wrote: > Thanks for replying. I have looked Ole Holm Nielsen script and when I > tried running it the script told me I had no accounting files. I've > started looking at the script but it's slow progress with me. I will > look at the other one as well. Thanks! > > Shade Hi Shade For the standard output report: pbsacct /path/to/your/server_priv/accounting/file_names which are probably somewhere in your Torque/PBS server. You can use shell file name wildcards to select a range of time in the file_names, say, 201207?? to get a report for July/2012. I hope it helps, Gus Correa > > On Mon, Aug 27, 2012 at 6:26 PM, Gus Correa > wrote: > > On 08/27/2012 04:19 PM, Shade Alabsa wrote: > > I'm fairly new to Torque and I was trying to get a feel of how other > > people get their accounting information from Torque's accounting > logs. > > I know it's possible to write a script that parses those logs but my > > scripting skills just aren't there yet and limited and previously we > > were using gold but due to complications after updating it Gold no > > longer works. I've looked for tools through google but most seem out > > of date or I couldn't find some sort of source code to look at. > Thanks! > > > > Shade Alabsa > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > Hi Shade > > Ole Holm Nielsen wrote a simple little set of bash scripts > [pbsacct, pestat] that parse and extract information from the > accounting > logs. > That may be a good starting point: > > ftp://ftp.fysik.dtu.dk/pub/Torque/ > > What I like most about them is that it is easy to tweak > with the scripts to show the information you need, > formatted the way you want. > > For a visual snapshot of cluster use, I like Fotis Georgatos' > 'qtop', which is also just a set of shell scripts built > on top of Torque's qstat command, and displays the > cluster occupancy in ASCII art: > > http://fotis.web.cern.ch/fotis/QTOP/ > > Both tools are simple, no-frills, text based, customizable, > but effective, IMHO. > > I hope this helps. > Gus Correa > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From shade34321 at gmail.com Tue Aug 28 10:54:36 2012 From: shade34321 at gmail.com (Shade Alabsa) Date: Tue, 28 Aug 2012 12:54:36 -0400 Subject: [torqueusers] Gathering accounting information In-Reply-To: <503BF804.8040605@ldeo.columbia.edu> References: <503BF42E.5030107@ldeo.columbia.edu> <503BF804.8040605@ldeo.columbia.edu> Message-ID: Gus, I was changing into my Torque directory, /var/spool/torque/server_priv/accounting/ and from there running running pbsacct 201208* to get all accounting for the month of August. It produces the error, There are no accounting records in the input files. Has Torque changed they way the accounting is kept from when pbsacct was last updated or is my case just special? I plan on looking at the code more closely today to try and figure out what's going on and how to manipulate it to what I need. Thanks! Shade On Mon, Aug 27, 2012 at 6:43 PM, Gus Correa wrote: > On 08/27/2012 06:30 PM, Shade Alabsa wrote: > > Thanks for replying. I have looked Ole Holm Nielsen script and when I > > tried running it the script told me I had no accounting files. I've > > started looking at the script but it's slow progress with me. I will > > look at the other one as well. Thanks! > > > > Shade > > Hi Shade > > For the standard output report: > > pbsacct /path/to/your/server_priv/accounting/file_names > > which are probably somewhere in your Torque/PBS server. > > You can use shell file name wildcards to select a range of time in the > file_names, > say, 201207?? to get a report for July/2012. > > I hope it helps, > Gus Correa > > > > > > On Mon, Aug 27, 2012 at 6:26 PM, Gus Correa > > wrote: > > > > On 08/27/2012 04:19 PM, Shade Alabsa wrote: > > > I'm fairly new to Torque and I was trying to get a feel of how > other > > > people get their accounting information from Torque's accounting > > logs. > > > I know it's possible to write a script that parses those logs but > my > > > scripting skills just aren't there yet and limited and previously > we > > > were using gold but due to complications after updating it Gold no > > > longer works. I've looked for tools through google but most seem > out > > > of date or I couldn't find some sort of source code to look at. > > Thanks! > > > > > > Shade Alabsa > > > > > > > > > _______________________________________________ > > > torqueusers mailing list > > > torqueusers at supercluster.org > > > http://www.supercluster.org/mailman/listinfo/torqueusers > > Hi Shade > > > > Ole Holm Nielsen wrote a simple little set of bash scripts > > [pbsacct, pestat] that parse and extract information from the > > accounting > > logs. > > That may be a good starting point: > > > > ftp://ftp.fysik.dtu.dk/pub/Torque/ > > > > What I like most about them is that it is easy to tweak > > with the scripts to show the information you need, > > formatted the way you want. > > > > For a visual snapshot of cluster use, I like Fotis Georgatos' > > 'qtop', which is also just a set of shell scripts built > > on top of Torque's qstat command, and displays the > > cluster occupancy in ASCII art: > > > > http://fotis.web.cern.ch/fotis/QTOP/ > > > > Both tools are simple, no-frills, text based, customizable, > > but effective, IMHO. > > > > I hope this helps. > > Gus Correa > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > > > > > > > > > _______________________________________________ > > torqueusers mailing list > > torqueusers at supercluster.org > > http://www.supercluster.org/mailman/listinfo/torqueusers > > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120828/4cd09973/attachment-0001.html From Ole.H.Nielsen at fysik.dtu.dk Wed Aug 29 03:28:27 2012 From: Ole.H.Nielsen at fysik.dtu.dk (Ole Holm Nielsen) Date: Wed, 29 Aug 2012 11:28:27 +0200 Subject: [torqueusers] Missing Torque job accounting information In-Reply-To: References: Message-ID: <503DE0BB.5060706@fysik.dtu.dk> Hi Shade (Cc: torqueusers list) I looked at the Torque accounting file you sent to me, and strangely your file doesn't contain any "resources_used" data whatsoever. Because missing data is equivalent to zero usage, the pbsacct scripts reports that there are NO accounting records from your cluster! In your accounting file please consider this example line (jobs that Ended will have an ;E; in field no. 2): 08/26/2012 06:11:00;E;2043.mgt.cluster.net;user=masquelet group=users jobname=DAVIS queue=batch ctime=1345845922 qtime=1345845922 etime=1345845922 start=1345889422 owner=masquelet at node256 exec_host=(list truncated/OHN) Resource_List.neednodes=60:ppn=4 Resource_List.nodect=60 Resource_List.nodes=60:ppn=4 Resource_List.walltime=24:00:00 session=8892 end=1345975860 Exit_status=-10 In a correct example accounting file entry from our cluster you'll see additional usage information at the end of the line: (data omitted) Exit_status=0 resources_used.cput=143:56:09 resources_used.mem=40639016kb resources_used.vmem=56211260kb resources_used.walltime=03:36:06 Questions to the Torqueusers list --------------------------------- Under what circumstances may jobs that finish have an "E" entry in the accounting file which is missing all the resources_used data? Is this due to a sysadmin misconfiguration? How can this error be corrected? Shade: Will you kindly reply to the list with information about your version of Torque (do a "qstat --version"). Regards, Ole On 08/28/2012 09:59 PM, Shade Alabsa wrote: > I recently came across your script for torque accounting. I have > tried to use it with the current version of Torque and it complains that > there are no accounting records in the input files and I was wondering > if you could help me in figuring out what I'm doing wrong. Below is a > copy of what I'm doing, > > [root at mgt accounting]# pwd > /var/spool/torque/server_priv/accounting > [root at mgt accounting]# pbsacct 2012**** Shade: The "*" means 1 or more letters, so **** is bad practice. The correct usage would be "pbsacct 2012????" where "?" indicated exactly 1 character. You could also do "pbsacct 201207??" to select just the files from the month of July. The pbsacct package includes a script pbsreportmonth which generates monthly accounting statistics. This is quite useful. > > Portable Batch System accounting statistics > ------------------------------------------- > > Processing a total of 69 accounting files... done. > /usr/local/bin/pbsacct ERROR: There are no accounting records in the > input files: > 20120621 20120622 20120623 20120624 20120625 20120626 20120627 20120628 > 20120629 20120630 20120701 20120702 20120703 20120704 20120705 20120706 > 20120707 20120708 20120709 20120710 20120711 20120712 20120713 20120714 > 20120715 20120716 20120717 20120718 20120719 20120720 20120721 20120722 > 20120723 20120724 20120725 20120726 20120727 20120728 20120729 20120730 > 20120731 20120801 20120802 20120803 20120804 20120805 20120806 20120807 > 20120808 20120809 20120810 20120811 20120812 20120813 20120814 20120815 > 20120816 20120817 20120818 20120819 20120820 20120821 20120822 20120823 > 20120824 20120825 20120826 20120827 20120828 > > The four stars are because we installed torque in June and we need the > accounting for since then and the logs only include dates since since. > Also attached is a accounting file if that helps at all. Thank you! -- Ole Holm Nielsen Department of Physics, Technical University of Denmark From anthony.schreiner at bc.edu Wed Aug 29 08:38:33 2012 From: anthony.schreiner at bc.edu (Tony Schreiner) Date: Wed, 29 Aug 2012 14:38:33 +0000 Subject: [torqueusers] pbs_server not keeping up Message-ID: <28F4B3452639F040B99A1738C510B50B068945F8@EBHAZARD01.bc.edu> On my smallish cluster with torque 2.5.7. A user submitted about 8000 jobs to a routing queue, which feeds to an execution queue with 200 runnable slots. At the moment, bps_server is unable to handle it, pbsnodes returns no nodes found, qstat -q takes a long time and shows nothing. This is the tail of the latest server_logs file 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux24 not detected in 1346250363 seconds, marking node down 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux25 not detected in 1346250363 seconds, marking node down 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux27 not detected in 1346250363 seconds, marking node down 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux29 not detected in 1346250363 seconds, marking node down 08/29/2012 10:26:40;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51) 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51) 08/29/2012 10:27:18;000d;PBS_Server;Job;797357.portal;Post job file processing error; job 797357.portal on host linux29/1 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=53) 08/29/2012 10:28:33;0004;PBS_Server;Svr;check_nodes;node linux20 not detected in 1346250513 seconds, marking node down here are some server settings set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server resources_default.ncpus = 1 set server resources_default.nodect = 1 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server mom_job_sync = True set server keep_completed = 86400 set server log_keep_days = 365 set server next_job_number = 808497 set server job_log_keep_days = 365 is there anything I can change to help move things along. Thanks Tony Schreiner From anthony.schreiner at bc.edu Wed Aug 29 09:35:53 2012 From: anthony.schreiner at bc.edu (Tony Schreiner) Date: Wed, 29 Aug 2012 15:35:53 +0000 Subject: [torqueusers] pbs_server not keeping up In-Reply-To: <28F4B3452639F040B99A1738C510B50B068945F8@EBHAZARD01.bc.edu> References: <28F4B3452639F040B99A1738C510B50B068945F8@EBHAZARD01.bc.edu> Message-ID: <28F4B3452639F040B99A1738C510B50B068948E2@EBHAZARD01.bc.edu> On Aug 29, 2012, at 10:38 AM, Tony Schreiner wrote: > On my smallish cluster with torque 2.5.7. > > A user submitted about 8000 jobs to a routing queue, which feeds to an execution queue with 200 runnable slots. > > At the moment, bps_server is unable to handle it, pbsnodes returns no nodes found, qstat -q takes a long time and shows nothing. > This is the tail of the latest server_logs file > > 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux24 not detected in 1346250363 seconds, marking node down > 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux25 not detected in 1346250363 seconds, marking node down > 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux27 not detected in 1346250363 seconds, marking node down > 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux29 not detected in 1346250363 seconds, marking node down > 08/29/2012 10:26:40;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51) > 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51) > 08/29/2012 10:27:18;000d;PBS_Server;Job;797357.portal;Post job file processing error; job 797357.portal on host linux29/1 > 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=53) > 08/29/2012 10:28:33;0004;PBS_Server;Svr;check_nodes;node linux20 not detected in 1346250513 seconds, marking node down > > here are some server settings > > set server log_events = 511 > set server mail_from = adm > set server query_other_jobs = True > set server resources_default.ncpus = 1 > set server resources_default.nodect = 1 > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server mom_job_sync = True > set server keep_completed = 86400 > set server log_keep_days = 365 > set server next_job_number = 808497 > set server job_log_keep_days = 365 > > > is there anything I can change to help move things along. > Thanks > > Tony Schreiner Addendum, it seems to have more to do with the number of entries in the server_priv/jobs directory. There were about 50,000 in there. When I deleted the older ones (about half), operation returned to normal. I'm going to reduce keep_completed, at least temporarily. Tony From gus at ldeo.columbia.edu Wed Aug 29 10:07:28 2012 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 29 Aug 2012 12:07:28 -0400 Subject: [torqueusers] pbs_server not keeping up In-Reply-To: <28F4B3452639F040B99A1738C510B50B068948E2@EBHAZARD01.bc.edu> References: <28F4B3452639F040B99A1738C510B50B068945F8@EBHAZARD01.bc.edu> <28F4B3452639F040B99A1738C510B50B068948E2@EBHAZARD01.bc.edu> Message-ID: <503E3E40.1090807@ldeo.columbia.edu> On 08/29/2012 11:35 AM, Tony Schreiner wrote: > On Aug 29, 2012, at 10:38 AM, Tony Schreiner wrote: > >> On my smallish cluster with torque 2.5.7. >> >> A user submitted about 8000 jobs to a routing queue, which feeds to an execution queue with 200 runnable slots. >> >> At the moment, bps_server is unable to handle it, pbsnodes returns no nodes found, qstat -q takes a long time and shows nothing. >> This is the tail of the latest server_logs file >> >> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux24 not detected in 1346250363 seconds, marking node down >> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux25 not detected in 1346250363 seconds, marking node down >> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux27 not detected in 1346250363 seconds, marking node down >> 08/29/2012 10:26:03;0004;PBS_Server;Svr;check_nodes;node linux29 not detected in 1346250363 seconds, marking node down >> 08/29/2012 10:26:40;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51) >> 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=51) >> 08/29/2012 10:27:18;000d;PBS_Server;Job;797357.portal;Post job file processing error; job 797357.portal on host linux29/1 >> 08/29/2012 10:27:18;0008;PBS_Server;Job;NULL;on_job_exit valid pjob: 0xe3d4180 (substate=53) >> 08/29/2012 10:28:33;0004;PBS_Server;Svr;check_nodes;node linux20 not detected in 1346250513 seconds, marking node down >> >> here are some server settings >> >> set server log_events = 511 >> set server mail_from = adm >> set server query_other_jobs = True >> set server resources_default.ncpus = 1 >> set server resources_default.nodect = 1 >> set server scheduler_iteration = 600 >> set server node_check_rate = 150 >> set server tcp_timeout = 6 >> set server mom_job_sync = True >> set server keep_completed = 86400 >> set server log_keep_days = 365 >> set server next_job_number = 808497 >> set server job_log_keep_days = 365 >> >> >> is there anything I can change to help move things along. >> Thanks >> >> Tony Schreiner > Addendum, it seems to have more to do with the number of entries in the server_priv/jobs directory. There were about 50,000 in there. When I deleted the older ones (about half), operation returned to normal. I'm going to reduce keep_completed, at least temporarily. > > Tony > > Hi Tony Have you tried to set the max_queueable or max_user_queueable attribute of your execution queue? http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php#attributes I guess this will throttle the routing-to-execution queue job flux, and reduce the clutter. Gus Correa From mej at lbl.gov Wed Aug 29 10:05:18 2012 From: mej at lbl.gov (Michael Jennings) Date: Wed, 29 Aug 2012 09:05:18 -0700 Subject: [torqueusers] Missing Torque job accounting information In-Reply-To: <503DE0BB.5060706@fysik.dtu.dk> References: <503DE0BB.5060706@fysik.dtu.dk> Message-ID: <20120829160515.GH8827@lbl.gov> On Wednesday, 29 August 2012, at 11:28:27 (+0200), Ole Holm Nielsen wrote: > Questions to the Torqueusers list > --------------------------------- > > Under what circumstances may jobs that finish have an "E" entry in the > accounting file which is missing all the resources_used data? > > Is this due to a sysadmin misconfiguration? How can this error be > corrected? This is a known bug in TORQUE 4.1 and should be fixed in 4.1.1. We filed a bug with Adaptive Support on this after 4.1.0 came out, and we've just confirmed the fix in 4.1.1 within the last couple weeks. Users with a support contract through Adaptive should consult with them before making any version changes, but others running 4.1.x will want to move to 4.1.1 (or the 4.1-fixes SVN branch) very soon. Numerous issues have been resolved. Michael -- Michael Jennings Senior HPC Systems Engineer High-Performance Computing Services Lawrence Berkeley National Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F: 510-486-8615 From anthony.schreiner at bc.edu Wed Aug 29 10:13:32 2012 From: anthony.schreiner at bc.edu (Tony Schreiner) Date: Wed, 29 Aug 2012 16:13:32 +0000 Subject: [torqueusers] pbs_server not keeping up In-Reply-To: <503E3E40.1090807@ldeo.columbia.edu> References: <28F4B3452639F040B99A1738C510B50B068945F8@EBHAZARD01.bc.edu> <28F4B3452639F040B99A1738C510B50B068948E2@EBHAZARD01.bc.edu> <503E3E40.1090807@ldeo.columbia.edu> Message-ID: <28F4B3452639F040B99A1738C510B50B06894AAF@EBHAZARD01.bc.edu> On Aug 29, 2012, at 12:07 PM, Gus Correa wrote: > On 08/29/2012 11:35 AM, Tony Schreiner wrote: >> On Aug 29, 2012, at 10:38 AM, Tony Schreiner wrote: >> >>> On my smallish cluster with torque 2.5.7. >>> >>> A user submitted about 8000 jobs to a routing queue, which feeds to an execution queue with 200 runnable slots. >>> >>> At the moment, bps_server is unable to handle it, pbsnodes returns no nodes found, qstat -q takes a long time and shows nothing. >>> This is the tail of the latest server_logs file >>> >>> ?.. >>> >>> is there anything I can change to help move things along. >>> Thanks >>> >>> Tony Schreiner >> Addendum, it seems to have more to do with the number of entries in the server_priv/jobs directory. There were about 50,000 in there. When I deleted the older ones (about half), operation returned to normal. I'm going to reduce keep_completed, at least temporarily. >> >> Tony >> >> > Hi Tony > > Have you tried to set the max_queueable or max_user_queueable attribute > of your execution queue? > http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/4.1queueconfig.php#attributes > > I guess this will throttle the routing-to-execution queue job flux, and > reduce the clutter. > > Gus Correa > _______________________________________________ I mis-spoke earlier. I had set max_user_queuable = 200, not runnable. Tony Schreiner From shade34321 at gmail.com Wed Aug 29 12:13:42 2012 From: shade34321 at gmail.com (Shade Alabsa) Date: Wed, 29 Aug 2012 14:13:42 -0400 Subject: [torqueusers] Missing Torque job accounting information In-Reply-To: <20120829160515.GH8827@lbl.gov> References: <503DE0BB.5060706@fysik.dtu.dk> <20120829160515.GH8827@lbl.gov> Message-ID: Thank you for answering. I did not know that's what the ? did, I will keep that in mind. As for what Torque version we are using it is version 4.0.2. I've looked at some of my other log files as well and none of them state resources used. I will keep looking since it was only a brief look. Shade On Wed, Aug 29, 2012 at 12:05 PM, Michael Jennings wrote: > On Wednesday, 29 August 2012, at 11:28:27 (+0200), > Ole Holm Nielsen wrote: > > > Questions to the Torqueusers list > > --------------------------------- > > > > Under what circumstances may jobs that finish have an "E" entry in the > > accounting file which is missing all the resources_used data? > > > > Is this due to a sysadmin misconfiguration? How can this error be > > corrected? > > This is a known bug in TORQUE 4.1 and should be fixed in 4.1.1. We > filed a bug with Adaptive Support on this after 4.1.0 came out, and > we've just confirmed the fix in 4.1.1 within the last couple weeks. > > Users with a support contract through Adaptive should consult with > them before making any version changes, but others running 4.1.x will > want to move to 4.1.1 (or the 4.1-fixes SVN branch) very soon. > Numerous issues have been resolved. > > Michael > > -- > Michael Jennings > Senior HPC Systems Engineer > High-Performance Computing Services > Lawrence Berkeley National Laboratory > Bldg 50B-3209E W: 510-495-2687 > MS 050B-3209 F: 510-486-8615 > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120829/0f12699a/attachment-0001.html From knielson at adaptivecomputing.com Thu Aug 30 14:34:11 2012 From: knielson at adaptivecomputing.com (Ken Nielson) Date: Thu, 30 Aug 2012 14:34:11 -0600 Subject: [torqueusers] TORQUE 4.1.1 available Message-ID: TORQUE version 4.1.1 is now available for general download. There were several bugs fixed in this version of TORQUE. Several deadlock issues were fixed around the combination of job arrays and routing queues. x11-forwarding was fixed for interactive jobs. There were fixes for memory corruption and double free. There were 5 memory leaks that were fixed. The mail feature we re-enabled. It had been removed in earlier versions of TORQUE 4.x For a complete list of fixes see the CHANGELOG. We want to thank The University of Michigan, NOAA, University of Florida, LBNL and Cray for their help in finding and fixing many of the bugs for this release. We also appreciate the contributions made by others to the code base. The tar ball for this release can be downloaded at the following URL. http://www.adaptivecomputing.com/support/download-center/torque-download/torque-4.1.1.tar.gz Thanks again for all of the help. The feedback from the community is what makes TORQUE the best it can be. Regards Ken Nielson Adaptive Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120830/eb3da18a/attachment.html From shantanugadgil at yahoo.com Fri Aug 31 00:12:02 2012 From: shantanugadgil at yahoo.com (Shantanu Gadgil) Date: Thu, 30 Aug 2012 23:12:02 -0700 (PDT) Subject: [torqueusers] TORQUE 4.1.1 available In-Reply-To: Message-ID: <1346393522.39784.YahooMailClassic@web120606.mail.ne1.yahoo.com> Hi, Has the 'allow submit as root' been resolved? Ref: http://www.supercluster.org/pipermail/torqueusers/2012-March/014325.html Thanks and Regards, Shantanu --- On Fri, 8/31/12, Ken Nielson wrote: From: Ken Nielson Subject: [torqueusers] TORQUE 4.1.1 available To: "torqueusers" , "Torque Developers mailing list" Date: Friday, August 31, 2012, 2:04 AM TORQUE version 4.1.1 is now available for general download. There were several bugs fixed in this version of TORQUE. Several deadlock issues were fixed around the combination of job arrays and routing queues. x11-forwarding was fixed for interactive jobs. There were fixes for memory corruption and double free. There were 5 memory leaks that were fixed. The mail feature we re-enabled. It had been removed in earlier versions of TORQUE 4.x For a complete list of fixes see the CHANGELOG. We want to thank The University of Michigan, NOAA, University of Florida, LBNL and Cray for their help in finding and fixing many of the bugs for this release. We also appreciate the contributions made by others to the code base. The tar ball for this release can be downloaded at the following URL. http://www.adaptivecomputing.com/support/download-center/torque-download/torque-4.1.1.tar.gz Thanks again for all of the help. The feedback from the community is what makes TORQUE the best it can be. Regards Ken Nielson Adaptive Computing -----Inline Attachment Follows----- _______________________________________________ torqueusers mailing list torqueusers at supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120830/610184af/attachment.html From go-yoshimura at sstc.co.jp Fri Aug 31 06:50:16 2012 From: go-yoshimura at sstc.co.jp (Go Yoshimura) Date: Fri, 31 Aug 2012 21:50:16 +0900 Subject: [torqueusers] prioritize a set of users for using a set of nodes. Message-ID: <201208311250.AA14040@winxp-pc.sstc.co.jp> Hello, My name is Go Yoshimura. I am using to torque4.1.0/maui3.3.1. - My question is about prioritizing a set of users for using a certain nodes - Granted that (1)There are node groups "node group A" and "node group B" nodeA1,nodeA2,nodeA3 .... nodeB1,nodeB2,nodeB3 .... (2)There are user groups "user group A" and "user group B" userA1,userA2,userA3 .... userB1,userB2,userB3 .... - What we want to do are (1)"user group A" use "node group A" (2)"user group B" use "node group B" (3)"user group A" can use "node group B" if "node group B" has no idle job of "node group B". (4)"user group B" can use "node group A" if "node group A" has no idle job of "node group A" - I have succeeded in doing (1)/(2) with using parttions http://www.adaptivecomputing.com/resources/docs/maui/7.2partitions.php NODECFG[nodeA1] PARTITION=groupA NODECFG[nodeA2] PARTITION=groupA .. NODECFG[nodeB1] PARTITION=groupB NODECFG[nodeB2] PARTITION=groupB .. USERCFG[userA1] PLIST=groupA USERCFG[userA2] PLIST=groupA .. USERCFG[userB1] PLIST=groupB USERCFG[userB2] PLIST=groupB .. SYSCFG[base] PLIST= - But (3)/(4) are difficult. - I succeeded in doing (1a)/(2a)/(3a)/(4a) (1a)"user group A" mainly use "node group A" (2a)"user group B" mainly use "node group B" (3a)"user group A" can use "node group B" if "user group A" needs more nodes. (4a)"user group B" can use "node group A" if "user group B" needs more nodes. USERCFG[userA1] PLIST=groupA:groupB PDEF=groupA USERCFG[userA2] PLIST=groupA:groupB PDEF=groupA .. USERCFG[userB1] PLIST=groupB:groupA PDEF=groupB USERCFG[userB2] PLIST=groupB:groupA PDEF=groupB .. With (1a)/(2a)/(3a)/(4a), it is possible all Nodes are used by "user group A" and jobs of "user group B" are idle. - How can we prioritize a set of users for using a certain nodes? if((node==node group A) && (user==user group A)) Priority +=100 ; if((node==node group B) && (user==user group B)) Priority +=100 ; thank you go ---- ---- Go Yoshimura Scalable Systems Co., Ltd. Osaka Office HONMACHI-COLLABO Bldg. 4F, 4-4-2 Kita-kyuhoji-machi, Chuo-ku, Osaka 541-0057 Japan Tel: 81-6-6224-4115 Tokyo Kojimachi Office BUREX Kojimachi 11F, 3-5-2 Kojimachi, Chiyoda-ku, Tokyo 102-0083 Japan Tel: 81-3-5875-4718 Fax: 81-3-3237-7612 From brockp at umich.edu Fri Aug 31 12:41:39 2012 From: brockp at umich.edu (Brock Palen) Date: Fri, 31 Aug 2012 14:41:39 -0400 Subject: [torqueusers] osc mpiexec and torque4 In-Reply-To: <5011F0C0.2090805@unimelb.edu.au> References: <20120725160613.GF5670@lbl.gov> <20120725170115.GJ5670@lbl.gov> <50102AAE.2010204@byu.edu> <5011F0C0.2090805@unimelb.edu.au> Message-ID: <3783D5FD-F573-422D-981D-67A61BF9852B@umich.edu> Chris, As an update I have filed an issue with adaptive. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing brockp at umich.edu (734)936-1985 On Jul 26, 2012, at 9:37 PM, Christopher Samuel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hiya Brock, > > On 26/07/12 04:15, Brock Palen wrote: > >> I will look at filing a bug with adaptive about this behavior >> change. > > In case nobody's mentioned it yet, this would be a good place to start: > > http://www.clusterresources.com/bugzilla/ > > If you've got Moab then once you've reported that you can open a > support case with Adaptive for 4.1 and reference that bug. > > Best of luck! > Chris (still on 2.4, I'm turning into a Garrick) > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAlAR8MAACgkQO2KABBYQAh/MDgCdEd7ppPrcqysFiFmx8Pe48TJU > mxMAn1+o7tpdCtQp36UxaFCtU5GATpIh > =kLbG > -----END PGP SIGNATURE----- > _______________________________________________ > torqueusers mailing list > torqueusers at supercluster.org > http://www.supercluster.org/mailman/listinfo/torqueusers From azher at hep.caltech.edu Fri Aug 31 13:23:39 2012 From: azher at hep.caltech.edu (Azher Mughal) Date: Fri, 31 Aug 2012 12:23:39 -0700 Subject: [torqueusers] jobs stuck in queue Message-ID: <50410F3B.6000702@hep.caltech.edu> Hi all, I have jobs stucked in the queue. One of the sample job and related node output is below. Server is 2.3.7 with maui. Any help ? Thanks -Azher [root at omega server_priv]# checkjob -v 1621827.omega checking job 1621827 (RM job '1621827.omega.cluster.hep.caltech.edu') State: Idle Creds: user:bays group:minos class:minos qos:DEFAULT WallTime: 00:00:00 of 10:00:00:00 SubmitTime: Mon Aug 27 16:31:10 (Time Queued Total: 3:19:49:56 Eligible: 00:00:00) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 1024M NodeAccess: SHARED NodeCount: 0 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] SystemQueueTime: Fri Aug 31 09:12:14 Flags: HOSTLIST RESTARTABLE HostList: [node151:1] Holds: Defer Messages: exceeds available partition procs PE: 1.00 StartPriority: 20188 cannot select job 1621827 for partition DEFAULT (job hold active) [root at omega server_priv]# pbsnodes node151 node151 state = free np = 16 properties = sl5,MEM24G,workdisk ntype = cluster status = opsys=linux,uname=Linux node151 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 x86_64,sessions=? 0,nsessions=? 0,nusers=0,idletime=68747,totmem=24675820kb,availmem=24521680kb,physmem=24675820kb,ncpus=16,loadave=0.00,gres=,netload=1921779741,size=1628314552kb:1788585084kb,state=free,jobs=,varattr=,rectime=1346440829 From zacharybs at ornl.gov Mon Aug 20 12:40:27 2012 From: zacharybs at ornl.gov (Zachary, Brian S.) Date: Mon, 20 Aug 2012 18:40:27 -0000 Subject: [torqueusers] Hardcoded CPU time limit? Message-ID: <60D410E9-382A-4C42-B09E-0AE10BDBDB95@ornl.gov> Hello, We are running torque 3.0.2, and are seeing an issue where jobs that run longer than 10,000 hours are killed with a message similar to "PBS: job killed: cput job total 36010171 secs exceeded limit 36000000 secs". This despite that the queue we are seeing the problem on is configured with "resources_max.cput = 24000:00:00". Does anyone know how to get around this limit, either through source modification, configuration changes, use a different version of torque, etc.? Thanks, Brian Zachary From msmorioka-tky at umin.ac.jp Tue Aug 28 01:07:48 2012 From: msmorioka-tky at umin.ac.jp (Masaki MS) Date: Tue, 28 Aug 2012 07:07:48 -0000 Subject: [torqueusers] A node state down Message-ID: Dear everyone: I am now trying torque installing in a single workstation on Cent OS6. Because I want to manage any jobs in a PC. Almost all setting is done. Test script is successfully submitted, but could not run with no error message. When I check the "pbsnodes -a", I got the message as, ##################################################### [suimye at localhost ~]$ pbsnodes -a localhost state = down np = 12 ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 ##################################################### . Then, I googled my problem and found out this page. http://www.clusterresources.com/pages/resources/documentation/common-issues/torque.php Following this, I checked: 1. *ps -ef|grep pbs_mom -> OK* 2. Restart the *pbs_server* and check *pbsnodes -a -> still DOWN* 3. *mom_priv/config -> pbs_server = 127.0.0.1 (I think this is ok)* 4. *server_name -> localhost* 5. ping localhost *-> OK* 6. port settings are opened at 15000-15005. I could not resolve this problem... If you have any idea or suggestion, please teach me. Best regards: Suimye -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120828/f120b99a/attachment-0001.html From pabloguaza at ugr.es Wed Aug 29 05:47:19 2012 From: pabloguaza at ugr.es (Pablo Guaza Peces) Date: Wed, 29 Aug 2012 11:47:19 -0000 Subject: [torqueusers] Programs hang when accessing MPI file Message-ID: <503E0141.7070606@ugr.es> Hi everybody! I've been having this problem for a while now and I haven't been able to solve it: Whenever I access a MPI shared file my program freezes and it doesn't give any output or errors. I made this very very simple program in C to test it: #include "mpi.h" #include int main(int argc, char **argv) { MPI_File fh; MPI_Init(&argc,&argv); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_Finalize(); return 0; } As I said it freezes and I have to kill it myself with qdel command. It actually creates the file "datafile", and there's no output in the error or output files besides the ones related to being manually killed. I send this program to torque with this PBS script: #! /bin/bash #PBS -S /bin/bash #PBS -A batch #PBS -N test_mpi_file #PBS -l nodes=2:ppn=2 #PBS -l walltime=00:02:50 #PBS -j oe cd $PBS_O_WORKDIR mpiexec.hydra -rmk pbs /home/pablo/Programs/mbg/c/test_mpi_file I have the next SW configuration: - mpich2 1.2.1 using Hydra - Torque 2.5.7 - Maui 3.2.6 Maybe it has something to do with the NFS home directory that is shared with all the nodes, because I can execute the program with no problem when I do it in just one machine, being the head node or any other. It only fails when two or more machines are accessing the file. Any help would be very appreciated! :) Thanks From pabloguaza at ugr.es Thu Aug 30 05:30:11 2012 From: pabloguaza at ugr.es (Pablo Guaza Peces) Date: Thu, 30 Aug 2012 11:30:11 -0000 Subject: [torqueusers] Missing Torque job accounting information Message-ID: <503F4EBD.5050708@ugr.es> Hi everybody! I've been having this problem for a while now and I haven't been able to solve it: Whenever I access a MPI shared file my program freezes and it doesn't give any output or errors. I made this very very simple program in C to test it: #include "mpi.h" #include int main(int argc, char **argv) { MPI_File fh; MPI_Init(&argc,&argv); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_Finalize(); return 0; } As I said it freezes and I have to kill it myself with qdel command. It actually creates the file "datafile", and there's no output in the error or output files besides the ones related to being manually killed. I send this program to torque with this PBS script: #! /bin/bash #PBS -S /bin/bash #PBS -A batch #PBS -N test_mpi_file #PBS -l nodes=2:ppn=2 #PBS -l walltime=00:02:50 #PBS -j oe cd $PBS_O_WORKDIR mpiexec.hydra -rmk pbs /home/pablo/Programs/mbg/c/test_mpi_file I have the next SW configuration: - mpich2 1.2.1 using Hydra - Torque 2.5.7 - Maui 3.2.6 Maybe it has something to do with the NFS home directory that is shared with all the nodes, because I can execute the program with no problem when I do it in just one machine, being the head node or any other. It only fails when two or more machines are accessing the file. Any help would be very appreciated! Thanks From pabloguaza at ugr.es Thu Aug 30 05:36:24 2012 From: pabloguaza at ugr.es (Pablo Guaza Peces) Date: Thu, 30 Aug 2012 11:36:24 -0000 Subject: [torqueusers] Programs hang when accessing MPI file Message-ID: <503F5030.4010708@ugr.es> Hi everybody! I've been having this problem for a while now and I haven't been able to solve it: Whenever I access a MPI shared file my program freezes and it doesn't give any output or errors. I made this very very simple program in C to test it: #include "mpi.h" #include int main(int argc, char **argv) { MPI_File fh; MPI_Init(&argc,&argv); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_close(&fh); MPI_Finalize(); return 0; } As I said it freezes and I have to kill it myself with qdel command. It actually creates the file "datafile", and there's no output in the error or output files besides the ones related to being manually killed. I send this program to torque with this PBS script: #! /bin/bash #PBS -S /bin/bash #PBS -A batch #PBS -N test_mpi_file #PBS -l nodes=2:ppn=2 #PBS -l walltime=00:02:50 #PBS -j oe cd $PBS_O_WORKDIR mpiexec.hydra -rmk pbs /home/pablo/Programs/mbg/c/test_mpi_file I have the next SW configuration: - mpich2 1.2.1 using Hydra - Torque 2.5.7 - Maui 3.2.6 Maybe it has something to do with the NFS home directory that is shared with all the nodes, because I can execute the program with no problem when I do it in just one machine, being the head node or any other. It only fails when two or more machines are accessing the file. Any help would be very appreciated! :) Thanks From mike.vandewege at gmail.com Fri Aug 31 13:35:18 2012 From: mike.vandewege at gmail.com (mike vandewege) Date: Fri, 31 Aug 2012 19:35:18 -0000 Subject: [torqueusers] install torque-4.1 on compute Message-ID: I'm a noob, I'm building an HPC cluster on openSUSE. I've installed torque-4.1 on the headnode ./configure --with-default-server= --with-server-home=/var/spool/torque --prefix=/usr/local/torque make make install make packages I have /usr/local/ NSF'd with my compute node(s). When I copy ./torque-package-mom-linux-x86_64.sh and try to install the package on the compute node I get errors such as: tar: ./usr/local/torque/sbin/momctl: Cannot open: File exists tar: Exiting with failure status due to previous errors chown: changing ownership of `/./usr/local/torque/sbin/momctl': Invalid argument chgrp: changing group of `/./usr/local/torque/sbin/momctl': Invalid argument etc. Given these packages are already present on the compute node, do I really need to do this step (in a different way) or is there a better method to install torque on my compute nodes that accounts for the shared binaries? Mike -- Michael Vandewege, Ph.D. Student Graduate Research Assistant Dept. of Biochemistry and Molecular Biology Mississippi State University Mississippi State, MS 39762 Email: mike.vandewege at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120831/80bf784d/attachment-0001.html From suimye at gmail.com Wed Aug 29 11:41:28 2012 From: suimye at gmail.com (=?ISO-2022-JP?B?GyRCJDkkJCRfJCQbKEI=?=) Date: Wed, 29 Aug 2012 17:41:28 -0000 Subject: [torqueusers] A node state is down Message-ID: Dear everyone: I am now trying torque installing in a single workstation on Cent OS6. Because I want to manage any jobs in a PC. Almost all setting was done. Test script is successfully submitted, but could not run with no error message. When I check the "pbsnodes -a", I got the message as, ##################################################### [suimye at localhost ~]$ pbsnodes -a localhost state = down np = 12 ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 ##################################################### . Then, I googled my problem and found out this page. http://www.clusterresources.com/pages/resources/documentation/common-issues/torque.php Following this, I checked: 1. *ps -ef|grep pbs_mom -> OK* 2. Restart the *pbs_server* and check *pbsnodes -a -> still DOWN* 3. *mom_priv/config -> pbs_server = 127.0.0.1 (I think this is ok)* 4. *server_name -> localhost* 5. ping localhost *-> OK* 6. port settings are opened at 15000-15005. I could not resolve this problem... If you have any idea or suggestion, please teach me. Best regards: Suimye -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120829/871ce78f/attachment.html