From dbeer at adaptivecomputing.com Wed Feb 1 09:33:19 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Wed, 1 Feb 2012 09:33:19 -0700
Subject: [torquedev] [Bug 173] [torque-3.0.4] pbs_mom buffer overflow /
segfaults when using --enable-nvidia-gpus [with BUG FIX]
In-Reply-To: <20120201043550.899FA6780FC@http.supercluster.org>
References:
<20120201043550.899FA6780FC@http.supercluster.org>
Message-ID:
We have a dynamic_string struct in TORQUE. Perhaps we can re-write things
so that it uses the dynamic string struct instead of a static buffer.
David
On Tue, Jan 31, 2012 at 9:35 PM, wrote:
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=173
>
> --- Comment #2 from Nicolas Pinto 2012-01-31 21:35:50
> MST ---
> Agreed. What you suggest would be the correct way to handle this. I just
> hacked
> something in a way that is "compatible" with the current implementation.
>
> --
> Configure bugmail:
> http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120201/5f24e338/attachment.html
From knielson at adaptivecomputing.com Thu Feb 2 10:04:21 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 02 Feb 2012 10:04:21 -0700 (MST)
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To:
Message-ID: <4b31f863-4104-427c-97d5-40f9eb42f414@mail>
Hi all,
Are there any RPM experts out there who could look at the make rpm part of TORQUE 4.0 and help us fix the holes we have in the rpm build?
Thanks
Ken
From rea+maui at grid.kiae.ru Thu Feb 2 14:14:17 2012
From: rea+maui at grid.kiae.ru (Eygene Ryabinkin)
Date: Fri, 3 Feb 2012 01:14:17 +0400
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To: <4b31f863-4104-427c-97d5-40f9eb42f414@mail>
References:
<4b31f863-4104-427c-97d5-40f9eb42f414@mail>
Message-ID:
Thu, Feb 02, 2012 at 10:04:21AM -0700, Ken Nielson wrote:
> Are there any RPM experts out there who could look at the make rpm
> part of TORQUE 4.0 and help us fix the holes we have in the rpm
> build?
Yes, I am building Torque RPMs for 3-4 years and using them on our
farm. What's the problem?
--
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
From knielson at adaptivecomputing.com Thu Feb 2 11:01:33 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 02 Feb 2012 11:01:33 -0700 (MST)
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To:
Message-ID: <71716b39-f5b6-445d-9630-5eb343389152@mail>
----- Original Message -----
> From: "Eygene Ryabinkin"
> To: "Torque Developers mailing list"
> Sent: Thursday, February 2, 2012 2:14:17 PM
> Subject: Re: [torquedev] Looking for help with RPMs on TORQUE 4.0
>
> Thu, Feb 02, 2012 at 10:04:21AM -0700, Ken Nielson wrote:
> > Are there any RPM experts out there who could look at the make rpm
> > part of TORQUE 4.0 and help us fix the holes we have in the rpm
> > build?
>
> Yes, I am building Torque RPMs for 3-4 years and using them on our
> farm. What's the problem?
> --
> Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"
> _______________________________________________
Eygene,
Thanks for volunteering.
We neglect our RPMs because we do not support them. Sadly we also lack the expertise right now.
We have added and deleted many files from the source for TORQUE 4.0. There are likely other problems as well. We would like you to run make rpm and then fix any deficiencies.
Let me know if you have any other questions.
Regards
Ken
From mej at lbl.gov Thu Feb 2 11:07:30 2012
From: mej at lbl.gov (Michael Jennings)
Date: Thu, 2 Feb 2012 10:07:30 -0800
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To: <4b31f863-4104-427c-97d5-40f9eb42f414@mail>
References:
<4b31f863-4104-427c-97d5-40f9eb42f414@mail>
Message-ID:
On Feb 2, 2012 9:04 AM, "Ken Nielson"
wrote:
> Are there any RPM experts out there who could look at the make rpm part
of TORQUE 4.0 and help us fix the holes we have in the rpm build?
All you had to do was ask. :-)
I'll try some builds and see what I find.
Michael
--
Michael Jennings
Linux Systems and Cluster Engineer
High-Performance Computing Services
Bldg 50B-3209E. W: 510-495-2687
MS 050C-3396 F: 510-486-8615
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120202/663791e6/attachment.html
From knielson at adaptivecomputing.com Thu Feb 2 11:13:25 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Thu, 02 Feb 2012 11:13:25 -0700 (MST)
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To:
Message-ID:
Thanks Michael.
Ken
----- Original Message -----
> From: "Michael Jennings"
> To: "Torque Developers mailing list"
> Sent: Thursday, February 2, 2012 11:07:30 AM
> Subject: Re: [torquedev] Looking for help with RPMs on TORQUE 4.0
>
>
>
>
> On Feb 2, 2012 9:04 AM, "Ken Nielson" <
> knielson at adaptivecomputing.com > wrote:
>
> > Are there any RPM experts out there who could look at the make rpm
> > part of TORQUE 4.0 and help us fix the holes we have in the rpm
> > build?
>
> All you had to do was ask. :-)
>
> I'll try some builds and see what I find.
>
> Michael
>
> --
> Michael Jennings < mej at lbl.gov >
> Linux Systems and Cluster Engineer
> High-Performance Computing Services
> Bldg 50B-3209E. W: 510-495-2687
> MS 050C-3396 F: 510-486-8615
>
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>
From mej at lbl.gov Thu Feb 2 14:40:26 2012
From: mej at lbl.gov (Michael Jennings)
Date: Thu, 2 Feb 2012 13:40:26 -0800
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To:
References:
Message-ID: <20120202214024.GQ2104@lbl.gov>
On Thursday, 02 February 2012, at 11:13:25 (-0700),
Ken Nielson wrote:
> Thanks Michael.
Not a problem. :-)
Attached is my initial patch. I still need to do some testing, but
this is what I ran into right off the bat. If there's more, I'll send
another patch.
Please let me know when this is committed to trunk or if there are any
issues with it. It was made against trunk as of 14:38 Utah time.
Michael
--
Michael Jennings
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
-------------- next part --------------
Index: buildutils/torque.spec.in
===================================================================
--- buildutils/torque.spec.in (revision 5693)
+++ buildutils/torque.spec.in (working copy)
@@ -32,13 +32,13 @@
%bcond_without syslog
### Autoconf macro expansions
-%define ac_with_blcr --%{?with_blcr:en}%{!?with_blcr:dis}able-blcr
-%define ac_with_cpuset --%{?with_cpuset:en}%{!?with_cpuset:dis}able-cpuset
-%define ac_with_drmaa --%{?with_drmaa:en}%{!?with_drmaa:dis}able-drmaa
+%define ac_with_blcr --%{?with_blcr:en}%{!?with_blcr:dis}able-blcr
+%define ac_with_cpuset --%{?with_cpuset:en}%{!?with_cpuset:dis}able-cpuset
+%define ac_with_drmaa --%{?with_drmaa:en}%{!?with_drmaa:dis}able-drmaa
%define ac_with_gui --%{?with_gui:en}%{!?with_gui:dis}able-gui --with%{!?with_gui:out}-tcl
-%define ac_with_ha --%{?with_ha:en}%{!?with_ha:dis}able-high-availability
-%define ac_with_pthreads --%{?with_ha:en}%{!?with_ha:dis}able-pthreads
-%define ac_with_munge --%{?with_munge:en}%{!?with_munge:dis}able-munge-auth
+%define ac_with_ha --%{?with_ha:en}%{!?with_ha:dis}able-high-availability
+%define ac_with_pthreads --%{?with_ha:en}%{!?with_ha:dis}able-pthreads
+%define ac_with_munge --%{?with_munge:en}%{!?with_munge:dis}able-munge-auth
%define ac_with_numa --%{?with_numa:en}%{!?with_numa:dis}able-numa-support
%define ac_with_memacct --%{?with_memacct:en}%{!?with_memacct:dis}able-memacct
%define ac_with_libcpuset --%{?with_libcpuset:en}%{!?with_libcpuset:dis}able-libcpuset
@@ -189,7 +189,7 @@
%{__mkdir_p} $RPM_BUILD_ROOT%{_initrddir}
INIT_PREFIX=""
test -f /etc/SuSE-release && INIT_PREFIX="suse."
-for PROG in pbs_mom pbs_sched pbs_server ; do
+for PROG in pbs_mom pbs_sched pbs_server trqauthd ; do
%{__sed} -e 's|^PBS_HOME=.*|PBS_HOME=%{torque_home}|' \
-e 's|^PBS_DAEMON=.*|PBS_DAEMON=%{_sbindir}/'$PROG'|' contrib/init.d/$INIT_PREFIX$PROG \
> $RPM_BUILD_ROOT%{_initrddir}/$PROG
@@ -208,8 +208,21 @@
mv $RPM_BUILD_ROOT%{_docdir}/*drmaa* drmaa-doc-dir
%endif
-%post -p /sbin/ldconfig
+%post
+/sbin/ldconfig
+if [ $1 -eq 1 ]; then
+ chkconfig --add trqauthd >/dev/null 2>&1 || :
+ chkconfig trqauthd on >/dev/null 2>&1 || :
+ service trqauthd condrestart >/dev/null 2>&1 || :
+fi
+%preun
+if [ $1 -eq 0 ]; then
+ chkconfig trqauthd off >/dev/null 2>&1 || :
+ service trqauthd stop >/dev/null 2>&1 || :
+ chkconfig --del trqauthd >/dev/null 2>&1 || :
+fi
+
%postun -p /sbin/ldconfig
%post server
@@ -248,14 +261,16 @@
qmgr -c "set server default_queue = batch" >/dev/null 2>&1 || :
qmgr -c "set node $TORQUE_SERVER state = free" >/dev/null 2>&1 || :
+ chkconfig --add pbs_server >/dev/null 2>&1 || :
chkconfig pbs_server on >/dev/null 2>&1 || :
service pbs_server condrestart >/dev/null 2>&1 || :
fi
%preun server
if [ $1 -eq 0 ]; then
- chkconfig pbs_server off
+ chkconfig pbs_server off >/dev/null 2>&1 || :
service pbs_server stop >/dev/null 2>&1 || :
+ chkconfig --del pbs_server >/dev/null 2>&1 || :
fi
%post client
@@ -275,26 +290,30 @@
TORQUE_SERVER=`hostname`
perl -pi -e "s/localhost/$TORQUE_SERVER/g" %{torque_home}/mom_priv/config 2>/dev/null || :
fi
- chkconfig pbs_mom on
+ chkconfig --add pbs_mom >/dev/null 2>&1 || :
+ chkconfig pbs_mom on >/dev/null 2>&1 || :
service pbs_mom condrestart >/dev/null 2>&1 || :
fi
%preun client
if [ $1 -eq 0 ]; then
- chkconfig pbs_mom off
+ chkconfig pbs_mom off >/dev/null 2>&1 || :
service pbs_mom stop >/dev/null 2>&1 || :
+ chkconfig --del pbs_mom >/dev/null 2>&1 || :
fi
%post scheduler
if [ $1 -eq 1 ]; then
- chkconfig pbs_sched on
+ chkconfig --add pbs_sched >/dev/null 2>&1 || :
+ chkconfig pbs_sched on >/dev/null 2>&1 || :
service pbs_sched condrestart >/dev/null 2>&1 || :
fi
%preun scheduler
if [ $1 -eq 0 ]; then
- chkconfig pbs_sched off
+ chkconfig pbs_sched off >/dev/null 2>&1 || :
service pbs_sched stop >/dev/null 2>&1 || :
+ chkconfig --del pbs_sched >/dev/null 2>&1 || :
fi
%clean
@@ -306,6 +325,7 @@
%doc doc/READ_ME doc/doc_fonts doc/soelim.c doc/ers
%config(noreplace) %{torque_home}/pbs_environment
%config(noreplace) %{torque_home}/server_name
+%{_initrddir}/trqauthd
%{_bindir}/chk_tree
%{_bindir}/hostn
%{_bindir}/nqs2pbs
@@ -317,7 +337,7 @@
%{_bindir}/printtracking
%{_bindir}/q*
%{_bindir}/tracejob
-%attr(4755, root, root) %{_sbindir}/pbs_iff
+%{_sbindir}/trqauthd
%if %{without scp}
%attr(4755, root, root) %{_sbindir}/pbs_rcp
%endif
Index: contrib/init.d/trqauthd
===================================================================
--- contrib/init.d/trqauthd (revision 5693)
+++ contrib/init.d/trqauthd (working copy)
@@ -19,7 +19,7 @@
# let see how we were called
case "$1" in
start)
- echo -n "Starting TORQUE Scheduler: "
+ echo -n "Starting TORQUE Authorization Daemon: "
status trqauthd 2>&1 > /dev/null
RET=$?
[ $RET -eq 0 ] && echo -n "trqauthd already running" && success && echo && exit 0
@@ -30,7 +30,7 @@
echo
;;
stop)
- echo -n "Shutting down TORQUE Scheduler: "
+ echo -n "Shutting down TORQUE Authorization Daemon: "
status trqauthd 2>&1 > /dev/null
RET=$?
[ ! $RET -eq 0 ] && echo -n "trqauthd already stopped" && success && echo && exit 0
@@ -49,7 +49,7 @@
$0 start
;;
reload)
- echo -n "Reloading trqauthd: "
+ echo -n "Reloading TORQUE Authorization Daemon: "
killproc trqauthd -HUP
RET=$?
echo
From jayavant.patil82 at gmail.com Thu Feb 2 23:49:03 2012
From: jayavant.patil82 at gmail.com (Jayavant Patil)
Date: Fri, 3 Feb 2012 12:19:03 +0530
Subject: [torquedev] Queue to Node Mapping
Message-ID:
Hi,
I am using Torque 3.0.0 and Maui 3.3. I want the jobs submitted to a
specific queue should run only on some allocated nodes to that queue (i.e.
queue to node mapping).
Does anybody know how to do this?
--
Thanks & Regards,
Jayavant Ningoji Patil
Engineer: System Software
Computational Research Laboratories Ltd.
Pune-411 004.
Maharashtra, India.
+91 9923536030.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120203/f0a4085d/attachment.html
From jayavant.patil82 at gmail.com Fri Feb 3 01:18:24 2012
From: jayavant.patil82 at gmail.com (Jayavant Patil)
Date: Fri, 3 Feb 2012 13:48:24 +0530
Subject: [torquedev] [Mauiusers] Queue to Node Mapping
In-Reply-To: <2d7330a.22258.135420519a4.Coremail.zgp121@126.com>
References:
<2d7330a.22258.135420519a4.Coremail.zgp121@126.com>
Message-ID:
On Fri, Feb 3, 2012 at 12:31 PM, Guangping Zhang wrote:
> **
> I will give you one example as follows, as far as I know this works in
> Torque 2.4.6+maui 3.3.1
>
> 1. edit the file /var/spool/torque/server_priv/nodes
>
> node01 np=12 sugon siesta dalton gaussian
> node02 np=12 sugon siesta dalton gaussian
> node03 np=12 sugon siesta dalton gaussian
> node04 np=12 sugon siesta dalton gaussian
> node05 np=12 sugon siesta dalton gaussian
> node06 np=12 sugon siesta dalton gaussian
> node07 np=12 sugon siesta dalton gaussian
> node08 np=12 sugon siesta dalton gaussian
> node09 np=12 sugon siesta dalton gaussian
> node10 np=12 sugon siesta dalton gaussian
> node11 np=12 sugon siesta dalton gaussian
> node12 np=12 sugon siesta dalton gaussian
> node31 np=8 powerlead siesta dalton gaussian others
> node32 np=8 powerlead siesta dalton gaussian others
> node33 np=8 powerlead siesta dalton gaussian others
> node34 np=8 powerlead siesta dalton gaussian others
> node35 np=8 powerlead siesta dalton gaussian others
> node36 np=8 powerlead siesta dalton gaussian others
> node38 np=8 powerlead siesta dalton gaussian others
> node39 np=8 powerlead siesta dalton gaussian others
> node40 np=8 powerlead siesta dalton gaussian others
> node41 np=8 dell siesta dalton gaussian others
> node42 np=8 dell siesta dalton gaussian others
> node43 np=8 dell siesta dalton gaussian others
> node44 np=8 dell molpro
> node45 np=8 dell molpro
> node46 np=8 dell molpro
> 2.create a queue named SIESTA
> qmgr -c "create queue SIESTA queue_type=execution"
> qmgr -c "set queue SIESTA started=true"
> qmgr -c "set queue SIESTA enabled=true"
> qmgr -c "set queue SIESTA acl_group_enable=true"
> qmgr -c "set queue SIESTA acl_groups=siesta"
> qmgr -c "set queue SIESTA acl_group_sloppy=true"
> qmgr -c "set queue SIESTA resources_default.neednodes=siesta"
> 3.restart service
>
> qterm -t quick
> pbs_server
> ps -A |grep maui
> 18066 ? 00:00:00 maui
> kill 18066
> /usr/local/software/maui-3.3.1/sbin/maui
>
> 4. That is Ok
>
> A user that belong to group siesta only can submit jobs to queue SIESTA
> and can only use the nodes which has property "siesta"
>
>
> Best
>
> 2012-02-03
> ------------------------------
> Guangping Zhang
> ------------------------------
> *????*Jayavant Patil
> *?????*2012-02-03 14:49
> *???*[Mauiusers] Queue to Node Mapping
> *????*torquedev,mauiusers
> *???*
>
> Hi,
>
> I am using Torque 3.0.0 and Maui 3.3. I want the jobs submitted to a
> specific queue should run only on some allocated nodes to that queue (i.e.
> queue to node mapping).
>
>
> Does anybody know how to do this?
>
> --
>
> Thanks & Regards,
> Jayavant Ningoji Patil
> Engineer: System Software
> Computational Research Laboratories Ltd.
> Pune-411 004.
> Maharashtra, India.
> +91 9923536030.
>
> Hi Guangping Zhang,
Thanks a lot. It works.
Just for curiosity, is this the only way to achieve the required?
--
Thanks & Regards,
Jayavant Ningoji Patil
Engineer: System Software
Computational Research Laboratories Ltd.
Pune-411 004.
Maharashtra, India.
+91 9923536030.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120203/88352a0a/attachment.html
From Gareth.Williams at csiro.au Fri Feb 3 01:44:05 2012
From: Gareth.Williams at csiro.au (Gareth.Williams at csiro.au)
Date: Fri, 3 Feb 2012 19:44:05 +1100
Subject: [torquedev] [Mauiusers] Queue to Node Mapping
In-Reply-To:
References:
<2d7330a.22258.135420519a4.Coremail.zgp121@126.com>
Message-ID: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD0@exvic-mbx04.nexus.csiro.au>
>?? Just for curiosity, is this the only way to achieve the required?
We use a different setup where all the configuration is in moab (it will probably work with maui too - it would be interesting to know). Features are used decide which nodes map to which queues - but the features are from moab's perspective rather than torque's. A similar acl setup is needed to restrict access to the queues.
The moab part looks like:
## for io jobs in io queue
CLASSCFG[io] DEFAULT.FEATURES=io
SRCFG[io] NODEFEATURES=io
SRCFG[io] PERIOD=WEEK DEPTH=1
SRCFG[io] CLASSLIST=io
SRCFG[io] HOSTLIST=.
SRCFG[io] FLAGS=ACLOVERLAP,IGNSTATE
NODECFG[ionode01] FEATURES+=io
Gareth
From dbeer at adaptivecomputing.com Fri Feb 3 09:09:41 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Fri, 3 Feb 2012 09:09:41 -0700
Subject: [torquedev] [Mauiusers] Queue to Node Mapping
In-Reply-To: <007DECE986B47F4EABF823C1FBB19C620102CDD74DD0@exvic-mbx04.nexus.csiro.au>
References:
<2d7330a.22258.135420519a4.Coremail.zgp121@126.com>
<007DECE986B47F4EABF823C1FBB19C620102CDD74DD0@exvic-mbx04.nexus.csiro.au>
Message-ID:
I can point out that the two different ways are essentially equivalent, but
one does it through Moab and the other through TORQUE. These are the two
ways I have seen it done, although there may be others.
David
On Fri, Feb 3, 2012 at 1:44 AM, wrote:
> > Just for curiosity, is this the only way to achieve the required?
>
> We use a different setup where all the configuration is in moab (it will
> probably work with maui too - it would be interesting to know). Features
> are used decide which nodes map to which queues - but the features are from
> moab's perspective rather than torque's. A similar acl setup is needed to
> restrict access to the queues.
>
> The moab part looks like:
> ## for io jobs in io queue
> CLASSCFG[io] DEFAULT.FEATURES=io
> SRCFG[io] NODEFEATURES=io
> SRCFG[io] PERIOD=WEEK DEPTH=1
> SRCFG[io] CLASSLIST=io
> SRCFG[io] HOSTLIST=.
> SRCFG[io] FLAGS=ACLOVERLAP,IGNSTATE
> NODECFG[ionode01] FEATURES+=io
>
> Gareth
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120203/967c6f76/attachment.html
From mej at lbl.gov Fri Feb 3 16:42:11 2012
From: mej at lbl.gov (Michael Jennings)
Date: Fri, 3 Feb 2012 15:42:11 -0800
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To: <4b31f863-4104-427c-97d5-40f9eb42f414@mail>
References:
<4b31f863-4104-427c-97d5-40f9eb42f414@mail>
Message-ID: <20120203234210.GZ2104@lbl.gov>
On Thursday, 02 February 2012, at 10:04:21 (-0700),
Ken Nielson wrote:
> Are there any RPM experts out there who could look at the make rpm
> part of TORQUE 4.0 and help us fix the holes we have in the rpm
> build?
Here's the second half of the patch. I've tested this one about as
thoroughly as I can here without putting it into production, and
everything builds and installs (and uninstalls) fine for me.
Some of these fixes will need to find their way into 2.5.x and 3.x as
well, but I can generate separate patches for those.
HTH,
Michael
--
Michael Jennings
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
-------------- next part --------------
Index: buildutils/torque.spec.in
===================================================================
--- buildutils/torque.spec.in (revision 5704)
+++ buildutils/torque.spec.in (working copy)
@@ -18,15 +18,14 @@
%bcond_with cpuset
%bcond_with drmaa
%bcond_with gui
+%bcond_with libcpuset
+%bcond_with memacct
%bcond_with munge
+%bcond_with numa
%bcond_with pam
-%bcond_with numa
-%bcond_with memacct
-%bcond_with libcpuset
%bcond_with top
### Features enabled by default
-%bcond_without ha
%bcond_without scp
%bcond_without spool
%bcond_without syslog
@@ -36,8 +35,6 @@
%define ac_with_cpuset --%{?with_cpuset:en}%{!?with_cpuset:dis}able-cpuset
%define ac_with_drmaa --%{?with_drmaa:en}%{!?with_drmaa:dis}able-drmaa
%define ac_with_gui --%{?with_gui:en}%{!?with_gui:dis}able-gui --with%{!?with_gui:out}-tcl
-%define ac_with_ha --%{?with_ha:en}%{!?with_ha:dis}able-high-availability
-%define ac_with_pthreads --%{?with_ha:en}%{!?with_ha:dis}able-pthreads
%define ac_with_munge --%{?with_munge:en}%{!?with_munge:dis}able-munge-auth
%define ac_with_numa --%{?with_numa:en}%{!?with_numa:dis}able-numa-support
%define ac_with_memacct --%{?with_memacct:en}%{!?with_memacct:dis}able-memacct
@@ -162,7 +159,7 @@
Group: Applications/System
Provides: pbs-drmaa = %{version}-%{release}
-%description
+%description drmaa
This package contains a DRMAA 1.0 implementation for use with TORQUE.
%endif
@@ -177,7 +174,7 @@
%configure --includedir=%{_includedir}/%{name} --with-default-server=%{torque_server} \
--with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \
--disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} %{ac_with_syslog} \
- --disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} %{ac_with_ha} \
+ --disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} \
--disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} %{ac_with_spool} %{?acflags}
%{__make} %{?_smp_mflags} %{?mflags}
@@ -191,6 +188,8 @@
test -f /etc/SuSE-release && INIT_PREFIX="suse."
for PROG in pbs_mom pbs_sched pbs_server trqauthd ; do
%{__sed} -e 's|^PBS_HOME=.*|PBS_HOME=%{torque_home}|' \
+ -e 's|^BIN_PATH=.*|BIN_PATH=%{_bindir}|' \
+ -e 's|^SBIN_PATH=.*|SBIN_PATH=%{_sbindir}|' \
-e 's|^PBS_DAEMON=.*|PBS_DAEMON=%{_sbindir}/'$PROG'|' contrib/init.d/$INIT_PREFIX$PROG \
> $RPM_BUILD_ROOT%{_initrddir}/$PROG
%{__chmod} 0755 $RPM_BUILD_ROOT%{_initrddir}/$PROG
@@ -213,6 +212,8 @@
if [ $1 -eq 1 ]; then
chkconfig --add trqauthd >/dev/null 2>&1 || :
chkconfig trqauthd on >/dev/null 2>&1 || :
+ service trqauthd start >/dev/null 2>&1 || :
+else
service trqauthd condrestart >/dev/null 2>&1 || :
fi
@@ -238,31 +239,36 @@
pbs_resmom 15003/udp # mom resource management requests
pbs_sched 15004/tcp # scheduler
pbs_sched 15004/udp # scheduler
+ trqauthd 15005/tcp # authorization daemon
+ trqauthd 15005/udp # authorization daemon
EOF
- pbs_server -t create || exit 0
+ if [ ! -e %{torque_home}/server_priv/serverdb ]; then
+ if [ "%{torque_server}" = "localhost" ]; then
+ TORQUE_SERVER=`hostname`
+ perl -pi -e "s/localhost/$TORQUE_SERVER/g" %{torque_home}/server_name %{torque_home}/server_priv/nodes 2>/dev/null || :
+ else
+ TORQUE_SERVER="%{torque_server}"
+ fi
- if [ "%{torque_server}" = "localhost" ]; then
- TORQUE_SERVER=`hostname`
- perl -pi -e "s/localhost/$TORQUE_SERVER/g" %{torque_home}/server_name %{torque_home}/server_priv/nodes 2>/dev/null || :
- else
- TORQUE_SERVER="%{torque_server}"
+ pbs_server -t create -f >/dev/null 2>&1 || :
+ sleep 1
+ qmgr -c "set server scheduling = true" >/dev/null 2>&1 || :
+ qmgr -c "set server managers += root@$TORQUE_SERVER" >/dev/null 2>&1 || :
+ qmgr -c "set server managers += %{torque_user}@$TORQUE_SERVER" >/dev/null 2>&1 || :
+ qmgr -c "create queue batch queue_type = execution" >/dev/null 2>&1 || :
+ qmgr -c "set queue batch started = true" >/dev/null 2>&1 || :
+ qmgr -c "set queue batch enabled = true" >/dev/null 2>&1 || :
+ qmgr -c "set queue batch resources_default.walltime = 1:00:00" >/dev/null 2>&1 || :
+ qmgr -c "set queue batch resources_default.nodes = 1" >/dev/null 2>&1 || :
+ qmgr -c "set server default_queue = batch" >/dev/null 2>&1 || :
+ qmgr -c "set node $TORQUE_SERVER state = free" >/dev/null 2>&1 || :
+ killall -TERM pbs_server >/dev/null 2>&1 || :
fi
- qmgr -c "set server scheduling = true" >/dev/null 2>&1 || :
- qmgr -c "set server operators += root@$TORQUE_SERVER" >/dev/null 2>&1 || :
- qmgr -c "set server managers += root@$TORQUE_SERVER" >/dev/null 2>&1 || :
- qmgr -c "set server operators += %{torque_user}@$TORQUE_SERVER" >/dev/null 2>&1 || :
- qmgr -c "set server managers += %{torque_user}@$TORQUE_SERVER" >/dev/null 2>&1 || :
- qmgr -c "create queue batch queue_type = execution" >/dev/null 2>&1 || :
- qmgr -c "set queue batch started = true" >/dev/null 2>&1 || :
- qmgr -c "set queue batch enabled = true" >/dev/null 2>&1 || :
- qmgr -c "set queue batch resources_default.walltime = 1:00:00" >/dev/null 2>&1 || :
- qmgr -c "set queue batch resources_default.nodes = 1" >/dev/null 2>&1 || :
- qmgr -c "set server default_queue = batch" >/dev/null 2>&1 || :
- qmgr -c "set node $TORQUE_SERVER state = free" >/dev/null 2>&1 || :
-
chkconfig --add pbs_server >/dev/null 2>&1 || :
chkconfig pbs_server on >/dev/null 2>&1 || :
+ service pbs_server start >/dev/null 2>&1 || :
+else
service pbs_server condrestart >/dev/null 2>&1 || :
fi
@@ -285,6 +291,8 @@
pbs_resmom 15003/udp # mom resource management requests
pbs_sched 15004/tcp # scheduler
pbs_sched 15004/udp # scheduler
+ trqauthd 15005/tcp # authorization daemon
+ trqauthd 15005/udp # authorization daemon
EOF
if [ "%{torque_server}" = "localhost" ]; then
TORQUE_SERVER=`hostname`
@@ -292,6 +300,8 @@
fi
chkconfig --add pbs_mom >/dev/null 2>&1 || :
chkconfig pbs_mom on >/dev/null 2>&1 || :
+ service pbs_mom start >/dev/null 2>&1 || :
+else
service pbs_mom condrestart >/dev/null 2>&1 || :
fi
@@ -306,6 +316,8 @@
if [ $1 -eq 1 ]; then
chkconfig --add pbs_sched >/dev/null 2>&1 || :
chkconfig pbs_sched on >/dev/null 2>&1 || :
+ service pbs_sched start >/dev/null 2>&1 || :
+else
service pbs_sched condrestart >/dev/null 2>&1 || :
fi
@@ -386,6 +398,7 @@
%dir %{torque_home}/mom_priv
%config(noreplace) %{torque_home}/mom_priv/config
%{_initrddir}/pbs_mom
+%{_sbindir}/momctl
%{_sbindir}/pbs_demux
%{_sbindir}/pbs_mom
%{_sbindir}/qnoded
Index: contrib/init.d/pbs_server
===================================================================
--- contrib/init.d/pbs_server (revision 5704)
+++ contrib/init.d/pbs_server (working copy)
@@ -6,7 +6,7 @@
# description: PBS is a versatile batch system for SMPs and clusters
#
# Source the library functions
-#. /etc/rc.d/init.d/functions
+test -f /etc/rc.d/init.d/functions && . /etc/rc.d/init.d/functions
create() {
echo -n "Creating initial TORQUE configuration: "
@@ -15,12 +15,18 @@
exit 1
fi
- $PBS_DAEMON -d $PBS_HOME -t create &
- while [ ! -r $PBS_SERVERDB ]; do
- sleep 1
+ for SLEEP in 2 4 6 8 10 ; do
+ $PBS_DAEMON -d $PBS_HOME -t create &
+ sleep $SLEEP
+ $BIN_PATH/qterm
done
- $BIN_PATH/qterm
- RET=$?
+ if [ -r $PBS_SERVERDB ]; then
+ success
+ RET=0
+ else
+ failure
+ RET=1
+ fi
}
start() {
@@ -29,7 +35,6 @@
echo "pbs_server is already running."
exit 0
fi
- echo $PBS_SERVERDB
if [ ! -r $PBS_SERVERDB ]; then
create
fi
@@ -47,7 +52,7 @@
exit 0
fi
echo -n "Shutting down TORQUE Server: "
- qterm pbs_server
+ killproc pbs_server -TERM
RET=$?
rm -f /var/lock/subsys/pbs_server
echo
From knielson at adaptivecomputing.com Mon Feb 6 11:16:51 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Mon, 06 Feb 2012 11:16:51 -0700 (MST)
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To: <20120203234210.GZ2104@lbl.gov>
Message-ID:
----- Original Message -----
> From: "Michael Jennings"
> To: torquedev at supercluster.org
> Sent: Friday, February 3, 2012 4:42:11 PM
> Subject: Re: [torquedev] Looking for help with RPMs on TORQUE 4.0
>
> On Thursday, 02 February 2012, at 10:04:21 (-0700),
> Ken Nielson wrote:
>
> > Are there any RPM experts out there who could look at the make rpm
> > part of TORQUE 4.0 and help us fix the holes we have in the rpm
> > build?
>
> Here's the second half of the patch. I've tested this one about as
> thoroughly as I can here without putting it into production, and
> everything builds and installs (and uninstalls) fine for me.
>
> Some of these fixes will need to find their way into 2.5.x and 3.x as
> well, but I can generate separate patches for those.
>
> HTH,
> Michael
>
> --
> Michael Jennings
> Senior HPC Systems Engineer
> High-Performance Computing Services
> Lawrence Berkeley National Laboratory
> Bldg 50B-3209E W: 510-495-2687
> MS 050B-3209 F: 510-486-8615
>
Michael,
Thanks for the patch. I am getting the following error message when I do make rpm.
rpmbuild --with syslog --define "torque_home/sdb/torque" --with scp --without pam --without drmaa -tb torque-4.0.0.tar.gz
error: Failed dependencies:
/usr/bin/scp is needed by torque-4.0.0-1.cri.x86_64
make is needed by torque-4.0.0-1.cri.x86_64
make: *** [rpm] Error 1
Is this something in my setup or is there an error in the script?
Ken
From mej at lbl.gov Mon Feb 6 11:36:41 2012
From: mej at lbl.gov (Michael Jennings)
Date: Mon, 6 Feb 2012 10:36:41 -0800
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To:
References: <20120203234210.GZ2104@lbl.gov>
Message-ID: <20120206183640.GG2104@lbl.gov>
On Monday, 06 February 2012, at 11:16:51 (-0700),
Ken Nielson wrote:
> Thanks for the patch. I am getting the following error message when I do make rpm.
>
> rpmbuild --with syslog --define "torque_home/sdb/torque" --with scp --without pam --without drmaa -tb torque-4.0.0.tar.gz
> error: Failed dependencies:
> /usr/bin/scp is needed by torque-4.0.0-1.cri.x86_64
> make is needed by torque-4.0.0-1.cri.x86_64
> make: *** [rpm] Error 1
>
> Is this something in my setup or is there an error in the script?
Are you building on an RPM-based system? It strikes me as very odd
that "make" wouldn't be installed, so the other possibility is that it
isn't installed via RPM (or your RPM database is corrupt).
What is the output of the following commands?
rpm -q make
rpm -qf `which scp`
rpm -qf `which make`
To force the build to happen anyway, add the --nodeps flag.
Michael
--
Michael Jennings
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
From knielson at adaptivecomputing.com Mon Feb 6 12:12:22 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Mon, 06 Feb 2012 12:12:22 -0700 (MST)
Subject: [torquedev] Looking for help with RPMs on TORQUE 4.0
In-Reply-To: <20120206183640.GG2104@lbl.gov>
Message-ID:
----- Original Message -----
> From: "Michael Jennings"
> To: torquedev at supercluster.org
> Sent: Monday, February 6, 2012 11:36:41 AM
> Subject: Re: [torquedev] Looking for help with RPMs on TORQUE 4.0
>
> On Monday, 06 February 2012, at 11:16:51 (-0700),
> Ken Nielson wrote:
>
> > Thanks for the patch. I am getting the following error message when
> > I do make rpm.
> >
> > rpmbuild --with syslog --define "torque_home/sdb/torque" --with scp
> > --without pam --without drmaa -tb torque-4.0.0.tar.gz
> > error: Failed dependencies:
> > /usr/bin/scp is needed by torque-4.0.0-1.cri.x86_64
> > make is needed by torque-4.0.0-1.cri.x86_64
> > make: *** [rpm] Error 1
> >
> > Is this something in my setup or is there an error in the script?
>
> Are you building on an RPM-based system? It strikes me as very odd
> that "make" wouldn't be installed, so the other possibility is that
> it
> isn't installed via RPM (or your RPM database is corrupt).
>
> What is the output of the following commands?
> rpm -q make
> rpm -qf `which scp`
> rpm -qf `which make`
>
> To force the build to happen anyway, add the --nodeps flag.
>
> Michael
>
Michael,
That is the problem.
Thanks
Ken
From knielson at adaptivecomputing.com Wed Feb 15 11:06:37 2012
From: knielson at adaptivecomputing.com (Ken Nielson)
Date: Wed, 15 Feb 2012 11:06:37 -0700 (MST)
Subject: [torquedev] New TORQUE 4.0 beta available
In-Reply-To:
Message-ID:
Hi all,
We are getting closer. There is a TORQUE 4.0 beta available for download at http://www.adaptivecomputing.com/resources/downloads/torque/4.0-beta/torque-4.0.0-snap.201202151044.tar.gz
Since the February 10 beta snapshot we have the following fixes:
Added code to im_join_job_as_sister to set the pjob->ji_radix to 2 for
leaf nodes in a job radix. This is used later for IM_SPAWN_TASK to
get the correct address of an intermediate MOM node in start_process.
TRQ-768. Add the -w option to allow pbs_moms to wait until they get the mom hierarchy file from pbs_server to send their first update.
change moab_array_compatible server parameter so it defaults to true. TRQ-761
pbs_o variables moved to upper case to support previously working scripts.
Bug TRQ-765
Change send_update_within_ten to send_update_soon and make it update interval / 3
Creating a new branch for TORQUE 4.0. The new branch is 4.0-fixes
another fix for where pbs_mom would not automatically set gpu mode.
Commit a version of Jon's patch to fix a segfault in the log locking messages. It was an off-by-one buffer error
Documentation for TORQUE 4.0 can be found at http://www.adaptivecomputing.com/resources/docs/. If you find bugs in the documentation we would like to have those reported as well.
Please download and try this version and let us know if you find any issues.
Regards
Ken
From l.flis at cyf-kr.edu.pl Wed Feb 22 05:19:55 2012
From: l.flis at cyf-kr.edu.pl (Lukasz Flis)
Date: Wed, 22 Feb 2012 13:19:55 +0100
Subject: [torquedev] torque 2.5.10 - interactive jobs startup
Message-ID: <4F44DD6B.3000209@cyf-kr.edu.pl>
Hi,
Torque: 2.5.10
On a busy clusters where many jobs share the same node we can observe
that some of interactive jobs get interrupted during startup.
From user point of view problem manifests itself with message:
qsub: Job apparently deleted
Corresponding log file from pbs_mom indicates Interrupted System call
during read() on a socket in function rcvttype
Feb 19 11:00:35 n6-2-32.local pbs_mom[]:
LOG_ERROR::Interrupted system call (4) in TMomFinalizeChild, cannot get
termtype
It looks like read is interrupted by SIGCHLD or SIGALARM (pbs_mom
definfes 5 second limit for rcvtermtype() to return, might be not long
enough for busy systems
I didn't have time to write the fix for it but is should be trivial
Cheers
--
Lukasz Flis
From dbeer at adaptivecomputing.com Wed Feb 22 11:17:03 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Wed, 22 Feb 2012 11:17:03 -0700
Subject: [torquedev] TORQUE 4.0.0 Update
Message-ID:
All,
Thanks for all of the help with beta testing and input for assisting us in
preparing TORQUE 4.0 for release. Just to keep everyone updated on the
progress, we want to let you all know that QA has approved TORQUE and
everyone should look for the official release announcement, together with
the Moab HPC Suite 7.0, on March 13th.
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120222/d84bb46c/attachment.html
From Sean.Kellogg at fei.com Wed Feb 22 11:43:55 2012
From: Sean.Kellogg at fei.com (Kellogg, Sean)
Date: Wed, 22 Feb 2012 18:43:55 +0000
Subject: [torquedev] Question about IamUserByName, Win32 mom
Message-ID: <40F21C77600CA844B0874E52E9867017143E9F08@hlexc07.w2k.feico.com>
Dear Torque development community,
I'm trying to troubleshoot a PBS error that I have on a Win32 execute host. I have run into a dead end.
The symptoms are as follows: after a job is queued to the Torque system, the job is passed to a Win32 execute host and then exits with Exit_status=-1. The PBS mom log on the execute host contains the following seven lines as a record of the failure:
02/20/2012 07:57:02;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::IamUserByName, WARNING!!! Can`t find user "simuser"!
02/20/2012 07:57:02;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::start_exec, Torque Mom Version = 2.5.4, loglevel = 0
02/20/2012 07:57:02;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters for job 2777.hl-vcomputenodemaster.DOMAIN.COMPANYNAME.com
02/20/2012 07:57:05;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
02/20/2012 07:57:05;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
02/20/2012 07:57:05;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
02/20/2012 07:57:05;0080; pbs_mom;Job;2777.hl-vcomputenodemaster. DOMAIN.COMPANYNAME.com;obit sent to server
I suspect that the most interesting information is contained in the first line, following the function name IamUserByName. I have found limited information about this error, and it's all contained in the torquedev mailing list thread 2173 (http://www.supercluster.org/pipermail/torquedev/2010-June/002173.html)
To make matters more confusing, this problem only occurs on about 80% of the job submissions; the other 20% are executed normally. Therefore I wonder if there is a reliability issue with the function call IamUserByName.
Can anyone provide any insight?
Thank you,
Sean Kellogg
FEI Company
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120222/e54e6913/attachment-0001.html
From dbeer at adaptivecomputing.com Wed Feb 22 13:53:16 2012
From: dbeer at adaptivecomputing.com (David Beer)
Date: Wed, 22 Feb 2012 13:53:16 -0700
Subject: [torquedev] [torqueusers] Performance of non-GPU codes on GPU
nodes reduced by nvidia-smi overhead
In-Reply-To:
References:
Message-ID:
Doug,
This is now officially in the list of things to do, and I'll keep you
updated on it. That may sound terrible, but it should be done quickly.
David
On Fri, Feb 17, 2012 at 3:12 PM, Doug Johnson wrote:
> At Fri, 17 Feb 2012 13:10:41 -0700,
> David Beer wrote:
> >
> > [1 ]
> > [1.1 ]
> >
> > [1.2 ]
> > Doug,
> >
> > I have created a ticket for our documentation team to note that the TDK
> is where nvml.h can be found.
> >
> > We also thank you for the patch. I believe there is some more work that
> needs to be done beyond just
> > this change, but we will look to get those done very soon. I think it
> would be ideal to allow people to
> > use the same binary for both GPU enabled and non-GPU enabled nodes.
> >
>
> Yeah, conceptual versus actually working. There's no really proper
> way to do this with how the gpu code is currently inline with many
> ifdefs. It's a surprisingly small number of entry points that need to
> be modified. See the attached patch, this allows an NVML enabled
> build to run on either a GPU or non-GPU node. If all the GPU routines
> were moved into their own file this could be done cleanly and without
> a lot of effort.
>
> Doug
>
> PS. caveat emptor with the patch, I've run two jobs on the nodes so
> it's not exactly thought out.
>
> PPS. Should move thread to torque devel, sorry.
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120222/32d1d3d9/attachment.html
From djohnson at osc.edu Wed Feb 22 14:50:32 2012
From: djohnson at osc.edu (Doug Johnson)
Date: Wed, 22 Feb 2012 16:50:32 -0500
Subject: [torquedev] [torqueusers] Performance of non-GPU codes on GPU
nodes reduced by nvidia-smi overhead
In-Reply-To:
References:
Message-ID:
Hi David,
I'm in the process of devising a cleaner way to do what I did in my
previous quick and dirty patch. I'm working out of the torque-2.5
branch. I'd be very interested in working with whoever gets assigned
this problem at Adaptive.
And as an update, I've executed many jobs on the GPU and non-GPU nodes
with the patch referenced below. Same mom binary with nvml support on
both types of nodes with no apparent problems.
Doug
At Wed, 22 Feb 2012 13:53:16 -0700,
David Beer wrote:
>
> [1 ]
> [1.1 ]
>
> [1.2 ]
> Doug,
>
> This is now officially in the list of things to do, and I'll keep you updated on it. That may sound
> terrible, but it should be done quickly.
>
> David
>
> On Fri, Feb 17, 2012 at 3:12 PM, Doug Johnson wrote:
>
> At Fri, 17 Feb 2012 13:10:41 -0700,
> David Beer wrote:
> >
> > [1 ]
> > [1.1 ]
> >
> > [1.2 ]
> > Doug,
> >
> > I have created a ticket for our documentation team to note that the TDK is where nvml.h can be
> found.
> >
> > We also thank you for the patch. I believe there is some more work that needs to be done beyond
> just
> > this change, but we will look to get those done very soon. I think it would be ideal to allow
> people to
> > use the same binary for both GPU enabled and non-GPU enabled nodes.
> >
>
> Yeah, conceptual versus actually working. There's no really proper
> way to do this with how the gpu code is currently inline with many
> ifdefs. It's a surprisingly small number of entry points that need to
> be modified. See the attached patch, this allows an NVML enabled
> build to run on either a GPU or non-GPU node. If all the GPU routines
> were moved into their own file this could be done cleanly and without
> a lot of effort.
>
> Doug
>
> PS. caveat emptor with the patch, I've run two jobs on the nodes so
> it's not exactly thought out.
>
> PPS. Should move thread to torque devel, sorry.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
> David Beer | Software Engineer
> Adaptive Computing
>
>
> [2 ]
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
From bugzilla-daemon at supercluster.org Thu Feb 23 11:33:50 2012
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Thu, 23 Feb 2012 11:33:50 -0700 (MST)
Subject: [torquedev] [Bug 139] Negative value in 'Que' when using qstat
In-Reply-To:
References:
Message-ID: <20120223183350.5E345678195@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=139
Victor Gregorio changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |vgregorio at penguincomputing.
| |com
--- Comment #4 from Victor Gregorio 2012-02-23 11:33:49 MST ---
Hello, I am also seeing this problem with 2.5.10. In our case, not only is a
negative number listed in Que, but 2 non-existent jobs are listed in Run. I am
not sure if these two issues are related, though.
qstat as a privileged user lists zero jobs, but qstat -q shows:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
batch -- -- -- -- 2 -4 -- E R
----- -----
2 -4
And qmgr -c 'list server' | grep state_count shows:
state_count = Transit:0 Queued:0 Held:-4 Waiting:0 Running:2 Exiting:0
The only way to clear these erroneous numbers is to restart pbs_server.
Has this issue been resolved in 3.0.4 or 4.0.0?
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
From bugzilla-daemon at supercluster.org Thu Feb 23 11:55:31 2012
From: bugzilla-daemon at supercluster.org (bugzilla-daemon at supercluster.org)
Date: Thu, 23 Feb 2012 11:55:31 -0700 (MST)
Subject: [torquedev] [Bug 139] Negative value in 'Que' when using qstat
In-Reply-To:
References:
Message-ID: <20120223185531.C9FDE4120EBA@http.supercluster.org>
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=139
--- Comment #5 from Nicolas Pinto 2012-02-23 11:55:31 MST ---
> Has this issue been resolved in 3.0.4 or 4.0.0?
The bug is still present in 3.0.4:
% sudo pbs_server --version
version: 3.0.4
% qmgr -c 'l s' | grep state_count
state_count = Transit:0 Queued:-22737 Held:22590 Waiting:0 Running:139
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.