<html><head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head><body bgcolor="#FFFFFF" text="#000000">Hah! I've been deep into
SGE (user, trainer, consultant) for years. <br>
<br>
Our setups are pretty similar but I'm hoping to use the AWS cfnCluster
stack (<a class="moz-txt-link-freetext" href="https://github.com/awslabs/cfncluster">https://github.com/awslabs/cfncluster</a>) because it is officially
blessed by AWS and since it's a cloudformation template at the end of
the day it's both easy to support and extend. It also does all the hard
work (auto-scaling etc.) that I don't want to have to code myself via
ansible. Since I'm a consultant I need to hand off something to my users
that is easy for them to operate moving forward without me. <br>
<br>
Your experience using IPA in an HPC environment is very helpful. We also
use ansible to automate "ipa-client-install --unattended ..." so
scripting the install and remove commands should be pretty
straightforward. <span><br>
<br>
Just trying to my compare my appreciation for IPA vs what I saw on the
ground at massive HPC installations where the operators jumped through
hoops to remove network services that could break user info or affect
stablity. I lost count of how many sites I saw people dumping NIS maps
and LDAP directories into plaintext files every 4-6hours that they'd
spread across the cluster simply to remove any chance that a failed
NIS/LDAP query could mess up a node, user or job. <br>
<br>
Thanks!<br>
<br>
Chris<br>
<br>
</span>
<blockquote style="border: 0px none;"
cite="mid:1492089680.4713.47.camel@googlemail.com" type="cite">
<div style="margin:30px 25px 10px 25px;" class="__pbConvHr"><div
style="width:100%;border-top:2px solid #EDF1F4;padding-top:10px;"> <div
style="display:inline-block;white-space:nowrap;vertical-align:middle;width:49%;">
<a moz-do-not-send="true" href="mailto:gmzgames.de@googlemail.com"
style="color:#485664
!important;padding-right:6px;font-weight:500;text-decoration:none
!important;">Gerald-Markus Zabos</a></div> <div
style="display:inline-block;white-space:nowrap;vertical-align:middle;width:48%;text-align:
right;"> <font color="#909AA4"><span style="padding-left:6px">April
13, 2017 at 9:21 AM</span></font></div> </div></div>
<div style="color:#909AA4;margin-left:24px;margin-right:24px;"
__pbrmquotes="true" class="__pbConvBody"><div><!----><br>Hi Chris,<br><br>we're
facing a similar use case from day to day, but changed from AWS to<br>another
cloud provider. Our use case works on both, so i am refering to<br>AWS.<br><br>We
decided...<br><br>...to use SGE for our HPC infrastructure<br>...recycle
network ranges for 100 static IP addresses + 100 static<br>hostnames<br>...to
use scripts & cronjobs & ansible (depending on "qstat" and
"qhost"<br>output) on the cluster head node to determine how many
additional<br>cluster nodes have to be created as an additional reserve
for<br>"What-if-we-need-more-nodes?" scenarios<br>...to create cluster
nodes via ansible-playbook on AWS from a<br>pre-defined image, do
software installation & configuration via<br>ansible-playbook, do
the IPA domain join via ansible-playbook<br>("ipa-client-install
--domain=<DOMAIN> --mkhomedir<br>--hostname=<FreeIPA-Client>.<DOMAIN>
--ip-address=<FreeIPA-Client IP<br>address> -p <Join User>
-w <Join User's password> --unattended")<br>...to destroy cluster
nodes in two steps: 1) ansible-playbook<br>"ipa-client-install
--uninstall", 2) ansible-playbook destroy cluster<br>node on AWS via API<br><br>(Right
now, i am working on a bulk creation script of IPA users/groups<br>for
expanding our single HPC cluster into several ones, whereas we have<br>the
same set of users (~65-100) with differing suffix in the username<br>e.g.
"it_ops01", "it_ops20", etc...)<br><br>We're using 2x IPA-Servers (ESXi
VMs, 4GB RAM, 2 CPU) in replication<br>with another 2x IPA Servers
(same dimensions) on our main physical<br>datacenter. Didn't see much
impact on the IPA servers during<br>enrollment/removal of domain hosts.
So far after three months of<br>operations, we had several "bad box"
scenarios, all of them because of<br>problems with SGE. We solved these
problems manually, by removing/adding<br>cluster nodes via SGE commands.
<br><br>As you can see, i tend to [Option 1], since it does all the
magic with<br>pre-defined software commands(sge, ansible, ipa cli),
instead of jumping<br>around with additional scripts doing work, which
can be done by<br>"built-in" commands. For us, this works best.<br><br>Regards,<br><br>Gerald<br></div></div>
</blockquote>
<br>
</body></html>