[Linux-cluster] SSI, Virtual Servers, ShareRoot, Etc

Fri Jan 25 10:20:38 UTC 2008

On Thu, 24 Jan 2008, isplist at logicore.net wrote:

>> Indeed. That sounds rather like you were using a SAN just for the sake of
>> using a SAN, and taking all the disadvantages without any of the
>> advantages.
>
> I'm not sure what you mean. I use an FC SAN because it allows me to separate
> the storage from the machine.

But is that really cost effective? SANs aren't exactly cheap. Even if you 
build one yourself, it takes a lot of disks before you actually break even 
on disk costs.

> What I was hoping to do with the partitions was to give each blade it's own
> boot and scratch space, then allow each blade to have/use shared storage for
> the rest. I was hoping to boot the machines from FC or perhaps PXE.

The point I was making was that SAN isn't cost effective unless you are 
reaping other benefits (such as simplified administration) in addition to 
saving on storage space (by the time you put the rest of the SAN box 
together, the chances are that your net price per GB will increase). SAN 
is also typically slower than local disks (don't believe the marketting 
hype).

> Then
> someone mentioned OpenSharedroot which sounded more interesting than carving
> up a chassis into a bunch of complicated partitions. Just never got back to it
> but want to again now. I badly want to eliminate the drives from each blade in
> place of PXE or FC boot with something such as a sharedroot.

Sure, PXE+OSR is exactly the sort of administration simplification that I 
was talking about. It also saves you extra space on top. The fact that 
most configuration files (with only a few exceptions) are shared saves
you from having to implement really naff things like coming up with 
complex frameworks to push the configuration to all machines and keep them 
in sync.

Same applies to keeping installed packages the same across the cluster 
(you install on any node, and all nodes have it), and most importantly, 
the data itself. All of it is always going to be consistent. It 
effectively reduces the administration and maintainance complexity from 
O(n) to O(1).

The only thing to watch out for is that swapping onto SAN is likely to be 
relatively slow, as is scratch space. The only time I'd use completely 
diskless nodes is when I can get away without swap/scratch partitions and 
just use shared space for everything.

>> A shared root system would be better because then I don't have
>> to have fixed partitions, just files.
>> Then, each node would have it's storage over other storage chassis on
>> the network.
>
>> Not sure what you mean there, the last two sentences didn't quite parse.
>> Can you please elaborate?
>
> These two statements you mean?
>
> What I meant was first, that when I was trying this out, I was not aware of
> the sharedroot project. However, I could take one of my 12-drive FC chassis
> and partition it into say 32 partitions. My hope was to be able to boot 32
> machines from that storage chassis. So, for the cost of running 12 drives in a
> RAID5 array, I would eliminate 32 drives of all sorts of sizes for 12.

But does that really end up being cheaper, when you factor in the cost of 
the chasis itself?

> Since I was not able to boot from FC for what ever reason, I found that I
> could install a small cheap flash IDE card into each blade for their /boot
> partition, then it's main / partition would be the storage device. This worked
> but of course I ran into other problems.
>
> The problem had to do with not just zoning complications but that the storage
> device was static in that if say I needed to do something with a certain
> partition, I was unable to make any changes unless I changed all of the
> partitions. Not a good idea.

This is where PXE booting the OSR initrd is useful. Update the initrd, do 
a rolling reboot, and it's all updated. :-)

>>> maybe before going right into SSI which seems to be quite an effort to
>>> get running, you might want to think about "cloning" system images by
>>> using any kind of volume manager and copy on write volumes. lvm2 for
>>> instance does support this.

That sounds like misusing LVM for what DRBD is designed for...

> But my thinking is not about ease of creating servers, it is about wasting the
> resources of servers which are sitting there idle for the most part.

Actually, unless the administrator's time is worthless, you'll likely find 
that cost of the extra man-hours required to admin the system is a greater 
saving than what you'll achieve with cost saving on the hardware.

> Important
> machines yet when they aren't doing anything, really, just wasted resources.
> My curiosity was about creating a cluster of machines, which could use all of
> the processing power of the others when needed. Instead of having a bunch of
> machines sitting around mostly idle, when something came up, any one of them
> could use what it needed for resources, better utilizing the resources.

The point is that all the resources you have are spinning all the time 
anyway, and if you are using SSI/OSR, the chances are that you have a 
homogenous cluster and all the machines are evenly load balanced.

> That, of course, is making the assumption that I would be using applications
> which put to use such resources as an SSI cluster.

One of the neat things you CAN achieve using SSI and clustering is that 
you don't need all the nodes up all the time. Say you have 31 nodes. You 
need 16 of those to maintain quorum. You want another 2 in there just to 
cover any unexpected failures, so that's 18. The other 13 can be powered 
off to conserve power, with their customer facing IPs HA failed over to 
the 18 working nodes, with weighting adjusted accordingly. If the load 
goes up you can bring the additional nodes online to cope with the load, 
fail it's IP back to it and adjust the load balancing weights.

It gives you a reasonably nice way of only using the hardware you need for 
normal load, while giving you a transparent way to bring up extra capacity 
when your load spikes up (e.g. after a big marketing campaign). The 
hardware that is powered down isn't using power and it isn't depleting 
it's MTBF, so it prolongs the operational live of your cluster, too.

>> That would make things more complicated to set up AND more complicated to
>> maintain and keep in sync afterwards. SSI isn't hard to set up. Follow the
>> OSR howto and you'll have it up and running in no time.
>
> From what I understand of SSI clusters, the applications have to be able to
> put that sort of processing to proper use. While I'm using the term SSI, I am
> really just asking everyone here for thoughts :).

SSI in a very simple setup is just shared-root. It doesn't necessarily 
include Mosix-like things where all the CPUs from all the nodes appear as 
local CPUs. That adds an additional layer of complexity that is a lot less 
useful for most applications. SR SSI, OTOH, gives you immediate and 
obvious benefits such as reduced space usage and more importantly, reduced 
administration complexity.

> All I really mean is kind of where computing is going anyhow. Build something
> very powerful (SSI for lack of better word), allow it to be sliced as many
> times as required (VM), allow all of it's power to be used by any one or more
> requests or share it amongst request automatically.

You'll find that even with Mosix type node unification, it's more 
efficient to use the virtualizer's own VM migration solution. Mosix is 
useful if you want to be lazy, but it isn't the magic bullet and doesn't 
work well in all scenarios. Most of the time application-level clustering 
is better in terms of performance, and it is usually a lot better at 
handling error conditions (e.g. node failure) gracefully.

> On the resources side, make it easy to add more power by easily being able to
> add servers, memory, storage, what ever, into the mix. Isn't this where
> computing is heading?

That's exactly what I described above. SR makes that really easy. You can 
add additional nodes in the time it takes you to add it to the dhcp.conf 
and cluster.conf.

> In my case, I just want to stop wasting all the power I'm wasting, have
> something flexible, easy to grow and manage.

See above. :-)

>> the top of my head. It's useful to keep things like /var/cache shared. But
>> that's all fairly minor stuff. The howto will get you up and running.
>
> I'll take a peek, thanks much.
>
>> The major upshot of SSI is that you only need to manage one file system,
>> which means you can both use smaller disks in the nodes, and save yourself
>> the hassle of keeping all the packages/libraries/configs in sync.
>
> The very first thing I'd like to achieve is being able to get rid of the
> drives in all of my machines in place of a FC HBA or using PXE. Then using
> central storage for each servers needs from there.

You might as well do it all in one go. If you have a single machine set 
up, you can get the OSR set up pretty quickly. Certainly less time than 
it'll take you to create volumes for all the nodes and copy their FS to 
their SAN volume.

> On SSI, again, this is where it is unclear to me and perhaps I am using the
> wrong term. I understand SSI as meaning a single system image but one which
> you can only take advantage of with special applications. In other words, a
> LAMP system would not take advantage of it.

In that case you are misunderstanding. Any homogenous cluster can take 
advantage of it. In fact, you can do it even where all nodes don't have 
the exact same job, but then you have to unshare the configs, which makes 
it more complicated. You could still work around that by having / shared 
via GFS and then having a separate /etc volume per cluster, but then you 
still have the same package set on all the nodes (which may not be a 
problem).

To summarize - most clusters can take advantage of SSI.

Gordan