OSSNET - Proposal for Swarming Data Propagation

Wed Oct 29 05:04:52 UTC 2003

(Alan Cox mentioned a theoretical idea for bittorrent in data
propagation for yum... so this seemed like the most appropriate time to 
post this again. Comments would be greatly welcomed.)

OSSNET Proposal
October 28, 2003
Warren Togami <warren at togami.com>

The following describes my proposal for the "OSSNET" swarming data
propagation network. This was originally posted to mirror-list-d
during April 2003. This proposal has been cleaned up a bit and
amended.

Unified Namespace
=================
This can be shared with all Open Source projects and distributions.
Imagine this type of unified namespace for theoretical protocol "ossnet".

ossnet://%{publisher}/path/to/data
Where %{publisher} is the vendor or project's master tracker.
The client finds it with standard DNS.

Examples:
ossnet://swarm.redhat.com/linux/fedora/1/en/iso/i386/
ossnet://ossnet.kernel.org/pub/linux/kernel/
ossnet://swarm.openoffice.org/stable/1.2beta/
ossnet://central.debian.org/dists/woody/
ossnet://swarm.k12ltsp.org/3.1.1/
ossnet://master.mozilla.org/mozilla1.7/

Each project tracker has their own official data source with the entire
repository GPG signed for automatic ossnet client verification.

Phase 1 - Swarming for Mirrors only
===================================
Initial implementation would be something like rsync, except swarming
like bittorrent and used only for mirror propagation. It may need
encryption, some kind of access control, and tracking in order to
prevent intrusion, i.e. hold new release secret until release day.

(This paragraph below about access control and encryption was written 
after the release of RH9, and the failure of "Easy ISO" early access due 
to bandwidth overloading and bittorrent.  In the new Fedora episteme 
this access control stuff may actually not be needed anymore.  We can 
perhaps implement OSSNET without it at first.)

I believe access control can be done with the central tracker (i.e. Red
Hat) generating public/private keys, and giving the public key to the
mirror maintainers. Each mirror maintainer would choose which
directories they want to permanently mirror, and which to exclude. Each
mirror server that communicates with another mirror would first need to
verify identity with the master tracker somehow. If somebody leaks
before a release, they can be punished by revoking their key, then the
master tracker and other mirrors will reject them.

Even without the encryption/authorization part this would be powerful.
This would make mirror propagation far faster while dramatically
reducing load on the master mirror. Huge money savings for the data
publisher... but it gets better.

Phase 2 - Swarming clients for users
====================================
I was also thinking about end-user swarming clients. up2date, apt or yum
could integrate this functionality, and this would work well because
they already maintain local caches. The protocol described above would
need to behave differently for end-users in several ways.

Other than the package manager tools, a simple "wget" like program would 
be best for ISO downloads.

Unauthenticated clients could join the swarm with upload turned off by
default and encryption turned off (reduce server CPU usage). Most users
don't want to upload, and that's okay because the Linux mirrors are
always swarming outgoing data. Clients can optionally turn on upload,
set an upload rate cap, and specify network subnets where uploading is
allowed. This would allow clients within an organization to act as
caches for each other, or a network administrator could setup a client
running as a swarm cache server uploading only to the LAN, saving tons
of ISP bandwidth. A DSL/cable modem ISP would be easy to convince to
setup their own cache server to efficiently serve their customers. This
is because setting up a server can be done quickly & unofficially.

Clients joining the swarm would greatly complicate things because the
protocol would need to know about "nearby" nodes, like your nearest
swarming mirror or your LAN cache server. This may need to be a
configuration option for end-user clients. These clients would need to
make more intelligent use of nearby caches rather than randomly swarm
packets from hosts over the ISP link. The (bittorrent) protocol would
need to be changed to allow "leeching" under certain conditions without
returning packets to the network. Much additional thought would be
needed in these design considerations.

Region Complication
===================
Due to higher costs of intercontinental bandwidth, or commodity Internet
over I2 cost within America, we may need to implement a "cost table"
system that calculates best near-nodes taking bandwidth cost into account.

Perhaps this may somehow use dedicated "alternate master trackers"
within each cost region, for example Australia, which are GPG identified 
by the master tracker as being authoritative for the entire region. Then 
end-user clients that connect to the master tracker are immediately told 
about their nearer regional tracker.

Possible Multicasting
=====================
This isn't required, but multicasting could be utilized in addition to
unicast in order to more efficiently seed the larger and more connected
worldwide mirrors.  Multicast would significantly increase the 
complexity of network router setup and software complexity, so I am not 
advocating this be worked on until the rest of the system is mplemented.

Possible Benefits?
==================
* STATISTICS!
As BitTorrent has demonstrated, end-user downloads could
possibily be tracked and counted. It would be fairly easy to
standardize data collection in this type of system. Today we have no
realistic way to collect download data from many mirrors due to the
setup hassles and many different types of servers. Imagine how useful
package download frequency data would be. We would have a real idea of
what software people are using, and possibly use that data to guage
where QA should be focused to better and make users/customers happier.

* Unified namespace!
Users never have a need to find mirrors anymore, although optionally 
setting cache addresses would help it be faster and more efficient.

* Public mirrors (even unofficial) can easily setup and shutdown at any
time. Immediately after going online they will join the swarm and begin
contributing packets to the world. THAT is an unprecedented and amazing
ability. The server maintainer can set an upload cap so it never kills
their network. For example, businesses or schools could increase their
upload cap during periods of low activity (like night?) and contribute
to the world. The only difference between an official and unofficial
mirror would be unofficial cannot download or serve access controlled
data since they are not cryptographically identified by the master 
tracker.  Any client (client == mirror) can choose what data they want 
to serve, and what they do not want to serve.

* Automatic failover: If your nearest or preferred mirror is down, as
long as you can still reach the master tracker you can still download
from the swarm.

* Most of everything I described above is ALREADY WRITTEN AND PROVEN
CONCEPTS in existing Open Source implementations like bittorrent
(closest in features), freenet (unified namespace) and swarmcast
(progenitor?). I think the access control and dynamic update mechanism
has been implemented yet. bittorrent may be a good point to start
development from since it is written in python ... although scalability 
may be a factor with python, so a C rewrite may be needed. (?)

FAQ
===
1. This idea sucks, I don't want to upload!
RTFM! This proposal says that clients have upload DISABLED by default.

2. This idea sucks, I don't want to upload to other people!
RTTM! In this proposal you can set your mirror to upload only to certain
subnets, at certain set upload rate caps.

3. Wont this plan fail for clients behind NAT?
Incoming TCP sessions are only needed if you upload to the swarm, as 
other clients connect to you.  Uploading is DISABLED by default 
Downloading only requires outgoing TCP connections.

4. What if outgoing connections on high ports are disallowed?
Then you are SOL, unless we implement a "proxy" mode. Your LAN can have 
a single proxy mirror that serves only your local network, and 
downloading your requests on your behalf.

Conclusion
==========
Just imagine how much of a benefit this would be to the entire Open
Source community! Never again would anyone need to find mirrors.
Simply point ossnet compatible clients to the unified namespace URI, and
it *just works*. We could make a libossnet library, and easily extend
existing programs like wget, curl, Mozilla, galeon, or Konqueror to
browse this namespace.

This is an AWESOME example of legitimate use of P2P, and far easier to
police abuse than traditional illegal use of P2P clients. Data 
publishers need to run a publically accessible tracker and must be held 
legally accountable. This is more like a web server with legal content 
and millions of worldwide proxy caches. In any case the web server would
be held accountable for the legality of their content.

That is how this differs from Freenet which uses encryption everywhere
and is decentralized. Freenet can be used for both good and evil, while
ossnet can only sustainably used for good, because normal law 
enforcement can easily locate and (rightly) prosecute offenders. This is 
existing copyright law, how it was meant to be used. If this idea became 
reality, we could point to this glowing example of legitimate P2P as a 
weapon to fight RIAA/MPAA interests.

I hope I can work on this project one day. This could be world
changing... and sure would be a fun to develop. Maybe Red Hat could
develop this, in cooperation with other community benefactors of such an
awesome distribution system.

Comments? =)

Warren Togami
warren at togami.com

p.s. Time to short Akamai stock. <evil grin>