Distributed computing in REBOL

    Author: Gabriele Santilli
    Last update: 17-Jan-2005

===Introduction

In this document I'd like to outline my vision for distributed computing,
mainly inspired by the advancements in P2P network technology such as the Chord
lookup protocol.

=url http://www.pdos.lcs.mit.edu/chord/

While I'm implementing Chord in REBOL as the basis for a distributed architecture
for the next version of SURFnet's Network Detective, I will present here the possibilities
offered by a completely decentralized architecture, and how to solve some of the problems
that come from decentralization.

===Chord

Chord is basically a simple lookup protocol, that can scale very well with the number
of nodes in the network. It works by using the mathematical concept of a ring;
to give an example, the hours in an analog clock form a ring. Three hours after 3 is 6, but
three hours after 10 is 1. So in the clock you can say that 1 is after 10.

To explain how Chord works, let's continue with the analogy of the clock. If you assign
a number from 1 to 12 to each node (with each node having a different number), you can then
place each node on our imaginary clock. For example, we could have four nodes, at 1, 3, 7 and 10.
Each node only needs to know about its successor and predecessor nodes; so 3 only knows that its
successor is 7 and its predecessor 1, 7 knows 10 and 3, 10 knows 7 and 1, and 1 knows 3 and 10.

Now to locate any resource in the network, we need to assign a number to each resource too.
(Of course, this will need to be done in a way so that it is easy to calculate the number knowing
the resource you want to locate.) After that, for each resource, the node that is considered
responsible for it is its successor node. So in our example, if we wanted to know about resource
number 4, we would have to ask to node 7.

Chord is an algorithm that allows to find the successor node responsbile for any given number.
It works this way: if the number is between your predecessor's number and your number, then
you are the successor of it. If it is between you and your successor, then your successor
is the responsible. Otherwise, you don't know and so you ask to your successor; it will
apply the same logic and will thus either find a successor and tell you, or it will ask to its successor
and so on.

This is like a linear search in a list; it does not scale well if you have a lot of nodes.
So in practice each node will have to know more than just its successor and predecessor;
the full details for those interested are in the Chord's papers, but in the end you get to
O(log N) lookup time, where N is the number of nodes.

In my REBOL implementation of Chord, I'm using a ring modulo 2^160 (i.e. with all the 160 bit integer
numbers, from 0 to 2^160 - 1); each number is actually called a key and is obtained using
an hash function (CHECKSUM/SECURE in our case).

===Distributed Hash Tables

Usually the Chord protocol is used to implement distributed hash tables (DHTs in brief);
what you want to do in a hash table is to associate data with keys, so that you can retrieve
the data by knowing the key. This fits Chord nicely, because you can just store the data
on the key's successor. A very good DHT based on Chord is described here:

=url http://www.pdos.lcs.mit.edu/papers/chord:cates-meng.pdf

I plan to implement it (even if not fully) in REBOL too.

A DHT removes the need to use a centralized server to store data. Data can be stored
on the network itself, with automatic redundancy so that it is much more reliable than
a single server (if the server fails, you lose the data), and with the advantage that
as more users want to access the data, the number of nodes grows and so the load on
each node stays constant. In the end, a distributed approach is more reliable,
more affordable (you don't need a dedicated server), and more scalable.

---A simple distributed web service

One great example of the possibilities of a DHT is a distributed web service for static files; imagine
just hashing each URL and looking it up in the DHT. No need for web servers anymore, no more
load problems for highly visited web sites, no more site downtime due to server maintainance or
failure. Too bad legacy won't allow this for a long time...

===Problems

Of course, there are drawbacks too. While a distributed system is not easily attackable
like a single server is (an attacker would need to attack thousands -- if not millions! -- of nodes),
it is vulnerable to an attacker joining the ring with malicious nodes.

This can be solved by adding redundant checks on the results you get from other nodes, and
having a trust level for each node. So you would trust well-known, good nodes more than
new, possibly evil nodes. This is not trivial and may get a bit complicated, but I think
it's doable. Of course, you need a way to identify nodes and authenticate data and messages...

===Identification, authentication and authorization

In a fully distributed architecture without any authority, how do you identify and authorize
users?

Of course, you need some form of authority. The good thing is that this can be done in a distributed
fashion too, except that you need a hierarchy (as you don't want each node to be equal -- you want
some to have more authority than others). However, the hierarchy doesn't need to be in the network;
it just needs to be in the trust.

What you do is similar to DNS; there's even a proposed DHT-based DNS service that shows how
DNS can be fully distributed using Chord:

=url http://www.pdos.lcs.mit.edu/chord/papers/ddns.pdf

What I propose is having one master Certification Authority that is recognized by each node.
Each certificate has a trust level; the main CA has trust level 10. Each signed certificate cannot
have a trust level greater or equal to the signer CA; so the main CA can at most sign level 9
certificates. This trust level can be used to control how much data from a certain node is trusted, but it
is not strictly needed for identification itself.

What the main CA will have to do is just create a few level 9 certificates for the top level
CAs (TLCAs). You can have as many TLCAs as you want, but usually you will want to just have a few,
like for the TLDs in the DNS. Each TLCA will then be able to create lower level CAs, down to final
users certificates. Each CA is responsible for the uniqueness of the names of the certificates
it creates, like for DNS. So CAs will need to keep a database of user names, but since you can
have many levels this easily scales to any number of users.

This way it is possible to assign a unique name to each user; each user will just register
with a "local" CA (for example, its organization's CA); for example, if user "john" registers
with the "acme" CA, registered by the "inc" CA, registered by the TLCA "us", he could be
uniquely identified by "us/inc/acme/john", or "john.acme.inc.us" following a DNS-alike notation.

When you can easily identify users and user domains, authorization is no big deal anymore.
Authentication is just a matter of verifying the certificates.