Distributed computing in REBOL Author: Gabriele Santilli Last update: 17-Jan-2005 ===Introduction In this document I'd like to outline my vision for distributed computing, mainly inspired by the advancements in P2P network technology such as the Chord lookup protocol. =url http://www.pdos.lcs.mit.edu/chord/ While I'm implementing Chord in REBOL as the basis for a distributed architecture for the next version of SURFnet's Network Detective, I will present here the possibilities offered by a completely decentralized architecture, and how to solve some of the problems that come from decentralization. ===Chord Chord is basically a simple lookup protocol, that can scale very well with the number of nodes in the network. It works by using the mathematical concept of a ring; to give an example, the hours in an analog clock form a ring. Three hours after 3 is 6, but three hours after 10 is 1. So in the clock you can say that 1 is after 10. To explain how Chord works, let's continue with the analogy of the clock. If you assign a number from 1 to 12 to each node (with each node having a different number), you can then place each node on our imaginary clock. For example, we could have four nodes, at 1, 3, 7 and 10. Each node only needs to know about its successor and predecessor nodes; so 3 only knows that its successor is 7 and its predecessor 1, 7 knows 10 and 3, 10 knows 7 and 1, and 1 knows 3 and 10. Now to locate any resource in the network, we need to assign a number to each resource too. (Of course, this will need to be done in a way so that it is easy to calculate the number knowing the resource you want to locate.) After that, for each resource, the node that is considered responsible for it is its successor node. So in our example, if we wanted to know about resource number 4, we would have to ask to node 7. Chord is an algorithm that allows to find the successor node responsbile for any given number. It works this way: if the number is between your predecessor's number and your number, then you are the successor of it. If it is between you and your successor, then your successor is the responsible. Otherwise, you don't know and so you ask to your successor; it will apply the same logic and will thus either find a successor and tell you, or it will ask to its successor and so on. This is like a linear search in a list; it does not scale well if you have a lot of nodes. So in practice each node will have to know more than just its successor and predecessor; the full details for those interested are in the Chord's papers, but in the end you get to O(log N) lookup time, where N is the number of nodes. In my REBOL implementation of Chord, I'm using a ring modulo 2^160 (i.e. with all the 160 bit integer numbers, from 0 to 2^160 - 1); each number is actually called a key and is obtained using an hash function (CHECKSUM/SECURE in our case). ===Distributed Hash Tables Usually the Chord protocol is used to implement distributed hash tables (DHTs in brief); what you want to do in a hash table is to associate data with keys, so that you can retrieve the data by knowing the key. This fits Chord nicely, because you can just store the data on the key's successor. A very good DHT based on Chord is described here: =url http://www.pdos.lcs.mit.edu/papers/chord:cates-meng.pdf I plan to implement it (even if not fully) in REBOL too. A DHT removes the need to use a centralized server to store data. Data can be stored on the network itself, with automatic redundancy so that it is much more reliable than a single server (if the server fails, you lose the data), and with the advantage that as more users want to access the data, the number of nodes grows and so the load on each node stays constant. In the end, a distributed approach is more reliable, more affordable (you don't need a dedicated server), and more scalable. ---A simple distributed web service One great example of the possibilities of a DHT is a distributed web service for static files; imagine just hashing each URL and looking it up in the DHT. No need for web servers anymore, no more load problems for highly visited web sites, no more site downtime due to server maintainance or failure. Too bad legacy won't allow this for a long time... ===Problems Of course, there are drawbacks too. While a distributed system is not easily attackable like a single server is (an attacker would need to attack thousands -- if not millions! -- of nodes), it is vulnerable to an attacker joining the ring with malicious nodes. This can be solved by adding redundant checks on the results you get from other nodes, and having a trust level for each node. So you would trust well-known, good nodes more than new, possibly evil nodes. This is not trivial and may get a bit complicated, but I think it's doable. Of course, you need a way to identify nodes and authenticate data and messages... ===Identification, authentication and authorization In a fully distributed architecture without any authority, how do you identify and authorize users? Of course, you need some form of authority. The good thing is that this can be done in a distributed fashion too, except that you need a hierarchy (as you don't want each node to be equal -- you want some to have more authority than others). However, the hierarchy doesn't need to be in the network; it just needs to be in the trust. What you do is similar to DNS; there's even a proposed DHT-based DNS service that shows how DNS can be fully distributed using Chord: =url http://www.pdos.lcs.mit.edu/chord/papers/ddns.pdf What I propose is having one master Certification Authority that is recognized by each node. Each certificate has a trust level; the main CA has trust level 10. Each signed certificate cannot have a trust level greater or equal to the signer CA; so the main CA can at most sign level 9 certificates. This trust level can be used to control how much data from a certain node is trusted, but it is not strictly needed for identification itself. What the main CA will have to do is just create a few level 9 certificates for the top level CAs (TLCAs). You can have as many TLCAs as you want, but usually you will want to just have a few, like for the TLDs in the DNS. Each TLCA will then be able to create lower level CAs, down to final users certificates. Each CA is responsible for the uniqueness of the names of the certificates it creates, like for DNS. So CAs will need to keep a database of user names, but since you can have many levels this easily scales to any number of users. This way it is possible to assign a unique name to each user; each user will just register with a "local" CA (for example, its organization's CA); for example, if user "john" registers with the "acme" CA, registered by the "inc" CA, registered by the TLCA "us", he could be uniquely identified by "us/inc/acme/john", or "john.acme.inc.us" following a DNS-alike notation. When you can easily identify users and user domains, authorization is no big deal anymore. Authentication is just a matter of verifying the certificates.