Thursday, April 20, 2017

Theseus Protocol v0.1 Overview

Previously:
Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 
Distributed Search in Theseus
The State of Theseus, One Month In

Introduction

Hope everyone's having a relaxing day. Things have been a bit quiet over here lately, as I've been busy sampling the boundless delights of graduate life. But fear not: progress on Theseus is still being made. I've been hard at work on the data storage formats described in the previous post's TODO list, and on the specifics of the Theseus protocol itself.

These elements of the design are all interrelated, so it seems appropriate to summarize and discuss them as a unit. That means we've got a lot to cover here, so we're going to break description and discussion into separate sections and make the description section deliberately exposition-light.

Theseus is growing more realizable by the day. Soon, interface design, not protocol details, will end up being the project's top priority. That'll be an exciting shift and I'm looking forward to finding people to collaborate with on it.

For now, though... Let's get technical.


Description


Model

The protocol is conceptualized as a set of bencoded RPC messages following the KRPC protocol format as described by the Mainline DHT implementation of Kademlia, with a custom set of RPCs. BEP-05 defines the KRPC format as follows:
A KRPC message is a single dictionary with two keys common to every message and additional keys depending on the type of message. Every message has a key "t" with a string value representing a transaction ID. This transaction ID is generated by the querying node and is echoed in the response, so responses may be correlated with multiple queries to the same node. The transaction ID should be encoded as a short string of binary numbers, typically 2 characters are enough as they cover 2^16 outstanding queries. The other key contained in every KRPC message is "y" with a single character value describing the type of message. The value of the "y" key is one of "q" for query, "r" for response, or "e" for error. 
Note that we follow the MLDHT KRPC protocol as regards message format, but not as regards message transport.

We define the following queries: find_node, get_peers, announce_peer, get_signatures, announce_signatures, get_raw, announce_raw, search_files, search_nodes, get_info, and the colorfully-named outlier socialist_millionaires_auth. Note the absence of MLDHT's ping query, which get_info renders redundant.

It is expected that from time to time additional queries will be added to this list in support of new features.

Transport

We deviate from the MLDHT specification in running the protocol over TCP rather than UDP and encapsulating all protocol messages within an encrypted channel established and maintained via a variant of the OTR protocol. The duration of TCP connections is left for individual clients to decide. It is noted that peers who communicate often or wish to authenticate to each other would likely see performance benefits from maintaining long-lived connections -- though it is also noted that connection duration is trivially observable by third parties, a fact which may be of significance to privacy-minded users.

Modern iterations of OTR includes provisions for authentication as part of the initial handshake. Authentication is not necessarily desired in our use case -- some nodes may wish to authenticate to each other, and others may not -- and so rather than handling authentication as part of the initial handshake, we break it out into a specialized, optional RPC message exchange which may be carried out at any point post-handshake.

Descriptions of RPC Query and Response Messages

Some of these queries are modeled on the BEP-05 queries, but the resemblance is limited. Most notably, the fact that we are discarding MLDHT's connectionless UDP transport in favor of persistent TCP connections means that we can relax MLDHT's requirement of including "id" keys in every query, since it's now straightforward to track identities by associating IDs with connections. Another difference is that per the anti-Sybil measures discussed in Resisting Sybil Attacks in Distributed Hash Tables, where node IDs do appear they must be accompanied by hash preimages.

It is expected that extensions to any of these queries will, more likely than not, take the form of extra dictionary keys. As such, any implementation of this protocol should be capable of gracefully handling (i.e. silently ignoring) the presence of unexpected keys in any of these messages, as long as the expected parts of the message are correct.

find_node

This is analogous to BEP-5's find_node query, but lacks an "id" key.

arguments: {"target" : "<id of target node>"}

response: {"nodes" : "<compact node info>"}

get_peers

Also analogous to the BEP-5 definition. In addition to omitting the "id" key, the "token" key is omitted. This key is a security measure which is appropriate over UDP but unnecessary over properly encrypted TCP, as discussed below.

arguments: {"info_hash" : "<20-byte infohash of target data>"}

response: {"values" : ["<peer 1 info string>", "<peer 2 info string>"]}

or: {"nodes" : "<compact node info>"}

announce_peer

For this query, which is the last of those directly derived from BEP-05's MLDHT specification, we specify a new optional key, "sybil". "sybil" keys to an integer value of 1 or 0 depending on whether the sending node believes a vertical Sybil attack is taking place at the write address. If "sybil" is present and nonzero, the receiving node may attempt to verify the claim and subsequently increase its timeout for stored data.

We also omit "id" and "token", for the same reasons as above. Note that this leaves the response dictionary empty. This empty response should, of course, still be sent, to acknowledge query receipt.

arguments: {"implied_port": <0 or 1>, "info_hash" : "<20-byte infohash of target torrent>", "port" : <port number>, "sybil" : <0 or 1>}

response: {}

get_signatures

This is much like the get_peers query, but instead of returning a list of peers, it returns a list of 2-tuples containing various bytestrings and their signatures.

If the query is a miss, i.e. if the node does not have any stored signatures, it replies with its best peers, mimicking the response to a find_node query just like the alternate response for get_peers does.

The "key_fingerprint" argument is optional. If it is omitted, all signatures stored at the queried node are to be returned. This is perhaps unexpected. The motivation is that all signed data is meant to be essentially public-record between peers in the protocol, and in fact there are certain situations in which trawling for data is valuable -- for instance to collect cite_keyserver entries, as discussed below.

arguments: {"key_fingerprint" : "<20-byte fingerprint of signing key>"}

response: {signatures" : [["message_one", "signature_one"], ... , ["message_n", "signature_n"]]}

or: {nodes" : "<compact node info>"}

announce_signatures

This is used to publish new signed data associated with a given key. The key's address in the DHT is determined by a 160-bit key signature. When a node receives these submissions, it should, before storing the signatures, check that they are valid. It will also want to check how close the key signature is to the node's own IP, to know how long to cache it. An optional "sybil" argument may be included, which operates just like it does in announce_peer.

Note that signatures need not be announced by someone holding the private key that generated them -- once they're in the network, anyone can re-publish them.

arguments: {"signing_key" : "<public key>", "signatures" : [["message_one", "signature_one"], ... , ["message_n", "signature_n"]], "sybil" : <0 or 1>}

response: {}

get_raw

Similar in operation to get_peers or get_signatures, get_raw is used to retrieve raw binary data blobs stored at the DHT node (as used by systems like the one described in Securely Finding Friends via DHT Dead Drops). The data is returned as a bencoded list of binary strings.

arguments: {"address" : "<20-byte address>"}

response: {data" : ["<first bytestring>", ...]}

or: {nodes" : "<compact node info>"}

announce_raw

Used to store the raw binary data retrieved by get_raw. Like the other announce_* queries, announce_raw includes an optional "sybil" argument.

arguments: {"address" : "<20-byte address>", "data" : "<bytestring>", "sybil" : <0 or 1>}

response: {}

search_files

This query operates as described in Distributed Search in Theseus. It takes as an argument a search string. The remote peer checks its own local files for this string in any way it sees fit, and returns a list of file descriptions including, among other things, titles and magnet links.

If the remote peer has hits, it returns a (compressed) JSON list containing the full metadata for each matching file, along with a string specifying the compression algorithm used. "compression" : "none" is valid, and may even be appropriate when the result string is small enough for the overhead of compression to negate its benefits.

If the remote peer doesn't have any hits, its response should instead contain a "nodes" key suggesting other nodes to query. The list of nodes technically can be empty (though it probably shouldn't be).

The format of the query string is left unspecified, at least for now. The intention is that this could even just contain whatever the user has typed into a search box. The question of how to actually carry out the lookup is left for the remote peer to answer.

arguments: {"query_string" : "string containing terms to be searched for"}

response: {"compression" : "compression algorithm name", "data" : "<compressed bytestring for JSON-format file metadata>"}

or: {"nodes" : "<compact node info>"}

search_nodes

This also works as described in Distributed Search in Theseus. It has a "target" key which maps to a (compressed) Bloom filter bitmask.

A "compression" key is included, specifying the algorithm used to compress the bitmask. It may be set to "none" (though that is not recommended). The default algorithm is left undefined at this point -- expect a future post on this subject (and on related Bloom filter parameters).

arguments: {"compression" : "algorithm name", "target" : "<compressed bitmask bytestring>"}

response: {"nodes" : "<compact node info>"}

get_info

Used to ask a remote peer to describe themself to the querying node. The reply contains a dictionary encoding information such as the remote node's ID, their local content Bloom filter, a protocol version or details on specific features they do or don't support, and so on.

By default, all available data is returned. The querying peer may limit the data returned by including the optional "keys" argument in their query and providing a comprehensive list of keys desired. This prevents large data like Bloom filters from being transmitted unnecessarily. The querying peer may also report that its own info has changed (such as would happen when a node changes ID or when files are added to its cache) by including an optional "advertise" key.

Submitting the query with "keys" included and mapped to an empty list is allowed. The reply's "info" key should map to an empty dictionary.

Including "advertise" : {"id" : ["<querying node's id>", "<querying node's id preimage>"]} and "keys" : ["id"] allows two nodes to exchange ID information using only one query and response between them. This is functionally equivalent to MLDHT's ping query (hence ping's omission).

Note that the "values" associated with keys within the "info" dictionary may be arbitrary bencoded data, even though the example below only shows strings. It is perfectly fine to include a set of flags as a binary string, to include nested lists or dictionaries, etc.

A node may have as many info fields as it wants, but it should at the very minimum provide these: {"bloom" : "<search keyword bloom filter>", "id" : ["<node's id>", "<id hash preimage>"], "max_version" : "protocol version string"}


arguments: {"advertise" : {"sender_key_one" : "sender_value_one", ...}, "keys" : ["key_one", "key_two", ..., "key_n"]}

response: {"info" : {"key_one" : "value_one", "key_two" : "value_two", ... , "key_n" : "value_n"}}

socialist_millionaires_auth

Used when two nodes wish to authenticate to one another and ensure that an active MitM attack is not taking place on their communications channel. Motivation discussed below.

All significant data for the protocol is contained within the "data" dictionary to make it simpler to sign it all at once. This dictionary has three keys: "token", which keys to a persistent random token identifying the particular SMP exchange taking place -- necessary because this is a multi-message protocol; "step", which keys to an integer identifying the step in the protocol performed by the current message; and "values", which keys to a dictionary listing the actual information being sent in the course of performing the protocol.

Further details are omitted for now; we'll just say this is an adaptation of OTR's SMP protocol and leave it at that for the time being. We'll deal with this in depth in the forthcoming blog post on message encryption.


arguments: {"data" : {"token" : "<large bytestring>", "step" : <int>, "values" : { ... }}, "signature" : "<signature of dictionary keyed by 'data'>"}

response: {}


Error Codes

The BEP-05 specification of KRPC lists four error codes. They are:

201: "Generic Error"

202: "Server Error"

203: "Protocol Error (i.e. malformed packet, or bad query parameters)"

204: "Method Unknown"

We'll define one more error code on top of these:

211: "Rate-limiting active"

It seems unlikely that any non-malicious node would generate enough traffic to trigger this error, but we'll include it just in case so such nodes can figure out why they've stopped getting responses. I'm numbering this code 21x instead of 20x in an optimistic attempt at avoiding collisions with any other KRPC extensions which might exist. BEP-05 gives no motivation for the numbering of the original error codes, so this is just sort of a best guess at how to properly extend things.

Types of Signed Data

There are a few different reasons for people to store signed data in the DHT. You might be stating your trust in a given public key, or identifying a magnet link for a metadata file summarizing the torrents you've tagged, or sharing a Diffie-Hellman public key integer, or even submitting a signed revocation notice invalidating the content of a previous signed message.

Consistency is important in all these for obvious reasons. Here are formats these signed data could use. Like the queries described above, it's possible that future versions of the protocol may make modifications to some of these data structures, and so implementations should be able to gracefully handle unexpectedly long lists or seemingly extraneous keys appearing in these data structures. Future extensions to the protocol should attempt to limit their changes to those two categories whenever possible, for the sake of preserving backward-compatibility.

Each message-plus-signature datum should be a bencoded data structure of the following format:

[["Theseus Protocol Signed System Data", {"arguments" : { ... }, "expires" : unix_timestamp, "type" : "name"}], signature : "<digital signature of previous list entries>"]

This consists of two nested 2-tuples. The outer 2-tuple contains the inner 2-tuple, then that 2-tuple's signature.

The first list entry within the inner 2-tuple is fixed and serves no purpose other than to tag the signed data as internal to the Theseus protocol, greatly reducing the signature's value for e.g. social engineering attacks (or really anything except its intended use).

The second entry within the inner 2-tuple provides details of the data. Its "expires" key is optional and, if included, should key to an integer representing a Unix timestamp past which nodes should refuse to store this information. An expiration date, essentially. The "type" key defines what this datum is intended to communicate, and "arguments" maps to a dictionary containing any extra required information.

As far as types of data are concerned, we define the following type names: endorse_key, endorse_metadata, endorse_dh, revoke_signature, cite_keyserver.

endorse_key

This is used to signify that the signer trusts the signed key. One purpose for this data is to create an ad-hoc, semi-connected web of trust with several uses, as discussed in Theseus's introductory post.

The "identity" key is optional, and may be used to associate e.g. an email address with the target_key.

The "relation" key is used to indicate the relation between the signed and signing keys. Users who wish to compartmentalize key use may specify "relation" : "self" to indicate that the signed keys should be trusted equally to the signing key. Keying to "peer" is the most common expected use case for establishing web-of-trust relationships. Additional values for "relation" may be defined in future.

arguments: {"identity" : "<optional additional identifying info>", "relation" : "self|peer", "target_key" : "<public key to endorse>", "type" : "name of public key format"}

endorse_metadata

This is used to specify a magnet link corresponding to a torrent for a Theseus torrent metadata file. Theseus nodes build their content indices from files distributed in this way.

arguments: {"magnet" : "<metadata file magnet link>"}

endorse_dh

This is used to provide a Diffie-Hellman public key for use in secure friend-finding. A good client will want to change this value somewhat often, so it probably makes sense to have these entries expire pretty quickly (maybe after six hours, say).

arguments: {"dh_key" : <Diffie-Hellman public key integer>}

revoke_signature

This is used to unconditionally countermand any previously signed datum or data. Revocations may be issued individually or in batches. A list of cryptographic hashes is provided. These should be the full hashes of whatever records are being invalidated. The hash function used should also be specified. For the forseeable future SHA512 should be exclusively used, but this argument is still included for the sake of future-proofing.

Note that including an expiration date in one's original published data is more reliable than planning on issuing revoke_signature at a future point. This functionality is only really intended for things like un-endorsing a compromised key or replacing an obsolete version of a metadata file -- things where it's just about impossible to know when you'll want to do them.

arguments: {"data_hashes" : ["<hash of signed data to revoke>", "<second data hash>", ...], "hash_function" : "SHA512"}

cite_keyserver

This is used to announce that a remote key server is willing to vouch for the identity of the given key. It includes the identity used to query for the key and the server to query. The motivation for this data type's inclusion is discussed at length below. Note that we don't need to include the key itself, since we can retrieve a copy of it from the remote server and use that copy to check the signature.

Nodes should feel encouraged to actually query the keyserver and validate the signature against the returned key before storing it. However, nodes may opt not to do so if they are resource-strapped or if they don't want to expose themselves by generating the required network traffic. Verifications, if performed, should be rate-limited, to prevent naive nodes from being compelled by malicious nodes to flood keyservers with traffic.

arguments: {"identity" : "<keyserver query string>", "key" : "<public key>", "server" : "<server URL>"}


Torrent Metadata File Format

The format for metadata files is very simple. It should follow JSON format. The top-level data object should be a dictionary with two keys: "version" and "data". "version" should map to a version string used to tell clients how to read the file. "data" should map to a list of files, each of which is represented by a dictionary containing key/value pairs detailing the file. Required keys are "title", "authors", and "magnet"; recommended keys include "subject""tags", and "abstract". An example with bare-bones metadata on one file:

{
    "version" : "1",
    "data" : [
        {
            "title" : "A Field Study of Digital Forensics of Intrusions in the Electrical Power Grid",
            "authors" : ["Eli Sohl", "Curtis Fielding", "Tyler Hanlon", "Julian Rrushi", "Hassan Farhangi", "Clay Howey", "Kelly Carmichael", "Joey Dabell"],
            "magnet" : "<magnet link>",
            "subject" : "computer science",
            "tags" : ["computer science", "security", "power grid", "forensics"],
            "abstract" : "text of abstract goes here (omitted for brevity in this example)"
        }
    ]
}


The Bloom filter for a node storing this file should include the first and last names of each author, as well as the following words: field, study, digital, forensics, intrusions, electrical, power, grid, computer, science, security.

One minor detail concerning the Bloom filters: any worthwhile hash function is going to be case-sensitive, so we might as well specify that all entries be converted to lowercase before hashing them into the Bloom filter to avoid e.g. a filter containing "Forensics" failing to register a hit in a search for "forensics".


Description Section Revision Log

4/20/2017: Published.
4/21/2017: Clarifications to metadata format description; structure slightly changed to support addition of "version".


Discussion


Transport

Perhaps the most discussion-worthy aspect of the design is the decision to operate over an encrypted TCP channel, rather than over unencrypted UDP packets as e.g. MLDHT and parts of Tribler operate. We'll save a detailed discussion of this topic for the future post in which the specifics of the channel are described. For now, a few brief comments:

A primary goal is to make the protocol as hard to fingerprint as possible. Hence all messages being exchanged in an encrypted channel (which on the wire appears to be fully random). If all an observer sees is a Diffie-Hellman key exchange followed by a large amount of seemingly random noise, clearly something's going on, but it's nontrivial to figure out what. It could be Theseus -- or it could be anything else that encrypts all its communications. This has two advantages: first, nobody can tell for sure that you're using Theseus from traffic analysis; second, any signature-based firewall rules written to block Theseus traffic would necessarily also block traffic from many other things, which is likely a prohibitive level of collateral damage. Making Theseus hard to fingerprint makes it hard to censor without also accidentally censoring a whole lot of other things you didn't want to break.

Note, as an addendum, that of course users of any peer-to-peer regularly advertise their contact information to other peers -- meaning, of course, that if anyone wants to find out whether you're using Theseus, they can just maintain a presence on the network and see if your IP address ever shows up in e.g. peer lookups. There are ways around this, for instance if some peers are willing to act as proxies for others, which is a topic I'll be addressing in a future blog post on preserving user privacy. But it's important to recognize that these can only ever protect a proper subset of the peers on the network. So making the protocol hard to fingerprint does not prevent its users from ever being identified. What it does do is increase the effort required from a passive attack to an active attack -- and a sustained, involved, and only partly reliable one at that -- which is a significant win for us.

Another goal is to make it as hard as possible to interfere with normal operation of the protocol. As a case study, MLDHT has to go through some contortions to pull this off: for instance, the token management between "get_peers" and "announce_peer". "get_peers" queries return, along with their data, a one-time token which must be included in a subsequent announce_peer query. The purpose of this token is "to prevent malicious hosts from signing up other hosts for torrents."

You might not think this would be a problem in the first place, since the recipient of an announce_peer query just signs up the sending peer for the torrent -- but source addresses of UDP packets can be spoofed, and so a malicious peer could try to sign up others by sending spoofed queries. The token solves this by introducing state of which the malicious peer is (presumably) not aware. A different solution is to just encrypt the communications channel with a secret known only by the participants. An outside observer who tries to inject queries into this channel would fail: since they don't know the encryption key, their messages would decrypt as gibberish. This simplifies the protocol design process, since it greatly weakens the assumptions we must make about the powers available to our attackers.

Some explanation should probably also be given for the choice of OTR (or a lightly tweaked version thereof) for the encryption. Why isn't it enough to just agree on a key and use it for the duration of the communication?

The answer is that using a single long-lived key is probably enough -- but that I don't like dealing in "probably"s.

A little context: OTR was originally developed to protect social communication, which the original authors saw as having different and more stringent privacy requirements than communications between automated systems. I agree with their assessment of the privacy requirements for social communications. I think it's interesting, though, that people never seem to even consider extending these requirements to the automatic communications made by programs at our behest. Wouldn't it make sense to expect the same privacy properties of, say, a question asked to a friend and a question asked to a search engine?

Quoting from the original paper on OTR, "Off-the-Record Communication, or, Why Not To Use PGP":
The notion of an off-the-record conversation well-captures the semantics one intuitively wants from private communication: only the two parties involved are privy to the contents of the conversation; after the conversation is over, no one (not even the parties involved) can produce a transcript; and although the participants are assured of each other’s identities, neither they nor anyone else can prove this information to a third party.
We absolutely want these properties for private conversations -- but really, they all seem like they'd be nice to have in any communications which aren't explicitly intended to be part of a public record. It's less that we'd necessarily miss these properties' absence, and more that their presence can hardly hurt. Forward secrecy in particular is a very appealing property in that it somewhat lessens the impact of key compromise.

We should also say a few words about the decision to break authentication out into an RPC exchange. The reason authentication was introduced into OTR in the first place was to allow the detection of man-in-the-middle attacks of the sort to which vanilla Diffie-Hellman is vulnerable. This is only possible when the transacting nodes know some information about each other a priori -- typically a public key -- or else possess a shared secret established over a separate, trusted channel. This works well enough in social contexts, but in this peer-to-peer context, where nodes don't necessarily know anything about each other and in all likelihood have been introduced to each other by other nodes to whom they are otherwise total strangers, we can't make such assumptions about key distribution. Not everyone will want to authenticate -- indeed, not everyone will possess a key their remote peer would trust enough to accept authentication by in the first place.

As such, the unfortunate truth is that traditional defenses against man-in-the-middle attacks cannot be applied in this context. It is however noted that nine times out of ten, it makes more sense in terms of cost vs benefit for an active attacker to simply run malicious nodes in the network rather than MitMing the communications of existing nodes which don't trust each other to begin with.

That said, some users do possess keys with which they can mutually authenticate. For these users we have the Socialist Millionaires authentication queries, in which the peers carry out the Socialist Millionaires protocol to verify that they possess the same shared encryption secret and sign their messages to prevent tampering in-transit. This allows the minority of peer connections for which authentication is desired to carry out this authentication securely.

Note that schemes like secure friend-finding securely associate a node's contact info with its public key, meaning that if that node subsequently refuses to authenticate using this public key (which it previously, securely attested to owning), then this refusal itself may be taken as an indication that communications are being tampered with by an active MitM attacker.

One piece of popular wisdom among the security-minded is, "don't roll your own crypto". This is very good advice and some might say I'm flirting with breaking it here by making protocol tweaks. All I can say is that whenever I deviate from standard practice, the decision is very carefully considered. As I'm sure you've gathered by now, I want this thing to be secure, and that means placing as little trust as possible in the work of any single person, including myself -- hence my extensive documentation of this design process. If I'm taking risks, I want to be exceedingly clear about what they are and why I consider them to be justified. Consider this full transparency and full disclosure rolled into one.

A corollary from the world of competition math comes to mind. When you take the Putnam exam, they explicitly warn you against submitting, quote, "worthless material". I've always appreciated the bluntness of this dictum and tried to apply it to my life more generally. This project is no exception.

As I'd hope you've gathered by now, I am also completely open to feedback. If you're skeptical about any part of the system, I'd be happy to discuss it with you further. You can get in touch with me in the comments, on Twitter, by email, through Signal, or through any combination thereof.

RPC Queries

The design of many of these is based off of Kademlia and MLDHT. BEP-05 does a good job of discussing the design considerations at work there.

As mentioned above, we deviate somewhat from this protocol by not sending IDs in every message. Including that key simply isn't necessary when we have a connection-based model, because we can track things like IDs as persistent state associated with the connection. Same with the "token" key in the MLDHT version of announce_peer: it's necessary there but not here, since we're not vulnerable to the same type trivial message spoofing that MLDHT is.

We use the DHT to track multiple types of data. All of them are associated with (hash-generated, and thus evenly distributed) addresses in the DHT, and get_peers is used to traverse the network regardless of what data we're interested in retrieving. For each type of data, we define a get_* query and an announce_* query to store and retrieve it respectively.

Note that announce_peer is very resistant to DoS attacks that flood a peer with data to store, because any given IP address can only sign itself up once, meaning you'd need a huge number of IP addresses at your disposal if you wanted to overwhelm a node with bad data. Unfortunately, the same is not true for announce_signature and announce_raw. Any single peer could generate and submit a colossal amount of raw data or signature data.

To deal with this, an important design goal for system components using these features is to make sure that as little actual data as reasonably possible is put in the DHT. If we can stick to this guiding principle, we can safely assume that anyone submitting large amounts of data is doing so maliciously and can be automatically blacklisted.

A little more on get_info: if we're on TCP then it just makes sense, really, to distribute peer metadata in its own message rather than just packing it into every single message like MLDHT does with its "id" argument. We can track this metadata easily enough if we just request it once and, as soon as we get it, associate it with the connection.

Having an explicit get_info query allows us to efficiently share more data than just peer IDs. The search subsystem uses it to track Bloom filters, for instance, and we include an indicator of the maximum protocol version supported by a node to make sure nodes can figure out how to talk to each other. It is expected that various extensions to the protocol might add keys to this dictionary; if the dictionary starts getting large enough, it might make sense to start breaking it up into sub-dictionaries indexed by feature name to give these different features their own "namespaces". At this point, though, such measures would probably be overkill.

Signed Data Formats

The hardest thing about this part of the design is trying to come up with something to call these. They're essentially statements of various kinds, endorsed by the holder of a specific key. But calling them "statements" lacks kick, calling them "declarations" is tacky, calling them "proclamations" makes it sound like we're not sure what century we're in... and so on. For now we'll just call them "data", "signed data", etc. Better ideas welcome.

As for the actual data types, a few of them deserve individual discussion.

endorse_key

As said above,
This is used to signify that the signer trusts the signed key. One purpose for this data is to create an ad-hoc, semi-connected web of trust with several uses, as discussed in Theseus's introductory post
In addition to endorsing others' public keys, users who would rather use a dedicated public key for their Theseus operations (which would be understandable from the point of view of compartmentalization, and because this key may generate many signatures over its lifespan) may use endorse_key to indicate and, well, endorse that dedicated key.

Some might wonder why we don't follow a pre-established standard format for key signatures. If we did this, people could import (say) their extensive lists of PGP key signatures that they've spent years curating, and have the associated web of trust data ready to go. But is this actually what we want? The social function of key-signing is endorsement of identity -- vouching that a certain person owns a certain key. The function of key-signing in Theseus is to recommend that anyone who wants to help store content from the signing key should also store content from the signed key -- so that, for instance, one researcher trusted by a large part of the network could endorse their academic peers and thus encourage the system to start storing these peers' documents as well. You don't necessarily want to make those endorsements for everyone whose public key you've ever signed.

revoke_signature

The main noteworthy features of this is the ability to revoke signatures individually or in batches, and the use of cryptographic hashes. Using hashes ensures that we can countermand a previously issued signature without specifying what was originally signed. A signature previously issued in error might be embarrassing to the issuer, and so having a revocation scheme which avoids explicitly recording the mistake for all time is an attractive feature. Recording revocations in batches also lowers the amount of memory required of a DHT node to store the data.

cite_keyserver

This data type helps us address a difficult problem which until now I've been sort of skirting around. This is the issue of where we're getting trusted keys from in the first place.

We've talked at length about what to do with these keys once we've got them, but where do they come from? Well, people could input them manually -- which would be a huge pain. We could also get them from some kind of trusted source. We don't want to rely on a single such source, because this would undercut the principle of decentralization around which Theseus is designed. But allowing arbitrary sources to be cited and leaving it up to individual nodes to decide which sources to trust (perhaps with the assistance of a suggested list) seems like a more redundant, and thus more reliable, model.

I might discuss this in more depth in future. For now, a quick one-paragraph sketch: Peers could choose to trust (say) the key for any .edu email address registered in the PGP Global Directory, or any identity registered with a specific dedicated key server hosted by trusted Theseus users, or so on. Most key servers don't make it easy to scrape lists of their users, though, and even if we did have such a list, most of the users on it would have nothing to do with Theseus. cite_keyserver helps us dodge this issue: people with keys on a trustworthy key server can create cite_keyserver records, and people looking to build a list of trusted keys can periodically query a random, roughly evenly distributed set of DHT nodes to look for cite_keyserver records in the results (it might even make sense to combine this operation with the periodic random peer ID lookups used for network size estimation). These records can then be verified to get a working list of trusted keys.

Metadata Files

Since these files are meant to be human-readable and the data in them might have to be parsed by a web interface, JSON is preferred over bencode for these files. Since JSON is a fully plaintext file format and these records contain a number of repeated elements, it can be expected that they will compress quite well. As such, we compress them before sending them over the wire. The hope is that not only will this result in a more efficient use of bandwidth but it may also increase traffic opacity after encryption. Speaking of which...

Traffic Analysis

As observed several times above, many of the protocol's features bear strong resemblances to each other; for instance, call signatures are very similar across the get_* and announce_* subsystems, and peer traversal by search_nodes is very similar to traversal by get_peers. This is true not only for the (approximate) sizes of the messages but for the specific patterns of connections and messages established for each. The idea is that making these different operations follow similar traffic patterns makes passive analysis of protocol traffic even less rewarding. Messages which are close but not quite equal in length could even be padded pre-encryption, ensuring uniform message size on the wire and make things even more opaque.

All of this serves no real purpose aside from making life more difficult for passive observers at the internet-backbone level -- but that's something that I would argue is, in and of itself, an important goal. If the last couple decades have taught us anything, it's that ISPs and government agencies cannot be trusted to respect the privacy of the citizens they serve. If they refuse to stop surveilling us, then we have no choice but to try and make their jobs as hard as possible. All of this is self-defense.