Sunday, June 11, 2017

Resisting Man-in-the-Middle Attacks in P2P Networks

Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 
Distributed Search in Theseus
The State of Theseus, One Month In
Theseus Protocol v0.1 Overview 
Bloom Filter Parameters for Distributed Search 
Message Encryption in Theseus

In the previous post, Message Encryption in Theseus, I outlined how on top of robust encryption, Theseus can handle optional public-key authentication. This authentication gives us a way of defeating man-in-the-middle (MitM) attacks, but only if at least one of the communicating peers trusts a key possessed by the other. Since most interactions in peer-to-peer networks take place between total strangers, this limitation carries with it some unpleasant drawbacks.

One solution would be to have peers vouch for each other. For instance, when replying to a query with a list of peers, one could include not only the listed peers' contact info but also their public key fingerprints. Then when the querying peer connects to a new peer from this list, they could request the new peer's public key, check it against the provided fingerprint, authenticate using the key if it matches, or bail out or degrade their trust in the connection if the fingerprint and key don't match.

This seems like it would work, but at the cost of necessarily sharing public key identity information very promiscuously. If these are long-term identity keys we're talking about, then that's a big problem as far as anonymity, deniability, etc are concerned. As discussed in the previous post, it makes sense from a privacy angle to be reluctant to share long-term identity information with anyone you don't have a mutual trust relationship with.

That's no reason, though, why we couldn't add dedicated keys for this purpose. Since the term "ephemeral key" is already taken, let's call these, say, "transient keys".

A peer could generate a transient key immediately before first connecting to the network and have this key last only as long as their session does. They could share the key with anyone they connect to, but avoid explicitly associating it with their long-term cryptographic identity (e.g. by authenticating using both keys only with well-trusted remote peers, if ever). The only thing this key ever gets explicitly associated with is the peer's contact info. It would probably make sense to put an upper limit on the lifespan of such keys -- maybe six or twelve hours.

What is accomplished here? Man-in-the-middle attacks are made considerably more difficult and complex. What would it take to carry out a successful MitM attack with this system in place? That's a good and tricky question, the answer to which depends somewhat on the goal of the attack.

If the goal is to read and potentially alter all traffic sent to or from a specific peer, the MitM attack would have to intercept all encrypted traffic to or from that peer, swap out the keys used for all ephemeral ECDH handshakes, actively replace the advertised transient key with an attacker-controlled key every time it is sent, alter the contents of all re-handshake negotiation messages to reflect the attacker-controlled key, and actively MitM those extra handshakes as well.

Such an attack is not impossible, of course, but it would require considerable resources, and if a connection can be established with even one trusted peer then the attack can be detected. The attacker could of course attempt to close the trusted peers' connection prior to their authentication to prevent this -- but that would likely seem suspicious in and of itself.

Note that the impact of malicious nodes lying about other nodes' transient key fingerprints when returning contact info is minimal, since they don't really accomplish much that they couldn't've also accomplished by refusing to refer those nodes in the first place. Really, the worst case scenario is that the connecting node comes to believe that something malicious is taking place -- which, since they were just talking to a malicious node, is actually correct.

Friday, June 9, 2017

Message Encryption in Theseus

Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 
Distributed Search in Theseus
The State of Theseus, One Month In
Theseus Protocol v0.1 Overview 
Bloom Filter Parameters for Distributed Search 

Brief Summary

Theseus will use the Noise Protocol Framework.

This will allow all network traffic to be encrypted, with flexible authentication and many other desirable properties as detailed below.

This post explicitly obsoletes the Theseus Protocol v0.1 Overview. That overview deliberately lacked specifics on message encryption. It was expected that said specifics would necessitate some changes, and this has indeed proven to be the case. A v0.2 specification is forthcoming.


Prior to learning of the Noise protocol framework I was (with considerable trepidation) planning on designing a custom cryptographic protocol modeled closely on OTR and Signal but with modifications to fit our use case. Both of those protocols came close but ultimately fell just short of satisfying the rigorous and slightly unusual security parameters Theseus is aiming for.

Desired properties for the encrypted channel established by this protocol include the standard AE fare, namely confidentiality, integrity, and authentication. On top of these properties, we desire forward secrecy and strong resistance to surveillance.

By strong resistance to surveillance, I mean all of the above plus several more properties: traffic analysis should yield little to no information, with message lengths and traffic patterns leaking minimal data and protocol traffic indistinguishable from randomness; relatedly, the protocol must be as hard as possible to fingerprint; and lastly, the cryptographic identities of the participants must be hidden from passive observers.

Efficiency, simplicity of design, and avoidance of bloat are also all high priorities. The last and perhaps most troublesome requirement is that authentication be optional, since parties may variously desire mutual, one-way, or no authentication depending on context. The protocol should be able to handle all three (or four, depending on how you count) of these cases straightforwardly and securely.

Startlingly few protocols even come close to meeting these criteria; virtually none meet all of them. Note in particular that cryptographic mainstays like TLS clearly fall short, as do standard message-based protocols like OTR and Signal (albeit by much smaller margins).

As I've mentioned before, the conventional wisdom concerning crypto algorithms and protocols is: whatever you do, don't roll your own. Unable to find any existing protocol meeting all our desired security properties, I was on the verge of breaking this advice and rolling a custom protocol when I discovered the Noise Protocol Framework, which describes something remarkably close to (though considerably better-designed than) what I had in mind -- something which effortlessly provides every property listed above and a few more for good measure. I'm amazed that it isn't more well-known. See the Github wiki's list of properties and protocol comparisons. Having now reviewed the specification in full, I really could not be happier or more impressed by the work that's clearly gone into it.

Noise is less a concrete protocol and more a framework for assembling protocols by mixing and matching stand-alone components in robust ways. What follows are some specifics on how Noise will be used in Theseus.

Handshake and Message Encryption

As mentioned before, a primary goal is to avoid any features that could lead to easy protocol fingerprinting. Negotiating handshake parameters (such as e.g. ciphers) in the clear would obviously violate this design principle. What we'll opt for here is to specify a default handshake parameterization to first establish an encrypted channel, and allow users to optionally re-negotiate a second handshake once communicating within this channel. This allows us to encrypt absolutely everything aside from an initial exchange of ephemeral public keys.

The default handshake will be an exchange only of ephemeral keys, i.e. the Noise_NN pattern. Future handshakes negotiated within this channel may be of the patterns Noise_NK, Noise_KN, or Noise_KK, with an optional extra PSK potentially determined during negotiation and applied via the psk0 modifier.

The default ciphersuite will be Noise448, which specifies ChaCha20/Poly1305 for AE and the conservatively-defined Ed448-Goldilocks curve for elliptic curve operations. We'll have the default hash function be SHA512.

Thus the protocol name for Theseus's default Noise protocol is:


If peers wish to negotiate a second handshake, that handshake may be any of:


Later versions of the Theseus protocol may expand this list, in particular potentially expanding the list of permitted ciphersuites. The handshake re-negotiation is designed to gracefully support such extensions.

One last note: Noise specifies a max message size of 65535 bytes, with the handling of larger messages being considered an application concern. As such, Theseus will prepend total message length to all plaintexts as a 16-bit big-endian integer, as recommended in the framework specification.

Handshake Re-Negotiation

Once an encrypted channel has been established using ephemeral keys, peers may optionally use this channel to negotiate authentication. Once they agree on parameters like a handshake protocol name and which public keys they wish to use in authentication, they may discard their current CipherState objects and, within the same TCP connection, start from scratch executing a new handshake using the previously agreed-upon keys (and, optionally, a similarly agreed-upon PSK).

The challenge for re-negotiation is that two peers must agree on three things:

  1. What sort of authentication to use
    • That is, whether authentication should be one-way or mutual
    • Or in other words, whether to handshake with Noise_NK, Noise_KN, or Noise_KK.
  2. What public keys to use in the handshake
  3. Whether to include a PSK, and if so, what its value should be.
The difficulty is that some peers may understandably be reluctant to advertise to a total stranger the list of all public keys by which they are willing to authenticate. After all, doing so could have obvious and severe ramifications where anonymity is concerned. As such, a peer may be willing to authenticate using a given key if and only if the remote peer demonstrates prior knowledge of this key, and/or if the remote peer is also willing to authenticate using a specific key of their own.

Note that such a model would not completely obviate the abovementioned privacy concern -- what it does is reduce the issue from peers promiscuously advertising identity keys to those peers exposing what is essentially an oracle by which an attacker could test whether the remote peer is willing to authenticate using a single given key.

I see this as a minor concern, as abuse of the oracle should be very easy to detect, but it is still worth pointing out as both an informational note and an acknowledgement that there is perhaps room for further improvement here.

We'll define two protocol messages which together allow for re-negotiation without unnecessarily promiscuous disclosure of information: handshake_suggest and handshake_request.

Messages of the former type are purely informational and may be exchanged any number of times. Their purpose is to communicate re-handshake parameters which the sending party would find acceptable.

A message of the latter type specifies concrete re-handshake parameters. If the remote peer finds these parameters unacceptable, it may reply with an error code. A non-error response indicates that the remote peer accepts the re-handshake parameters.

As soon as acceptance is indicated, both peers should discard their existing protocol state, remembering only the agreed-upon parameters for the new handshake, and the new handshake should then be carried out, with the party who first sent the handshake_request message serving as the handshake initiator.

We'll defer on providing a precise specification of the parameters of handshake_suggest and handshake_request pending the v0.2 spec draft.

Miscellaneous Considerations

Public Key Disclosure

As discussed above, some parties may be reluctant to disclose their public keys to an unauthenticated third party. If both peers harbor this concern, a curious impasse is reached. This is a hard problem to crack, with a number of seemingly obvious solutions harboring non-obvious but fatal flaws. The idea of an approach involving zero-knowledge proofs seems promising but threatens to introduce to the protocol worrying levels of complexity for marginal benefit. This might be an interesting problem for a particularly sharp and motivated contributor to spend some time mulling over.

In-Order Message Delivery

Just as an aside, it's worth noting that the vanilla Noise framework works best in environments guaranteeing in-order delivery of messages. Out-of-order delivery could lead to legitimate messages being decrypted with the wrong key material, causing them to appear invalid. Other protocols such as Signal and OTR face a similar problem and solve it by various means such as order-specifying message headers. These approaches solve the problem but also introduce trivially-recordable metadata and make the protocol significantly easier to fingerprint -- both undesirable properties.

Applications using Noise in spaces where in-order delivery is not guaranteed typically seem to solve the problem by precomputing a certain number of keys and thus maintaining a sliding "window" within which they can gracefully handle out-of-order delivery, i.e. if you precompute five keys ahead then you can handle out-of-order delivery as long as the delivery order doesn't place any message more than five notches away from its spot in the intended message order.

I mention all this purely for informational purposes, as the Theseus control protocol -- the component of the system using Noise -- will operate over TCP, which guarantees in-order message delivery.

Man-In-The-Middle Attacks

Authentication is necessary whenever possible in order to defeat man-in-the-middle attacks, which can otherwise straightforwardly bypass Diffie-Hellman of both the traditional and elliptic-curve varieties. The initial handshake, in which only ephemeral keys are exchanged and therefore no cryptographic statements of identity are made by either party, is thus vulnerable to man-in-the-middle compromise by an active attacker. There isn't really any way around this, since we need to support connections between complete strangers (who obviously won't know or trust each other's public keys, and thus have no easy way of authenticating to one another).

Note that if either party authenticates to the other, the success of this authentication would necessarily indicate the absence of a man-in-the-middle attack. Note further that if an active MitM attacker were to interfere with the re-handshake negotiation, the most harm they could do would be to either cause the second handshake to fail or else prevent it from happening at all.

Of course, if an attacker is in control of a public key well-trusted by a given peer, then naturally the attacker can impersonate the identity associated with this key to this peer and carry out a man-in-the-middle attack in this way. Dealing with such a threat is well beyond the scope of this post (and if you have any ideas on how to do it, I'd gladly buy you a beer...)

Note that the authentication model described above makes no mention of public key roles, leaving itself open to supporting basically any trust model a given client may wish to adopt. This is intentional, for flexibility's sake. Consider it a form of future-proofing. I do have some ideas on a reasonable baseline method of managing public-key identities to get some of their benefits without needlessly compromising anonymity; this'll likely be the subject of my next post.

Explicit Connection-Ending

The Noise protocol specification suggests that an explicit way of indicating end-of-connection be included at the application level, to give applications a way of distinguishing between benign disconnects and attacker interference. This is a good idea which will be included in the v0.2 Theseus spec draft.


The intent is to implement Theseus in Python 3. It is therefore worth mentioning Python 3 does not yet have a library for Noise, though there is an ongoing project to implement one though bindings to the C reference implementation via ctypes. The library appears to be under active development, with Python 2.7 support available and Python 3 support in progress.

Protocol Fingerprinting & Info Leaks

One fantastic property of the Noise protocol is its resistance to fingerprinting. True to its name, virtually all protocol traffic is indistinguishable from random noise, although this varies somewhat depending on context: for instance, handshakes that lead with long-lived public identity keys provide identifiable data to a knowledgable adversary, and public keys either ephemeral or long-lived can be identified as such with reasonable confidence unless they are cleverly encoded using a scheme such as Elligator. These schemes tend to come with complications of their own, but that's a story for another time.

Note that even if message contents are indistinguishable from random noise, message sizes and traffic patterns still convey some data to the attacker. These are harder problems to solve.

Reducing the information conveyed by traffic patterns is a protocol design problem more than an encryption problem. The v0.1 protocol overview discussed this matter at some length, and the (relatively promising) ideas described therein are absolutely going to be carried forward in v0.2 and hopefully beyond.

The amount of information conveyed by message size can be reduced by padding our messages with random noise. This is discussed in brief under the Noise specification's section on "Application Responsibilities". One thing worth noting is that the Noise protocol specifies a maximum message size. This allows extremely security-focused users to pad all their messages to said size, thus eliminating all variation in message size. This size limit also means that messages larger than the size limit must be broken up. So even with an aggressive padding scheme, some size data will necessarily leak in the case of very large messages.

On this subject, the Noise framework specification says:
Length fields: Applications must handle any framing or additional length fields for Noise messages, considering that a Noise message may be up to 65535 bytes in length. If an explicit length field is needed, applications are recommended to add a 16-bit big-endian length field prior to each message.
65535 bytes is large enough that message fragmentation doesn't seem likely to cause problems in practice. It may come up in some situations e.g. when returning lots of search results or Bloom filters, but this is probably fine.

One last note: padding should use cryptographically secure random number generation in order to avoid identifiable patterns in the output (which would undercut the entire point of including padding in the first place). If this random number generation is delegated to the OS, the designer should take deliberate steps to avoid depleting system entropy. This a situation would likely only be encountered in extremely high-traffic edge cases, if ever, but it's still worth mentioning since entropy exhaustion can catastrophically undermine the security of ephemeral key generation. This issue varies so much by OS, though, that I think that's all I'll say on it for now.

Monday, May 8, 2017

Bloom Filter Parameters for Distributed Search

Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 
Distributed Search in Theseus
The State of Theseus, One Month In
Theseus Protocol v0.1 Overview 

Bloom filters are central to both the routing and content-indexing functions of Theseus's distributed search algorithm. A filter's false-positive probability increases monotonically as a function of the number of keys in the filter, as well as on two pre-set parameters: the filter size and the number of hash functions used. The expected number of bits set to 1 in the filter also increases monotonically as a function of n. Derivations of the equations governing these values can be found in section II.A of this paper, which also provides a treatment of the idea of compressing Bloom filters in certain situations, e.g. for network transfer or bulk storage. The question is asked of whether a benefit can be obtained from this compression, and the answer is that it of course depends on the sparsity of the filter.

The strategies suggested by that paper for parameterizing a Bloom filter with compression in mind involve significantly increasing the filter's uncompressed size in memory. This is a fairly unattractive cost.

"Correctly" optimizing a Bloom filter's parameters requires having a good estimate of the number of elements the filter will contain. If this value is expected to vary widely, as it likely would in Theseus, then such an optimization can't really be performed in any particularly rigorous way.

The cost of underestimating the number of elements is that your filter's false positive rate will be untenably high. The cost of overestimating is that you'll end up with a sparse, inefficient filter.

I observed in Resisting Sybil Attacks in Distributed Hash Tables that having honest peers operate multiple DHT nodes would help increase their resilience to horizontal Sybil attacks. Another benefit to operating multiple DHT nodes is that one could spread one's content library across them and have each node index only its own subset of the total library. So in the case of Theseus, if a peer has more data to share than they can fit in a single Bloom filter while maintaining an acceptable false-positive rate, they can simply partition the data to optimally-sized subsets & in this way sidestep the problem. Thus for Theseus the cost of underestimating the number of elements is actually very low.

The cost of overestimating the number of elements is also fairly low: sparse filters, as we just discussed, are amenable to compression. Thus we can compress them before transmission to keep from wasting bandwidth -- and if we're tight on working memory, we can compress them at rest as well. Making compression optional allows us to do all this without impacting performance for saturated filters.

So really, as long as our estimate isn't wildly off-base, the system should be able to gracefully handle outliers on either extreme. This means that there is in fact a wide range of parameterizations "good enough" for our purposes. How do we choose one over the others? Lacking a better motivator, we might as well make the choice based on convenience of implementation.

What's convenient? Well, having the filter size be a power of 2 would be convenient, because it'd make it easier to get indices from a cryptographic hash function. Having the 2's exponent be a multiple or 4 would also be quite nice, to make the hex constant for the bitmask more readable. Say 2¹⁶ bits, then. The next question is what we want the capacity of the filter to be -- or, equivalently, how many bits per element we want to allocate.

If we choose, say, 10 bits per element, our filter would cap out at 2¹⁶ / 10 ≈ 6554 elements. The optimal number of hash functions would be ln(2) * 10 = 6.93... ≈ 7, and the false positive rate at saturation would be (1-e^(-7/10))^7 = 0.0081... ≈ an 0.8% chance of a false positive.

6554 seems a little small, though -- especially considering that in all likelihood any given file in the system would have at least several unique filter elements associated with it. What if we try fewer bits per element?

Going down to 7 bits per element raises our cap to 2¹⁶ / 7 ≈ 9362 elements. ln(2) * 7 = 4.8... ≈ 5. The false positive chance increases to about 3.5%. These figures sound reasonable enough to me, so let's say our provisional Bloom filter parameters are 2¹⁶ bits and 5 hash functions.

As a reward for sticking with me through this short but dense post, have some pretty pictures. These show how various filter properties change as the filter fills up. Click the images to view them full-size. The code that generated them can be found on Github. Analysis is left as an exercise for the reader.

Thursday, April 20, 2017

Theseus Protocol v0.1 Overview

Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 
Distributed Search in Theseus
The State of Theseus, One Month In


Hope everyone's having a relaxing day. Things have been a bit quiet over here lately, as I've been busy sampling the boundless delights of graduate life. But fear not: progress on Theseus is still being made. I've been hard at work on the data storage formats described in the previous post's TODO list, and on the specifics of the Theseus protocol itself.

These elements of the design are all interrelated, so it seems appropriate to summarize and discuss them as a unit. That means we've got a lot to cover here, so we're going to break description and discussion into separate sections and make the description section deliberately exposition-light.

Theseus is growing more realizable by the day. Soon, interface design, not protocol details, will end up being the project's top priority. That'll be an exciting shift and I'm looking forward to finding people to collaborate with on it.

For now, though... Let's get technical.



The protocol is conceptualized as a set of bencoded RPC messages following the KRPC protocol format as described by the Mainline DHT implementation of Kademlia, with a custom set of RPCs. BEP-05 defines the KRPC format as follows:
A KRPC message is a single dictionary with two keys common to every message and additional keys depending on the type of message. Every message has a key "t" with a string value representing a transaction ID. This transaction ID is generated by the querying node and is echoed in the response, so responses may be correlated with multiple queries to the same node. The transaction ID should be encoded as a short string of binary numbers, typically 2 characters are enough as they cover 2^16 outstanding queries. The other key contained in every KRPC message is "y" with a single character value describing the type of message. The value of the "y" key is one of "q" for query, "r" for response, or "e" for error. 
Note that we follow the MLDHT KRPC protocol as regards message format, but not as regards message transport.

We define the following queries: find_node, get_peers, announce_peer, get_signatures, announce_signatures, get_raw, announce_raw, search_files, search_nodes, get_info, and the colorfully-named outlier socialist_millionaires_auth. Note the absence of MLDHT's ping query, which get_info renders redundant.

It is expected that from time to time additional queries will be added to this list in support of new features.


We deviate from the MLDHT specification in running the protocol over TCP rather than UDP and encapsulating all protocol messages within an encrypted channel established and maintained via a variant of the OTR protocol. The duration of TCP connections is left for individual clients to decide. It is noted that peers who communicate often or wish to authenticate to each other would likely see performance benefits from maintaining long-lived connections -- though it is also noted that connection duration is trivially observable by third parties, a fact which may be of significance to privacy-minded users.

Modern iterations of OTR includes provisions for authentication as part of the initial handshake. Authentication is not necessarily desired in our use case -- some nodes may wish to authenticate to each other, and others may not -- and so rather than handling authentication as part of the initial handshake, we break it out into a specialized, optional RPC message exchange which may be carried out at any point post-handshake.

Descriptions of RPC Query and Response Messages

Some of these queries are modeled on the BEP-05 queries, but the resemblance is limited. Most notably, the fact that we are discarding MLDHT's connectionless UDP transport in favor of persistent TCP connections means that we can relax MLDHT's requirement of including "id" keys in every query, since it's now straightforward to track identities by associating IDs with connections. Another difference is that per the anti-Sybil measures discussed in Resisting Sybil Attacks in Distributed Hash Tables, where node IDs do appear they must be accompanied by hash preimages.

It is expected that extensions to any of these queries will, more likely than not, take the form of extra dictionary keys. As such, any implementation of this protocol should be capable of gracefully handling (i.e. silently ignoring) the presence of unexpected keys in any of these messages, as long as the expected parts of the message are correct.


This is analogous to BEP-5's find_node query, but lacks an "id" key.

arguments: {"target" : "<id of target node>"}

response: {"nodes" : "<compact node info>"}


Also analogous to the BEP-5 definition. In addition to omitting the "id" key, the "token" key is omitted. This key is a security measure which is appropriate over UDP but unnecessary over properly encrypted TCP, as discussed below.

arguments: {"info_hash" : "<20-byte infohash of target data>"}

response: {"values" : ["<peer 1 info string>", "<peer 2 info string>"]}

or: {"nodes" : "<compact node info>"}


For this query, which is the last of those directly derived from BEP-05's MLDHT specification, we specify a new optional key, "sybil". "sybil" keys to an integer value of 1 or 0 depending on whether the sending node believes a vertical Sybil attack is taking place at the write address. If "sybil" is present and nonzero, the receiving node may attempt to verify the claim and subsequently increase its timeout for stored data.

We also omit "id" and "token", for the same reasons as above. Note that this leaves the response dictionary empty. This empty response should, of course, still be sent, to acknowledge query receipt.

arguments: {"implied_port": <0 or 1>, "info_hash" : "<20-byte infohash of target torrent>", "port" : <port number>, "sybil" : <0 or 1>}

response: {}


This is much like the get_peers query, but instead of returning a list of peers, it returns a list of 2-tuples containing various bytestrings and their signatures.

If the query is a miss, i.e. if the node does not have any stored signatures, it replies with its best peers, mimicking the response to a find_node query just like the alternate response for get_peers does.

The "key_fingerprint" argument is optional. If it is omitted, all signatures stored at the queried node are to be returned. This is perhaps unexpected. The motivation is that all signed data is meant to be essentially public-record between peers in the protocol, and in fact there are certain situations in which trawling for data is valuable -- for instance to collect cite_keyserver entries, as discussed below.

arguments: {"key_fingerprint" : "<20-byte fingerprint of signing key>"}

response: {signatures" : [["message_one", "signature_one"], ... , ["message_n", "signature_n"]]}

or: {nodes" : "<compact node info>"}


This is used to publish new signed data associated with a given key. The key's address in the DHT is determined by a 160-bit key signature. When a node receives these submissions, it should, before storing the signatures, check that they are valid. It will also want to check how close the key signature is to the node's own IP, to know how long to cache it. An optional "sybil" argument may be included, which operates just like it does in announce_peer.

Note that signatures need not be announced by someone holding the private key that generated them -- once they're in the network, anyone can re-publish them.

arguments: {"signing_key" : "<public key>", "signatures" : [["message_one", "signature_one"], ... , ["message_n", "signature_n"]], "sybil" : <0 or 1>}

response: {}


Similar in operation to get_peers or get_signatures, get_raw is used to retrieve raw binary data blobs stored at the DHT node (as used by systems like the one described in Securely Finding Friends via DHT Dead Drops). The data is returned as a bencoded list of binary strings.

arguments: {"address" : "<20-byte address>"}

response: {data" : ["<first bytestring>", ...]}

or: {nodes" : "<compact node info>"}


Used to store the raw binary data retrieved by get_raw. Like the other announce_* queries, announce_raw includes an optional "sybil" argument.

arguments: {"address" : "<20-byte address>", "data" : "<bytestring>", "sybil" : <0 or 1>}

response: {}


This query operates as described in Distributed Search in Theseus. It takes as an argument a search string. The remote peer checks its own local files for this string in any way it sees fit, and returns a list of file descriptions including, among other things, titles and magnet links.

If the remote peer has hits, it returns a (compressed) JSON list containing the full metadata for each matching file, along with a string specifying the compression algorithm used. "compression" : "none" is valid, and may even be appropriate when the result string is small enough for the overhead of compression to negate its benefits.

If the remote peer doesn't have any hits, its response should instead contain a "nodes" key suggesting other nodes to query. The list of nodes technically can be empty (though it probably shouldn't be).

The format of the query string is left unspecified, at least for now. The intention is that this could even just contain whatever the user has typed into a search box. The question of how to actually carry out the lookup is left for the remote peer to answer.

arguments: {"query_string" : "string containing terms to be searched for"}

response: {"compression" : "compression algorithm name", "data" : "<compressed bytestring for JSON-format file metadata>"}

or: {"nodes" : "<compact node info>"}


This also works as described in Distributed Search in Theseus. It has a "target" key which maps to a (compressed) Bloom filter bitmask.

A "compression" key is included, specifying the algorithm used to compress the bitmask. It may be set to "none" (though that is not recommended). The default algorithm is left undefined at this point -- expect a future post on this subject (and on related Bloom filter parameters).

arguments: {"compression" : "algorithm name", "target" : "<compressed bitmask bytestring>"}

response: {"nodes" : "<compact node info>"}


Used to ask a remote peer to describe themself to the querying node. The reply contains a dictionary encoding information such as the remote node's ID, their local content Bloom filter, a protocol version or details on specific features they do or don't support, and so on.

By default, all available data is returned. The querying peer may limit the data returned by including the optional "keys" argument in their query and providing a comprehensive list of keys desired. This prevents large data like Bloom filters from being transmitted unnecessarily. The querying peer may also report that its own info has changed (such as would happen when a node changes ID or when files are added to its cache) by including an optional "advertise" key.

Submitting the query with "keys" included and mapped to an empty list is allowed. The reply's "info" key should map to an empty dictionary.

Including "advertise" : {"id" : ["<querying node's id>", "<querying node's id preimage>"]} and "keys" : ["id"] allows two nodes to exchange ID information using only one query and response between them. This is functionally equivalent to MLDHT's ping query (hence ping's omission).

Note that the "values" associated with keys within the "info" dictionary may be arbitrary bencoded data, even though the example below only shows strings. It is perfectly fine to include a set of flags as a binary string, to include nested lists or dictionaries, etc.

A node may have as many info fields as it wants, but it should at the very minimum provide these: {"bloom" : "<search keyword bloom filter>", "id" : ["<node's id>", "<id hash preimage>"], "max_version" : "protocol version string"}

arguments: {"advertise" : {"sender_key_one" : "sender_value_one", ...}, "keys" : ["key_one", "key_two", ..., "key_n"]}

response: {"info" : {"key_one" : "value_one", "key_two" : "value_two", ... , "key_n" : "value_n"}}


Used when two nodes wish to authenticate to one another and ensure that an active MitM attack is not taking place on their communications channel. Motivation discussed below.

All significant data for the protocol is contained within the "data" dictionary to make it simpler to sign it all at once. This dictionary has three keys: "token", which keys to a persistent random token identifying the particular SMP exchange taking place -- necessary because this is a multi-message protocol; "step", which keys to an integer identifying the step in the protocol performed by the current message; and "values", which keys to a dictionary listing the actual information being sent in the course of performing the protocol.

Further details are omitted for now; we'll just say this is an adaptation of OTR's SMP protocol and leave it at that for the time being. We'll deal with this in depth in the forthcoming blog post on message encryption.

arguments: {"data" : {"token" : "<large bytestring>", "step" : <int>, "values" : { ... }}, "signature" : "<signature of dictionary keyed by 'data'>"}

response: {}

Error Codes

The BEP-05 specification of KRPC, lists four error codes. They are:

201: "Generic Error"

202: "Server Error"

203: "Protocol Error (i.e. malformed packet, or bad query parameters)"

204: "Method Unknown"

We'll define one more error code on top of these:

211: "Rate-limiting active"

It seems unlikely that any non-malicious node would generate enough traffic to trigger this error, but we'll include it just in case so such nodes can figure out why they've stopped getting responses. I'm numbering this code 21x instead of 20x in an optimistic attempt at avoiding collisions with any other KRPC extensions which might exist. BEP-05 gives no motivation for the numbering of the original error codes, so this is just sort of a best guess at how to properly extend things.

Types of Signed Data

There are a few different reasons for people to store signed data in the DHT. You might be stating your trust in a given public key, or identifying a magnet link for a metadata file summarizing the torrents you've tagged, or sharing a Diffie-Hellman public key integer, or even submitting a signed revocation notice invalidating the content of a previous signed message.

Consistency is important in all these for obvious reasons. Here are formats these signed data could use. Like the queries described above, it's possible that future versions of the protocol may make modifications to some of these data structures, and so implementations should be able to gracefully handle unexpectedly long lists or seemingly extraneous keys appearing in these data structures. Future extensions to the protocol should attempt to limit their changes to those two categories whenever possible, for the sake of preserving backward-compatibility.

Each message-plus-signature datum should be a bencoded data structure of the following format:

[["Theseus Protocol Signed System Data", {"arguments" : { ... }, "expires" : unix_timestamp, "type" : "name"}], signature : "<digital signature of previous list entries>"]

This consists of two nested 2-tuples. The outer 2-tuple contains the inner 2-tuple, then that 2-tuple's signature.

The first list entry within the inner 2-tuple is fixed and serves no purpose other than to tag the signed data as internal to the Theseus protocol, greatly reducing the signature's value for e.g. social engineering attacks (or really anything except its intended use).

The second entry within the inner 2-tuple provides details of the data. Its "expires" key is optional and, if included, should key to an integer representing a Unix timestamp past which nodes should refuse to store this information. An expiration date, essentially. The "type" key defines what this datum is intended to communicate, and "arguments" maps to a dictionary containing any extra required information.

As far as types of data are concerned, we define the following type names: endorse_key, endorse_metadata, endorse_dh, revoke_signature, cite_keyserver.


This is used to signify that the signer trusts the signed key. One purpose for this data is to create an ad-hoc, semi-connected web of trust with several uses, as discussed in Theseus's introductory post.

The "identity" key is optional, and may be used to associate e.g. an email address with the target_key.

The "relation" key is used to indicate the relation between the signed and signing keys. Users who wish to compartmentalize key use may specify "relation" : "self" to indicate that the signed keys should be trusted equally to the signing key. Keying to "peer" is the most common expected use case for establishing web-of-trust relationships. Additional values for "relation" may be defined in future.

arguments: {"identity" : "<optional additional identifying info>", "relation" : "self|peer", "target_key" : "<public key to endorse>", "type" : "name of public key format"}


This is used to specify a magnet link corresponding to a torrent for a Theseus torrent metadata file. Theseus nodes build their content indices from files distributed in this way.

arguments: {"magnet" : "<metadata file magnet link>"}


This is used to provide a Diffie-Hellman public key for use in secure friend-finding. A good client will want to change this value somewhat often, so it probably makes sense to have these entries expire pretty quickly (maybe after six hours, say).

arguments: {"dh_key" : <Diffie-Hellman public key integer>}


This is used to unconditionally countermand any previously signed datum or data. Revocations may be issued individually or in batches. A list of cryptographic hashes is provided. These should be the full hashes of whatever records are being invalidated. The hash function used should also be specified. For the forseeable future SHA512 should be exclusively used, but this argument is still included for the sake of future-proofing.

Note that including an expiration date in one's original published data is more reliable than planning on issuing revoke_signature at a future point. This functionality is only really intended for things like un-endorsing a compromised key or replacing an obsolete version of a metadata file -- things where it's just about impossible to know when you'll want to do them.

arguments: {"data_hashes" : ["<hash of signed data to revoke>", "<second data hash>", ...], "hash_function" : "SHA512"}


This is used to announce that a remote key server is willing to vouch for the identity of the given key. It includes the identity used to query for the key and the server to query. The motivation for this data type's inclusion is discussed at length below. Note that we don't need to include the key itself, since we can retrieve a copy of it from the remote server and use that copy to check the signature.

Nodes should feel encouraged to actually query the keyserver and validate the signature against the returned key before storing it. However, nodes may opt not to do so if they are resource-strapped or if they don't want to expose themselves by generating the required network traffic. Verifications, if performed, should be rate-limited, to prevent naive nodes from being compelled by malicious nodes to flood keyservers with traffic.

arguments: {"identity" : "<keyserver query string>", "key" : "<public key>", "server" : "<server URL>"}

Torrent Metadata File Format

The format for metadata files is very simple. It should follow JSON format. The top-level data object should be a dictionary with two keys: "version" and "data". "version" should map to a version string used to tell clients how to read the file. "data" should map to a list of files, each of which is represented by a dictionary containing key/value pairs detailing the file. Required keys are "title", "authors", and "magnet"; recommended keys include "subject""tags", and "abstract". An example with bare-bones metadata on one file:

    "version" : "1",
    "data" : [
            "title" : "A Field Study of Digital Forensics of Intrusions in the Electrical Power Grid",
            "authors" : ["Eli Sohl", "Curtis Fielding", "Tyler Hanlon", "Julian Rrushi", "Hassan Farhangi", "Clay Howey", "Kelly Carmichael", "Joey Dabell"],
            "magnet" : "<magnet link>",
            "subject" : "computer science",
            "tags" : ["computer science", "security", "power grid", "forensics"],
            "abstract" : "text of abstract goes here (omitted for brevity in this example)"

The Bloom filter for a node storing this file should include the first and last names of each author, as well as the following words: field, study, digital, forensics, intrusions, electrical, power, grid, computer, science, security.

One minor detail concerning the Bloom filters: any worthwhile hash function is going to be case-sensitive, so we might as well specify that all entries be converted to lowercase before hashing them into the Bloom filter to avoid e.g. a filter containing "Forensics" failing to register a hit in a search for "forensics".

Description Section Revision Log

4/20/2017: Published.
4/21/2017: Clarifications to metadata format description; structure slightly changed to support addition of "version".



Perhaps the most discussion-worthy aspect of the design is the decision to operate over an encrypted TCP channel, rather than over unencrypted UDP packets as e.g. MLDHT and parts of Tribler operate. We'll save a detailed discussion of this topic for the future post in which the specifics of the channel are described. For now, a few brief comments:

A primary goal is to make the protocol as hard to fingerprint as possible. Hence all messages being exchanged in an encrypted channel (which on the wire appears to be fully random). If all an observer sees is a Diffie-Hellman key exchange followed by a large amount of seemingly random noise, clearly something's going on, but it's nontrivial to figure out what. It could be Theseus -- or it could be anything else that encrypts all its communications. This has two advantages: first, nobody can tell for sure that you're using Theseus from traffic analysis; second, any signature-based firewall rules written to block Theseus traffic would necessarily also block traffic from many other things, which is likely a prohibitive level of collateral damage. Making Theseus hard to fingerprint makes it hard to censor without also accidentally censoring a whole lot of other things you didn't want to break.

Note, as an addendum, that of course users of any peer-to-peer regularly advertise their contact information to other peers -- meaning, of course, that if anyone wants to find out whether you're using Theseus, they can just maintain a presence on the network and see if your IP address ever shows up in e.g. peer lookups. There are ways around this, for instance if some peers are willing to act as proxies for others, which is a topic I'll be addressing in a future blog post on preserving user privacy. But it's important to recognize that these can only ever protect a proper subset of the peers on the network. So making the protocol hard to fingerprint does not prevent its users from ever being identified. What it does do is increase the effort required from a passive attack to an active attack -- and a sustained, involved, and only partly reliable one at that -- which is a significant win for us.

Another goal is to make it as hard as possible to interfere with normal operation of the protocol. As a case study, MLDHT has to go through some contortions to pull this off: for instance, the token management between "get_peers" and "announce_peer". "get_peers" queries return, along with their data, a one-time token which must be included in a subsequent announce_peer query. The purpose of this token is "to prevent malicious hosts from signing up other hosts for torrents."

You might not think this would be a problem in the first place, since the recipient of an announce_peer query just signs up the sending peer for the torrent -- but source addresses of UDP packets can be spoofed, and so a malicious peer could try to sign up others by sending spoofed queries. The token solves this by introducing state of which the malicious peer is (presumably) not aware. A different solution is to just encrypt the communications channel with a secret known only by the participants. An outside observer who tries to inject queries into this channel would fail: since they don't know the encryption key, their messages would decrypt as gibberish. This simplifies the protocol design process, since it greatly weakens the assumptions we must make about the powers available to our attackers.

Some explanation should probably also be given for the choice of OTR (or a lightly tweaked version thereof) for the encryption. Why isn't it enough to just agree on a key and use it for the duration of the communication?

The answer is that using a single long-lived key is probably enough -- but that I don't like dealing in "probably"s.

A little context: OTR was originally developed to protect social communication, which the original authors saw as having different and more stringent privacy requirements than communications between automated systems. I agree with their assessment of the privacy requirements for social communications. I think it's interesting, though, that people never seem to even consider extending these requirements to the automatic communications made by programs at our behest. Wouldn't it make sense to expect the same privacy properties of, say, a question asked to a friend and a question asked to a search engine?

Quoting from the original paper on OTR, "Off-the-Record Communication, or, Why Not To Use PGP":
The notion of an off-the-record conversation well-captures the semantics one intuitively wants from private communication: only the two parties involved are privy to the contents of the conversation; after the conversation is over, no one (not even the parties involved) can produce a transcript; and although the participants are assured of each other’s identities, neither they nor anyone else can prove this information to a third party.
We absolutely want these properties for private conversations -- but really, they all seem like they'd be nice to have in any communications which aren't explicitly intended to be part of a public record. It's less that we'd necessarily miss these properties' absence, and more that their presence can hardly hurt. Forward secrecy in particular is a very appealing property in that it somewhat lessens the impact of key compromise.

We should also say a few words about the decision to break authentication out into an RPC exchange. The reason authentication was introduced into OTR in the first place was to allow the detection of man-in-the-middle attacks of the sort to which vanilla Diffie-Hellman is vulnerable. This is only possible when the transacting nodes know some information about each other a priori -- typically a public key -- or else possess a shared secret established over a separate, trusted channel. This works well enough in social contexts, but in this peer-to-peer context, where nodes don't necessarily know anything about each other and in all likelihood have been introduced to each other by other nodes to whom they are otherwise total strangers, we can't make such assumptions about key distribution. Not everyone will want to authenticate -- indeed, not everyone will possess a key their remote peer would trust enough to accept authentication by in the first place.

As such, the unfortunate truth is that traditional defenses against man-in-the-middle attacks cannot be applied in this context. It is however noted that nine times out of ten, it makes more sense in terms of cost vs benefit for an active attacker to simply run malicious nodes in the network rather than MitMing the communications of existing nodes which don't trust each other to begin with.

That said, some users do possess keys with which they can mutually authenticate. For these users we have the Socialist Millionaires authentication queries, in which the peers carry out the Socialist Millionaires protocol to verify that they possess the same shared encryption secret and sign their messages to prevent tampering in-transit. This allows the minority of peer connections for which authentication is desired to carry out this authentication securely.

Note that schemes like secure friend-finding securely associate a node's contact info with its public key, meaning that if that node subsequently refuses to authenticate using this public key (which it previously, securely attested to owning), then this refusal itself may be taken as an indication that communications are being tampered with by an active MitM attacker.

One piece of popular wisdom among the security-minded is, "don't roll your own crypto". This is very good advice and some might say I'm flirting with breaking it here by making protocol tweaks. All I can say is that whenever I deviate from standard practice, the decision is very carefully considered. As I'm sure you've gathered by now, I want this thing to be secure, and that means placing as little trust as possible in the work of any single person, including myself -- hence my extensive documentation of this design process. If I'm taking risks, I want to be exceedingly clear about what they are and why I consider them to be justified. Consider this full transparency and full disclosure rolled into one.

A corollary from the world of competition math comes to mind. When you take the Putnam exam, they explicitly warn you against submitting, quote, "worthless material". I've always appreciated the bluntness of this dictum and tried to apply it to my life more generally. This project is no exception.

As I'd hope you've gathered by now, I am also completely open to feedback. If you're skeptical about any part of the system, I'd be happy to discuss it with you further. You can get in touch with me in the comments, on Twitter, by email, through Signal, or through any combination thereof.

RPC Queries

The design of many of these is based off of Kademlia and MLDHT. BEP-05 does a good job of discussing the design considerations at work there.

As mentioned above, we deviate somewhat from this protocol by not sending IDs in every message. Including that key simply isn't necessary when we have a connection-based model, because we can track things like IDs as persistent state associated with the connection. Same with the "token" key in the MLDHT version of announce_peer: it's necessary there but not here, since we're not vulnerable to the same type trivial message spoofing that MLDHT is.

We use the DHT to track multiple types of data. All of them are associated with (hash-generated, and thus evenly distributed) addresses in the DHT, and get_peers is used to traverse the network regardless of what data we're interested in retrieving. For each type of data, we define a get_* query and an announce_* query to store and retrieve it respectively.

Note that announce_peer is very resistant to DoS attacks that flood a peer with data to store, because any given IP address can only sign itself up once, meaning you'd need a huge number of IP addresses at your disposal if you wanted to overwhelm a node with bad data. Unfortunately, the same is not true for announce_signature and announce_raw. Any single peer could generate and submit a colossal amount of raw data or signature data.

To deal with this, an important design goal for system components using these features is to make sure that as little actual data as reasonably possible is put in the DHT. If we can stick to this guiding principle, we can safely assume that anyone submitting large amounts of data is doing so maliciously and can be automatically blacklisted.

A little more on get_info: if we're on TCP then it just makes sense, really, to distribute peer metadata in its own message rather than just packing it into every single message like MLDHT does with its "id" argument. We can track this metadata easily enough if we just request it once and, as soon as we get it, associate it with the connection.

Having an explicit get_info query allows us to efficiently share more data than just peer IDs. The search subsystem uses it to track Bloom filters, for instance, and we include an indicator of the maximum protocol version supported by a node to make sure nodes can figure out how to talk to each other. It is expected that various extensions to the protocol might add keys to this dictionary; if the dictionary starts getting large enough, it might make sense to start breaking it up into sub-dictionaries indexed by feature name to give these different features their own "namespaces". At this point, though, such measures would probably be overkill.

Signed Data Formats

The hardest thing about this part of the design is trying to come up with something to call these. They're essentially statements of various kinds, endorsed by the holder of a specific key. But calling them "statements" lacks kick, calling them "declarations" is tacky, calling them "proclamations" makes it sound like we're not sure what century we're in... and so on. For now we'll just call them "data", "signed data", etc. Better ideas welcome.

As for the actual data types, a few of them deserve individual discussion.


As said above,
This is used to signify that the signer trusts the signed key. One purpose for this data is to create an ad-hoc, semi-connected web of trust with several uses, as discussed in Theseus's introductory post
In addition to endorsing others' public keys, users who would rather use a dedicated public key for their Theseus operations (which would be understandable from the point of view of compartmentalization, and because this key may generate many signatures over its lifespan) may use endorse_key to indicate and, well, endorse that dedicated key.

Some might wonder why we don't follow a pre-established standard format for key signatures. If we did this, people could import (say) their extensive lists of PGP key signatures that they've spent years curating, and have the associated web of trust data ready to go. But is this actually what we want? The social function of key-signing is endorsement of identity -- vouching that a certain person owns a certain key. The function of key-signing in Theseus is to recommend that anyone who wants to help store content from the signing key should also store content from the signed key -- so that, for instance, one researcher trusted by a large part of the network could endorse their academic peers and thus encourage the system to start storing these peers' documents as well. You don't necessarily want to make those endorsements for everyone whose public key you've ever signed.


The main noteworthy features of this is the ability to revoke signatures individually or in batches, and the use of cryptographic hashes. Using hashes ensures that we can countermand a previously issued signature without specifying what was originally signed. A signature previously issued in error might be embarrassing to the issuer, and so having a revocation scheme which avoids explicitly recording the mistake for all time is an attractive feature. Recording revocations in batches also lowers the amount of memory required of a DHT node to store the data.


This data type helps us address a difficult problem which until now I've been sort of skirting around. This is the issue of where we're getting trusted keys from in the first place.

We've talked at length about what to do with these keys once we've got them, but where do they come from? Well, people could input them manually -- which would be a huge pain. We could also get them from some kind of trusted source. We don't want to rely on a single such source, because this would undercut the principle of decentralization around which Theseus is designed. But allowing arbitrary sources to be cited and leaving it up to individual nodes to decide which sources to trust (perhaps with the assistance of a suggested list) seems like a more redundant, and thus more reliable, model.

I might discuss this in more depth in future. For now, a quick one-paragraph sketch: Peers could choose to trust (say) the key for any .edu email address registered in the PGP Global Directory, or any identity registered with a specific dedicated key server hosted by trusted Theseus users, or so on. Most key servers don't make it easy to scrape lists of their users, though, and even if we did have such a list, most of the users on it would have nothing to do with Theseus. cite_keyserver helps us dodge this issue: people with keys on a trustworthy key server can create cite_keyserver records, and people looking to build a list of trusted keys can periodically query a random, roughly evenly distributed set of DHT nodes to look for cite_keyserver records in the results (it might even make sense to combine this operation with the periodic random peer ID lookups used for network size estimation). These records can then be verified to get a working list of trusted keys.

Metadata Files

Since these files are meant to be human-readable and the data in them might have to be parsed by a web interface, JSON is preferred over bencode for these files. Since JSON is a fully plaintext file format and these records contain a number of repeated elements, it can be expected that they will compress quite well. As such, we compress them before sending them over the wire. The hope is that not only will this result in a more efficient use of bandwidth but it may also increase traffic opacity after encryption. Speaking of which...

Traffic Analysis

As observed several times above, many of the protocol's features bear strong resemblances to each other; for instance, call signatures are very similar across the get_* and announce_* subsystems, and peer traversal by search_nodes is very similar to traversal by get_peers. This is true not only for the (approximate) sizes of the messages but for the specific patterns of connections and messages established for each. The idea is that making these different operations follow similar traffic patterns makes passive analysis of protocol traffic even less rewarding. Messages which are close but not quite equal in length could even be padded pre-encryption, ensuring uniform message size on the wire and make things even more opaque.

All of this serves no real purpose aside from making life more difficult for passive observers at the internet-backbone level -- but that's something that I would argue is, in and of itself, an important goal. If the last couple decades have taught us anything, it's that ISPs and government agencies cannot be trusted to respect the privacy of the citizens they serve. If they refuse to stop surveilling us, then we have no choice but to try and make their jobs as hard as possible. All of this is self-defense.

Wednesday, March 22, 2017

The State of Theseus, One Month In

For just over a month now I've been working nonstop on one part after another of my current pet project, Theseus. Things are coming along very well so far -- in fact, I've surprised myself with the amount of progress I've been able to make in such a short time.

That said, I have yet to write a single line of code. This is because I'm trying to nail down a complete specification of the system first. The system has many moving parts, some of which possess complex interdependencies, and it's important to try to get this stuff right the first time.

The goal of this post is to take inventory of what's been done so far and to list out everything that's still needed in order to reach a point where work on an implementation can begin.

If you'd like to collaborate on Theseus, I'm hoping this post will give you an idea of where your contributions would be particularly valuable. If you're realizing at this point that you're not even sure what Theseus is, I've got you covered: The beginning of my introductory post, Theseus: A Robust System for Preserving and Sharing Research, can take care of that. I'll be directly referencing that post throughout the discussion below.

Progress on the Promises

In that original blog post, I laid out five major promises that I believed Theseus could provide. It would not be an exaggeration to say that these promises have guided the entire design of the system, and I remain fully confident in its ability to deliver on every single one of them. Some of them, I've already described solutions for. Others, I've found solutions but have yet to write them up. There are also some outstanding questions that need to be answered, either through hard work or consultation with subject experts.

The promises are laid out and described in the original post. Below I'll list them in order and discuss the progress made on them thus far.

1. A Guarantee of Robustness in the Face of Censorship

The system is useless unless it is robust under adversarial conditions. The unfortunate truth is that both private publishers and the current administration of my own country could conceivably take issue with this project's goals. As such, it is only prudent to do everything in my power to make sure the project can withstand their utmost efforts against it.

I lack the subject knowledge to understand the legal channels through which the project might be attacked. My suspicion, as I've mentioned elsewhere, is that anyone trying to pursue such an avenue of attack would be facing an uphill battle, since all I'm creating is a tool through which researchers may share their work without fear of censorship. Hopefully we'll be able to avoid putting this suspicion of mine to the test.

Setting aside such concerns (at least for the time being, until I can get in touch with someone having the appropriate legal background), we move on to technical concerns. These, I am more confident with. The question here is whether technical attacks exist which could prevent Theseus from functioning properly, effectively censoring the system by either rendering it unusable or rendering certain of its content inaccessible.

Denial-of-service attacks are obviously possible against individual nodes. However, the more nodes we have in the network, the less viable these attacks become. Thus this threat can be mitigated simply by making Theseus useful enough to see wide adoption.

At the heart of Theseus is a distributed hash table (DHT). This DHT has a number of uses in the system, some of which I have yet to write about in detail. Attacks on this DHT would have serious consequences. The most potent class of attacks on DHTs are known as Sybil attacks, which in fact are the bane of virtually all peer-to-peer systems. In Resisting Sybil Attacks in Distributed Hash Tables, I describe several countermeasures to these attacks which, to the best of my knowledge, constitute original research. Through these, I expect Theseus will be able to achieve virtually unprecedented resilience in the face of Sybil attacks.

Another critical component of Theseus is its distributed search functionality. This functionality is described in Distributed Search in Theseus. It relies on information returned by peers to get its results; as such, it obviously assumes access to honest peers. Dishonest peers can always return bad search results which would be nontrivial to identify as bad. However, they cannot prevent honest peers from returning honest results. The question of distinguishing honest and dishonest results is a very difficult one. Some methods that might help include: lending more weight to results returned by many nodes, using basic spam-filtering techniques locally, carrying out DHT lookups for returned links and prioritizing links with large numbers of available peers, prioritizing nodes possessing trusted public keys when carrying out the search, and so on.

The challenge here is to try to minimize the impact of these measures on search speed. It might be appropriate to allow the user to specify how paranoid their client should be, so that the more costly of these measures end up only being deployed as reactive countermeasures when a user starts noticing that they're getting lots of bogus search results.

Interesting future work in this vein could include: Investigating the viability of the countermeasures listed above, trying to come up with new countermeasures to add to the list, and investigating the various ideas' implications where latency is concerned.

The only other aspect of Theseus where censorship is applicable (as far as I can see) is where its social features are concerned. These have yet to be fully fleshed out, but anonymity and deniability are top priorities in their design. The only work I've released so far on this topic is Securely Finding Friends via DHT Dead Drops.

2. Guarantees of Anonymity for Users Who Need Them

The Theseus control protocol -- which handles everything except file downloads -- is intended to be low-bandwidth and to operate over TCP. As such, it could potentially be routed through a proxy, a VPN, or even over the Tor network. As mentioned in my original post, routing large downloads through Tor is rightly frowned upon; however, using it for lower-bandwidth activity like this is simply using the network as intended.

For some users, these options might not be enough. Maybe their threat model precludes simple solutions like VPNs (there are situations where this is entirely reasonable). Maybe they live in a country where access to Tor is restricted. For these users, I'm currently considering including in the Theseus spec a poor-hacker's equivalent to the Tor network, much like what Tribler uses for anonymizing downloads. More on that in a future post.

That takes care of anonymity at the control layer. At the download layer, here is what I'm thinking.

Users who trust a VPN can, of course, use that VPN to anonymize their UDP download traffic.

Regardless of other factors, the Theseus client should default to forcing encryption on all downloads / uploads. It's probably wise to include an option to disable mandatory encryption, for users who know what they're doing. It's probably also wise to discourage this by tucking the option deep in a config menu.

Encrypting traffic should go a long way towards thwarting ISP-level surveillance, but the person you're downloading from still knows what you're downloading. Thus, malicious peers still can get more insight into network behavior than we would perhaps like them to. As such it might be prudent to implement functionality similar or identical to the Tribler download-anonymizing spec linked above. I doubt it would be possible to make Theseus and Tribler clients interoperable -- but if it is possible, it might be a good idea. This is something that needs to be researched more thoroughly. Getting a clear idea of how this would work, if it would work, is an absolutely critical prerequisite to nailing down a draft spec for Theseus.

3. The Ability to Quickly Search for Papers on Any Topic

As mentioned above, Distributed Search in Theseus deals with this exact topic and gives a detailed description of the solution I have in mind. Next steps for realizing this solution are also described therein. Practically all that's left is to decide on implementation details.

4. The Ability to Securely Share Your Own Research

I have a full, detailed idea of how this will work. It is described in some detail in the original post on Theseus. Expect more in an upcoming post -- in fact, writing that post might be my next priority.

5. The (Potential) Ability to Sidestep Thorny Copyright Issues

As stated in the original post:
I am not a lawyer. Nothing in this blog post is legal advice, because legal advice is something I am not qualified to give. However, from my lay understanding I am optimistic about this tool's potential for circumventing restrictive laws and promoting freedom of information in the sense advocated by greats like Aaron Swartz.
Decentralized services are very difficult to bring down, especially if they have nodes in multiple countries. Skeptical? Just ask the music industry.
These quotes continue to accurately reflect my views on this topic. If anyone with a working knowledge of modern copyright law would be interested in exploring this topic further, please get in touch.

The Big To-Do List

In the discussions above, I outlined some things that still need to be done. Here is a more comprehensive list of what I see as standing between what we have and what we need to start implementing (roughly ordered to place prerequisites ahead of their dependencies):

  • Brainstorm front-end ideas
    • There won't be a blog post on this until I have enough material to include some cool illustrations.
    • The priorities for the front-end are to make search intuitive and collaboration as painless as possible.
    • The goal is to make Theseus useful to as many people as possible, so outside input would be invaluable here.
      • What features would be useful to you?
      • What are some things that existing research tools get wrong?
      • Would you be interested in collaborating on the project's interface design?
  • Work out a universal format for storing data in the DHT
    • We're using it as a workhorse for several different aspects of the design
    • It needs to be flexible and support a variety of types of data, including...
      • For metadata collection:
        • Magnet links to files containing large amounts of document metadata
        • Cryptographic signatures for the above magnet links
        • (potentially:) The public key used to generate said signatures
      • For friend-finding:
        • Raw encrypted data which arrives sans context
        • Diffie-Hellman public-key integer
        • Cryptographic signature of above integer, possibly w/ public key
      • For actual torrenting:
        • Tracking lists of peers (some of which might actually be anonymous introduction points a la Tribler anonymous seeding)
        • [this is meant to mimic but operate disjointly from Mainline DHT's functionality, providing torrent-finding functionality to peers who only trust anonymized communications with other Theseus nodes]
    • The question of whether or not to include public keys in the records is a tricky one which depends on several factors. The question is subtler than it may initially appear. I'm currently leaning towards including them.
    • One major priority in designing this format is to try and keep it as flexible as possible to encourage forward- and backward-compatibility, since it is possible that the DHT could end up shared between clients running multiple versions of the eventual Theseus protocol.
    • I've got a lot on this topic in my notebooks, and I'll probably be writing a post about it soon.
  • Work out a universal format for metadata files
    • This involves deciding what kinds of data we want to track. A non-exhaustive list of ideas, not all of which necessarily merit inclusion:
      • Title
      • Author(s)
      • Author-specified tag list
      • User-specified tag list
      • Year of publication
      • Citations in paper (potentially w/ cross-references for other Theseus-accessible papers)
      • Citations of paper (potentially w/ same)
      • User-provided annotations
      • etc...
    • The considerations outlined above concerning backward- and forward-compatibility apply here as well.
  • Work out an appropriate set of parameters for the search feature's Bloom filters
    • There is potential here for interesting math, pretty graphs, etc...
  • Work out in more detail the protocol's features concerning anonymity, as discussed in brief above, and potentially write a post about these as well if there turns out to be enough to say.
    • Do we like Tribler's approach to anonymizing downloads? If so, can we plug into their system? Would they be OK with that? Would we even want to? Would we be better-off just using their protocol across our own separate overlay graph?
  • Nail down a rough draft of the Theseus control protocol (& potentially of the UDP anonymization system as well, if applicable)
  • Start implementing and testing!

This might look like a long list, but it's a hell of a lot shorter than it would've been a month ago. I'm looking forward to seeing what it looks like a month from now!