Thursday, April 20, 2017

Theseus Protocol v0.1 Overview

Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 
Distributed Search in Theseus
The State of Theseus, One Month In


Hope everyone's having a relaxing day. Things have been a bit quiet over here lately, as I've been busy sampling the boundless delights of graduate life. But fear not: progress on Theseus is still being made. I've been hard at work on the data storage formats described in the previous post's TODO list, and on the specifics of the Theseus protocol itself.

These elements of the design are all interrelated, so it seems appropriate to summarize and discuss them as a unit. That means we've got a lot to cover here, so we're going to break description and discussion into separate sections and make the description section deliberately exposition-light.

Theseus is growing more realizable by the day. Soon, interface design, not protocol details, will end up being the project's top priority. That'll be an exciting shift and I'm looking forward to finding people to collaborate with on it.

For now, though... Let's get technical.



The protocol is conceptualized as a set of bencoded RPC messages following the KRPC protocol format as described by the Mainline DHT implementation of Kademlia, with a custom set of RPCs. BEP-05 defines the KRPC format as follows:
A KRPC message is a single dictionary with two keys common to every message and additional keys depending on the type of message. Every message has a key "t" with a string value representing a transaction ID. This transaction ID is generated by the querying node and is echoed in the response, so responses may be correlated with multiple queries to the same node. The transaction ID should be encoded as a short string of binary numbers, typically 2 characters are enough as they cover 2^16 outstanding queries. The other key contained in every KRPC message is "y" with a single character value describing the type of message. The value of the "y" key is one of "q" for query, "r" for response, or "e" for error. 
Note that we follow the MLDHT KRPC protocol as regards message format, but not as regards message transport.

We define the following queries: find_node, get_peers, announce_peer, get_signatures, announce_signatures, get_raw, announce_raw, search_files, search_nodes, get_info, and the colorfully-named outlier socialist_millionaires_auth. Note the absence of MLDHT's ping query, which get_info renders redundant.

It is expected that from time to time additional queries will be added to this list in support of new features.


We deviate from the MLDHT specification in running the protocol over TCP rather than UDP and encapsulating all protocol messages within an encrypted channel established and maintained via a variant of the OTR protocol. The duration of TCP connections is left for individual clients to decide. It is noted that peers who communicate often or wish to authenticate to each other would likely see performance benefits from maintaining long-lived connections -- though it is also noted that connection duration is trivially observable by third parties, a fact which may be of significance to privacy-minded users.

Modern iterations of OTR includes provisions for authentication as part of the initial handshake. Authentication is not necessarily desired in our use case -- some nodes may wish to authenticate to each other, and others may not -- and so rather than handling authentication as part of the initial handshake, we break it out into a specialized, optional RPC message exchange which may be carried out at any point post-handshake.

Descriptions of RPC Query and Response Messages

Some of these queries are modeled on the BEP-05 queries, but the resemblance is limited. Most notably, the fact that we are discarding MLDHT's connectionless UDP transport in favor of persistent TCP connections means that we can relax MLDHT's requirement of including "id" keys in every query, since it's now straightforward to track identities by associating IDs with connections. Another difference is that per the anti-Sybil measures discussed in Resisting Sybil Attacks in Distributed Hash Tables, where node IDs do appear they must be accompanied by hash preimages.

It is expected that extensions to any of these queries will, more likely than not, take the form of extra dictionary keys. As such, any implementation of this protocol should be capable of gracefully handling (i.e. silently ignoring) the presence of unexpected keys in any of these messages, as long as the expected parts of the message are correct.


This is analogous to BEP-5's find_node query, but lacks an "id" key.

arguments: {"target" : "<id of target node>"}

response: {"nodes" : "<compact node info>"}


Also analogous to the BEP-5 definition. In addition to omitting the "id" key, the "token" key is omitted. This key is a security measure which is appropriate over UDP but unnecessary over properly encrypted TCP, as discussed below.

arguments: {"info_hash" : "<20-byte infohash of target data>"}

response: {"values" : ["<peer 1 info string>", "<peer 2 info string>"]}

or: {"nodes" : "<compact node info>"}


For this query, which is the last of those directly derived from BEP-05's MLDHT specification, we specify a new optional key, "sybil". "sybil" keys to an integer value of 1 or 0 depending on whether the sending node believes a vertical Sybil attack is taking place at the write address. If "sybil" is present and nonzero, the receiving node may attempt to verify the claim and subsequently increase its timeout for stored data.

We also omit "id" and "token", for the same reasons as above. Note that this leaves the response dictionary empty. This empty response should, of course, still be sent, to acknowledge query receipt.

arguments: {"implied_port": <0 or 1>, "info_hash" : "<20-byte infohash of target torrent>", "port" : <port number>, "sybil" : <0 or 1>}

response: {}


This is much like the get_peers query, but instead of returning a list of peers, it returns a list of 2-tuples containing various bytestrings and their signatures.

If the query is a miss, i.e. if the node does not have any stored signatures, it replies with its best peers, mimicking the response to a find_node query just like the alternate response for get_peers does.

The "key_fingerprint" argument is optional. If it is omitted, all signatures stored at the queried node are to be returned. This is perhaps unexpected. The motivation is that all signed data is meant to be essentially public-record between peers in the protocol, and in fact there are certain situations in which trawling for data is valuable -- for instance to collect cite_keyserver entries, as discussed below.

arguments: {"key_fingerprint" : "<20-byte fingerprint of signing key>"}

response: {signatures" : [["message_one", "signature_one"], ... , ["message_n", "signature_n"]]}

or: {nodes" : "<compact node info>"}


This is used to publish new signed data associated with a given key. The key's address in the DHT is determined by a 160-bit key signature. When a node receives these submissions, it should, before storing the signatures, check that they are valid. It will also want to check how close the key signature is to the node's own IP, to know how long to cache it. An optional "sybil" argument may be included, which operates just like it does in announce_peer.

Note that signatures need not be announced by someone holding the private key that generated them -- once they're in the network, anyone can re-publish them.

arguments: {"signing_key" : "<public key>", "signatures" : [["message_one", "signature_one"], ... , ["message_n", "signature_n"]], "sybil" : <0 or 1>}

response: {}


Similar in operation to get_peers or get_signatures, get_raw is used to retrieve raw binary data blobs stored at the DHT node (as used by systems like the one described in Securely Finding Friends via DHT Dead Drops). The data is returned as a bencoded list of binary strings.

arguments: {"address" : "<20-byte address>"}

response: {data" : ["<first bytestring>", ...]}

or: {nodes" : "<compact node info>"}


Used to store the raw binary data retrieved by get_raw. Like the other announce_* queries, announce_raw includes an optional "sybil" argument.

arguments: {"address" : "<20-byte address>", "data" : "<bytestring>", "sybil" : <0 or 1>}

response: {}


This query operates as described in Distributed Search in Theseus. It takes as an argument a search string. The remote peer checks its own local files for this string in any way it sees fit, and returns a list of file descriptions including, among other things, titles and magnet links.

If the remote peer has hits, it returns a (compressed) JSON list containing the full metadata for each matching file, along with a string specifying the compression algorithm used. "compression" : "none" is valid, and may even be appropriate when the result string is small enough for the overhead of compression to negate its benefits.

If the remote peer doesn't have any hits, its response should instead contain a "nodes" key suggesting other nodes to query. The list of nodes technically can be empty (though it probably shouldn't be).

The format of the query string is left unspecified, at least for now. The intention is that this could even just contain whatever the user has typed into a search box. The question of how to actually carry out the lookup is left for the remote peer to answer.

arguments: {"query_string" : "string containing terms to be searched for"}

response: {"compression" : "compression algorithm name", "data" : "<compressed bytestring for JSON-format file metadata>"}

or: {"nodes" : "<compact node info>"}


This also works as described in Distributed Search in Theseus. It has a "target" key which maps to a (compressed) Bloom filter bitmask.

A "compression" key is included, specifying the algorithm used to compress the bitmask. It may be set to "none" (though that is not recommended). The default algorithm is left undefined at this point -- expect a future post on this subject (and on related Bloom filter parameters).

arguments: {"compression" : "algorithm name", "target" : "<compressed bitmask bytestring>"}

response: {"nodes" : "<compact node info>"}


Used to ask a remote peer to describe themself to the querying node. The reply contains a dictionary encoding information such as the remote node's ID, their local content Bloom filter, a protocol version or details on specific features they do or don't support, and so on.

By default, all available data is returned. The querying peer may limit the data returned by including the optional "keys" argument in their query and providing a comprehensive list of keys desired. This prevents large data like Bloom filters from being transmitted unnecessarily. The querying peer may also report that its own info has changed (such as would happen when a node changes ID or when files are added to its cache) by including an optional "advertise" key.

Submitting the query with "keys" included and mapped to an empty list is allowed. The reply's "info" key should map to an empty dictionary.

Including "advertise" : {"id" : ["<querying node's id>", "<querying node's id preimage>"]} and "keys" : ["id"] allows two nodes to exchange ID information using only one query and response between them. This is functionally equivalent to MLDHT's ping query (hence ping's omission).

Note that the "values" associated with keys within the "info" dictionary may be arbitrary bencoded data, even though the example below only shows strings. It is perfectly fine to include a set of flags as a binary string, to include nested lists or dictionaries, etc.

A node may have as many info fields as it wants, but it should at the very minimum provide these: {"bloom" : "<search keyword bloom filter>", "id" : ["<node's id>", "<id hash preimage>"], "max_version" : "protocol version string"}

arguments: {"advertise" : {"sender_key_one" : "sender_value_one", ...}, "keys" : ["key_one", "key_two", ..., "key_n"]}

response: {"info" : {"key_one" : "value_one", "key_two" : "value_two", ... , "key_n" : "value_n"}}


Used when two nodes wish to authenticate to one another and ensure that an active MitM attack is not taking place on their communications channel. Motivation discussed below.

All significant data for the protocol is contained within the "data" dictionary to make it simpler to sign it all at once. This dictionary has three keys: "token", which keys to a persistent random token identifying the particular SMP exchange taking place -- necessary because this is a multi-message protocol; "step", which keys to an integer identifying the step in the protocol performed by the current message; and "values", which keys to a dictionary listing the actual information being sent in the course of performing the protocol.

Further details are omitted for now; we'll just say this is an adaptation of OTR's SMP protocol and leave it at that for the time being. We'll deal with this in depth in the forthcoming blog post on message encryption.

arguments: {"data" : {"token" : "<large bytestring>", "step" : <int>, "values" : { ... }}, "signature" : "<signature of dictionary keyed by 'data'>"}

response: {}

Error Codes

The BEP-05 specification of KRPC, lists four error codes. They are:

201: "Generic Error"

202: "Server Error"

203: "Protocol Error (i.e. malformed packet, or bad query parameters)"

204: "Method Unknown"

We'll define one more error code on top of these:

211: "Rate-limiting active"

It seems unlikely that any non-malicious node would generate enough traffic to trigger this error, but we'll include it just in case so such nodes can figure out why they've stopped getting responses. I'm numbering this code 21x instead of 20x in an optimistic attempt at avoiding collisions with any other KRPC extensions which might exist. BEP-05 gives no motivation for the numbering of the original error codes, so this is just sort of a best guess at how to properly extend things.

Types of Signed Data

There are a few different reasons for people to store signed data in the DHT. You might be stating your trust in a given public key, or identifying a magnet link for a metadata file summarizing the torrents you've tagged, or sharing a Diffie-Hellman public key integer, or even submitting a signed revocation notice invalidating the content of a previous signed message.

Consistency is important in all these for obvious reasons. Here are formats these signed data could use. Like the queries described above, it's possible that future versions of the protocol may make modifications to some of these data structures, and so implementations should be able to gracefully handle unexpectedly long lists or seemingly extraneous keys appearing in these data structures. Future extensions to the protocol should attempt to limit their changes to those two categories whenever possible, for the sake of preserving backward-compatibility.

Each message-plus-signature datum should be a bencoded data structure of the following format:

[["Theseus Protocol Signed System Data", {"arguments" : { ... }, "expires" : unix_timestamp, "type" : "name"}], signature : "<digital signature of previous list entries>"]

This consists of two nested 2-tuples. The outer 2-tuple contains the inner 2-tuple, then that 2-tuple's signature.

The first list entry within the inner 2-tuple is fixed and serves no purpose other than to tag the signed data as internal to the Theseus protocol, greatly reducing the signature's value for e.g. social engineering attacks (or really anything except its intended use).

The second entry within the inner 2-tuple provides details of the data. Its "expires" key is optional and, if included, should key to an integer representing a Unix timestamp past which nodes should refuse to store this information. An expiration date, essentially. The "type" key defines what this datum is intended to communicate, and "arguments" maps to a dictionary containing any extra required information.

As far as types of data are concerned, we define the following type names: endorse_key, endorse_metadata, endorse_dh, revoke_signature, cite_keyserver.


This is used to signify that the signer trusts the signed key. One purpose for this data is to create an ad-hoc, semi-connected web of trust with several uses, as discussed in Theseus's introductory post.

The "identity" key is optional, and may be used to associate e.g. an email address with the target_key.

The "relation" key is used to indicate the relation between the signed and signing keys. Users who wish to compartmentalize key use may specify "relation" : "self" to indicate that the signed keys should be trusted equally to the signing key. Keying to "peer" is the most common expected use case for establishing web-of-trust relationships. Additional values for "relation" may be defined in future.

arguments: {"identity" : "<optional additional identifying info>", "relation" : "self|peer", "target_key" : "<public key to endorse>", "type" : "name of public key format"}


This is used to specify a magnet link corresponding to a torrent for a Theseus torrent metadata file. Theseus nodes build their content indices from files distributed in this way.

arguments: {"magnet" : "<metadata file magnet link>"}


This is used to provide a Diffie-Hellman public key for use in secure friend-finding. A good client will want to change this value somewhat often, so it probably makes sense to have these entries expire pretty quickly (maybe after six hours, say).

arguments: {"dh_key" : <Diffie-Hellman public key integer>}


This is used to unconditionally countermand any previously signed datum or data. Revocations may be issued individually or in batches. A list of cryptographic hashes is provided. These should be the full hashes of whatever records are being invalidated. The hash function used should also be specified. For the forseeable future SHA512 should be exclusively used, but this argument is still included for the sake of future-proofing.

Note that including an expiration date in one's original published data is more reliable than planning on issuing revoke_signature at a future point. This functionality is only really intended for things like un-endorsing a compromised key or replacing an obsolete version of a metadata file -- things where it's just about impossible to know when you'll want to do them.

arguments: {"data_hashes" : ["<hash of signed data to revoke>", "<second data hash>", ...], "hash_function" : "SHA512"}


This is used to announce that a remote key server is willing to vouch for the identity of the given key. It includes the identity used to query for the key and the server to query. The motivation for this data type's inclusion is discussed at length below. Note that we don't need to include the key itself, since we can retrieve a copy of it from the remote server and use that copy to check the signature.

Nodes should feel encouraged to actually query the keyserver and validate the signature against the returned key before storing it. However, nodes may opt not to do so if they are resource-strapped or if they don't want to expose themselves by generating the required network traffic. Verifications, if performed, should be rate-limited, to prevent naive nodes from being compelled by malicious nodes to flood keyservers with traffic.

arguments: {"identity" : "<keyserver query string>", "key" : "<public key>", "server" : "<server URL>"}

Torrent Metadata File Format

The format for metadata files is very simple. It should follow JSON format. The top-level data object should be a dictionary with two keys: "version" and "data". "version" should map to a version string used to tell clients how to read the file. "data" should map to a list of files, each of which is represented by a dictionary containing key/value pairs detailing the file. Required keys are "title", "authors", and "magnet"; recommended keys include "subject""tags", and "abstract". An example with bare-bones metadata on one file:

    "version" : "1",
    "data" : [
            "title" : "A Field Study of Digital Forensics of Intrusions in the Electrical Power Grid",
            "authors" : ["Eli Sohl", "Curtis Fielding", "Tyler Hanlon", "Julian Rrushi", "Hassan Farhangi", "Clay Howey", "Kelly Carmichael", "Joey Dabell"],
            "magnet" : "<magnet link>",
            "subject" : "computer science",
            "tags" : ["computer science", "security", "power grid", "forensics"],
            "abstract" : "text of abstract goes here (omitted for brevity in this example)"

The Bloom filter for a node storing this file should include the first and last names of each author, as well as the following words: field, study, digital, forensics, intrusions, electrical, power, grid, computer, science, security.

One minor detail concerning the Bloom filters: any worthwhile hash function is going to be case-sensitive, so we might as well specify that all entries be converted to lowercase before hashing them into the Bloom filter to avoid e.g. a filter containing "Forensics" failing to register a hit in a search for "forensics".

Description Section Revision Log

4/20/2017: Published.
4/21/2017: Clarifications to metadata format description; structure slightly changed to support addition of "version".



Perhaps the most discussion-worthy aspect of the design is the decision to operate over an encrypted TCP channel, rather than over unencrypted UDP packets as e.g. MLDHT and parts of Tribler operate. We'll save a detailed discussion of this topic for the future post in which the specifics of the channel are described. For now, a few brief comments:

A primary goal is to make the protocol as hard to fingerprint as possible. Hence all messages being exchanged in an encrypted channel (which on the wire appears to be fully random). If all an observer sees is a Diffie-Hellman key exchange followed by a large amount of seemingly random noise, clearly something's going on, but it's nontrivial to figure out what. It could be Theseus -- or it could be anything else that encrypts all its communications. This has two advantages: first, nobody can tell for sure that you're using Theseus from traffic analysis; second, any signature-based firewall rules written to block Theseus traffic would necessarily also block traffic from many other things, which is likely a prohibitive level of collateral damage. Making Theseus hard to fingerprint makes it hard to censor without also accidentally censoring a whole lot of other things you didn't want to break.

Note, as an addendum, that of course users of any peer-to-peer regularly advertise their contact information to other peers -- meaning, of course, that if anyone wants to find out whether you're using Theseus, they can just maintain a presence on the network and see if your IP address ever shows up in e.g. peer lookups. There are ways around this, for instance if some peers are willing to act as proxies for others, which is a topic I'll be addressing in a future blog post on preserving user privacy. But it's important to recognize that these can only ever protect a proper subset of the peers on the network. So making the protocol hard to fingerprint does not prevent its users from ever being identified. What it does do is increase the effort required from a passive attack to an active attack -- and a sustained, involved, and only partly reliable one at that -- which is a significant win for us.

Another goal is to make it as hard as possible to interfere with normal operation of the protocol. As a case study, MLDHT has to go through some contortions to pull this off: for instance, the token management between "get_peers" and "announce_peer". "get_peers" queries return, along with their data, a one-time token which must be included in a subsequent announce_peer query. The purpose of this token is "to prevent malicious hosts from signing up other hosts for torrents."

You might not think this would be a problem in the first place, since the recipient of an announce_peer query just signs up the sending peer for the torrent -- but source addresses of UDP packets can be spoofed, and so a malicious peer could try to sign up others by sending spoofed queries. The token solves this by introducing state of which the malicious peer is (presumably) not aware. A different solution is to just encrypt the communications channel with a secret known only by the participants. An outside observer who tries to inject queries into this channel would fail: since they don't know the encryption key, their messages would decrypt as gibberish. This simplifies the protocol design process, since it greatly weakens the assumptions we must make about the powers available to our attackers.

Some explanation should probably also be given for the choice of OTR (or a lightly tweaked version thereof) for the encryption. Why isn't it enough to just agree on a key and use it for the duration of the communication?

The answer is that using a single long-lived key is probably enough -- but that I don't like dealing in "probably"s.

A little context: OTR was originally developed to protect social communication, which the original authors saw as having different and more stringent privacy requirements than communications between automated systems. I agree with their assessment of the privacy requirements for social communications. I think it's interesting, though, that people never seem to even consider extending these requirements to the automatic communications made by programs at our behest. Wouldn't it make sense to expect the same privacy properties of, say, a question asked to a friend and a question asked to a search engine?

Quoting from the original paper on OTR, "Off-the-Record Communication, or, Why Not To Use PGP":
The notion of an off-the-record conversation well-captures the semantics one intuitively wants from private communication: only the two parties involved are privy to the contents of the conversation; after the conversation is over, no one (not even the parties involved) can produce a transcript; and although the participants are assured of each other’s identities, neither they nor anyone else can prove this information to a third party.
We absolutely want these properties for private conversations -- but really, they all seem like they'd be nice to have in any communications which aren't explicitly intended to be part of a public record. It's less that we'd necessarily miss these properties' absence, and more that their presence can hardly hurt. Forward secrecy in particular is a very appealing property in that it somewhat lessens the impact of key compromise.

We should also say a few words about the decision to break authentication out into an RPC exchange. The reason authentication was introduced into OTR in the first place was to allow the detection of man-in-the-middle attacks of the sort to which vanilla Diffie-Hellman is vulnerable. This is only possible when the transacting nodes know some information about each other a priori -- typically a public key -- or else possess a shared secret established over a separate, trusted channel. This works well enough in social contexts, but in this peer-to-peer context, where nodes don't necessarily know anything about each other and in all likelihood have been introduced to each other by other nodes to whom they are otherwise total strangers, we can't make such assumptions about key distribution. Not everyone will want to authenticate -- indeed, not everyone will possess a key their remote peer would trust enough to accept authentication by in the first place.

As such, the unfortunate truth is that traditional defenses against man-in-the-middle attacks cannot be applied in this context. It is however noted that nine times out of ten, it makes more sense in terms of cost vs benefit for an active attacker to simply run malicious nodes in the network rather than MitMing the communications of existing nodes which don't trust each other to begin with.

That said, some users do possess keys with which they can mutually authenticate. For these users we have the Socialist Millionaires authentication queries, in which the peers carry out the Socialist Millionaires protocol to verify that they possess the same shared encryption secret and sign their messages to prevent tampering in-transit. This allows the minority of peer connections for which authentication is desired to carry out this authentication securely.

Note that schemes like secure friend-finding securely associate a node's contact info with its public key, meaning that if that node subsequently refuses to authenticate using this public key (which it previously, securely attested to owning), then this refusal itself may be taken as an indication that communications are being tampered with by an active MitM attacker.

One piece of popular wisdom among the security-minded is, "don't roll your own crypto". This is very good advice and some might say I'm flirting with breaking it here by making protocol tweaks. All I can say is that whenever I deviate from standard practice, the decision is very carefully considered. As I'm sure you've gathered by now, I want this thing to be secure, and that means placing as little trust as possible in the work of any single person, including myself -- hence my extensive documentation of this design process. If I'm taking risks, I want to be exceedingly clear about what they are and why I consider them to be justified. Consider this full transparency and full disclosure rolled into one.

A corollary from the world of competition math comes to mind. When you take the Putnam exam, they explicitly warn you against submitting, quote, "worthless material". I've always appreciated the bluntness of this dictum and tried to apply it to my life more generally. This project is no exception.

As I'd hope you've gathered by now, I am also completely open to feedback. If you're skeptical about any part of the system, I'd be happy to discuss it with you further. You can get in touch with me in the comments, on Twitter, by email, through Signal, or through any combination thereof.

RPC Queries

The design of many of these is based off of Kademlia and MLDHT. BEP-05 does a good job of discussing the design considerations at work there.

As mentioned above, we deviate somewhat from this protocol by not sending IDs in every message. Including that key simply isn't necessary when we have a connection-based model, because we can track things like IDs as persistent state associated with the connection. Same with the "token" key in the MLDHT version of announce_peer: it's necessary there but not here, since we're not vulnerable to the same type trivial message spoofing that MLDHT is.

We use the DHT to track multiple types of data. All of them are associated with (hash-generated, and thus evenly distributed) addresses in the DHT, and get_peers is used to traverse the network regardless of what data we're interested in retrieving. For each type of data, we define a get_* query and an announce_* query to store and retrieve it respectively.

Note that announce_peer is very resistant to DoS attacks that flood a peer with data to store, because any given IP address can only sign itself up once, meaning you'd need a huge number of IP addresses at your disposal if you wanted to overwhelm a node with bad data. Unfortunately, the same is not true for announce_signature and announce_raw. Any single peer could generate and submit a colossal amount of raw data or signature data.

To deal with this, an important design goal for system components using these features is to make sure that as little actual data as reasonably possible is put in the DHT. If we can stick to this guiding principle, we can safely assume that anyone submitting large amounts of data is doing so maliciously and can be automatically blacklisted.

A little more on get_info: if we're on TCP then it just makes sense, really, to distribute peer metadata in its own message rather than just packing it into every single message like MLDHT does with its "id" argument. We can track this metadata easily enough if we just request it once and, as soon as we get it, associate it with the connection.

Having an explicit get_info query allows us to efficiently share more data than just peer IDs. The search subsystem uses it to track Bloom filters, for instance, and we include an indicator of the maximum protocol version supported by a node to make sure nodes can figure out how to talk to each other. It is expected that various extensions to the protocol might add keys to this dictionary; if the dictionary starts getting large enough, it might make sense to start breaking it up into sub-dictionaries indexed by feature name to give these different features their own "namespaces". At this point, though, such measures would probably be overkill.

Signed Data Formats

The hardest thing about this part of the design is trying to come up with something to call these. They're essentially statements of various kinds, endorsed by the holder of a specific key. But calling them "statements" lacks kick, calling them "declarations" is tacky, calling them "proclamations" makes it sound like we're not sure what century we're in... and so on. For now we'll just call them "data", "signed data", etc. Better ideas welcome.

As for the actual data types, a few of them deserve individual discussion.


As said above,
This is used to signify that the signer trusts the signed key. One purpose for this data is to create an ad-hoc, semi-connected web of trust with several uses, as discussed in Theseus's introductory post
In addition to endorsing others' public keys, users who would rather use a dedicated public key for their Theseus operations (which would be understandable from the point of view of compartmentalization, and because this key may generate many signatures over its lifespan) may use endorse_key to indicate and, well, endorse that dedicated key.

Some might wonder why we don't follow a pre-established standard format for key signatures. If we did this, people could import (say) their extensive lists of PGP key signatures that they've spent years curating, and have the associated web of trust data ready to go. But is this actually what we want? The social function of key-signing is endorsement of identity -- vouching that a certain person owns a certain key. The function of key-signing in Theseus is to recommend that anyone who wants to help store content from the signing key should also store content from the signed key -- so that, for instance, one researcher trusted by a large part of the network could endorse their academic peers and thus encourage the system to start storing these peers' documents as well. You don't necessarily want to make those endorsements for everyone whose public key you've ever signed.


The main noteworthy features of this is the ability to revoke signatures individually or in batches, and the use of cryptographic hashes. Using hashes ensures that we can countermand a previously issued signature without specifying what was originally signed. A signature previously issued in error might be embarrassing to the issuer, and so having a revocation scheme which avoids explicitly recording the mistake for all time is an attractive feature. Recording revocations in batches also lowers the amount of memory required of a DHT node to store the data.


This data type helps us address a difficult problem which until now I've been sort of skirting around. This is the issue of where we're getting trusted keys from in the first place.

We've talked at length about what to do with these keys once we've got them, but where do they come from? Well, people could input them manually -- which would be a huge pain. We could also get them from some kind of trusted source. We don't want to rely on a single such source, because this would undercut the principle of decentralization around which Theseus is designed. But allowing arbitrary sources to be cited and leaving it up to individual nodes to decide which sources to trust (perhaps with the assistance of a suggested list) seems like a more redundant, and thus more reliable, model.

I might discuss this in more depth in future. For now, a quick one-paragraph sketch: Peers could choose to trust (say) the key for any .edu email address registered in the PGP Global Directory, or any identity registered with a specific dedicated key server hosted by trusted Theseus users, or so on. Most key servers don't make it easy to scrape lists of their users, though, and even if we did have such a list, most of the users on it would have nothing to do with Theseus. cite_keyserver helps us dodge this issue: people with keys on a trustworthy key server can create cite_keyserver records, and people looking to build a list of trusted keys can periodically query a random, roughly evenly distributed set of DHT nodes to look for cite_keyserver records in the results (it might even make sense to combine this operation with the periodic random peer ID lookups used for network size estimation). These records can then be verified to get a working list of trusted keys.

Metadata Files

Since these files are meant to be human-readable and the data in them might have to be parsed by a web interface, JSON is preferred over bencode for these files. Since JSON is a fully plaintext file format and these records contain a number of repeated elements, it can be expected that they will compress quite well. As such, we compress them before sending them over the wire. The hope is that not only will this result in a more efficient use of bandwidth but it may also increase traffic opacity after encryption. Speaking of which...

Traffic Analysis

As observed several times above, many of the protocol's features bear strong resemblances to each other; for instance, call signatures are very similar across the get_* and announce_* subsystems, and peer traversal by search_nodes is very similar to traversal by get_peers. This is true not only for the (approximate) sizes of the messages but for the specific patterns of connections and messages established for each. The idea is that making these different operations follow similar traffic patterns makes passive analysis of protocol traffic even less rewarding. Messages which are close but not quite equal in length could even be padded pre-encryption, ensuring uniform message size on the wire and make things even more opaque.

All of this serves no real purpose aside from making life more difficult for passive observers at the internet-backbone level -- but that's something that I would argue is, in and of itself, an important goal. If the last couple decades have taught us anything, it's that ISPs and government agencies cannot be trusted to respect the privacy of the citizens they serve. If they refuse to stop surveilling us, then we have no choice but to try and make their jobs as hard as possible. All of this is self-defense.

Wednesday, March 22, 2017

The State of Theseus, One Month In

For just over a month now I've been working nonstop on one part after another of my current pet project, Theseus. Things are coming along very well so far -- in fact, I've surprised myself with the amount of progress I've been able to make in such a short time.

That said, I have yet to write a single line of code. This is because I'm trying to nail down a complete specification of the system first. The system has many moving parts, some of which possess complex interdependencies, and it's important to try to get this stuff right the first time.

The goal of this post is to take inventory of what's been done so far and to list out everything that's still needed in order to reach a point where work on an implementation can begin.

If you'd like to collaborate on Theseus, I'm hoping this post will give you an idea of where your contributions would be particularly valuable. If you're realizing at this point that you're not even sure what Theseus is, I've got you covered: The beginning of my introductory post, Theseus: A Robust System for Preserving and Sharing Research, can take care of that. I'll be directly referencing that post throughout the discussion below.

Progress on the Promises

In that original blog post, I laid out five major promises that I believed Theseus could provide. It would not be an exaggeration to say that these promises have guided the entire design of the system, and I remain fully confident in its ability to deliver on every single one of them. Some of them, I've already described solutions for. Others, I've found solutions but have yet to write them up. There are also some outstanding questions that need to be answered, either through hard work or consultation with subject experts.

The promises are laid out and described in the original post. Below I'll list them in order and discuss the progress made on them thus far.

1. A Guarantee of Robustness in the Face of Censorship

The system is useless unless it is robust under adversarial conditions. The unfortunate truth is that both private publishers and the current administration of my own country could conceivably take issue with this project's goals. As such, it is only prudent to do everything in my power to make sure the project can withstand their utmost efforts against it.

I lack the subject knowledge to understand the legal channels through which the project might be attacked. My suspicion, as I've mentioned elsewhere, is that anyone trying to pursue such an avenue of attack would be facing an uphill battle, since all I'm creating is a tool through which researchers may share their work without fear of censorship. Hopefully we'll be able to avoid putting this suspicion of mine to the test.

Setting aside such concerns (at least for the time being, until I can get in touch with someone having the appropriate legal background), we move on to technical concerns. These, I am more confident with. The question here is whether technical attacks exist which could prevent Theseus from functioning properly, effectively censoring the system by either rendering it unusable or rendering certain of its content inaccessible.

Denial-of-service attacks are obviously possible against individual nodes. However, the more nodes we have in the network, the less viable these attacks become. Thus this threat can be mitigated simply by making Theseus useful enough to see wide adoption.

At the heart of Theseus is a distributed hash table (DHT). This DHT has a number of uses in the system, some of which I have yet to write about in detail. Attacks on this DHT would have serious consequences. The most potent class of attacks on DHTs are known as Sybil attacks, which in fact are the bane of virtually all peer-to-peer systems. In Resisting Sybil Attacks in Distributed Hash Tables, I describe several countermeasures to these attacks which, to the best of my knowledge, constitute original research. Through these, I expect Theseus will be able to achieve virtually unprecedented resilience in the face of Sybil attacks.

Another critical component of Theseus is its distributed search functionality. This functionality is described in Distributed Search in Theseus. It relies on information returned by peers to get its results; as such, it obviously assumes access to honest peers. Dishonest peers can always return bad search results which would be nontrivial to identify as bad. However, they cannot prevent honest peers from returning honest results. The question of distinguishing honest and dishonest results is a very difficult one. Some methods that might help include: lending more weight to results returned by many nodes, using basic spam-filtering techniques locally, carrying out DHT lookups for returned links and prioritizing links with large numbers of available peers, prioritizing nodes possessing trusted public keys when carrying out the search, and so on.

The challenge here is to try to minimize the impact of these measures on search speed. It might be appropriate to allow the user to specify how paranoid their client should be, so that the more costly of these measures end up only being deployed as reactive countermeasures when a user starts noticing that they're getting lots of bogus search results.

Interesting future work in this vein could include: Investigating the viability of the countermeasures listed above, trying to come up with new countermeasures to add to the list, and investigating the various ideas' implications where latency is concerned.

The only other aspect of Theseus where censorship is applicable (as far as I can see) is where its social features are concerned. These have yet to be fully fleshed out, but anonymity and deniability are top priorities in their design. The only work I've released so far on this topic is Securely Finding Friends via DHT Dead Drops.

2. Guarantees of Anonymity for Users Who Need Them

The Theseus control protocol -- which handles everything except file downloads -- is intended to be low-bandwidth and to operate over TCP. As such, it could potentially be routed through a proxy, a VPN, or even over the Tor network. As mentioned in my original post, routing large downloads through Tor is rightly frowned upon; however, using it for lower-bandwidth activity like this is simply using the network as intended.

For some users, these options might not be enough. Maybe their threat model precludes simple solutions like VPNs (there are situations where this is entirely reasonable). Maybe they live in a country where access to Tor is restricted. For these users, I'm currently considering including in the Theseus spec a poor-hacker's equivalent to the Tor network, much like what Tribler uses for anonymizing downloads. More on that in a future post.

That takes care of anonymity at the control layer. At the download layer, here is what I'm thinking.

Users who trust a VPN can, of course, use that VPN to anonymize their UDP download traffic.

Regardless of other factors, the Theseus client should default to forcing encryption on all downloads / uploads. It's probably wise to include an option to disable mandatory encryption, for users who know what they're doing. It's probably also wise to discourage this by tucking the option deep in a config menu.

Encrypting traffic should go a long way towards thwarting ISP-level surveillance, but the person you're downloading from still knows what you're downloading. Thus, malicious peers still can get more insight into network behavior than we would perhaps like them to. As such it might be prudent to implement functionality similar or identical to the Tribler download-anonymizing spec linked above. I doubt it would be possible to make Theseus and Tribler clients interoperable -- but if it is possible, it might be a good idea. This is something that needs to be researched more thoroughly. Getting a clear idea of how this would work, if it would work, is an absolutely critical prerequisite to nailing down a draft spec for Theseus.

3. The Ability to Quickly Search for Papers on Any Topic

As mentioned above, Distributed Search in Theseus deals with this exact topic and gives a detailed description of the solution I have in mind. Next steps for realizing this solution are also described therein. Practically all that's left is to decide on implementation details.

4. The Ability to Securely Share Your Own Research

I have a full, detailed idea of how this will work. It is described in some detail in the original post on Theseus. Expect more in an upcoming post -- in fact, writing that post might be my next priority.

5. The (Potential) Ability to Sidestep Thorny Copyright Issues

As stated in the original post:
I am not a lawyer. Nothing in this blog post is legal advice, because legal advice is something I am not qualified to give. However, from my lay understanding I am optimistic about this tool's potential for circumventing restrictive laws and promoting freedom of information in the sense advocated by greats like Aaron Swartz.
Decentralized services are very difficult to bring down, especially if they have nodes in multiple countries. Skeptical? Just ask the music industry.
These quotes continue to accurately reflect my views on this topic. If anyone with a working knowledge of modern copyright law would be interested in exploring this topic further, please get in touch.

The Big To-Do List

In the discussions above, I outlined some things that still need to be done. Here is a more comprehensive list of what I see as standing between what we have and what we need to start implementing (roughly ordered to place prerequisites ahead of their dependencies):

  • Brainstorm front-end ideas
    • There won't be a blog post on this until I have enough material to include some cool illustrations.
    • The priorities for the front-end are to make search intuitive and collaboration as painless as possible.
    • The goal is to make Theseus useful to as many people as possible, so outside input would be invaluable here.
      • What features would be useful to you?
      • What are some things that existing research tools get wrong?
      • Would you be interested in collaborating on the project's interface design?
  • Work out a universal format for storing data in the DHT
    • We're using it as a workhorse for several different aspects of the design
    • It needs to be flexible and support a variety of types of data, including...
      • For metadata collection:
        • Magnet links to files containing large amounts of document metadata
        • Cryptographic signatures for the above magnet links
        • (potentially:) The public key used to generate said signatures
      • For friend-finding:
        • Raw encrypted data which arrives sans context
        • Diffie-Hellman public-key integer
        • Cryptographic signature of above integer, possibly w/ public key
      • For actual torrenting:
        • Tracking lists of peers (some of which might actually be anonymous introduction points a la Tribler anonymous seeding)
        • [this is meant to mimic but operate disjointly from Mainline DHT's functionality, providing torrent-finding functionality to peers who only trust anonymized communications with other Theseus nodes]
    • The question of whether or not to include public keys in the records is a tricky one which depends on several factors. The question is subtler than it may initially appear. I'm currently leaning towards including them.
    • One major priority in designing this format is to try and keep it as flexible as possible to encourage forward- and backward-compatibility, since it is possible that the DHT could end up shared between clients running multiple versions of the eventual Theseus protocol.
    • I've got a lot on this topic in my notebooks, and I'll probably be writing a post about it soon.
  • Work out a universal format for metadata files
    • This involves deciding what kinds of data we want to track. A non-exhaustive list of ideas, not all of which necessarily merit inclusion:
      • Title
      • Author(s)
      • Author-specified tag list
      • User-specified tag list
      • Year of publication
      • Citations in paper (potentially w/ cross-references for other Theseus-accessible papers)
      • Citations of paper (potentially w/ same)
      • User-provided annotations
      • etc...
    • The considerations outlined above concerning backward- and forward-compatibility apply here as well.
  • Work out an appropriate set of parameters for the search feature's Bloom filters
    • There is potential here for interesting math, pretty graphs, etc...
  • Work out in more detail the protocol's features concerning anonymity, as discussed in brief above, and potentially write a post about these as well if there turns out to be enough to say.
    • Do we like Tribler's approach to anonymizing downloads? If so, can we plug into their system? Would they be OK with that? Would we even want to? Would we be better-off just using their protocol across our own separate overlay graph?
  • Nail down a rough draft of the Theseus control protocol (& potentially of the UDP anonymization system as well, if applicable)
  • Start implementing and testing!

This might look like a long list, but it's a hell of a lot shorter than it would've been a month ago. I'm looking forward to seeing what it looks like a month from now!

Tuesday, March 21, 2017

Distributed Search in Theseus

Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 

Search algorithms in peer-to-peer networks have a long, storied, and rather unhappy history. Search sounds like a straightforward problem, but in fact it turns out to be alarmingly difficult even in non-adversarial settings. In realistic models which assume the presence of some hostile peers, the situation is even worse.

It has taken me a while to sit down and write this post. For that, there are two reasons. First off, I've been pretty busy: on top of keeping up with my job, last week was my last finals week before graduating with my Bachelor's, yesterday was commencement, and next weekend is PRCCDC. So I've been spread somewhat thin. The second reason is that search is a very hard problem, and I've been holding off until I'm confident I've got something worth sharing. Luckily, after making some major headway in the last couple weeks, I think I have just that.

Before we go any further, let me just say one thing: The design I'm proposing is simple enough that (like much else that I've written about on this blog) I'd be surprised if I'm the first to get to it. However, I have yet to find such prior work, and if anyone could point me towards it I'd be incredibly grateful.

For the sake of providing context, we'll take a moment to discuss previous attempts at peer-to-peer search before delving into what I'm proposing.

Prior Work

RFC 4981 is a whirlwind tour of major work on this topic up to '07. There seems to have been surprisingly little work since then (perhaps due to the difficulty of the problem, which RFC 4981 indirectly demonstrates through its length and density). Most of the strategies outlined are either uncomfortably simple, uncomfortably complex, or some unholy amalgam of both. Also, little attention seems to have been paid how these various approaches perform in adversarial contexts -- perhaps because everyone already knows that the findings would not be pretty.

One thing worth mentioning is that the obvious approach to p2p search, which was implemented by Gnutella, fails miserably. Their strategy was to simply define a "search" message which explicitly describes what you're looking for. Then, whenever you wanted to do a search, you would send these messages to as many of your peers as possible. This works alright for small networks, but to say it scales poorly would be a dramatic understatement. It can be seen that total bandwidth used by the peer swarm grows exponentially with network size. The details are discussed at some length in RFC 4981; their ultimate conclusion is that scalable systems unavoidably require some sort of routing structure to reduce overall network load.

In surveying the rest of the survey, I found nothing that lined up with all our requirements. In the time since that survey's publication, there have been some more recent developments. None that I know of have fully met our requirements, but a couple of them perhaps merit brief discussion before we move on.

In '08, a protocol called Cubit was released which took on the problem of inexact keyword search using the concept of edit distance. This work made enough of a splash at the time to land a writeup in Ars Technica. The idea, roughly: Nodes in a p2p overlay graph were transparently associated with specific search terms, and when you wanted to carry out a search, you would query not only the node corresponding to a given word but also the nodes whose words were closest to it by the edit distance metric (which the overlay structure ensures are proximal in the routing graph as well). One stated goal of the protocol is to gracefully handle misspellings in search queries. (I'm not sure why those can't just be taken care of by a spell checker.) The protocol seems, unfortunately, to suffer from serious lack of resilience in the presence of hostile peers.

Around the same time, the Tribler torrent client's team was developing (and continues to develop) a different approach to a similar problem. Specifically, their goal is efficient, low-bandwidth exact-match keyword search for torrent files. The idea is to have everyone in the network share their download histories, and to let peers with similar histories be close in the overlay graph. When you carry out a keyword search, the peers closest to you are queried first, the idea being that since their tastes resemble yours, maybe they've already looked for and found what you're looking for now. This scheme yields some nice properties; however, they come at an obvious and massive cost as far as user privacy is concerned. The Tribler team deserves credit for acknowledging this problem, as well as that their protocols operate in an adversarial environment. They even go so far as to suggest appropriate countermeasures for some malicious behavior. Unfortunately, their countermeasures are generally based on reputation-management systems. My reservations about such systems are discussed in previous posts.

So, neither of these systems quite meet our criteria. This, again, underlines how challenging the problem of p2p search really is. What I'm proposing here is a system which is (arguably) simpler than those suggested above, and which excels in some of their weaker areas, in particular by possessing several very nice privacy-related properties. It also allows for significant routing redundancy, which is desirable under adversarial conditions and which helps prevent hotspots in the network.

The Idea

Fun with Bloom Filters

The approach is centered around Bloom filters, which have many useful properties:
  1. At the minor cost of a small but nonzero false positive rate, they allow large content indices to be stored with remarkable space efficiency. This space efficiency makes them easy to fit on the wire, and also makes them amenable to storage en masse.
  2. They are inherently one-way, which has desirable privacy properties. Any given Bloom filter can be generated by a number of different input key sets, and going only by the filter there is no way to tell which set was used. Given a key, you can test its membership in a Bloom filter, but the possibility of false positives means that deniability is maintained even if a hit is found.
  3. They have meaningful behavior under bitwise logical operations. Relatedly, the number of elements in a Bloom filter or in the union or intersection of two Bloom filters can be straightforwardly estimated.
  4. Querying a Bloom filter for a key involves checking the filter against a certain bit mask derived from the key in a way that (for good hash functions) is infeasible to reverse. A user may thus search for a key by generating a bit mask and sending it to remote peers, who can check if their known filters contain it without even really knowing the key they're searching for. The mapping from keys to masks is many-to-one, meaning that attempts to invert it cannot ever achieve perfect confidence in their results.
  5. There has been interesting research into optimizing Bloom filters for network transfer, with excellent results. The work is presented here (PDF) and summarized in section 2.6 of this survey (PDF). The main finding is that while ordinary Bloom filters don't compress well, it's possible to create sparser Bloom filters which are optimized for compression and end up with a structure that takes up more space in memory but less space on the wire, for a given false positive rate, than the corresponding ordinary Bloom filter. This trade-off is in some cases very appealing.


We start with the assumption that everyone has, in local storage, metadata for a certain number of files. For this post, we're not yet going to worry about where it comes from. The problem we're trying to solve here is: if everyone has a certain amount of metadata but there's no shared record of who has what, how can we efficiently implement keyword search across everyone's metadata?

The answer, as I see it, is for everyone to communicate an approximation of what they've got using Bloom filters, and to use some simple tricks to build an efficient and highly redundant routing system out of those filters.

More specifically, everyone takes the names of (or other tag data for) every paper they have metadata on, hashes everything that's not a stop word into their Bloom filter, then shares this filter with other peers upon request. The linguistically perverse might be inclined to call this filter a peer's metametadata. Everyone stores up-to-date Bloom filters from all the peers in their routing tables. On top of that, whenever a Bloom filter is received, we compute the value |F_r| + |F_r AND F_l|, where |F| represents the Hamming weight of a filter, and F_r and F_l are the remote and local nodes' Bloom filters respectively. We record in a separate list the N nodes for which this sum is largest, out of all the nodes we've received Bloom filters for. Like peers in routing table buckets, the peers on this list are periodically pinged to check their freshness.

Keeping this list helps us make sure that we know at least a few nodes with well-populated large Bloom filters, which is useful since we have no built-in guarantees about the filters of the peers in our routing table. It has the further interesting property of promoting a degree of clustering in the network, since the second term in the summand favors nodes whose Bloom filters F_r have many bits in common with the local filter F_l.

It may be appropriate to place a dynamic upper limit on |F_r|, to weed out filters for which the false positive rate would be prohibitively large and to prevent malicious nodes from trivially poisoning everyone's filter lists with all-1 filters. It may also make sense to restrict this list to only include peers not already in the routing table, to avoid redundancy. If the clustering effect is found to be insufficiently weak to prevent hotspots from forming, we may want to include a constant coefficient like |F_r| + 2*|F_r AND F_l|.

Note that these are all decisions which individual nodes can make for themselves, internally, without taking input from or even informing the network. This means that different clients -- or successive versions of the same client -- can try different strategies without harming the overall network (except for the potential harm that would result from a client choosing a bad strategy and making itself less useful than it could otherwise be).


To carry out a search, we define a couple queries: one to request data from a remote node, and one to find nodes whose Bloom filters match the data you want to request. We'll describe them at a high level here and go into more detail below.

The first query is simple. Let's call it "search_files". You send a request for some data to the remote node, which either sends back what you wanted or apologizes and says it doesn't have what you're looking for.

The second query is simple as well but has certain subtleties. We'll call it "search_peers". The idea is that if your immediate neighbors don't have the data you need, you can ask them to help you find other nodes who might. You do this by sending the Bloom filter masks describing your query to the peers who are closest to having a match, i.e. who have most of the mask's bits set. These nodes, in their filter lists, keep track of other nodes with large Bloom filters having large intersections with their own. Thus, they are likely to be able to point you in the direction of nodes that match your search even better than they themselves do. Applying these query iteratively should allow a user to find very many peers.

The concept behind these queries is somewhat similar in spirit to the design of Kademlia, and proofs of Kademlia's proper function suggest broad strokes for several heuristic arguments for this system's efficacy as well.

Bit Masks

Say you have three keys: K_1, K_2, K_3. You have a Bloom filter B that you want to check the keys' membership in. Running our k hash functions on K_1 gives us a bit mask with (up to) k bits set in it. Call this mask M_1, and define M_2 and M_3 analogously.

To check for K_1's membership in B (in Python syntax), we test whether M_1 == B & M_1. To check for K_1 through K_3 at once, the naive way is to test whether (M_1 == B & M_1) and (M_2 == B & M_2) and (M_3 == B & M_3). The same logic can be implemented more simply as M_1 | M_2 | M_3 == B & (M_1 | M_2 | M_3).

Thus to search for a set of keywords K_1 ... K_n, we can simply compute the combined bit mask K_1 | K_2 | ... | K_n and send this combined bit mask as our argument to the query. This conserves bandwidth and increases opacity as to the actual terms being searched for.

This allows us to efficiently use AND logic in our search queries. We can implement OR conditions by conducting multiple queries, and implement NOT conditions through local filtering of results. Part of the goal of this design is to recognize which problems it is appropriate to solve at the protocol level and which problems are better handled in other parts of an application, and to use this recognition to design a protocol which does everything it needs to and nothing more.

Specifics (or Lack Thereof)

One weird thing about this system is that in its current iteration it requires constant, universal Bloom filter parameters across the entire network. This is tricky because filter parameters are typically optimized as functions of two variables: the desired uncompressed filter size, and the expected number of elements in the filter. Since the number of filter elements is likely to vary widely from peer to peer, coming up with good values here is a nontrivial task. More accurately: it's easy to do, but hard to do right.

When I can spare the time -- probably next week -- I'll sit down with a notebook, a Python terminal, and something green that's illegal to take across state lines, and work at this until I come up with some good arguments for what an optimal parameterization might be and have some good simulations to back them up. Another question to be answered as part of that analysis is whether the value of compressed Bloom filters outweighs the increased cost of computation they entail. This all will involve a lot of fun but fairly involved math. Expect gory details.

Monday, February 27, 2017

Securely Finding Friends via DHT Dead Drops

Theseus: A Robust System for Preserving and Sharing Research  
Resisting Sybil Attacks in Distributed Hash Tables

This won't be a long one -- I just wanted to take a few spare moments to jot down an idea for how friends in a Theseus-like network who know each other only by public key but don't have each other's network contact info could first get in touch.

I believe it's possible to do this without anyone else finding that contact info, deducing the social connection, or even knowing that either involved party is trying to get in touch with anyone. Those guarantees would be tricky to establish through naive use of public-key cryptography, but I believe this Diffie-Hellman based scheme can offer all that and more without breaking a sweat.

This solution strikes me as being simple and elegant enough that it's almost certainly been discovered before. As always, if anyone can point out prior work, please do.

In addition to basic cryptographic primitives, the scheme requires a couple nontrivial building blocks. I assume the existence of a distributed hash table like the one being designed for Theseus. I also assume that the DHT can be accessed anonymously. This second assumption is of course easier said than done, but we'll address that issue in its own blog post in due time.

This idea allows for two parties who only know each other's public keys to establish communication without any personally-identifying metainfo leaking to third parties. No centralized authority of any kind is required. Communicating public keys to each other is a bit of a pain but, again, we're sweeping that under the rug for now. Everything in due time.

How it starts is: if you're on the Theseus network and you have a public key that you're using with Theseus, you probably have some file signatures associated with the key published to the DHT. You're going to publish something else along with them: a very large integer signed by your key.

If your friend also has a public key associated with Theseus, they'll share a very large integer signed by their key as well.

These integers are each chosen to constitute one half of a Diffie-Hellman key exchange. Behind them lie personal secrets used to derive them. Put any two of them together, and someone knowing at least one of the personal secrets can derive a shared secret. So, if everyone who wants to be contacted by public key publishes such an integer, then any two such people can arrive at a shared secret without ever directly communicating.

(as an aside, if the public keys are for an elliptic-curve cryptosystem then ECDH could be used to do away with the need to publish traditional DH keys at all, since the public keys essentially also become the DH keys. This may or may not be desirable.)

It's perhaps worth mentioning that using the same public Diffie-Hellman integer in multiple key exchanges does nothing to weaken it. Also, while we're mentioning things, this requires system-wide constants for a DH modulus and base, which is no problem.

Okay, so this allows two arbitrary people who have exchanged public keys to reach a shared secret, but since the value of this secret is essentially random, they still haven't managed to communicate. They don't even have each other's contact info. What next?
The idea here is to set up a dead drop. Specifically:
  • Calculate the shared secret
  • XOR together cryptographic hashes of the two public keys
  • Compute an HMAC of the resulting value, keyed by the shared secret
    • The first 160 bits of this are the address for your dead drop
  • Take your contact info and encrypt it with the shared secret
  • Store the encrypted value at the dead drop address
The encryption key is derived from Diffie-Hellman, rather than from the public keys (which are only used to certify the trustworthiness of the DH integers), meaning that this system achieves forward secrecy even if users' private keys are later compromised.

One might be inclined to just directly hash the concatenation of the pubkey XOR and the shared secret instead of using an HMAC, but such a simple approach can have unexpected consequences, for instance vulnerability to length extension attacks. The use of an HMAC is a conservative measure intended to increase resilience to such arcanae.

The particularly conservative might, prior to storage, doubly encrypt their contact info: first with the shared secret, then with their contact's public key. This would complicate brute-force attacks on the shared secret somewhat, at the cost of increasing overhead and complicating the algorithm. That measure is not taken here because the benefit of it seems small enough to be outweighed by the cost.

The system as described seems to me fairly robust. To an outside observer, the dead drop address and the stored value convey no personal information. However, to someone who knows the shared secret and both public keys, the drop is easy to locate and decrypt. So as long as users can somehow anonymize themselves while accessing the data, this system has all the properties we want.