Wednesday, March 22, 2017

The State of Theseus, One Month In

For just over a month now I've been working nonstop on one part after another of my current pet project, Theseus. Things are coming along very well so far -- in fact, I've surprised myself with the amount of progress I've been able to make in such a short time.

That said, I have yet to write a single line of code. This is because I'm trying to nail down a complete specification of the system first. The system has many moving parts, some of which possess complex interdependencies, and it's important to try to get this stuff right the first time.

The goal of this post is to take inventory of what's been done so far and to list out everything that's still needed in order to reach a point where work on an implementation can begin.

If you'd like to collaborate on Theseus, I'm hoping this post will give you an idea of where your contributions would be particularly valuable. If you're realizing at this point that you're not even sure what Theseus is, I've got you covered: The beginning of my introductory post, Theseus: A Robust System for Preserving and Sharing Research, can take care of that. I'll be directly referencing that post throughout the discussion below.

Progress on the Promises

In that original blog post, I laid out five major promises that I believed Theseus could provide. It would not be an exaggeration to say that these promises have guided the entire design of the system, and I remain fully confident in its ability to deliver on every single one of them. Some of them, I've already described solutions for. Others, I've found solutions but have yet to write them up. There are also some outstanding questions that need to be answered, either through hard work or consultation with subject experts.

The promises are laid out and described in the original post. Below I'll list them in order and discuss the progress made on them thus far.

1. A Guarantee of Robustness in the Face of Censorship

The system is useless unless it is robust under adversarial conditions. The unfortunate truth is that both private publishers and the current administration of my own country could conceivably take issue with this project's goals. As such, it is only prudent to do everything in my power to make sure the project can withstand their utmost efforts against it.

I lack the subject knowledge to understand the legal channels through which the project might be attacked. My suspicion, as I've mentioned elsewhere, is that anyone trying to pursue such an avenue of attack would be facing an uphill battle, since all I'm creating is a tool through which researchers may share their work without fear of censorship. Hopefully we'll be able to avoid putting this suspicion of mine to the test.

Setting aside such concerns (at least for the time being, until I can get in touch with someone having the appropriate legal background), we move on to technical concerns. These, I am more confident with. The question here is whether technical attacks exist which could prevent Theseus from functioning properly, effectively censoring the system by either rendering it unusable or rendering certain of its content inaccessible.

Denial-of-service attacks are obviously possible against individual nodes. However, the more nodes we have in the network, the less viable these attacks become. Thus this threat can be mitigated simply by making Theseus useful enough to see wide adoption.

At the heart of Theseus is a distributed hash table (DHT). This DHT has a number of uses in the system, some of which I have yet to write about in detail. Attacks on this DHT would have serious consequences. The most potent class of attacks on DHTs are known as Sybil attacks, which in fact are the bane of virtually all peer-to-peer systems. In Resisting Sybil Attacks in Distributed Hash Tables, I describe several countermeasures to these attacks which, to the best of my knowledge, constitute original research. Through these, I expect Theseus will be able to achieve virtually unprecedented resilience in the face of Sybil attacks.

Another critical component of Theseus is its distributed search functionality. This functionality is described in Distributed Search in Theseus. It relies on information returned by peers to get its results; as such, it obviously assumes access to honest peers. Dishonest peers can always return bad search results which would be nontrivial to identify as bad. However, they cannot prevent honest peers from returning honest results. The question of distinguishing honest and dishonest results is a very difficult one. Some methods that might help include: lending more weight to results returned by many nodes, using basic spam-filtering techniques locally, carrying out DHT lookups for returned links and prioritizing links with large numbers of available peers, prioritizing nodes possessing trusted public keys when carrying out the search, and so on.

The challenge here is to try to minimize the impact of these measures on search speed. It might be appropriate to allow the user to specify how paranoid their client should be, so that the more costly of these measures end up only being deployed as reactive countermeasures when a user starts noticing that they're getting lots of bogus search results.

Interesting future work in this vein could include: Investigating the viability of the countermeasures listed above, trying to come up with new countermeasures to add to the list, and investigating the various ideas' implications where latency is concerned.

The only other aspect of Theseus where censorship is applicable (as far as I can see) is where its social features are concerned. These have yet to be fully fleshed out, but anonymity and deniability are top priorities in their design. The only work I've released so far on this topic is Securely Finding Friends via DHT Dead Drops.

2. Guarantees of Anonymity for Users Who Need Them

The Theseus control protocol -- which handles everything except file downloads -- is intended to be low-bandwidth and to operate over TCP. As such, it could potentially be routed through a proxy, a VPN, or even over the Tor network. As mentioned in my original post, routing large downloads through Tor is rightly frowned upon; however, using it for lower-bandwidth activity like this is simply using the network as intended.

For some users, these options might not be enough. Maybe their threat model precludes simple solutions like VPNs (there are situations where this is entirely reasonable). Maybe they live in a country where access to Tor is restricted. For these users, I'm currently considering including in the Theseus spec a poor-hacker's equivalent to the Tor network, much like what Tribler uses for anonymizing downloads. More on that in a future post.

That takes care of anonymity at the control layer. At the download layer, here is what I'm thinking.

Users who trust a VPN can, of course, use that VPN to anonymize their UDP download traffic.

Regardless of other factors, the Theseus client should default to forcing encryption on all downloads / uploads. It's probably wise to include an option to disable mandatory encryption, for users who know what they're doing. It's probably also wise to discourage this by tucking the option deep in a config menu.

Encrypting traffic should go a long way towards thwarting ISP-level surveillance, but the person you're downloading from still knows what you're downloading. Thus, malicious peers still can get more insight into network behavior than we would perhaps like them to. As such it might be prudent to implement functionality similar or identical to the Tribler download-anonymizing spec linked above. I doubt it would be possible to make Theseus and Tribler clients interoperable -- but if it is possible, it might be a good idea. This is something that needs to be researched more thoroughly. Getting a clear idea of how this would work, if it would work, is an absolutely critical prerequisite to nailing down a draft spec for Theseus.

3. The Ability to Quickly Search for Papers on Any Topic

As mentioned above, Distributed Search in Theseus deals with this exact topic and gives a detailed description of the solution I have in mind. Next steps for realizing this solution are also described therein. Practically all that's left is to decide on implementation details.

4. The Ability to Securely Share Your Own Research

I have a full, detailed idea of how this will work. It is described in some detail in the original post on Theseus. Expect more in an upcoming post -- in fact, writing that post might be my next priority.

5. The (Potential) Ability to Sidestep Thorny Copyright Issues

As stated in the original post:
I am not a lawyer. Nothing in this blog post is legal advice, because legal advice is something I am not qualified to give. However, from my lay understanding I am optimistic about this tool's potential for circumventing restrictive laws and promoting freedom of information in the sense advocated by greats like Aaron Swartz.
and:
Decentralized services are very difficult to bring down, especially if they have nodes in multiple countries. Skeptical? Just ask the music industry.
These quotes continue to accurately reflect my views on this topic. If anyone with a working knowledge of modern copyright law would be interested in exploring this topic further, please get in touch.

The Big To-Do List

In the discussions above, I outlined some things that still need to be done. Here is a more comprehensive list of what I see as standing between what we have and what we need to start implementing (roughly ordered to place prerequisites ahead of their dependencies):

  • Brainstorm front-end ideas
    • There won't be a blog post on this until I have enough material to include some cool illustrations.
    • The priorities for the front-end are to make search intuitive and collaboration as painless as possible.
    • The goal is to make Theseus useful to as many people as possible, so outside input would be invaluable here.
      • What features would be useful to you?
      • What are some things that existing research tools get wrong?
      • Would you be interested in collaborating on the project's interface design?
  • Work out a universal format for storing data in the DHT
    • We're using it as a workhorse for several different aspects of the design
    • It needs to be flexible and support a variety of types of data, including...
      • For metadata collection:
        • Magnet links to files containing large amounts of document metadata
        • Cryptographic signatures for the above magnet links
        • (potentially:) The public key used to generate said signatures
      • For friend-finding:
        • Raw encrypted data which arrives sans context
        • Diffie-Hellman public-key integer
        • Cryptographic signature of above integer, possibly w/ public key
      • For actual torrenting:
        • Tracking lists of peers (some of which might actually be anonymous introduction points a la Tribler anonymous seeding)
        • [this is meant to mimic but operate disjointly from Mainline DHT's functionality, providing torrent-finding functionality to peers who only trust anonymized communications with other Theseus nodes]
    • The question of whether or not to include public keys in the records is a tricky one which depends on several factors. The question is subtler than it may initially appear. I'm currently leaning towards including them.
    • One major priority in designing this format is to try and keep it as flexible as possible to encourage forward- and backward-compatibility, since it is possible that the DHT could end up shared between clients running multiple versions of the eventual Theseus protocol.
    • I've got a lot on this topic in my notebooks, and I'll probably be writing a post about it soon.
  • Work out a universal format for metadata files
    • This involves deciding what kinds of data we want to track. A non-exhaustive list of ideas, not all of which necessarily merit inclusion:
      • Title
      • Author(s)
      • Author-specified tag list
      • User-specified tag list
      • Year of publication
      • Citations in paper (potentially w/ cross-references for other Theseus-accessible papers)
      • Citations of paper (potentially w/ same)
      • User-provided annotations
      • etc...
    • The considerations outlined above concerning backward- and forward-compatibility apply here as well.
  • Work out an appropriate set of parameters for the search feature's Bloom filters
    • There is potential here for interesting math, pretty graphs, etc...
  • Work out in more detail the protocol's features concerning anonymity, as discussed in brief above, and potentially write a post about these as well if there turns out to be enough to say.
    • Do we like Tribler's approach to anonymizing downloads? If so, can we plug into their system? Would they be OK with that? Would we even want to? Would we be better-off just using their protocol across our own separate overlay graph?
  • Nail down a rough draft of the Theseus control protocol (& potentially of the UDP anonymization system as well, if applicable)
  • Start implementing and testing!

This might look like a long list, but it's a hell of a lot shorter than it would've been a month ago. I'm looking forward to seeing what it looks like a month from now!

Tuesday, March 21, 2017

Distributed Search in Theseus

Previously:
Theseus: A Robust System for Preserving and Sharing Research
Resisting Sybil Attacks in Distributed Hash Tables
Securely Finding Friends via DHT Dead Drops 

Search algorithms in peer-to-peer networks have a long, storied, and rather unhappy history. Search sounds like a straightforward problem, but in fact it turns out to be alarmingly difficult even in non-adversarial settings. In realistic models which assume the presence of some hostile peers, the situation is even worse.

It has taken me a while to sit down and write this post. For that, there are two reasons. First off, I've been pretty busy: on top of keeping up with my job, last week was my last finals week before graduating with my Bachelor's, yesterday was commencement, and next weekend is PRCCDC. So I've been spread somewhat thin. The second reason is that search is a very hard problem, and I've been holding off until I'm confident I've got something worth sharing. Luckily, after making some major headway in the last couple weeks, I think I have just that.

Before we go any further, let me just say one thing: The design I'm proposing is simple enough that (like much else that I've written about on this blog) I'd be surprised if I'm the first to get to it. However, I have yet to find such prior work, and if anyone could point me towards it I'd be incredibly grateful.

For the sake of providing context, we'll take a moment to discuss previous attempts at peer-to-peer search before delving into what I'm proposing.

Prior Work

RFC 4981 is a whirlwind tour of major work on this topic up to '07. There seems to have been surprisingly little work since then (perhaps due to the difficulty of the problem, which RFC 4981 indirectly demonstrates through its length and density). Most of the strategies outlined are either uncomfortably simple, uncomfortably complex, or some unholy amalgam of both. Also, little attention seems to have been paid how these various approaches perform in adversarial contexts -- perhaps because everyone already knows that the findings would not be pretty.

One thing worth mentioning is that the obvious approach to p2p search, which was implemented by Gnutella, fails miserably. Their strategy was to simply define a "search" message which explicitly describes what you're looking for. Then, whenever you wanted to do a search, you would send these messages to as many of your peers as possible. This works alright for small networks, but to say it scales poorly would be a dramatic understatement. It can be seen that total bandwidth used by the peer swarm grows exponentially with network size. The details are discussed at some length in RFC 4981; their ultimate conclusion is that scalable systems unavoidably require some sort of routing structure to reduce overall network load.

In surveying the rest of the survey, I found nothing that lined up with all our requirements. In the time since that survey's publication, there have been some more recent developments. None that I know of have fully met our requirements, but a couple of them perhaps merit brief discussion before we move on.

In '08, a protocol called Cubit was released which took on the problem of inexact keyword search using the concept of edit distance. This work made enough of a splash at the time to land a writeup in Ars Technica. The idea, roughly: Nodes in a p2p overlay graph were transparently associated with specific search terms, and when you wanted to carry out a search, you would query not only the node corresponding to a given word but also the nodes whose words were closest to it by the edit distance metric (which the overlay structure ensures are proximal in the routing graph as well). One stated goal of the protocol is to gracefully handle misspellings in search queries. (I'm not sure why those can't just be taken care of by a spell checker.) The protocol seems, unfortunately, to suffer from serious lack of resilience in the presence of hostile peers.

Around the same time, the Tribler torrent client's team was developing (and continues to develop) a different approach to a similar problem. Specifically, their goal is efficient, low-bandwidth exact-match keyword search for torrent files. The idea is to have everyone in the network share their download histories, and to let peers with similar histories be close in the overlay graph. When you carry out a keyword search, the peers closest to you are queried first, the idea being that since their tastes resemble yours, maybe they've already looked for and found what you're looking for now. This scheme yields some nice properties; however, they come at an obvious and massive cost as far as user privacy is concerned. The Tribler team deserves credit for acknowledging this problem, as well as that their protocols operate in an adversarial environment. They even go so far as to suggest appropriate countermeasures for some malicious behavior. Unfortunately, their countermeasures are generally based on reputation-management systems. My reservations about such systems are discussed in previous posts.

So, neither of these systems quite meet our criteria. This, again, underlines how challenging the problem of p2p search really is. What I'm proposing here is a system which is (arguably) simpler than those suggested above, and which excels in some of their weaker areas, in particular by possessing several very nice privacy-related properties. It also allows for significant routing redundancy, which is desirable under adversarial conditions and which helps prevent hotspots in the network.

The Idea

Fun with Bloom Filters

The approach is centered around Bloom filters, which have many useful properties:
  1. At the minor cost of a small but nonzero false positive rate, they allow large content indices to be stored with remarkable space efficiency. This space efficiency makes them easy to fit on the wire, and also makes them amenable to storage en masse.
  2. They are inherently one-way, which has desirable privacy properties. Any given Bloom filter can be generated by a number of different input key sets, and going only by the filter there is no way to tell which set was used. Given a key, you can test its membership in a Bloom filter, but the possibility of false positives means that deniability is maintained even if a hit is found.
  3. They have meaningful behavior under bitwise logical operations. Relatedly, the number of elements in a Bloom filter or in the union or intersection of two Bloom filters can be straightforwardly estimated.
  4. Querying a Bloom filter for a key involves checking the filter against a certain bit mask derived from the key in a way that (for good hash functions) is infeasible to reverse. A user may thus search for a key by generating a bit mask and sending it to remote peers, who can check if their known filters contain it without even really knowing the key they're searching for. The mapping from keys to masks is many-to-one, meaning that attempts to invert it cannot ever achieve perfect confidence in their results.
  5. There has been interesting research into optimizing Bloom filters for network transfer, with excellent results. The work is presented here (PDF) and summarized in section 2.6 of this survey (PDF). The main finding is that while ordinary Bloom filters don't compress well, it's possible to create sparser Bloom filters which are optimized for compression and end up with a structure that takes up more space in memory but less space on the wire, for a given false positive rate, than the corresponding ordinary Bloom filter. This trade-off is in some cases very appealing.

Metametadata

We start with the assumption that everyone has, in local storage, metadata for a certain number of files. For this post, we're not yet going to worry about where it comes from. The problem we're trying to solve here is: if everyone has a certain amount of metadata but there's no shared record of who has what, how can we efficiently implement keyword search across everyone's metadata?

The answer, as I see it, is for everyone to communicate an approximation of what they've got using Bloom filters, and to use some simple tricks to build an efficient and highly redundant routing system out of those filters.

More specifically, everyone takes the names of (or other tag data for) every paper they have metadata on, hashes everything that's not a stop word into their Bloom filter, then shares this filter with other peers upon request. The linguistically perverse might be inclined to call this filter a peer's metametadata. Everyone stores up-to-date Bloom filters from all the peers in their routing tables. On top of that, whenever a Bloom filter is received, we compute the value |F_r| + |F_r AND F_l|, where |F| represents the Hamming weight of a filter, and F_r and F_l are the remote and local nodes' Bloom filters respectively. We record in a separate list the N nodes for which this sum is largest, out of all the nodes we've received Bloom filters for. Like peers in routing table buckets, the peers on this list are periodically pinged to check their freshness.

Keeping this list helps us make sure that we know at least a few nodes with well-populated large Bloom filters, which is useful since we have no built-in guarantees about the filters of the peers in our routing table. It has the further interesting property of promoting a degree of clustering in the network, since the second term in the summand favors nodes whose Bloom filters F_r have many bits in common with the local filter F_l.

It may be appropriate to place a dynamic upper limit on |F_r|, to weed out filters for which the false positive rate would be prohibitively large and to prevent malicious nodes from trivially poisoning everyone's filter lists with all-1 filters. It may also make sense to restrict this list to only include peers not already in the routing table, to avoid redundancy. If the clustering effect is found to be insufficiently weak to prevent hotspots from forming, we may want to include a constant coefficient like |F_r| + 2*|F_r AND F_l|.

Note that these are all decisions which individual nodes can make for themselves, internally, without taking input from or even informing the network. This means that different clients -- or successive versions of the same client -- can try different strategies without harming the overall network (except for the potential harm that would result from a client choosing a bad strategy and making itself less useful than it could otherwise be).

Search

To carry out a search, we define a couple queries: one to request data from a remote node, and one to find nodes whose Bloom filters match the data you want to request. We'll describe them at a high level here and go into more detail below.

The first query is simple. Let's call it "search_files". You send a request for some data to the remote node, which either sends back what you wanted or apologizes and says it doesn't have what you're looking for.

The second query is simple as well but has certain subtleties. We'll call it "search_peers". The idea is that if your immediate neighbors don't have the data you need, you can ask them to help you find other nodes who might. You do this by sending the Bloom filter masks describing your query to the peers who are closest to having a match, i.e. who have most of the mask's bits set. These nodes, in their filter lists, keep track of other nodes with large Bloom filters having large intersections with their own. Thus, they are likely to be able to point you in the direction of nodes that match your search even better than they themselves do. Applying these query iteratively should allow a user to find very many peers.

The concept behind these queries is somewhat similar in spirit to the design of Kademlia, and proofs of Kademlia's proper function suggest broad strokes for several heuristic arguments for this system's efficacy as well.

Bit Masks

Say you have three keys: K_1, K_2, K_3. You have a Bloom filter B that you want to check the keys' membership in. Running our k hash functions on K_1 gives us a bit mask with (up to) k bits set in it. Call this mask M_1, and define M_2 and M_3 analogously.

To check for K_1's membership in B (in Python syntax), we test whether M_1 == B & M_1. To check for K_1 through K_3 at once, the naive way is to test whether (M_1 == B & M_1) and (M_2 == B & M_2) and (M_3 == B & M_3). The same logic can be implemented more simply as M_1 | M_2 | M_3 == B & (M_1 | M_2 | M_3).

Thus to search for a set of keywords K_1 ... K_n, we can simply compute the combined bit mask K_1 | K_2 | ... | K_n and send this combined bit mask as our argument to the query. This conserves bandwidth and increases opacity as to the actual terms being searched for.

This allows us to efficiently use AND logic in our search queries. We can implement OR conditions by conducting multiple queries, and implement NOT conditions through local filtering of results. Part of the goal of this design is to recognize which problems it is appropriate to solve at the protocol level and which problems are better handled in other parts of an application, and to use this recognition to design a protocol which does everything it needs to and nothing more.

Specifics (or Lack Thereof)

One weird thing about this system is that in its current iteration it requires constant, universal Bloom filter parameters across the entire network. This is tricky because filter parameters are typically optimized as functions of two variables: the desired uncompressed filter size, and the expected number of elements in the filter. Since the number of filter elements is likely to vary widely from peer to peer, coming up with good values here is a nontrivial task. More accurately: it's easy to do, but hard to do right.

When I can spare the time -- probably next week -- I'll sit down with a notebook, a Python terminal, and something green that's illegal to take across state lines, and work at this until I come up with some good arguments for what an optimal parameterization might be and have some good simulations to back them up. Another question to be answered as part of that analysis is whether the value of compressed Bloom filters outweighs the increased cost of computation they entail. This all will involve a lot of fun but fairly involved math. Expect gory details.