BEP:51
Title:DHT Infohash Indexing
Version: 889bd955b276e76fe40d926dcb09bf82502a6439
Last-Modified:Tue Jan 3 03:35:40 2017 +0100
Author: The 8472 <the8472.bep@infinite-source.de>
Status: Draft
Type:Standards Track
Content-Type:text/x-rst
Created:20-Dec-2016
Post-History:

Abstract

This extension enables DHT nodes to retrieve a sample of the infohashes that other nodes currently have in their storage.

Rationale

DHT indexing already is possible and done in practice by passively observing get_peers queries. But that approach is inefficient, favoring indexers with lots of unique IP addresses at their disposal. It also incentivizes bad behavior such as spoofing node IDs and attempting to pollute other nodes' routing tables [1].

With this extension a single node should be able to survey the entire DHT within a few hours without having to resort to non-compliant behavior.

Since it cannot be directly used to search for specific torrents it is not expected that average clients actually use this RPC, they only need to support replying to it. Instead the intended use is that a few specialized indexers in the network use it as building block to create and curate a database of available torrents and then make it available to endusers through other means, e.g. as a web service or through torrent feeds.

Sampling RPC

The new sample_infohashes remote procedure call requests that a remote node return a string of multiple concatenated infohashes (20 bytes each) for which it holds get_peers values.

When a node tracks more infohashes than would fit into a packet it should return a randomly sampled subset instead. After that size limit is reached nodes may occasionally precompute a subset and return that instead of calculating it upon each request. The recomputation interval is indicated by the interval field in the response. An indexer should not send a new sampling request to this node for that period of time. The fact that the precomputed set will not change within that interval should discourage them from doing that since they will not gain any new data. The the permissible range is between 0 and 21600 seconds (inclusive). Since the maximum is 6 hours an indexer that sweeps over the keyspace only once every 6 hours will not have to perform bookkeeping of backoff intervals.

The num field indicates how many infohash keys are currently in the node's storage. If the value is larger than the number of returned samples it indicates that the indexer may obtain additional samples after waiting out the interval.

Nodes supporting this extension should always include the samples field in the response, even when it is zero-length. This lets indexing nodes to distinguish nodes supporting this extension from those that respond to unknown query types which contain a target field [2].

The target and nodes/nodes6 mechanic for iterative lookups is included so that indexing nodes can perform a keyspace traversal with a single RPC per node by adjusting the target value for each RPC. It has no effect on the returned sample value.

message format

Request:

{
    "a":
    {
        "id": <20 byte id of sending node (string)>,
        "target": <20 byte ID for nodes>,
    },
    "t": <transaction-id (string)>,
    "y": "q",
    "q": "sample_infohashes"
}

Response:

{
    "r":
    {
        "id": <20 byte id of sending node (string)>,
        "interval": <the subset refresh interval in seconds (integer)>,
        "nodes": <nodes close to 'target'>,
        "num": <number of infohashes in storage (integer)>,
        "samples": <subset of stored infohashes, N × 20 bytes (string)>
    },
    "t": <transaction-id (string)>,
    "y": "r"
}

As usual, additional fields may be defined by other BEPs.

References

[1]Real-World Sybil Attacks in BitTorrent Mainline DHT, Liang Wang and Jussi Kangasharju (https://www.cl.cam.ac.uk/~lw525/publications/security.pdf)
[2]Libtorrent DHT extensions, Forward Compatibility (http://libtorrent.org/dht_extensions.html#forward-compatibility)