Thursday, July 6, 2017

How to Make a Python Module Subscriptable

This might not be a good idea -- but it is a cool one.


Say you've got a Python dictionary with some data in it. There are two ways (well, two simple ways) to access its contents: via subscripting, or via a .get method.
d = {'pears': 'good', 'peaches': 'meh', 'kumquats': 'yes please'}
d['kumquats'] # subscripting
d.get('pears') # .get

d.get(k) defaults to None if a key is not found, whereas d[k] throws a KeyError. When the key is present, though, these two ways of getting data are more or less equivalent.

The same goes for nested dictionaries:
d = {
    'eggs': {
        'breakfast': 'maybe',
        'lunch': 'i guess',
        'dinner': 'nah'
    },
    'sausage': {
        'breakfast': 'deffo',
        'lunch': 'yee',
        'dinner': 'why not'
    },
    'sushi': {
        'breakfast': 'wtf',
        'lunch': 'erryday',
        'dinner': 'sometimes'
    }
}
d['eggs']['lunch'] # this works
d.get('sushi').get('breakfast') # so does this
d.get('sausage')['breakfast'] # this too -- but it's mega ugly

The particularly astute reader might already have sense of where this is going.

I'm writing a module to manage and expose a JSON config file to a Python app, and internally it represents the config using native Python data types. To the rest of the world, access is mediated via a .get call:
class Config:
    ... # load config file data, etc

    def get(self, *args, **kwargs):
        return self.config.get(*args, **kwargs)

# config query singleton -- prevents unnecessarily re-reading conf file
_config = Config()
get = _config.get

The idea is that other code, after import config, can call config.get and interface with the module similarly to how you'd deal with a plain dictionary. But here's the catch: the config data includes nested dictionaries. Accessing these would involve either chaining .get calls or mixing .get and subscript notation -- either of which would lead to ugly code.

What to do? There are a couple options here.

The simple, easy option would be to find a different way of exposing the config file.

The other option would be to find a way to make the entire module subscriptable, and to make subscripting it return the correct results.

I'm not really opposed to either option, but I know which one sounds more fun.

Internally, Python handles subscripting by calling an instance's __getitem__ method. But just defining a function named __getitem__ at module level won't work:
user@development:/tmp/tests$ cat a.py 
import random

def __getitem__(self, key):
    return random.choice(("o shit lol", "whaa", "no way", "lmfao"))

user@development:/tmp/tests$ cat b.py 
import a

for _ in range(10):
    print(a[_])

user@development:/tmp/tests$ python3 b.py 
Traceback (most recent call last):
  File "b.py", line 4, in <module>
    print(a[_])
TypeError: 'module' object is not subscriptable

What's going wrong here? That's a simple question with a hard answer. The answer also varies somewhat depending on how far into the weeds of Python's object-oriented internals we want to wade. A few explanations, tuned to varying comfort levels:

Python doesn't make it easy for us to override 'magic' functions like __getitem__ at the module level, for that way madness lies.

Python wants __getitem__ to be a bound method, not just a function.

Specifically, __getitem__ must be bound to the object whose scope it's in (i.e. this module), something that there's no real pretty way of doing from inside the module body.

This last one explains why the following fails:
user@development:/tmp/tests$ cat a.py 
import random

class Foo():
    def __getitem__(self, key):
        return random.choice(("o shit lol", "whaa", "no way", "lmfao"))

_foo = Foo()
__getitem__ = _foo.__getitem__

user@development:/tmp/tests$ cat b.py 
import a

for _ in range(10):
    print(a[_])

user@development:/tmp/tests$ python3 b.py 
Traceback (most recent call last):
  File "b.py", line 4, in <module>
    print(a[_])
TypeError: 'module' object is not subscriptable

It's starting to look like our goal of making the module subscriptable would require some deep, dark wizardry, maybe even hooking on module import or something equally perverse. The good news is that we can contain the perversity to a reasonably small area. The solution is to have the module overwrite its own entry in the sys.modules registry, replacing itself with a tailor-made object possessing the desired bound method. Check it out:
user@development:/tmp/tests$ cat a.py 
import random
import sys
import types

class Foo(types.ModuleType):
    def __getitem__(self, key):
        return random.choice(("o shit lol", "whaa", "no way", "lmfao"))

sys.modules[__name__] = Foo("a")

user@development:/tmp/tests$ cat b.py 
import a

for _ in range(10):
    print(a[_])

user@development:/tmp/tests$ python3 b.py 
no way
lmfao
no way
o shit lol
no way
whaa
o shit lol
whaa
whaa
o shit lol

And there you have it! We use __name__ to guarantee we're indexing into sys.modules at the right place. We pass the name of the module ("a") to the object constructor, which was inherited from types.ModuleType and expects a module name.

So there you have it: that's how to make a module subscriptable. Similar tricks would probably work for making a module e.g. callable, or really giving it any combination of properties implemented through __magic__ functions.

The more you know.