Hash function for function in Python

Posted on Mon, 11 Sep 2017 in Python

A couple of weeks ago one of my colleagues ask if Python function is possible to use as a key in dicts? Yes, it is. So, every function has a hash. But how is it calculated? Based on function name? Based on function byte code? Actually, it calculates by transforming a pointer to a function object. However, it is not so easy to find it in the CPython code.

Navigating through C code is an interesting experience and it is not easy at all. There are many macroses that make this process extremely difficult. I am sure you can find out all function that is called if you ask a hash for a function object.

But let's do it a little bit different way. We will go through a couple of calls that are important to understand how CPython works, and then we'll talk about CPython internals a bit.

I'm going to use the master branch of the official CPython repo. You can clone it or explore it in a browser. The function object is described in funcobject.c. The most interesting part for us is that:

PyTypeObject PyFunction_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "function",
    sizeof(PyFunctionObject),
    0,
    (destructor)func_dealloc,                   /* tp_dealloc */
    0,                                          /* tp_print */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    0,                                          /* tp_reserved */
    (reprfunc)func_repr,                        /* tp_repr */
    0,                                          /* tp_as_number */
    0,                                          /* tp_as_sequence */
    0,                                          /* tp_as_mapping */
    0,                                          /* tp_hash */
    function_call,                              /* tp_call */
    0,                                          /* tp_str */
    0,                                          /* tp_getattro */
    0,                                          /* tp_setattro */
    0,                                          /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC,    /* tp_flags */
    func_new__doc__,                            /* tp_doc */
    (traverseproc)func_traverse,                /* tp_traverse */
    0,                                          /* tp_clear */
    0,                                          /* tp_richcompare */
    offsetof(PyFunctionObject, func_weakreflist), /* tp_weaklistoffset */
    0,                                          /* tp_iter */
    0,                                          /* tp_iternext */
    0,                                          /* tp_methods */
    func_memberlist,                            /* tp_members */
    func_getsetlist,                            /* tp_getset */
    0,                                          /* tp_base */
    0,                                          /* tp_dict */
    func_descr_get,                             /* tp_descr_get */
    0,                                          /* tp_descr_set */
    offsetof(PyFunctionObject, func_dict),      /* tp_dictoffset */
    0,                                          /* tp_init */
    0,                                          /* tp_alloc */
    func_new,                                   /* tp_new */
};

This is the function type declaration. Each line market by comment tp_something is a pointer to a C function that does something specific for the particular type. However, in our case 0 is used for tp_hash. It means that some inherited function will be called.

It is not so easy to discover what function is used by default. There is only one place where PyFunction_Type structure is used - as a parameter for the PyObject_GC_New macros call inside PyFunction_NewWithQualName.

Although working with C code may be interesting, we will use a shortcut. I suggest looking through "CPython’s PyTypeObject" to find out how PyTypeObject is organized and works. This document says that by default tp_hash source is PyBaseObject_Type.tp_hash and the default value is _Py_HashPointer.

Now we can easily find this function:

Py_hash_t
_Py_HashPointer(void *p)
{
    Py_hash_t x;
    size_t y = (size_t)p;
    /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
       excessive hash collisions for dicts and sets */
    y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
    x = (Py_hash_t)y;
    if (x == -1)
        x = -2;
    return x;
}

As you can see, there are no things like qualified function name is used - just its pointer itself.

It was an interesting code discovery challenge. Of course, I've used a shortcut, but it still required some knowledge about CPython internals. One of the best ways to catch up it is watching the CPython Internals videos.