Hash function for function in Python
Posted on Mon, 11 Sep 2017 in Python
A couple of weeks ago one of my colleagues ask if Python function is possible to use as a key in dicts? Yes, it is. So, every function has a hash. But how is it calculated? Based on function name? Based on function byte code? Actually, it calculates by transforming a pointer to a function object. However, it is not so easy to find it in the CPython code.
Navigating through C code is an interesting experience and it is not easy at all. There are many macroses that make this process extremely difficult. I am sure you can find out all function that is called if you ask a hash for a function object.
But let's do it a little bit different way. We will go through a couple of calls that are important to understand how CPython works, and then we'll talk about CPython internals a bit.
I'm going to use the master branch of the official CPython repo. You can clone it or explore it in a browser. The function object is described in funcobject.c. The most interesting part for us is that:
PyTypeObject PyFunction_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"function",
sizeof(PyFunctionObject),
0,
(destructor)func_dealloc, /* tp_dealloc */
0, /* tp_print */
0, /* tp_getattr */
0, /* tp_setattr */
0, /* tp_reserved */
(reprfunc)func_repr, /* tp_repr */
0, /* tp_as_number */
0, /* tp_as_sequence */
0, /* tp_as_mapping */
0, /* tp_hash */
function_call, /* tp_call */
0, /* tp_str */
0, /* tp_getattro */
0, /* tp_setattro */
0, /* tp_as_buffer */
Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC, /* tp_flags */
func_new__doc__, /* tp_doc */
(traverseproc)func_traverse, /* tp_traverse */
0, /* tp_clear */
0, /* tp_richcompare */
offsetof(PyFunctionObject, func_weakreflist), /* tp_weaklistoffset */
0, /* tp_iter */
0, /* tp_iternext */
0, /* tp_methods */
func_memberlist, /* tp_members */
func_getsetlist, /* tp_getset */
0, /* tp_base */
0, /* tp_dict */
func_descr_get, /* tp_descr_get */
0, /* tp_descr_set */
offsetof(PyFunctionObject, func_dict), /* tp_dictoffset */
0, /* tp_init */
0, /* tp_alloc */
func_new, /* tp_new */
};
This is the function type declaration. Each line market by comment tp_something
is a pointer to a C function that does something specific for the particular type. However, in our case 0 is used for tp_hash. It means that some inherited function will be called.
It is not so easy to discover what function is used by default. There is only one place where PyFunction_Type
structure is used - as a parameter for the PyObject_GC_New
macros call inside PyFunction_NewWithQualName
.
Although working with C code may be interesting, we will use a shortcut. I suggest looking through "CPython’s PyTypeObject" to find out how PyTypeObject is organized and works. This document says that by default tp_hash
source is PyBaseObject_Type.tp_hash
and the default value is _Py_HashPointer
.
Now we can easily find this function:
Py_hash_t
_Py_HashPointer(void *p)
{
Py_hash_t x;
size_t y = (size_t)p;
/* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
excessive hash collisions for dicts and sets */
y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));
x = (Py_hash_t)y;
if (x == -1)
x = -2;
return x;
}
As you can see, there are no things like qualified function name is used - just its pointer itself.
It was an interesting code discovery challenge. Of course, I've used a shortcut, but it still required some knowledge about CPython internals. One of the best ways to catch up it is watching the CPython Internals videos.
Got a question? Hit me on Twitter: avkorablev