The Descriptor Protocol, and Python Black Magic

The Descriptor Protocol, and Python Black Magic

Late last night, I saw a very confusing tweet:

    class A:
def b(self):
pass

if A.b is A.b:
print("Python 3")
else:
print("Python 2")
— Jake VanderPlas (@jakevdp) April 26, 2016

Like any self-respecting programmer, the first thing I did was copy paste that into a terminal, even though I knew exactly what to expect. My response was simple:

Since I graduated last summer, I have been writing lots of both Python 2 and 3. This snippet seemed like something I should understand well. However, I did not, so this post is an attempt to solve that. I was inspired by Julia Evans, and her campaign to share the things she learns, however incomplete her understanding might be.

This post assumes you have at least a basic understanding of Python and OOP. For a good overview of OOP in Python, I recommend Leonardo Giordani’s series which builds up nicely from simple concepts to the internals of Python classes (he also has one on Python 2.x, although I haven’t read it closely).

So, what is this black magic?

My first instinct was to check the behavior of the comparison itself. While == delegates to an object’s __eq__ method to check equality, the is keyword checks identity, so those objects can’t be the same in memory!

# Python 2
>>> A.b
<unbound method A.b>
>>> hex(id(A.b))
'0x1006ebc80'
>>> hex(id(A.b))
'0x1006beb90'

# Python 3
>>> A.b
<function A.b at 0x101b75158>
>>> hex(id(A.b))
'0x101b75158'
>>> hex(id(A.b))
'0x101b75158'

As expected! The memory locations (as given by id) in Python 2 are different, causing the identity check to fail. Not so in 3. So far so good. But why do we get unbound method on one end and function on another? How are these objects even stored internally? In most cases, Python uses a dictionary, accessible under __dict__ to store the local variables, or namespace of an object (Note that not all objects have a __dict__, but that is a different story). Let’s look up b in A:

# Python 2
>>> A.__dict__['b']
<function b at 0x1007a8398>
>>> type(A.__dict__['b'])
<type 'function'>
>>> type(A.b)
<type 'instancemethod'>

# Python 3
>>> A.__dict__['b']
<function A.b at 0x101b75158>
>>> type(A.__dict__['b'])
<class 'function'>
>>> type(A.b)
<class 'function'>

Huh? In 2 we get an instancemethod, while 3 spits out a function, but if we check the type inside the enclosing __dict__ we see they are both functions? How does this work? This is caused by the design of the Descriptor Protocol, which defines how data in an object is reached through a series of attribute accesses. In Python 2, the protocol sets in place a type distinction based on how the function object is accessed. In the doc, Raymond Hettinger explains:

# Python 2
>>> class D(object):
... def f(self, x):
... return x

>>> d = D()
>>> D.__dict__['f'] # Stored internally as a function
<function f at 0x00C45070>
>>> D.f # Get from a class becomes an unbound method
<unbound method D.f>
>>> d.f # Get from an instance becomes a bound method
<bound method D.f of <__main__.D object at 0x00B18C90>>

In 3, this distinction between bound and unbound doesn’t exist, but strangely, the docs for Python 3 are not up to date, so I can’t tell what the underlying behavior is. The same code clearly has a different output:

# Python 3
>>> class D(object):
... def f(self, x):
... return x

>>> d = D()
>>> D.__dict__['f'] # Stored internally as a function
<function D.f at 0x1014021e0>
>>> D.f # Get from a class becomes an unbound method... NOT!
<function D.f at 0x1014021e0>
>>> d.f # Get from an instance becomes a bound method
<bound method D.f of <__main__.D object at 0x10123cf28>>

Also explained in the documentation is the fact that both bound and unbound methods are backed by the same C implementation, except for the value of their im_self attribute, which is NULL when unbound. So I am guessing thatinstancemethod is creating a new instance of the function object at runtime in 2 regardless of whether it is bound or unbound, while in 3 the instantiation only happens when bound, given that the unbounds don’t exist. This would make sense, as the function must be executed each time you access it.

If that were the case, we would expect that calling b on an instance on A would always return a different object, regardless of which Python runtime we’re on, as they are always bound:

# Python 2&3
>>> a = A()
>>> a.b is a.b
False
>>> hex(id(a.b))
'0x1003bf988'
>>> hex(id(a.b))
'0x1003f1448'

So, the reason why A.b is A.b in Python 3, and not Python 2 is this whole bound/unbound story. Seems like the Descriptor Protocol is responsible for this sorcery! Magic is just technology we don’t understand, yet.

If you have more insight into the inner workings of this, I’d love to hear about it.

Update (4/26/16): Jake VanderPlas replied to my tweet, and pointed to a 2009 post by Guido describing the behavior. Apparently, the bound/unbound distinction was introduced as a way to achieve “first-class everything,” which methods didn’t quite fit into. Python 3’s undoing of unbound methods is just a further expression of the idea.

Update 2 (4/29/16): Today I received an email from Todd Jennings, who pointed me to the bug that tracks the out-of-date documentation for Python 3. Sadly, it is marked as still waiting.

Update 3 (8/22/16): After attending PyBay, Wesley Chun pointed out that the definition of A was that of a classic 2.x class, while the rest of the article used new-style classes. Changing the class definition to inherit from object (as in, class A(object):) doesn’t change the behavior that I describe above, for either Python 2.x or 3.x. To remain true to the original tweet, I have kept the class definition without explicit inheritance, but the distinction is important.


Image: “The Witch No. 1” by Baker, Joseph E. - Licensed under Public Domain, via Wikimedia Commons

Want to see more articles like this? Sign up below: