The other day at work, a couple of us stumbled across an interesting little problem: when should a Python 2 class implement __cmp__(), and when should it implement __eq__()? In other words, how the heck do I make my objects comparable?
It seems like a simple question, but it turns out there's a lot going on here under the surface. You have to know some set theory, and you have to know some language history. It also helps to have experience with other programming languages to lend some perspective and understanding.
Oh, and the question has become academic—or will soon, depending on where you are in your migration process—because __cmp__() has been removed in Python 3.
The problem arose in a class for representing RPMs, a software package format used by Red Hat Linux and many other Linux distributions. The interesting attributes of an RPM are
(name, version, release, arch)
(Actually, there's also the epoch, but I'm just going to ignore that in this discussion. It's not necessary to illuminate the central points.)
For example, a Fedora machine might have a package called python-2.7.3-3.fc16.x86_64 installed. That breaks down as
name = python version = 2.7.3 release = 3.fc16 arch = x86_64
The main reason you would want to compare RPMs is to know if it's appropriate to apply an upgrade. Compare the above package to python-2.7.2-5.fc17.x86_64: the version is lesser, so the whole package is lesser, so you don't do that upgrade. But python-2.7.3-4.x86_64 is a valid upgrade: the version is the same, but the release is greater.
Thus, any class modelling RPMs for the purpose of planning package upgrades really should implement comparison methods. But which ones?
Up to and including Python 2.0, there was only __cmp__(). If you wanted to make your objects comparable, either for sorting or for comparison operators like == and >, you had to implement __cmp__().
The __cmp__(self, other) interface is simple: you return a negative integer if self is less than other, zero if they are equal, and a positive integer if self is greater. The precedent for __cmp__() is found in C's standard library: the interface is defined by bsearch() and qsort(), and the canonical implementation is strcmp().
For example, here is the definition of C's qsort():
void qsort(void *base, size_t count, size_t size, int (*compar)(const void *, const void *));
In case you're not familiar with C syntax, that last argument means "compar is a pointer to a function taking two generic pointer arguments and returning an integer". Putting qsort() and strmp() together is a bit non-obvious: see this sample code.
In C, it's traditionally the job of whoever calls qsort() to implement an appropriate comparison function. In Python, that job moves to the class author:
class Fruit(object): def __init__(self, name): self.name = name def __cmp__(self, other): return cmp(self.name, other.name)
Note the handy built-in cmp(), which is analogous to C's strcmp() but—this being Python—polymorphic, meaning it can take any two objects.
In the case of Fruit, we have made an arbitrary decision to compare based on name. That's a perfectly valid design choice: it's clear, unambiguous, and easy to explain. There's certainly no other obvious criterion on which to decide whether an apple is "greater" or "lesser" than a blackberry; it's largely a meaningless question.
So back to the model class for software packages:
class RPM(object): def __init__(self, name, version, arch): self.name = name self.version = version self.arch = arch
I'm ignoring the release attribute from here on, since it's unnecessary to explain the problem. And similarly, I'll just sweep aside the whole dark art of comparing software versions (1.9 > 1.1, but 1.9 < 1.10) and pretend that version is an integer. That makes it easy to compare two RPM objects:
def __cmp__(self, other): return (cmp(self.name, other.name) or cmp(self.version, other.version))
Note the sneaky use of boolean short-circut evaluation here: the first cmp() returns zero if the names are equal. Zero is a false value, which forces or to evaluate and return its second operand, the comparision of version attributes. Thus, if two packages have different names, we compare them purely by name. If the names are equal, the comparison is down to version. Normally I dislike conflating integers and booleans like this, but I bet C's designers intended exactly this sort of trickery—it's deeply consistent with the spirit of C. So it's the obvious idiom for comparison in C, and Python __cmp__() implementations have inherited that idiom.
So far, so good. But you may have noticed I'm ignoring one of the instance attributes: arch. What if I have RPM objects ("foo", 3, "i386") and ("foo", 3, "ppc"). Are they equal? Is one greater or lesser than the other? In other words, is the set of all RPMs a total order or a partial order?
The problem with __cmp__() is that it only makes sense for totally ordered sets. Roughly speaking, a totally ordered set is one where you can pick any two elements A and B and always know whether A is equal to, greater than, or lesser than B: no ambiguity. (See Wikipedia if you want the whole, mathematically rigorous story.)
The canonical example of a totally ordered set is the integers: 3 == 3, 5 < 17, and 15321355 > -6. No questions, no discussions, no dithering. In a programming context, strings are another totally ordered set: pick any two strings, and I (or strcmp() in C) can immediately tell you which one is greater.
But the world is full of partially ordered sets, where some pairs (A, B) simply don't have an order relative to each other. For example, if your abstraction for software packages includes the hardware architecture, then you lose the nice clear total ordering that we had before. But you gain something extremely valuable: a more accurate abstraction.
In Python, the solution to this dilemma is the rich comparison operator methods added back in Python 2.1. Instead of a single method to address sorting and all six comparison operators, there's now a one-to-one correspondence between operator and method:
Operator Method == __eq__() != __ne__() > __gt__() >= __ge__() < __lt__() <= __le__()
The good news is that rich comparisons make it possible to implement partially ordered abstractions, e.g. software packages with an arch attribute. The bad news is that instead of one single method, now you have to implement six.
Equality is usually pretty easy, e.g.:
def __eq__(self, other): return ((self.name, self.version, self.arch) == (other.name, other.version, other.arch)) def __ne__(self, other): return not self == other
(It seems silly that you have to implement __ne__() when the obvious implementation is to invert __eq__(), as I have done here, but that's how it is.)
Using tuple comparison is a nice trick: in other languages, you would have to manually code the series of comparisons that happen automatically for you in Python.
Relative comparisons are where things get interesting. First, let's take a look at strict greater-than and less-than:
def __gt__(self, other): return (self.name, self.version) > (other.name, other.version) def __lt__(self, other): return (self.name, self.version) < (other.name, other.version)
Again, use of Python's tuple comparison saves you from the tedious job of comparing name, then version. (Try implementing this in Java to see what a big win tuples are.)
By omitting the arch attribute from this comparison, we've immediately made this class a partial ordering. Consider these two instances:
("foo", 3, "i386") ("foo", 3, "ppc")
Neither one is greater or lesser than the other: __gt__() and __lt__() both return false, however you do it, because
("foo", 3) == ("foo", 3)
But neither are those two instances equal: arch differs, so the whole value differs. They're not equal, and neither one is greater or lesser than the other. The existence of that pair of values (and many others like them) makes this a partial ordering.
For completeness, let's throw in the final two methods:
def __ge__(self, other): return (self > other) or (self == other) def __le__(self, other): return (self < other) or (self == other)
(Again, it might seem silly that you have to implement __ge__() and __le__() when the default implementation is perfectly obvious... but that's the way it is. Python 2.7 added total_ordering to the functools module, but the name is a hint that it might not be appropriate to use in a partial ordering. I have not investigated carefully.)
Wait a second: whatever happened to 'There Should Be Only One Way To Do It'? Isn't that one of the prime directives of Python language design? Well, yes, but backwards compatibility beats TSBOOWTDI any day. Removing __cmp__() in favour of the rich comparison methods would have broken tons of code. So, we have to live with two ways of doing it.
Python 3 solves the TSBOOWTDI by just going ahead and breaking all that old code: support for __cmp__() has been removed, as has the cmp() builtin method. So much for C heritage!
One obvious question: what happens if you implement both __cmp__() and one or more of the rich comparison methods? Conveniently, the Python Reference Manual answers this for us:
[...the rich comparison methods] are called for comparison operators in preference to __cmp__()
Anyways, it's bad style to implement both __cmp__() and rich comparison in the same class; it tells the world that you are confused and have not read this article. It might happen by accident in a deep class hierarchy, but don't do it deliberately.
The rich comparison methods added in Python 2.1 are more general. You can use them to model both totally ordered sets and partially ordered sets. So if you're not sure whether you're dealing with a total ordering, you might as well suck it up and implement __eq__() and friends from the beginning.
And in Python 3, you have no choice: __cmp__() has been removed from the language. Unless you are writing a quick one-off hack that will never ever be ported to Python 3, you should avoid implementing __cmp__() in new code.