Good practices in Python ======================== :date: 2013-01-20 17:34 This post is a collection of various facts about Python: * common mistakes that I encounter frequently when reading code written by myself or other people. * specific features of the Python language that are not very well-known and that I think should be used more. * general recommendations regarding Python. Please note that I do not consider myself a Python expert, so it is possible that the following text contains some inaccurate statements. Also, due to its very nature, this post is rather unstructured. The table of contents should help you jumping directly to the part you are interested in. .. contents:: :local: List comprehensions ------------------- *List comprehensions* give Python users a very concise and powerful syntax to build a list from another list (or any iterable object). The syntax is the following: .. code-block:: python result_list = [expression(item) for item in original_list if condition(item)] which means that ``result_list`` will be a list containing ``expression(item)`` (an expression computed from ``item``) for each ``item`` element of ``original_list`` for which ``condition(item)`` (a boolean expression involving ``item``) is ``True``. The boolean condition which allows you to filer the list is optional. For example, to compute the list of squares of elements in a list, instead of: .. code-block:: python l = [1, 2, 3] result = [] for i in l: result.append(i*i) which is particularly inefficient because of the repeated use of the ``append`` function, one could use a *list comprehension*: .. code-block:: python l = [1, 2, 3] result = [i*i for i in l] # result = [1, 4, 9] In addition to being shorter, the above code is also faster (around 3x improvement) because you build the list in one instruction. Another example to compute the list of square roots of all non-negative elements of a list (you could get in big troubles computing the square root of a negative element): .. code-block:: python from math import sqrt l = [4, -3, 9] result = [sqrt(i) for i in l if i >= 0] # result = [2, 3] The same syntax also exists for dictionaries, this is called *dict comprehensions* (very original, isn't it?). For example, to transform a list of (name, phone number) pairs into a dictionary, for faster lookup: .. code-block:: python l = [("Barthes", "+33 6 29 64 91 12"), ("Dumbo", "+001 650 472 4243")] d = {name: phone for (name, phone) in l} You can get more details about *list comprehensions* on the `dedicated section `__ of the official documentation. The multiples faces of ``in`` ----------------------------- The ``in`` keyword has many different meanings and makes Python code so easy to write that people often forget to use it. * ``in`` gives a universal syntax to iterate over iterable objects. For example, to iterate over a list, instead of: .. code-block:: python l = [1, 2, 3] for i in range(len(l)): print l[i] you could simply write: .. code-block:: python l = [1, 2,3] for i in l: print i similarly, to iterate over a dictionary, instead of: .. code-block:: python d = { ... } for key in d.keys(): print d[key] you could write: .. code-block:: python d = { ... } for key in d: print d[key] * ``in`` also allows you to test whether an element belongs to some structure: list, dictionary (or any iterable object), occurrence of a substring inside a string. For example: .. code-block:: python l = [line for line in open("server.log") if "Connected" in line] will return the list of lines from the file ``server.log`` containing ``Connected`` as a substring. Manipulating lists with atomic instructions ------------------------------------------- More generally, it is advised to avoid iterating over a list with a ``for`` loop. ``for`` loops are slow in Python and writing an operation over a list as a single instruction allows Python to optimize the execution of the code internally. *List comprehensions* often help in replacing an iteration by a single instruction. Here are a few other functions which can be helpful in this regard: * ``join`` can be useful to format a list. For example, to print the list of words whose first letter is ``a`` in a list of words. Instead of: .. code-block:: python l = [ ... ] result = "" for word in l: if word[0] == 'a': result += word + " " print result you could do: .. code-block:: python l = [ ... ] print " ".join([word for word in l if word[0] == 'a']) * ``sum``, to sum the elements of a list. * ``map``, to apply a given function to all elements in a list. For example to reverse all the words in a list: .. code-block:: python l = ["Dumbo", "Polochon"] def reverse(word): return word[::-1] m = map(reverse, l) # m = ['obmuD', 'nohcoloP'] *slices* are also very useful when it comes to manipulating lists (or sublists) in blocks. Remember that if ``l`` is a list (or any iterable) ``l[begin:end:step]`` will extract all the elements from index ``begin`` (included) to index ``end`` (excluded) with a step of ``step`` (this last parameter being optional). If the ``begin`` parameter is omitted, it is given 0 as default value. Similarly, the default value of ``end`` when unspecified is ``len(l)`` (the numbers of elements in ``l``). A negative value for ``begin`` or ``end`` will be subtracted from the end of the list. For example, to extract all the element but the last one: .. code-block:: python l = [1, 2, 3] m = l[:-1] # m = [1, 2] Using a negative value for the ``step`` parameter can be useful to walk through an iterable object in reverse order as shown in the example given above to take the mirror image of a word: .. code-block:: python word = "dumbo" drow = word[::-1] # drow = "obmud" which compensates for the scandalous lack of a ``reverse`` function for strings in Python. Exceptions ---------- Exceptions provide a powerful tool found in many high-level programming languages which is often under-used. They allow for a less defensive programming style by handling errors *as they appear* instead of making test *beforehand* to prevent them from happening. In Python, every time you are trying to execute an illegal operation (*e. g.* trying to access an element outside a list's boundaries, dividing by zero, etc.) instead of simply crashing the program, Python raises an exception which can be caught, giving the programmer a last chance to fix the problem before the program ultimately crashes. The syntax to catch exceptions in Python is the following: .. code-block:: python try: .... # piece of code potentially raising the exception named Kaboum except Kaboum: .... # piece of code to be executed if the above code raises Kaboum For example, if a line of code contains a division by a number which could seldom be equal to zero, instead of systematically checking that the number is non zero, it is much more efficient to encapsulate the line within a ``try ... except ZeroDivisionError:`` to handle specifically the rare cases when the number will be zero. This is the well-known principle: *better ask for absolution than permission*. Another example, when trying to access an unbound key in a dictionary, Python raises the ``KeyError`` exception. This exception can be used to initialize the value associated with the unbound key. For example, to compute a dictionary of word counts in a text, you can often find: .. code-block:: python text = "..." result = {} for word in text.split(): if word in result: result[word] += 1 else: result[word] = 1 You could instead use the ``KeyError`` exception to your advantage to avoid the systematic ``if`` test: .. code-block:: python test = "..." result = {} for word in text.split(): try: result[word] += 1 except KeyError: result[word] = 1 The difference with the previous code is that *most of the time*, this code will behave exactly as if the body of the ``for`` loop only contained the instruction ``result[word] += 1``. This gives a significant speedup compared to the first code where a test was computed for each iteration of the loop. See the `dedicated page `__ in the official documentation. Values equivalent to ``True`` or ``False`` ------------------------------------------ If ``test`` is a boolean variable (equal to ``True`` or ``False``), we know that it is redundant to write: .. code-block:: python if test == True: ... instead of: .. code-block:: python if test: ... More generally, Python has automatic conversion rules from standard types to booleans. This can be used to shorten the syntax in conditional tests: * as in the vast majority of programming languages, a positive integer is converted to ``True`` and zero is converted to ``False``. * a string is converted to ``False`` if and only if it is empty. For example, to test whether a string ``title`` is empty, you can simply write: .. code-block:: python if title: ... instead of: .. code-block:: python if len(title) > 0: ... * the ``None`` value, a constant used to initialize unspecified variables, is converted to ``False``. To test that a variable ``var`` is not equal to ``None``, you can write: .. code-block:: python if not var: ... **Beware**, the above code will not allow you to distinguish the case where ``var`` is ``None`` from the case where ``var`` has a value which is converted to ``False`` by Python (for example, an empty string or list). You need to be careful that this is really what you are trying to test. Generators ---------- Generators provide an easy way to create iterator objects (objects over which you can iterate) and can be created in several ways. Generator expressions ~~~~~~~~~~~~~~~~~~~~~ *Generators expressions* are exactly similar to *list comprehensions* except that the brackets are replaced by parenthesis. Thus, the following code: .. code-block:: python l = [1, 2, 3] m = (i*i for i in l) print '\n'.join(m) would produce the exact same result had the second line been replaced by: .. code-block:: python m = [i*i for i in l] The difference between the two codes is that in the case where ``m`` is defined by a *list comprehension* the list is integrally computed and stored in memory when the variable ``m`` is defined. On the contrary, when ``m`` is defined by a *generator expression*, the elements in ``m`` are generated on the go *when needed*: only when trying to iterate over the variable ``m`` (as induced by the call to the ``join`` function in the above example) are the elements generated. From the speed of execution point of view, both solutions are equivalent: in the end, each element in ``m`` will be computed once and only once. From the memory usage point of view however, generators present a clear advantage: because the elements are generated dynamically, one at a time, never more than one element is stored in memory at the same time. In cases when the list is too big to fit into memory, *generators* could be the solution. When using a ``generator expression`` as the argument of a function, Python allows to drop one pair of parenthesis to make the code more readable. For example, in the following code: .. code-block:: python l = [1, 2, 3] total = sum((i*i for i in l)) the second line can be replaced by: .. code-block:: python total = sum(i*i for i in l) Generator functions ~~~~~~~~~~~~~~~~~~~ A second way to define a *generator* is by writing a function using the special keyword ``yield``. When called, this function will return an iterable object whose behavior is the following: on each iteration step, the function is executed until a ``yield`` instruction is hit. The value following the ``yield`` keyword is returned and can be used during the iteration step. The execution of the function is frozen until the next iteration step. For example, let us define the following function: .. code-block:: python def min_max(filename): with open(filename) as f: for line in f: l = map(int, line.split()) yield min(l), max(l) When called, this function will produce an iterable object. When iterating over this object, at each iteration, one line of ``filename`` will be read, and the minimum and maximum values of this line will be returned when the ``yield`` keyword is reached, freezing the execution of the function until the next iteration. Hence, the following code: .. code-block:: python for (inf, sup) in min_max(filename): print (inf + sup)/2. is exactly equivalent to: .. code-block:: python with open(filename) as f: for line in f: l = map(int, line.split()) inf, sup = min(l), max(l) print (inf + sup)/2. but allows you to define separately the code which will generate the list of minimum and maximum values, and the code which makes use of the generated elements. Built-in functions ~~~~~~~~~~~~~~~~~~ Finally, some built-in functions in Python return generator objects. This is the case of the ``xrange`` function which behaves exactly as the ``range`` function. The difference is that ``range`` computes a list of integers whereas ``xrange`` defines a generator object generating the elements on the go, one at a time. A call to ``range(1000000000)`` might induce a memory error on your machine (depending on your memory capacity), but you will be fine using ``xrange``, both calls being equivalent for iteration purposes. It is almost always more suitable to use ``xrange`` instead of ``range`` and in Python 3.x ``xrange`` has even been renamed to ``range``. Read more about generators on the `official documentation `__. Decorators ---------- *Decorators* provide a very powerful way to alter the behavior of a function without redifining it. The syntax is the following: .. code-block:: python @logging def f(x): return x + 1 In the above example, we say that ``f`` has been *decorated* with ``logging``. ``logging`` must be a function taking another function as an argument. The result of this decoration is equivalent to this piece of code: .. code-block:: python def f(x): return x + 1 f = logging(f) which means that by decorating ``f`` with ``logging``, ``f`` now behaves as the composite function ``logging(f)``. A simple decorator ~~~~~~~~~~~~~~~~~~ Imagine that we want the ``logging`` decorator to *log* the calls made to the function it decorates, by printing them to the standard output. Such a decorator could be written like this .. code-block:: python def logging(fun): def aux(*args, **kwargs): print "Calling", fun.__name__ fun(*args, **kwargs) return aux Because ``logging`` could be used to decorate any function, with an arbitrary number of arguments and keyword arguments, it is necessary to use the generic syntax ``aux(*args, **kwargs)``. This syntax stores all the arguments passed to ``aux`` in a list named ``args`` and all the keyword arguments in a dictionary named ``kwargs``. Note that the exact same arguments are passed to ``fun``, meaning that from the argument passing perspective, ``aux`` and ``fun`` will behave similarly. The difference being that ``aux`` logs the call to the standard output prior to doing the computation made in ``fun``: this is how we expected the decorator to behave. To be perfectly rigorous, the previous decorator should have been written like this: .. code-block:: python from functools import wraps def logging(fun): @wraps(fun) def aux(*args, **kwargs): print "Calling", fun.__name__ fun(*args, **kwargs) return aux ``aux`` is now itself decorated by the ``wraps`` decorator provided by the ``functools`` module. This decorators does some magic to ensure that ``aux`` behaves as closely as possible to ``fun``. Without this decorator, the following code: .. code-block:: python @logging def f(x): return x + 1 print f.__name__ would print ``aux`` to the standard output, instead of the expected ``f``. The ``wraps`` decorator ensures among other things that the ``__name__`` attribute is preserved throughout a decoration. Let us further assume that you want to extend the ``logging`` decorator to not only log the calls, but also keep track of how many times the function has been called. You could be tempted to write something like: .. code-block:: python from functools import wraps def logging(fun): a = 0 @wraps(fun) def aux(*args, **kwargs): a = a + 1 print "{0} has been called {1} times".format(fun.__name__, a) fun(*args, **kwargs) return aux However, if you apply this decorator to some function and then call it, you will get an angry face from Python complaining that the variable ``a`` is unbound. The problem comes from this line: .. code-block:: python a = a + 1 Here, Python thinks you are redefining the variable ``a`` and forgets about its previous definition. As a consequence, when reaching the ``a + 1`` part, ``a`` is no longer defined, causing the error. This is a current limitation of Python 2: local variables that have been defined outside the current scope are read-only. A standard way to circumvent this limitation is to use a mutable structure for ``a``: ``a`` itself cannot be redefined, but the structure it is pointed to can. Using this, the previous example can be rewritten as: .. code-block:: python from functools import wraps def logging(fun): a = [0] @wraps(fun) def aux(*args, **kwargs): a[0] = a[0] + 1 print "{0} has been called {1} times".format(fun.__name__, a[0]) fun(*args, **kwargs) return aux where ``a`` points to a list of length 1 storing the number of calls at its first position. Another example ~~~~~~~~~~~~~~~ A common example which is often used to illustrate decorators in Python is `memoization `__: when a function is computation-heavy but often called using the same arguments, you can save a lot of time by caching past results returned by the function. This idea can be nicely implemented in Python using a decorator. The decorator will store past results in a dictionary: when the decorated function will be called, the decorator will perform a lookup in its dictionary to check whether the function has already been called with the same argument. If the dictionary already contains an entry for this argument, the associated value is returned. Here is how you could write such a decorator for a single argument function: .. code-block:: python from functools import wraps def memoize(fun): cache = {} @wraps(fun) def aux(x): if x in cache: return cache[x] else: a = fun(x) cache[x] = a return a return aux Then if ``f`` is defined like this: .. code-block:: python @memoize def f(x): ... # very long and heavy computation when calling ``f`` twice with the same argument, you will incur the computation cost only during the first call, the second call being almost instantaneous. Classes with two methods ------------------------ Let us briefly recall how classes work in Python. A class is defined like this: .. code-block:: python class Cipher: def __init__(self, key): self.key = key def decrypt(self, message): return (message & self.key) all the methods of a class take as their first argument the instance on which the method is being called. By convention, this first arguments is always named ``self``. If ``a`` is an instance of ``Cipher``, the instruction ``a.decrypt(message)`` is equivalent to ``decrypt(a, message)``. The special function ``__init__`` is the class constructor and is called every time an instance of the class is created. Its typical use is to initialize some attributes of the instance. An instance of ``Cipher`` class can be created like this: .. code-block:: python d = Cipher(key) A flaw commonly found in code written by people coming from object-oriented programing languages is to create classes for everything. This often leads to classes containing only two methods, one being the ``__init__`` function. This is the case in the class written above as an example. By looking to this example a bit closer, you can see that it is possible to completely get rid of the class definition: a ``decrypt`` function taking the key as an additional argument is sufficient: .. code-block:: python def decrypt(key, message): return (message & key) Some people could object that it still makes sense to use a class in the example above, if we plan to extend the ``Cypher`` class in the future, for example by adding an ``encrypt`` function. In my opinion, it is better to start by writing your code as simply as possible. If you really need to extend the code, then you can start restructuring it and group several related functions in a class. PEP 8 ----- When writing about good practices in Python, it is impossible not to mention the PEP8. It is a set of recommendations on coding style in Python. These recommendations are of course not absolute rules and should be taken as advice. However, I noticed that following these recommendations generally leads to greater code readability. Moreover, as many people who code in Python also follow these recommendations, adopting them reduces the gap between your code and code written by others: this will save you some time when reading code. Here are a few points extracted from the PEP8: * you should follow English typographic rules: no space before a colon, no space before a comma, but a space after, etc. * you should put spaces around operators like the equal sign, plus sign, etc. * you should try to limit the length of the lines of code to a maximum of 80 characters. More details on the `PEP8 page `__.