Developing a Pythonic mindset

1. Look up the version of Python you’re using

1
2
3
4
$ python --version
$ python -V
$ python3 --version
$ python3 -V
1
2
3
4
5
6
>>> import sys
>>> print(sys.version_info)
sys.version_info(major=3, minor=6, micro=15, releaselevel='final', serial=0)
>>> print(sys.version)
3.6.15 (default, Apr 20 2022, 13:05:41)
[GCC 5.4.0 20160609]

Python 2 retires on January 1, 2020, at which point all bug fixes, security patches, and backward porting of features will cease. After that, if you stick with Python 2, you’ll face a lot of disadvantages because it won’t be officially maintained anymore. Developers who rely heavily on the Python 2 codebase might consider transitioning to Python 3 with tools like 2to3 (Python’s pre-installed tools) and six.

2. Follow the PEP 8 Style Guide

Whitespace is syntactically important in Python, and Python programmers are particularly concerned about the use of whitespace characters, as they can affect the clarity of code. Here are a few suggestions to follow in this regard.

  • Indent with spaces (spaces), not tabs (tabs).
  • Use 4 spaces for each level of indentation related to syntax.
  • No more than 79 characters per line.
  • For long expressions that occupy multiple lines, all but the first line should be indented with 4 spaces above the usual indentation level.
  • In the same document, functions are separated from classes by two blank lines.
  • Within the same class, methods are separated from each other by a single blank line.
  • When using dictionaries, no space is added between keys and colons, and a space should be added between colons and values written on the same line.
  • When assigning a value to a variable, one space to the left and one to the right of the assignment symbol, and just one space.
  • When annotating the type of a variable (annotation), do not separate the variable name from the colon, but there should be a space before the type information.

PEP 8 recommends a different approach to naming parts of Python code, so that when reading the code, you can see their roles in the Python language based on those names. Follow these naming-related recommendations.

  • Functions, variables, and properties are spelled with lowercase letters, and words are linked by underscores, for example: lowercase_underscore.
  • Protected instance properties, start with an underscore, e.g. _leading_underscore .
  • Private instance attributes that begin with two underscores, e.g., __double_leading_underscore .
  • Classes (including exceptions) are named with the first letter of each word capitalized, e.g., CapitalizedWord.
  • Constants at the module level, all letters are capitalized and words are linked by underscores, e.g. ALL_CAPS .
  • The first parameter of an instance method in a class should be named self, to indicate the object itself.
  • The first argument of a class method should be named cls, which is used to represent the class itself.

The Zen of Python says that everything should be done simply, and preferably only one way. PEP 8 tries to apply this idea to the way expressions and statements are written.

  • Use in-line negation, i.e., write the negation directly in front of what is to be negated, not in front of the entire expression; for example, you should write if a is not b, not if not a is b.
  • Do not determine whether a container or sequence is empty by its length, for example, do not determine whether somelist is empty by if len(somelist) == 0, but instead should be written like if not somelist, because Python automatically evaluates null values to False.
  • If you want to determine whether a container or sequence has contents inside it (for example, to determine whether somelist is a non-null value like [1] or 'hi'), you should also not judge by length, but instead use an if somelist statement, because Python automatically evaluates non-null values as True.
  • Don’t cram if statements, for loops, while loops, and except compound statements onto one line. It’s clearer to break them up into multiple lines.
  • If an expression does not fit on one line, enclose it in parentheses, and add line breaks and indents as appropriate for readability. Multi-line expressions should be enclosed in parentheses and not continued with the \ symbol.

PEP 8 gives the following advice on how to introduce and use modules in your code.

  • The import statement (with from x import y) should always be placed at the beginning of the file.
  • When introducing modules, you should always use absolute names, not relative names based on the current module path. For example, to introduce the foo module in the bar package, you should write from bar import foo in its entirety, even if the current path is in the bar package, and not abbreviate it to import foo.
  • If you must write the import statement with a relative name, it should be written explicitly as from . import foo. The import statement in the file should be divided into three parts in order: first into modules from the standard library, then into third-party modules, and finally into your own modules. The import statements belonging to the same section are listed in alphabetical order.

3. Understanding the difference between bytes and str

Python has two types for representing sequences of characters: bytes and str.

The bytes instance contains raw data, which is an 8-bit unsigned value (usually displayed according to the ASCII encoding standard).

1
2
3
4
5
6
7
a = b'h\x65llo'
print(list(a))
print(a)

>>>
[104, 101, 108, 108, 111]
b'hello'

The str instance contains Unicode code points (also called code points) that correspond to text characters in human languages.

1
2
3
4
5
6
7
a = 'a\u0300 propos'
print(list(a))
print(a)

>>>
['a', '̀', ' ', 'p', 'r', 'o', 'p', 'o', 's']
à propos

The two different character types correspond to two common use cases in Python.

  • Developers need to manipulate raw 8-bit sequences of values, and those 8-bit values inside the sequences together represent a string that should be encoded in UTF-8 or some other standard encoding.
  • Developers need to manipulate generic Unicode strings, not strings of a particular encoding.

We usually need to write two helper functions to convert between these two cases to ensure that the input value type matches the developer’s expected form.

The first helper function accepts bytes or str instances and returns str.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode('utf-8')
    else:
        value = bytes_or_str
    return value # Instance of str

print(repr(to_str(b'foo')))
print(repr(to_str('bar')))

>>>
'foo'
'bar'

The second helper function also accepts bytes or str instances, but it returns bytes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode('utf-8')
    else:
        value = bytes_or_str
    return value # Instance of bytes

print(repr(to_bytes(b'foo')))
print(repr(to_bytes('bar')))

>>>
b'foo'
b'bar'

There are two issues to be aware of when using raw 8-bit values with Unicode strings in Python.

The first issue is that the two types, bytes and str, seem to work in the same way, but their instances are not mutually compatible, so their types must be taken into account when passing sequences of characters.

Bytes can be added to bytes with the + operator, and str can do the same.

1
2
3
4
5
6
print(b'one' + b'two')
print('one' + 'two')

>>>
b'onetwo'
onetwo

But you cannot add str instances to bytes instances.

1
2
3
4
5
b'one' + 'two'

>>>
Traceback ...
TypeError: can't concat str to bytes

It is also not possible to add bytes instances to str instances.

1
2
3
4
5
'one' + b'two'

>>>
Traceback ...
TypeError: can only concatenate str (not "bytes") to str

The binary operator can be used to compare the size between bytes and bytes, and between str and str.

1
2
assert b'red' > b'blue'
assert 'red' > 'blue'

However, str instances cannot be compared with bytes instances.

1
2
3
4
5
assert 'red' > b'blue'

>>>
Traceback ...
TypeError: '>' not supported between instances of 'str' and 'bytes'

The reverse is also true, i.e. bytes instances cannot be compared with str instances.

1
2
3
4
5
assert b'blue' < 'red'

>>>
Traceback ...
TypeError: '<' not supported between instances of 'bytes' and 'str'

Determining whether bytes and str instances are equal always evaluates to False, even if the two instances represent the exact same character, they are not equal. For example, in the following example, they both represent strings that are equivalent to foo in ASCII encoding.

1
2
3
4
print(b'foo' == 'foo')

>>>
False

Instances of both types can appear to the right of the % operator, replacing the %s in the format string on the left.

1
2
3
4
5
6
print(b'red %s' % b'blue')
print('red %s' % 'blue')

>>>
b'red blue'
red blue

If the format string is of type bytes, you can’t replace the %s with an instance of str, because Python doesn’t know what scheme the str should be encoded in.

1
2
3
4
5
print(b'red %s' % 'blue')

>>>
Traceback ...
TypeError: %b requires a bytes-like object, or an object that implements _bytes_, not 'str'

But the opposite is possible, that is, if the format string is of type str, you can replace the %s in it with an instance of bytes, the problem is that this may not be the result you want.

1
2
3
4
print('red %s' % b'blue')

>>>
red b'blue'

Doing so causes the system to call the __repr__ method on top of the bytes instance (see clause 75) and then replace the %s in the format string with the result of that call, so the program outputs b'blue' directly instead of blue itself, as you might expect.

The second problem occurs when manipulating file handles, in this case the handles returned by the built-in open function. Such handles require Unicode string manipulation by default, rather than raw bytes, and developers used to Python 2 are particularly likely to run into this problem, which can lead to strange errors in their programs. For example, when writing binary data to a file, this is actually the wrong way to write it.

1
2
3
4
5
6
with open('data.bin', 'w') as f:
   f.write(b'\xf1\xf2\xf3\xf4\xf5')

>>>
Traceback ...
TypeError: write() argument must be str, not bytes

The exception occurred because when the open function was called, the 'w' mode was specified, so the system requires that it be written in text mode. If you wanted to use binary mode, you would have specified 'wb'. In text mode, the write method accepts str instances that contain Unicode data, not bytes instances that contain binary data. So, we have to change the mode to 'wb' to solve the problem.

1
2
with open('data.bin', 'wb') as f:
   f.write(b'\xf1\xf2\xf3\xf4\xf5')

There is a similar problem when reading files. For example, if you want to read out the binary file you just wrote, then you can’t use the following write method, you need to change it to 'rb' as well.

1
2
3
4
5
6
with open('data.bin', 'r') as f:
   data = f.read()

>>>
Traceback ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte

Another alternative is to explicitly specify the encoding standard in the encoding parameter when calling the open function to ensure that platform-specific behavior does not interfere with the code’s performance. For example, if the binary data you just wrote to the file represents a string encoded using the ‘cp1252’ standard (cp1252 is an old Windows encoding scheme), you could write it like this

1
2
3
4
with open('data.bin', 'r', encoding='cp1252') as f:
   data f.read()

assert data == 'ioooδ'

This way the program won’t throw an exception, but the returned string is also very different from what would be returned by reading the raw byte data. With this example, we want to remind ourselves of the current default encoding standard of the operating system (which can be seen by running the command python3 -c 'import locale; print(locale.getpreferredencoding())) to see if it matches what you expect. If you’re not sure, then explicitly specify the encoding argument when you call open.

4. Replacing C-style format strings with interpolation-enabled f-strings and str.format methods

The most common way to format strings in Python is with the % formatting operator.

1
2
3
4
5
6
a = 0b10111011
b = 0xc5f
print('Binary is %d, hex is %d' % (a, b))

>>>
Binary is 187, hex is 3167

C-style format strings, however, have four disadvantages in Python.

  1. If the value inside that tuple to the right of % changes in type or order, then the program may get an error because of an incompatibility problem when converting types. For example, the following simple formatting expression is correct.

    1
    2
    3
    4
    5
    6
    7
    
    key = 'my_var'
    value = 1.234
    formatted = '%-10s = %.2f' % (key, value)
    print(formatted)
    
    >>>
    my_var     = 1.23
    

    However, if the key and value are swapped, the program will throw an exception at runtime.

    1
    2
    3
    4
    5
    
    reordered_tuple = '%-10s = %.2f' % (value, key)
    
    >>>
    Traceback ...
    TypeError: must be real number, not str
    

    This error also occurs if the right side of % is written the same way, but the two specifiers in the format string on the left side are reversed in order.

  2. It is often necessary to do something with the value before filling in the template, but then the whole expression may be written in a long and confusing way. The code below is used to list the various ingredients in the kitchen, and the way it is written now does not pre-adjust the three values that are filled into the format string (i.e., the ingredient number i, the ingredient name item, and the ingredient quantity count).

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    pantry = [
    ('avocados', 1.25),
    ('bananas', 2.5),
    ('cherries', 15),
    ]
    
    for i, (item, count) in enumerate(pantry):
    print('#%d: %-10s = %.2f' % (i, item, count))
    
    >>>
    #0: avocados   = 1.25
    #1: bananas    = 2.50
    #2: cherries   = 15.00
    

    If you want to make the printed information more understandable, you may have to adjust the values a bit, but after that, the ternary to the right of the % operator is particularly long, so you need to split it over multiple lines to make it fit, which affects the readability of the program.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    
    pantry = [
    ('avocados', 1.25),
    ('bananas', 2.5),
    ('cherries', 15),
    ]
    
    for i, (item, count) in enumerate(pantry):
    print('#%d: %-10s = %d' % (
        i + 1,
        item.title(),
        round(count)))
    
    >>>
    #1: Avocados   = 1
    #2: Bananas    = 2
    #3: Cherries   = 15
    
  3. If you want to fill multiple positions in the format string with the same value, you must repeat the value multiple times in the tuple to the right of the % operator accordingly.

    1
    2
    3
    4
    5
    6
    7
    
    template = '%s loves food. See %s cook.'
    name = 'Max'
    formatted = template % (name, name)
    print(formatted)
    
    >>>
    Max loves food. See Max cook.
    

    This is especially annoying and error-prone if you want to change the value before filling it, but you have to change multiple things at once. For example, if you want to fill in name.title() instead of name, then you have to remind yourself to change all the names to name.title(). If something is changed and something is not changed, the output may not be consistent.

    To solve some of the problems mentioned above, Python’s % operator allows us to replace the tuple with a dict, so that we can have the format string specifiers correspond to the keys in the dict by their corresponding names, e.g., the %(key)s specifier means that the string (s) represents the value held by the key named key in the dict. The %(key)s descriptor is a string (s) representing the value of the key named key in the dict. The first disadvantage of this approach is the mismatch in the order of the % operator.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    key = 'my_var'
    value = 1.234
    old_way = '%-10s = %.2f' % (key, value)
    
    new_way = '%(key)-10s = %(value).2f' % {
    'key': key, 'value': value}  # Original
    
    reordered = '%(key)-10s = %(value).2f' % {
    'value': value, 'key': key}  # Swapped
    
    assert old_way == new_way == reordered
    

    This way of writing also solves the third drawback mentioned earlier, which is the problem of replacing multiple formatting specifiers with the same value. By writing it this way, we don’t have to repeat the value to the right of the % operator.

    1
    2
    3
    4
    5
    6
    
    name = 'Max'
    template = '%s loves food .See %s cook.'
    before = template % (name, name)  # Tuple
    template = '%(name)s loves food .See %(name)s cook.'
    after = template % {'name': name}  # Dictionary
    assert before == after
    

    However, this way of writing makes the second disadvantage just mentioned even worse, because with the introduction of dictionary formatted strings, we have to define key names for each value and add a colon to the right of the key name, and formatting expressions becomes even more lengthy and confusing looking. We can realize the disadvantages of this writing method more clearly by comparing the writing method without dict with the writing method with dict.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    
    pantry = [
    ('avocados', 1.25),
    ('bananas', 2.5),
    ('cherries', 15),
    ]
    
    for i, (item, count) in enumerate(pantry):
    before = '#%d: %-10s = %d' % (
        i + 1,
        item.title(),
        round(count)
    )
    after = '#%(loop)d: %(item)-10s = %(count)d' % {
        'loop': i + 1,
        'item': item.title(),
        'count': round(count),
    }
    
    assert before == after
    
  4. Writing a dict inside a formatted expression makes the code more. Each key would have to be written at least twice: once in the formatting specifier and once as a key in the dictionary. In addition, when defining the dictionary, a variable might have to be dedicated to the value corresponding to the key, and the variable might also have the same name as the key name, so that’s three times.

    1
    2
    3
    4
    5
    6
    
    soup = 'lentil'
    formatted = 'Today\'s soup is %(soup)s.' % {'soup': soup}
    print(formatted)
    
    >>>
    Today's soup is lentil.
    

    In addition to writing the key names repeatedly, the use of dict in formatted expressions makes the expressions particularly long and usually has to be split into multiple lines, and in order to correspond to the multi-line formatting of the formatted strings, the dictionaries are defined by setting the corresponding values for each key line by line.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    menu = {
    'soup': 'lentil',
    'oyster': 'kumamoto',
    'special': 'schnitzel',
    }
    
    template = ('Today\'s soup is %(soup)s, '
            'buy one get two %(oyster)s oysters, '
            'and our special entree is %(special)s.')
    formatted = template % menu
    print(formatted)
    
    >>>
    Today's soup is lentil, buy one get two kumamoto oysters, and our special entree is schnitzel.
    

    To see which key in the format string corresponds to which key in the dictionary, you have to jump back and forth between the two pieces of code, which can make it difficult to find bugs, and if you want to make minor changes to the key name, you have to change the format string descriptor at the same time, which makes the code even more cumbersome and less readable.

The built-in format function and the str class’s format method

Python 3 adds advanced string formatting, which is more expressive than old C-style format strings, and no longer uses the % operator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
a = 1234.5678
formatted = format(a, ',.2f')
print(formatted)
b = 'my string'
formatted = format(b, '^20s')
print('*', formatted, '*')

>>>
1,234.57
*      my string       *
1
2
3
4
5
6
7
8
key = 'my_var'
value = 1.234

formatted = '{} = {}'.format(key, value)
print(formatted)

>>>
my_var = 1.234

You can write a colon in {}, and then write the format specifier to the right of the colon to specify how the value received by the format method should be formatted. Type help(‘FORMATTING’) in the Python interpreter to see the rules that underlie this set of format specifiers used by str.format in detail.

1
2
3
4
5
6
7
8
key = 'my_var'
value = 1.234

formatted = '{:<10} = {:.2f}'.format(key, value)
print(formatted)

>>>
my_var     = 1.23

Procedure: The system first passes each value received by str.format to the built-in format function, and finds the {} corresponding to the value in the string, and also passes the format written inside the {} to the format function, for example, when the system is processing value, it passes format(value, ‘. 2f’). The result returned by the format function is then written in the location of the entire formatted string {}. In addition, each class can be customized with a special method, __format__, so that the format function will follow this logic when converting instances of that class to strings.

C-style format strings use the % operator to guide the format specifier, so if you want to output the symbol as is, you must escape it, i.e., write two %s in a row.

1
2
3
4
5
6
print('%.2f%%' % 12.5)
print('{} replaces {{}}'.format(1.23))

>>>
12.50%
1.23 replaces {}

When calling the str.format method, you can also write a number to the {} of str to refer to the position index of the parameter value received by the format method at that position. Later, even if the order of these {}’s in the format string changes, you don’t have to swap the arguments passed to the format method. Thus, this avoids the order problem mentioned in the first drawback mentioned earlier.

1
2
3
4
5
6
7
8
key = 'my_var'
value = 1.234

formatted = '{1} = {0}'.format(key, value)
print(formatted)

>>>
1.234 = my_var

The same position index can appear in multiple {}’s of str, all of which refer to the value received by the format method at the corresponding position. This eliminates the need to repeatedly pass the value to the format method, thus solving the third drawback mentioned earlier.

1
2
3
4
5
6
name = 'Max'
formatted = '{0} loves food. See {0} cook.'.format(name)
print(formatted)

>>>
Max loves food. See Max cook.

However, this new str.format method does not address the second drawback mentioned above. If the value is adjusted before it is filled, the code written in this way is still as messy and poorly readable as before. If you compare the original method with the new method, you will see that the new method is not much better than the original one.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
pantry = [
    ('avocados', 1.25),
    ('bananas', 2.5),
    ('cherries', 15),
]

for i, (item, count) in enumerate(pantry):
    old_style = '#%d: %-10s = %d' % (
        i + 1,
        item.title(),
        round(count))
    new_style = '#{}: {:<10s} = {}'.format(
        i + 1,
        item.title(),
        round(count))
    assert old_style == new_style

Of course, this {} form of the descriptor also supports some more advanced uses, such as querying the value of a key in a dict, accessing an element at a position in a list, and converting values to Unicode or repr strings. The following code combines these three features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
menu = {
    'soup': 'lentil',
    'oyster': 'kumamoto',
    'special': 'schnitzel',
}

formatted = 'First letter is {menu[oyster][0]!r}'.format(
    menu=menu)
print(formatted)

>>>
First letter is 'k'

These features, however, still don’t solve the fourth drawback mentioned earlier, the one where key names need to be repeated multiple times. Here’s a comparison of the C-style formatting expressions with the new str.format method to see what the difference is between the two writing styles when dealing with data in the form of key-value pairs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
old_template = (
    'Today\'s soup is %(soup)s, '
    'buy one get two %(oyster)s oysters, '
    'and our special entree is %(special)s.')
old_formatted = old_template % {
    'soup': 'lentil',
    'oyster': 'kumamoto',
    'special': 'schnitzel',
}

new_template = (
    'Today\'s soup is {soup}, '
    'buy one get two {oyster} oysters, '
    'and our special entree is {special}.')
new_formatted = new_template.format(
    soup='lentil',
    oyster='kumamoto',
    special='schnitzel',
)

assert old_formatted == new_formatted

The new way is a little better because it doesn’t need to define a dict, so you don’t need to enclose the key names in ' '. Its descriptors are also simpler than those of the old way of writing. However, these advantages are not outstanding. Also, although we can access keys in dictionaries and elements in lists in the new style, these features cover only a small fraction of the features of Python expressions, and the str.format method still doesn’t bring out the full benefits of Python expressions.

Interpolated format strings

Python 3.6 added a new feature called interpolatedformat string (f-string for short) that solves all of the problems mentioned above. The new syntax feature requires that format strings be prefixed with the letter f, similar to how the letters b and r are used, i.e., to prefix byte-format strings and raw (or unescaped) strings, respectively.

The f-string maximizes the expressiveness of formatted strings by completely solving the fourth drawback mentioned above, namely the programmatic redundancy caused by duplicate key names. Instead of defining a dict specifically, as we do with C-style format expressions, or passing a value to a parameter, as we do when calling the str.format method, we can simply refer to all names in the current Python range in the {} of the f-string, thereby simplifying things.

1
2
3
4
5
6
7
8
key = 'my_var'
value = 1.234

formatted = f'{key} = {value}'
print(formatted)

>>>
my_var = 1.234

The same set of minilanguage rules supported by str.format, which are used to the right of the colon inside {}, can now be used in f-string, and the value can be converted to Unicode and repr by the ! symbols to Unicode and repr strings, as was the case with str.format.

1
2
3
4
5
6
7
8
key = 'my_var'
value = 1.234

formatted = f'{key!r:<10} = {value:.2f}'
print(formatted)

>>>
'my_var'   = 1.23

Using f-string to solve the same problem is always easier than using C-style format strings via the % operator, and it is also easier than the str.format method. Here’s a comparison of the space taken up by these writing methods in order from shortest to longest, with the left side of the assignment symbol (=) aligned to the same position in each writing method, so it’s easy to see how much code is to the right of the symbol.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
key = 'my_var'
value = 1.234

f_string = f'{key:<10} = {value:.2f}'
c_tuple = '%-10s = %.2f' % (key, value)
str_args = '{:<10} = {:.2f}'.format(key, value)
str_kw = '{key:<10} = {value:.2f}'.format(key=key,
                                          value=value)
c_dict = '%(key)-10s = %(value).2f' % {'key': key,
                                       'value': value}
assert c_tuple == c_dict == f_string
assert str_args == str_kw == f_string

In the f-string method, all kinds of Python expressions can appear in {}, so this solves the second drawback mentioned earlier. We can now fine-tune the values that need to be filled into the string in a fairly concise way; C-style writing and writing with the str.format method might make for a long expression, but with f-string instead, you can probably write it in one line.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
pantry = [
    ('avocados', 1.25),
    ('bananas', 2.5),
    ('cherries', 15),
]

for i, (item, count) in enumerate(pantry):
    old_style = '#%d: %-10s = %d' % (
        i + 1,
        item.title(),
        round(count))
    new_style = '#{}: {:<10s} = {}'.format(
        i + 1,
        item.title(),
        round(count))
    f_string = f'#{i + 1}: {item.title():<10s} = {round(count)}'
    assert old_style == new_style == f_string

For more clarity, you can write the f-string as a multi-line string, similar to the adjacent-string concatenation in C. This is longer than a single-line f-string, but it is still better than the other two multi-line forms.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
pantry = [
    ('avocados', 1.25),
    ('bananas', 2.5),
    ('cherries', 15),
]

for i, (item, count) in enumerate(pantry):
    f_string = (
        f'# {i + 1}: '
        f'{item.title():<10s} = '
        f'{round(count)}'
    )
    print(f_string)

>>>
# 1: Avocados   = 1
# 2: Bananas    = 2
# 3: Cherries   = 15

Python expressions can also appear in format specifiers. For example, the following code represents the number of digits after the decimal point in a variable, and then puts the variable’s name, places, enclosed in {} in a format specifier, which is more flexible than using hard code.

1
2
3
4
5
6
places = 3
number = 1.23456
print(f'My number is {number:.{places}f}')

>>>
My number is 1.235

Of the four string formatting approaches built into Python, f-string can express many kinds of logic succinctly and clearly, which makes it the best choice for programmers. If you want to fill a string with values in the proper format, the first thing you should consider is using f-string to do it.

Summary

  1. using the % operator to fill a value into a C-style format string can be problematic and tedious to write.
  2. str.format defines its formatting specifiers exclusively in a mini-language that gives us some useful concepts, but otherwise suffers from many of the same drawbacks as C-style format strings, so we should avoid using it as well.
  3. f-string solves the biggest problem posed by C-style format strings by using a new way of writing the values into the string. f-string is a clean and powerful mechanism for embedding arbitrary Python expressions directly inside format specifiers.

5. Replacing Complex Expressions with Helper Functions

Python’s syntax is fairly straightforward, so sometimes you can implement a lot of logic with just one expression. For example, to split the query string in a URL into key-value pairs, you can simply use the parse_qs function. The following example parses each parameter in the query string and puts them into a dict with their corresponding integer values.

1
2
3
4
5
6
7
from urllib.parse import parse_qs
my_values = parse_qs('red=5&blue=0&green=',
                     keep_blank_values=True)
print(repr(my_values))

>>>
{'red': ['5'], 'blue': ['0'], 'green': ['']}

When parsing the query string, you can find that some parameters may have multiple values, some parameters may have only one value, and some parameters may have blank values, and you may also encounter cases where the parameter is not provided at all. The following three lines of code query the result dictionary for each of the three parameters in the get method, which correspond to the three different cases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from urllib.parse import parse_qs
my_values = parse_qs('red=5&blue=0&green=',
                     keep_blank_values=True)

print('Red:         ', my_values.get('red'))
print('Green:       ', my_values.get('green'))
print('Opacity:     ', my_values.get('opacity'))

>>>
Red:          ['5']
Green:        ['']
Opacity:      None

It would be nice if both cases, missing arguments and arguments with blank values, could be treated as 0 by default. But such a small piece of logic doesn’t seem worth writing a special if statement or helper function for, so some people will just use Boolean expressions to implement it.

Boolean expressions are easy to write in Python syntax, because Python treats blank strings, blank lists, and 0 values as False when evaluating such expressions. So, you just put the result of the get method to the left of the or operator and write 0 to the right. In this way, as long as the left subexpression is False, the value of the whole expression will be evaluated as the value of the right expression, which is 0.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from urllib.parse import parse_qs
my_values = parse_qs('red=5&blue=0&green=',
                     keep_blank_values=True)

# For query string 'red=5&blue=0&green='
red = my_values.get('red', [''])[0] or 0
green = my_values.get('green', [''])[0] or 0
opacity = my_values.get('opacity', [''])[0] or 0
print(f'Red:     {red!r}')
print(f'Green:   {green!r}')
print(f'Opacity: {opacity!r}')

>>>
Red:     '5'
Green:   0
Opacity: 0

The expression looks awkward when written like this, and it doesn’t achieve exactly what we want, because we still have to make sure that the parsed parameter value is an integer that can directly participate in mathematical operations. So, we need to convert the string parsed by this expression into an integer by using the built-in int().

1
red = int(my_values.get('red', [''])[0] or 0)

The code is now even harder to read because it looks even messier than it was. This code is horrible to read: if you are reading the code for the first time, you have to break the whole expression down layer by layer to understand what the line of code means, which takes a long time. The code should certainly be shorter, but that doesn’t mean it has to be crammed into a single line.

Python can implement ternary conditional expressions with if/else constructs, which are much clearer than what you just wrote, and keep the code short.

1
2
red_str = my_values.get('red', [''])
red = int(red_str[0]) if red_str[0] else 0

But in our case, this is still not as good as the full multi-line if/else structure, which is very easy to read, although it takes a few more lines to write.

1
2
3
4
5
green_str = my_values.get('green', [''])
if green_str[0]:
    green = int(green_str[0])
else:
    green = 0

If this logic is to be used repeatedly, it is better to write it as a helper function, and even if it is only used two or three times as in the example below, it is still worth doing so.

1
2
3
4
5
def get_first_int(values, key, default=0):
    found = values.get(key, [''])
    if found[0]:
        return int(found[0])
    return default

Summary

  1. Python’s syntax makes it easy to squeeze complex meanings into the same line of expressions, which is hard to understand.
  2. complex expressions, especially those that need to be reused, should be written inside helper functions.
  3. Conditional expressions written in if/else constructs are better understood than Boolean expressions written in or and.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
snack_calories = {
    'chips': 140,
    'popcorn': 80,
    'nuts': 190,
}
items = tuple(snack_calories.items())
print(items)

>>>
(('chips', 140), ('popcorn', 80), ('nuts', 190))

We can use integers as subscripts to access the corresponding elements inside a tuple.

1
2
3
4
5
6
7
item = ('Peanut butter', 'Jelly')
first = item[0]
second = item[1]
print(first, 'and', second)

>>>
Peanut butter and Jelly

Once a tuple is created, it is not possible to assign new values to its elements via subscripts.

1
2
3
4
5
6
pair = ('Chocolate', 'Peanut butter')
pair[0] = 'Honey'

>>>
Traceback ...
TypeError: 'tuple' object does not support item assignment

Python also has a way of writing things called unpacking. This way of writing allows us to assign the elements inside a tuple to multiple variables with just one statement.

1
2
3
4
5
6
item = ('Peanut butter', 'Jelly')
first, second = item  # Unpacking
print(first, 'and', second)

>>>
Peanut butter and Jelly

Assigning values by unpacking is clearer than accessing elements of a tuple by subscripts, and it usually requires less code to write. Of course, the left side of the assignment operation can be written as a list, a sequence, or an iterable of arbitrary depth, in addition to listing individual variables.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
favorite_snacks = {
    'salty': ('pretzels', 100),
    'sweet': ('cookies', 180),
    'veggie': ('carrots', 20),
}

((type1, (name1, cals1)),
 (type2, (name2, cals2)),
 (type3, (name3, cals3))) = favorite_snacks.items()
print(f'Favorite {type1} is {name1} with {cals1} calories')
print(f'Favorite {type2} is {name2} with {cals2} calories')
print(f'Favorite {type3} is {name3} with {cals3} calories')

>>>
Favorite salty is pretzels with 100 calories
Favorite sweet is cookies with 180 calories
Favorite veggie is carrots with 20 calories
1
2
3
4
5
6
7
8
9
snacks = [('bacon', 350), ('donut', 240), ('muffin', 190)]

for rank, (name, calories) in enumerate(snacks, 1):
    print(f'#{rank}: {name} has {calories} calories')

>>>
#1: bacon has 350 calories
#2: donut has 240 calories
#3: muffin has 190 calories

This is the Python style of writing (Pythonic style), so we don’t need to go through the subscripts layer by layer. It saves space and is easier to understand.

7. Replace range with enumerate whenever possible

Python’s built-in range function is good for iterating over a series of integers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from random import randint

random_bits = 0
for i in range(32):
    if randint(0, 1):
        random_bits |= 1 << i
print(bin(random_bits))

>>>
0b100101010010000001011011010011

If you want to iterate over a data structure, such as a list of strings, you can iterate over the sequence directly, without setting a range of values by range, and then using each integer value in the range as a subscript to access the elements of the list in turn.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flavor_list = ['vanilla', 'chocolate', 'pecan', 'strawberry']

for flavor in flavor_list:
    print(f'{flavor} is delicious')

>>>
vanilla is delicious
chocolate is delicious
pecan is delicious
strawberry is delicious

Of course, there are times when you need to know the position of the element you are currently working on in the list as you iterate through the list. For example, I write my favorite ice cream flavors in the flavor_list list, and when I print each flavor, I also want to indicate the ranking of the flavor in my mind. To do this, we can use the traditional range approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
flavor_list = ['vanilla', 'chocolate', 'pecan', 'strawberry']

for i in range(len(flavor_list)):
    flavor = flavor_list[i]
    print(f'{i+1}: {flavor}')

>>>
1: vanilla
2: chocolate
3: pecan
4: strawberry

range(len(favor_list)) is too complicated to write and can be replaced by enumerate. enumerate can encapsulate any kind of iterator into a lazy generator (see article 30).

1
2
3
4
5
6
7
8
9
flavor_list = ['vanilla', 'chocolate', 'pecan', 'strawberry']

it = enumerate(flavor_list)
print(next(it))
print(next(it))

>>>
(0, 'vanilla')
(1, 'chocolate')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
flavor_list = ['vanilla', 'chocolate', 'pecan', 'strawberry']

for i, flavor in enumerate(flavor_list):
    print(f'{i + 1}: {flavor}')

>>>
1: vanilla
2: chocolate
3: pecan
4: strawberry

Alternatively, you can specify the starting number with the second parameter of enumerate so that you don’t have to adjust it each time you print. For example, this example can start from 1.

1
2
for i, flavor in enumerate(flavor_list, 1):
    print(f'{i + 1}: {flavor}')

Summary

  1. the enumerate function can iterate over iterators with concise code, and can indicate the number of the current loop.
  2. Instead of specifying a range of subscripts and then using the subscripts to access the sequence, you should iterate directly with the enumerate function.
  3. You can specify the starting number (default is 0) by the second parameter of enumerate.

8. Iterating over two iterators at once with the zip function

When writing Python code, it’s common to create many new lists related to a list based on the objects in that list. A list derivation mechanism such as the one below applies an expression to each element of the source list to produce a derived list (see #27).

1
2
3
4
5
6
names = ['Cecilia', 'Lise', 'Marie']
counts = [len(n) for n in names]
print(counts)

>>>
[7, 4, 5]

The elements in the derived list have some relationship with the elements above the corresponding position in the source list. If you want to iterate through both lists, then you can do iterations based on the length of the source list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
names = ['Cecilia', 'Lise', 'Marie']
counts = [len(n) for n in names]

longest_name = None
max_count = 0
for i in range(len(names)):
    count = counts[i]
    if count > max_count:
        longest_name = names[i]
        max_count = count

print(longest_name)

>>>
Cecilia

The problem with this approach is that the entire loop code looks messy. We want to access the elements of the names and counts lists via subscripts, so the loop variable i, which represents the subscript, must appear twice in the loop body, which makes the code less understandable. The enumerate implementation (see #7) is a little better, but still not ideal.

To make the code clearer, you can use Python’s built-in zip function. This function wraps two or more iterators into a lazy generator.

1
2
3
4
for name, count in zip(names, counts):
    if count > max_count:
        longest_name = name
        max_count = count

The zip only takes one element at a time from each of those iterators it wraps, so even if the source list is long, the program won’t crash because it takes up too much memory. However, be careful if the length of the lists entered into the zip is inconsistent. For example, I added another name to the names list, but forgot to update its length to the counts list. In this case, traversing both lists with the zip will produce strange results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import itertools

names = ['Cecilia', 'Lise', 'Marie']
counts = [len(n) for n in names]

longest_name = None
max_count = 0
for name, count in zip(names, counts):
    if count > max_count:
        longest_name = name
        max_count = count

names.append('Rosalind')

for name, count in itertools.zip_longest(names, counts):
    print(f'{name}: {count}')

>>>
Cecilia: 7
Lise: 4
Marie: 5
Rosalind: None

If some of these lists have already been traversed, then zip_longest will fill in the gaps with the value passed to the fillvalue parameter (in this case, the length of the string 'Rosalind'), the default value being None.

Summary

  1. the built-in zip function can iterate over multiple iterators at the same time.
  2. zip creates inert generators so that it only generates one tuple at a time, so no matter how long the input data is, it is processed one at a time.
  3. If the provided iterators are not of the same length, zip will stop as soon as any of them has finished iterating. If you want to iterate over the longest iterator, then use the zip_longest function in the built-in itertools module instead.

9. Don’t write else blocks after for and while loops

One feature of Python loops that most programming languages don’t support is the ability to put else blocks right after the entire loop structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for i in range(3):
    print('Loop', i)
else:
    print('Else block!')

>>>
Loop 0
Loop 1
Loop 2
Else block!

If the loop is not executed from start to finish (i.e., the loop terminates early), the code in the else block is not executed. Using a break statement in a loop actually skips the else block. This is much different from the actual imagined functionality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for i in range(3):
    print('Loop', i)
    if i == 1:
        break
else:
    print('Else block!')

>>>
Loop 0
Loop 1

Another strange thing is that if you do a for loop on a blank sequence, the program immediately executes an else block.

1
2
3
4
5
6
7
for x in []:
    print('Never runs')
else:
    print('For Else block!')

>>>
For Else block!

The same is true for while loops, if the first loop encounters False, then the program will also run the else block immediately.

1
2
3
4
5
6
7
while False:
    print('Never runs')
else:
    print('While Else block!')

>>>
While Else block!

The else is designed in such a way that you can use it to implement search logic. For example, if you want to determine whether two numbers are mutually exclusive (i.e., whether no number other than 1 can divide them at the same time), you can use this structure. First, you try each number that can divide them at the same time, and if you don’t find such a number after trying them all, then the loop is executed from beginning to end (which means that the loop doesn’t break early), and then the program executes the code in the else block.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
a = 4
b = 9

for i in range(2, min(a, b) + 1):
    print('Testing', i)
    if a % i == 0 and b % i == 0:
        print('Not coprime')
        break
else:
    print('Coprime')

>>>
Testing 2
Testing 3
Testing 4
Coprime

In practice, an auxiliary function will be used instead to complete the calculation. Such auxiliary functions have two common ways of writing.

  1. Whenever a condition is found to hold, it is returned immediately. If this is never encountered, the loop is executed in its entirety, leaving the program to return the value at the end of the function as the default return value.

    1
    2
    3
    4
    5
    6
    7
    8
    
    def coprime(a, b):
    for i in range(2, min(a, b) + 1):
        if a % i == 0 and b % i == 0:
            return False
    return True
    
    assert coprime(4, 9)
    assert not coprime(3, 6)
    
  2. Use the variable to keep track of whether you encounter such a situation during the loop, and if so, then jump out of the loop early with break; if not, the loop will execute in full, returning the value of the variable at the end anyway.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    def coprime_alternate(a, b):
    is_coprime = True
    for i in range(2, min(a, b) + 1):
        if a % i == 0 and b % i == 0:
            is_coprime = False
            break
    return is_coprime
    
    assert coprime_alternate(4, 9)
    assert not coprime_alternate(3, 6)
    

While the for/else or while/else construct itself allows for certain logical expressions, the confusion it brings has overshadowed its benefits.

Summary

  1. Python has a special syntax that allows you to put an else block right after the entire for or while loop.
  2. the else block will only execute if the whole loop doesn’t jump out beforehand because of break
  3. Putting the else block immediately after the entire loop makes it less obvious what the code means, so avoid writing it that way.

10. Reducing Code Repetition with Assignment Expressions

Assignment expressions are a new syntax introduced in Python 3.8, and they use the walrus operator. This style of writing can solve some of the long-standing code repetition problems.

This expression is useful for implementing assignments in situations where normal assignment statements cannot be applied, such as inside the if statement of a conditional expression. The value of the assignment expression is the value assigned to the identifier to the left of the walrus operator.

As an example. If there is a basket of fresh fruit to be used as ingredients for a juice store, then we can define its contents like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fresh_fruit = {
    'apple': 10,
    'banana': 8,
    'lemon': 5,
}

def make_lemonade(count):
    pass

def out_of_stock():
    pass

count = fresh_fruit.get('lemon', 0)
if count:
    make_lemonade(count)
else:
    out_of_stock()

This code may look simple, but it is a bit loose because the count variable is defined on top of the entire if/else structure, yet only the if statement uses it, the else block does not need to use it at all. So, this way of writing makes it look like count is an important variable that is used in both if and else, but it is not.

In Python, we often want to get a value, determine if it’s non-zero, and then execute a piece of code if it is. For this kind of use, we used to have to use tricks to avoid having variables like count repeatedly appear in our code, and these tricks sometimes made the code harder to understand (see the ones mentioned in #5). Python introduced assignment expressions to solve this problem. The following is written using the walrus operator instead.

1
2
3
4
if count := fresh_fruit.get('lemon', 0):
    make_lemonade(count)
else:
    out_of_stock()

The new code is much clearer to read, even though only one line is omitted, because the way it is written clearly shows that the count variable is only relevant to the if block. The assignment expression first assigns the value to the right of := to the count variable on the left, and then evaluates itself, i.e., treats the value of the variable as the value of the entire expression. Since the expression follows the if, the program will decide whether to execute the if block based on whether its value is non-zero. This assignment followed by a judgment is exactly what the walrus operator is trying to say.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
fresh_fruit = {
    'apple': 10,
    'banana': 8,
    'lemon': 5,
}

def make_cider(count):
    pass

count = fresh_fruit.get('apple', 0)

if count >= 4:
    make_cider(count)
else:
    out_of_stock()

Similarly the above code can be modified as follows.

1
2
3
4
if (count := fresh_fruit.get('apple', 0)) >= 4:
    make_lemonade(count)
else:
    out_of_stock()

As in the previous example, the modified code has one less line than the original. But this time, let’s also notice another phenomenon: the assignment expression itself is placed inside a pair of parentheses. Why do we do this? Because we want to compare the result of this expression with the value 4 in the if statement. The lemonade example didn’t have parentheses because the if/else was determined by the value of the assignment expression itself: as long as the value of the expression was not 0, the program went to the if branch. But not this time. This time, the assignment expression has to be placed inside a larger expression, so it has to be enclosed in parentheses. Of course, it’s better to try not to add parentheses when they are not necessary.

There is also a similar logic that results in the duplicate code mentioned earlier, which means that we have to assign a different value to a variable depending on the situation, and then immediately call a function with that variable as a parameter. For example, if a customer orders a banana smoothie, we first have to cut the banana into several portions and use two of them to make the smoothie. If we don’t have enough bananas, then we throw an OutOfBananas exception. This logic is implemented in the following traditional way of writing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def slice_bananas(count):
    pass


class OutofBananas(Exception):
    pass


def make_smoothies(count):
    pass


fresh_fruit = {
    'apple': 10,
    'banana': 8,
    'lemon': 5,
}

pieces = 0
count = fresh_fruit.get('banana', 0)
if count >= 2:
    pieces = slice_bananas(count)

try:
    smoothies = make_smoothies(pieces)
except OutofBananas:
    out_of_stock()

Another traditional way of writing is also common, which is to move the assignment statement with pieces = 0 above the if/else structure to the else block.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
count = fresh_fruit.get('banana', 0)
if count >= 2:
    pieces = slice_bananas(count)
else:
    pieces = 0

try:
    smoothies = make_smoothies(pieces)
except OutofBananas:
    out_of_stock()

This looks slightly odd, because both the if and else branches define initial values for pieces variables. According to Python’s scoping rules, this way of defining the initial values of variables separately is valid (see Rule 21). Although this is true, it looks awkward, so many people prefer the first way, which is to set the initial value of pieces before entering the if/else structure.

By using the walrus operator instead, we can write one less line of code and keep the count variable low, so that it only appears in the if block, and we are more aware that the pieces variable is the focus of the code.

1
2
3
4
5
6
7
8
pieces = 0
if (count := fresh_fruit.get('banana', 0)) >= 2:
    pieces = slice_bananas(count)

try:
    smoothies = make_smoothies(pieces)
except OutofBananas:
    out_of_stock()

For writing with separate pieces variables inside the if and else branches, the walrus operator also makes the code clear because this time you don’t have to put the count variable above the entire if/else block.

1
2
3
4
5
6
7
8
9
if (count := fresh_fruit.get('banana', 0)) >= 2:
    pieces = slice_bananas(count)
else:
    pieces = 0

try:
    smoothies = make_smoothies(pieces)
except OutofBananas:
    out_of_stock()

Python newbies often have trouble finding a good way to implement switch/case constructs. The closest you can get to this is to continue nesting if/else constructs inside of them, or to use if/elif/else constructs.

For example, we want to automatically make drinks for our guests in a certain order so that we don’t have to order them. The following logic first determines if it can make a banana smoothie, if not, it makes apple juice, and if that doesn’t work, it makes lemon juice.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
count = fresh_fruit.get('banana', 0)
if count >= 2:
    pieces = slice_bananas(count)
    to_enjoy = make_smoothies(pieces)
else:
    count = fresh_fruit.get('apple', 0)
    if count >= 4:
        to_enjoy = make_cider(count)
    else:
        count = fresh_fruit.get('lemon', 0)
        if count:
            to_enjoy = make_lemonade(count)
        else:
            to_enjoy = 'Nothing'

This ugly way of writing is actually particularly common in Python code. Fortunately, we now have the walrus operator, which allows us to easily mimic something very close to the switch/case scenario.

1
2
3
4
5
6
7
8
9
if (count := fresh_fruit.get('banana', 0)) >= 2:
    pieces = slice_bananas(count)
    to_enjoy = make_smoothies(pieces)
elif (count := fresh_fruit.get('apple', 0)) >= 4:
    to_enjoy = make_cider(count)
elif count := fresh_fruit.get('lemon', 0):
    to_enjoy = make_lemonade(count)
else:
    to_enjoy = 'Nothing'

This version is only five lines shorter than the original, but it looks much clearer because the nesting depth and the number of indentation levels have been reduced. Whenever we encounter an unsightly structure like the one we just saw, we should consider whether we can write it with the walrus operator instead.

Another difficulty that Python newbies will encounter is the lack of do/while loop structures. For example, we want to make juice from new fruit and bottle it up until we run out of fruit. Here’s how to do it using a normal while loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def pick_fruit():
    pass


def make_juice(fruit, count):
    pass


bottles = []
fresh_fruit = pick_fruit()
while fresh_fruit:
    for fruit, count in fresh_fruit.items():
        batch = make_juice(fruit, count)
        bottles.extend(batch)
    fresh_fruit = pick_fruit()

This way you have to write fresh_fruit = pick_fruit() twice, first before entering the while loop, because we have to set the initial value for fresh_fruit, and second at the end of the while loop body, because we have to fill fresh_fruit with the list of fruits that need to be processed in the next round. inside. If you want to reuse this line of code, consider the loop-and-a-half pattern. This pattern eliminates repetition, but it makes the while loop look stupid because it becomes an infinite loop, and the program can only jump out of the loop with a break statement.

1
2
3
4
5
6
7
8
bottles = []
while True:
   fresh_fruit = pick_fruit()
   if not fresh_fruit:
      break
    for fruit, count in fresh_fruit.items():
        batch = make_juice(fruit, count)
        bottles.extend(batch)

With the walrus operator, there is no need to use the loop-and-a-half pattern; we can assign a value to the fresh_fruit variable at the beginning of each loop and decide whether to continue the loop based on the value of the variable. This is easy to read, so it should be the preferred solution.

1
2
3
4
5
bottles = []
while fresh_fruit := pick_fruit():
    for fruit, count in fresh_fruit.items():
        batch = make_juice(fruit, count)
        bottles.extend(batch)

In other cases, assignment expressions can also reduce repetitive code (see clause 29). In short, if an expression or assignment operation appears multiple times in a set of code, consider making the code simpler with assignment expressions.

Summary

  1. Assignment expressions assign a value to a variable by the walrus operator (:=) and make the value the result of this expression, so we can use this feature to reduce the code. If the assignment expression is part of a larger expression, you’ll have to enclose it in a pair of parentheses.
  2. Python doesn’t support switch/case and do/while constructs, but you can use assignment expressions to clearly model this logic.

Lists and dictionaries

11. Learning to Slice Sequences

Python has a way to slice parts of a sequence so that we can easily access some subset of the original sequence. The simplest use is to slice the built-in lists, str, and bytes; in fact, any class that implements the special methods __getitem__ and __setitem__ can slice (see #43).

The most basic way to do this is to use the form somelist[start:end], which means that you start at the start and go all the way to the end, but not the end itself.

1
2
3
4
5
6
7
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
print('Middle two:   ', a[3:5])
print('All but ends: ', a[1:7])

>>>
Middle two:    ['d', 'e']
All but ends:  ['b', 'c', 'd', 'e', 'f', 'g']

If the list is cut from the beginning, then the subscript 0 to the left of the colon should be omitted to make it look clearer.

1
assert a[:5] == a[0:5]

If you take it all the way to the end of the list, then you should omit the subscript to the right of the colon, because you don’t need to write it out specifically.

1
assert a[5:] == a[5:len(a)]

Using negative numbers as subscripts means counting from the end of the list to the front. The following cuts should make sense even to someone who just saw this code.

1
2
3
4
5
6
7
8
9
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
print(a[:]) # ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
print(a[:5]) # ['a', 'b', 'c', 'd', 'e']
print(a[:-1]) # ['a', 'b', 'c', 'd', 'e', 'f', 'g']
print(a[4:]) # ['e', 'f', 'g', 'h']
print(a[-3:]) # ['f', 'g', 'h']
print(a[2:5]) # ['c', 'd', 'e']
print(a[2:-1]) # ['c', 'd', 'e', 'f', 'g']
print(a[-3:-1]) # ['f', 'g']

If the range determined by the start and end points is beyond the boundary of the list, then the system automatically ignores the non-existent elements. Using this feature, it is easy to construct an input sequence with at most a number of elements, e.g.

1
2
first_twenty_items = a[:20]
last_twenty_items = a[-20:]

The subscripts used for cutting can be crossed, but not when accessing the list directly, which would cause the program to throw an exception.

1
2
3
4
5
a[20]

>>>
Traceback ...
IndexError: list index out of range

The cut list is a completely new list. Even if an element is replaced, it does not affect the corresponding position in the original list. The element in that position still has its old value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
b = a[3:]
print('Before:', b)
b[1] = 99
print('After: ', b)
print('No change:', a)

>>>
Before: ['d', 'e', 'f', 'g', 'h']
After:  ['d', 99, 'f', 'g', 'h']
No change: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

The slice can appear to the left of the assignment symbol, indicating that the elements of the original list within this range are replaced by those on the right. Unlike unpacking, this assignment does not require that the number of elements specified on either side of the equal sign be the same (in the case of unpacking, the number of variables on the left side of the equal sign to receive values must match the number of values provided on the right side of the equal sign, e.g., a, b = c[:2], see Section 6). In the original list, those elements before and after the slice range are preserved, but the length of the list may change. For example, in the following example, the list is shorter because the right-hand side of the assignment symbol provides only 3 values, but the slice on the left-hand side covers 5 values, and the list is two elements shorter than before.

1
2
3
4
5
6
7
8
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
print('Before ', a)
a[2:7] = [99, 22, 14]
print('After  ', a)

>>>
Before  ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
After   ['a', 'b', 99, 22, 14, 'h']

The following code will make the list longer because the number of elements to the right of the assignment symbol is greater than the number of elements covered by that slice on the left.

1
2
3
4
5
6
7
8
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
print('Before ', a)
a[2:3] = [47, 11]
print('After  ', a)

>>>
Before  ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
After   ['a', 'b', 47, 11, 'd', 'e', 'f', 'g', 'h']

Slices that are left empty at the start and end positions, and that appear to the right of the assignment symbol, indicate that a copy of this list is made, so that the new list produced has the same content as the original list, but with a different identity.

1
2
b = a[:]
assert b == a and b is not a

Putting the slice without the start/stop subscript to the left of the assignment symbol means replacing the entire contents of the left list with a copy of the list on the right (note that the left list still retains its original identity and no new list will be assigned).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
a = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
b = a
c = a[:]
print('Before a', a)
print('Before b', b)
print('Before c', b)
a[:] = [101, 102, 103]
assert b is a and c is not a  # Still the same list object
print('After a ', a)  # Now has different contents
print('After b ', b)  # Same list, so same contents as a
print('After c ', c)  # Not same list, same contents as a

>>>
Before a ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Before b ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Before c ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
After a  [101, 102, 103]
After b  [101, 102, 103]
After c  ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

Summarize

  1. keep the slice as simple as possible: omit the start subscript 0 if you pick from the beginning, or omit the end subscript if you pick to the end of the sequence.
  2. Slices allow start or stop subscripts to go out of bounds, so it is easy to express things like “how many elements to take at the beginning” (e.g. a[:20]) or “how many elements to take at the end” (e.g. a[-20:0]) without having to worry about whether the slice really has that many elements.
  3. Putting the slice on the left side of the assignment symbol can replace the elements in this range of the original list with the elements on the right side of the assignment symbol, but it may change the length of the original list.

12. Don’t specify both a start/end subscript and a step in a slice

In addition to the basic way of writing slices (see Rule 11), Python has a special form of step slicing, somelist[start : end : stride]. This form picks one out of every n elements, so that elements in odd and even positions can easily be picked out by x[::2] and x[1::2] respectively.

1
2
3
4
5
6
7
8
9
x = ['red', 'orange', 'yellow', 'green', 'blue', 'purple']
odds = x[::2]
evens = x[1::2]
print(odds)
print(evens)

>>>
['red', 'yellow', 'blue']
['orange', 'green', 'purple']

Slicing with steps can often have unintended effects and make programs buggy; for example, a common trick in Python is to use -1 as a step value to slice strings of type bytes so that the string is reversed.

1
2
3
4
5
6
x = b'mongoose'
y = x[::-1]
print(y)

>>>
b'esoognom'

Strings in Unicode form can also be inverted in this way (see clause 3).

1
2
3
4
5
6
x = '多喝水'
y = x[::-1]
print(y)

>>>
水喝多

However, if such strings are encoded as UTF-8 standard byte data, you cannot use this trick to invert them.

1
2
3
4
5
6
7
8
x = '多喝水'
x = x.encode('utf-8')
y = x[::-1]
z = y.decode('utf-8')

>>>
Traceback ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte

Does it make sense to use negative numbers other than -1 as step values? Consider the following example.

1
2
3
x = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
print(x[::2])  # ['a', 'c', 'e', 'g']
print(x[::-2])  # ['h', 'f', 'd', 'b']

In the above example, ::2 means select backward from the beginning, one out of every two elements. The meaning of ::-2 is a little more circular, as it means to select from the end to the front, one out of every two elements.

Using both start/stop subscripts and step makes slicing hard to understand. Writing three values inside square brackets is too crowded and not easy to read, and when specifying a step value (especially a negative step value), we have to think very carefully: is this a backward or a backward step?

To avoid this problem, I recommend that you do not write both the start/stop subscript and the step value in the slice. If you must specify a step, try to use a positive number and leave both the start and stop subscripts blank. Even if you have to use both step and start/stop subscripts, you should consider writing them twice.

1
2
y = x[::2]  # ['a', 'c', 'e', 'g']
z = y[1:-1] # ['c', 'e']

Selecting and then slicing as we just did will make the program do one more shallow copy. So, you should put the slice operation that can reduce the list length the most first. If the program doesn’t have the time or memory for a two-step operation, you can use the islice method in the built-in itertools module instead (see article 36), which is clearer to use because it can’t have a negative start/stop position or step value.

Summary

  1. It can be difficult to understand the start/stop subscript and the step value of a slice at the same time.
  2. If you want to specify a step value, omit the start and stop subscripts, and preferably use a positive number as the step value, not a negative one.
  3. Do not write the start position, end position and step value all in the same slice operation. If you have to use all three metrics at the same time, do it twice (once to pick and once to cut), or use the islice method in the itertools built-in module instead.

13. Capture multiple elements by unpacking with an asterisk, not slicing

One limitation of the basic unpacking operation (see clause 6) is that the length of the sequence to be unpacked must be determined in advance. For example, when selling cars, we might write the age of each car in a list, and then arrange them in order from oldest to youngest. If you try to get the oldest two of these cars by a basic unpacking operation, the program will run with an exception.

1
2
3
4
5
6
7
car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
oldest, second_oldest = car_ages_descending

>>>
Traceback ...
ValueError: too many values to unpack (expected 2)

Python novices often deal with this problem by subscripting and slicing (see #11). For example, you can explicitly subscript the two oldest and second oldest cars, and then put the rest of the cars in another list.

1
2
3
4
5
6
7
8
9
car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
oldest = car_ages_descending[0]
second_oldest = car_ages_descending[1]
others = car_ages_descending[2:]
print(oldest, second_oldest, others)

>>>
20 19 [15, 9, 8, 7, 6, 4, 1, 0]

This is fine, but subscripts and slicing can make the code look messy. Also, splitting elements in a sequence into multiple subsets in this way is actually quite error-prone, because we usually tend to over- or under-write the subscript by one position. For example, if you modify one of the rows but forget to update another, you will encounter this error.

This problem is better solved by using starred expressions, which are unpacking operations that include elements that cannot be picked up by normal variables. Here’s how to rewrite that code using the unpacking operation with an asterisk, this time without subscripts or slicing.

1
2
3
4
5
6
7
car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
oldest, second_oldest, *others = car_ages_descending
print(oldest, second_oldest, others)

>>>
20 19 [15, 9, 8, 7, 6, 4, 1, 0]

This is short and easy to read, and it is not error-prone, because it does not require us to remember to update the other subscripts simultaneously after modifying one of them.

This expression with an asterisk can appear at any position, so it can capture any segment of elements in the sequence.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
oldest, *others, youngest = car_ages_descending
print(oldest, youngest, others)
*others, second_youngest, youngest = car_ages_descending
print(youngest, second_youngest, others)

>>>
20 0 [19, 15, 9, 8, 7, 6, 4, 1]
0 1 [20, 19, 15, 9, 8, 7, 6, 4]

Only, when using this writing style, you must have at least one normal receiving variable with it, otherwise a SyntaxError will occur. e.g. you can’t just use an expression with an asterisk without a normal variable, like the following.

1
2
3
4
5
6
car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
*others = car_ages_descending

>>>
SyntaxError: starred assignment target must be in a list or tuple

In addition, for single-level structures, unpacking with an asterisk can occur at most once within the same level.

1
2
3
4
first, *middle, *second_middle, last = [1, 2, 3, 4]

>>>
SyntaxError: two starred expressions in assignment

If the structure to be disassembled has many layers, the unpacking operation with an asterisk can appear in different parts of the same level. This is not recommended (see Section 19 for a similar recommendation). The reason for this example is to help you understand what kind of unpacking can be achieved by such an asterisked expression.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
car_inventory = {
    'Downtown': ('Silver Shadow', 'Pinto', 'DMC'),
    'Airport': ('Skyline', 'Viper', 'Gremlin', 'Nova'),
}

((loc1, (best1, *rest1)),
 (loc2, (best2, *rest2))) = car_inventory.items()
print(f'Best at {loc1} is {best1}, {len(rest1)} others')
print(f'Best at {loc2} is {best2}, {len(rest2)} others')

>>>
Best at Downtown is Silver Shadow, 2 others
Best at Airport is Skyline, 3 others

An expression with an asterisk always forms a list instance. If there are no more elements left in the sequence to be split, the list is blank. This feature is useful if you can determine in advance that there will be at least N elements in the sequence to be processed.

1
2
3
4
5
6
short_list = [1, 2]
first, second, *rest = short_list
print(first, second, rest)

>>>
1 2 []

The unpacking operation can also be used on an iterator, but it is not as advantageous as the basic way of splitting the data into multiple variables. For example, I could construct a range of values of length 2, wrap it in an iterator, and split the values into first and second variables. But it would be simpler to just use a static list that matches the form (e.g., [1, 2]).

1
2
3
4
5
6
it = iter(range(1, 3))
first, second = it
print(f'{first} and {second}')

>>>
1 and 2

The benefit of doing unpacking operations on iterators is mainly reflected in the usage with an asterisk, which makes the splitting values of the iterator clearer. For example, here’s a generator that takes one row of data at a time from a CSV file containing a whole week’s worth of car orders.

1
2
3
def generate_csv():
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    # ...

We could use subscripts and slices to handle the results given by this generator, but this would take many lines of code to write and look confusing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def generate_csv():
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    # ...


all_csv_rows = list(generate_csv())
header = all_csv_rows[0]
rows = all_csv_rows[1:]
print('CSV Header:', header)
print('Row count:', len(rows))

>>>
CSV Header: ('Date', 'Make', 'Model', 'Year', 'Price')
Row count: 4

Using the unpacking operation with an asterisk, we can put the first row (the table header) in a separate header variable and combine the rest of the contents given by the iterator into a rows variable. This is much clearer to write.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def generate_csv():
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    yield ('Date', 'Make', 'Model', 'Year', 'Price')
    # ...


it = list(generate_csv())
header, *rows = it
print('CSV Header:', header)
print('Row count:', len(rows))

>>>
CSV Header: ('Date', 'Make', 'Model', 'Year', 'Price')
Row count: 4

The part with an asterisk always forms a list, so be aware that this may use up all of the computer’s memory and cause the program to crash. You must first make sure that the system has enough memory to store the split result data before you can use unpacking with an asterisk on the iterator (there is another way to do this, see article 31).

Summary

  1. When splitting a data structure and assigning its data to a variable, you can use an asterisked expression to capture the contents of the structure that cannot be matched to a normal variable in a list.
  2. This asterisked expression can appear anywhere to the left of the assignment symbol, and it will always result in a list with zero or more values.
  3. The asterisked unpacking is clearer when breaking the list into non-overlapping parts, while the subscripted and sliced approach is error-prone.

14. Representing complex sorting logic with the key argument of the sort method

The built-in list type provides a method called sort that sorts the elements of a list instance by a number of metrics. By default, the sort method always sorts the elements in the list in natural ascending order. For example, if the elements of the list are all integers, then it sorts them from smallest to largest value.

1
2
3
4
5
6
numbers = [93, 86, 11, 68, 70]
numbers.sort()
print(numbers)

>>>
[11, 68, 70, 86, 93]

Here a Tool class is defined to represent various construction tools, with a __repr__ method, so we can print instances of this class as strings.

If we just write it this way, there is no way to sort this list of objects of this class with the sort method, because the sort method finds that the special methods needed for sorting are not defined in the Tool class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
class Tool:
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight

    def __repr__(self):
        return f'Tool({self.name!r}, {self.weight})'


tools = [
    Tool('level', 3.5),
    Tool('hammer', 1.25),
    Tool('screwdriver', 0.5),
    Tool('chisel', 0.25),
]

tools.sort()

>>>
Traceback ...
TypeError: '<' not supported between instances of 'Tool' and 'Tool'

The following function is defined using the lambda keyword, which is passed to the key argument of the sort method, allowing us to sort the Tool objects alphabetically by name.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Tool:
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight

    def __repr__(self):
        return f'Tool({self.name!r}, {self.weight})'


tools = [
    Tool('level', 3.5),
    Tool('hammer', 1.25),
    Tool('screwdriver', 0.5),
    Tool('chisel', 0.25),
]

print('Unsorted:', repr(tools))
tools.sort(key=lambda x: x.name)
print('\nSorted:', tools)

>>>
Unsorted: [Tool('level', 3.5), Tool('hammer', 1.25), Tool('screwdriver', 0.5), Tool('chisel', 0.25)]

Sorted: [Tool('chisel', 0.25), Tool('hammer', 1.25), Tool('level', 3.5), Tool('screwdriver', 0.5)]

Sometimes we may need to sort with more than one criterion. This can be done using the properties of tuples: if the first element of two tuples are equal, compare the second element, and if they are still equal, continue down the list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Tool:
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight

    def __repr__(self):
        return f'Tool({self.name!r}, {self.weight})'


power_tools = [
    Tool('drill', 4),
    Tool('circular saw', 5),
    Tool('jackhammer', 40),
    Tool('sander', 4),
]

power_tools.sort(key=lambda x: (x.weight, x.name))
print(power_tools)

power_tools.sort(key=lambda x: (x.weight, x.name), reverse=True)
print(power_tools)

>>>
[Tool('drill', 4), Tool('sander', 4), Tool('circular saw', 5), Tool('jackhammer', 40)]
[Tool('jackhammer', 40), Tool('circular saw', 5), Tool('sander', 4), Tool('drill', 4)]

The disadvantage of this approach is that the tuple constructed by the key function can only compare the indicators it represents in the same sorting direction (if ascending, they must all be ascending; if descending, they must all be descending).

1
2
power_tools.sort(key=lambda x: (-x.weight, x.name))
print(power_tools)

However, the str type does not support the unary subtraction operator, so the sort method can be called multiple times to achieve different directions of sorting, taking advantage of the principle that the sort method is a stable sorting algorithm.

1
2
3
power_tools.sort(key=lambda x: x.name)  # Name ascending
power_tools.sort(key=lambda x: x.weight, reverse=True)  # Weight descending
print(power_tools)

No matter how many sorting indicators there are, they can be implemented in this way, and each indicator can be sorted in its own direction, not all in ascending order or all in descending order. Just write them backwards, i.e., put the most dominant one in the last round. In the example above, the primary indicator is weight descending, and the secondary indicator is name ascending, so it is sorted first by name ascending, and then by weight descending.

Although both ideas achieve the same result, it is still easier to call sort only once than to call sort multiple times. Therefore, when implementing multiple metrics sorted in different directions, you should give priority to having the key function return a tuple and take the opposite number of the corresponding metrics in the tuple. Only as a last resort should multiple calls to the sort method be considered.

Summary

  1. the sort method of a list can sort the elements of built-in types such as strings, integers, and tuples according to their natural order.
  2. ordinary objects that have a natural order defined by a special method can also be sorted using the sort method, but such objects are not common.
  3. you can pass a helper function to the key argument of the sort method and let sort sort the elements according to the value returned by this function, instead of sorting the elements themselves.
  4. If there are many indicators to sort on, you can put them in a tuple and have the key function return such a tuple. For types that support the one-decimal operator, you can invert this indicator separately and let the sorting algorithm process in the opposite direction on this indicator.
  5. If these metrics do not support the unary subtraction operator, you can call the sort method multiple times and specify the key function and the reverse argument separately in each call. The least important metrics are processed in the first round, then the more important metrics are processed step by step, and the first metrics are processed in the last round.

15. Don’t rely too much on the order you use when adding entries to a dictionary

In Python 3.5 and earlier, the order you see when iterating through a dictionary (dict) seems to be arbitrary, and not necessarily in the same order as when you added the key-value pairs to the dictionary in the first place.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Python 3.5
baby_names = {
    'cat': 'kitten',
    'dog': 'puppy',
}

print(baby_names)

>>>
{'dog': 'puppy', 'cat': 'kitten'}

This effect occurs because the dictionary type was previously implemented using a hash table algorithm (this algorithm runs through a built-in hash function with a random seed number that is determined each time the Python interpreter is started). So, such a mechanism results in these key-value pairs not necessarily being stored in the same order in the dictionary as they were when they were added, and the order of storage may be different each time you run the program.

Starting with Python 3.6, the dictionary retains the order in which these key-value pairs were added, and the Python 3.7 language specification formalizes this rule. So, in newer versions of Python, it’s always possible to iterate through these key-value pairs in the same order in which they were created.

1
2
3
4
5
6
7
8
9
baby_names = {
    'cat': 'kitten',
    'dog': 'puppy',
}

print(baby_names)

>>>
{'cat': 'kitten', 'dog': 'puppy'}

In Python 3.5 and earlier, many of the methods provided by dict (including keys, values, items, popitem, etc.) are not guaranteed to be in a fixed order, so they feel like they are processed randomly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
baby_names = {
    'cat': 'kitten',
    'dog': 'puppy',
}

# Python 3.5
print(list(baby_names.keys()))
print(list(baby_names.values()))
print(list(baby_names.items()))
print(baby_names.popitem())  # Randomly chooses an item

>>>
['dog', 'cat']
['puppy', 'kitten']
[('dog', 'puppy'), ('cat', 'kitten')]
('dog', 'puppy')

In newer versions of Python, these methods can now be handled in the same order as they were when the key-value pairs were first added.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
baby_names = {
    'cat': 'kitten',
    'dog': 'puppy',
}

print(list(baby_names.keys()))
print(list(baby_names.values()))
print(list(baby_names.items()))
print(baby_names.popitem())  # Randomly chooses an item

>>>
['cat', 'dog']
['kitten', 'puppy']
[('cat', 'kitten'), ('dog', 'puppy')]
('dog', 'puppy')

This change has had a number of implications for features of Python that rely on dictionary types and their implementation details.

Keyword arguments to functions (including the almighty **kwargs argument, see #23), which used to appear in near-random order, made function call operations difficult to debug.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Python 3.5
def my_func(**kwargs):
    for key, value in kwargs.items():
        print('%s = %s' % (key, value))


my_func(goose='gosling', kangaroo='joey')

>>>
kangaroo = joey
goose = gosling

Now, these keyword arguments always retain the set of order specified when the function is called.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def my_func(**kwargs):
    for key, value in kwargs.items():
        print('%s = %s' % (key, value))


my_func(goose='gosling', kangaroo='joey')

>>>
goose = gosling
kangaroo = joey

Also, classes make use of dictionaries to hold some of the data that an instance of that class has. In earlier versions of Python, the fields in an object looked as if they appeared in random order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class MyClass:
    def __init__(self):
        self.alligator = 'hatchling'
        self.elephant = 'calf'


a = MyClass()
for key, value in a.__dict__.items():
    print('%s = %s' % (key, value))

>>>
# Python 3.5
elephant = calf
alligator = hatchling

# Python > 3.5
alligator = hatchling
elephant = calf

The Python language specification now requires that dictionaries preserve the order in which key-value pairs are added. So, we can use this feature to implement some functionality, and we can incorporate it into our own APIs for classes and functions.

In fact, the built-in collections module already provides a dictionary that preserves the order of insertion, called OrderedDict, which behaves much like the standard dict type (since Python 3.7), but with a significant performance difference. If you want to insert or pop key-value pairs frequently (for example, to implement least-recently-used caching), then OrderedDict may be more appropriate than the standard Python dict type (see Section 70 for how to determine whether you should switch to this type).

When dealing with dictionaries, you can’t always assume that all dictionaries preserve the order in which key-value pairs are inserted. Python is not a statically typed language, and most code operates on a duck typing mechanism (i.e., objects can be used as data of whatever behavior they support, without any attachment to their status in the class system). can be used as whatever data it supports, without being obsessed with its place in the class system). This feature can create unexpected problems.

For example, now we want to write a program that counts the popularity of various critters. We can set up a dictionary that associates each animal with the number of votes it gets.

1
2
3
4
5
votes = {
    'otter': 1281,
    'polar bear': 587,
    'fox': 863,
}

Now define a function to handle the polling data. The user can pass an empty dictionary to this function, in which case it will put each animal and its ranking in this dictionary. Such a dictionary can act as a data model to give data to elements with a user interface (UI).

1
2
3
4
5
def populate_ranks(votes, ranks):
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i

We also need to write a function to find out the most popular animals. This function assumes that populate_ranks always writes key-value pairs to the dictionary in ascending order, so that the first one to appear in the dictionary should be the top-ranked animal.

1
2
def get_winner(ranks):
    return next(iter(ranks))

Here’s how to verify the functions you just designed and see if they achieve the desired results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
votes = {
    'otter': 1281,
    'polar bear': 587,
    'fox': 863,
}


def populate_ranks(votes, ranks):
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i


def get_winner(ranks):
    return next(iter(ranks))


ranks = {}
populate_ranks(votes, ranks)
print(ranks)
winner = get_winner(ranks)
print(winner)

>>>
{'otter': 1, 'fox': 2, 'polar bear': 3}
otter

The result is not a problem. But let’s say the requirements have changed and we now want to display them in the UI in alphabetical order, rather than by name as we did before. To achieve this effect, we define such a class using the built-in collections.abc module. This class has the same functionality as a dictionary and iterates through its contents in alphabetical order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from collections.abc import MutableMapping


class SortedDict(MutableMapping):
    def __init__(self):
        self.data = {}

    def __getitem__(self, key):
        return self.data[key]

    def __setitem__(self, key, value):
        self.data[key] = value

    def __delitem__(self, key):
        del self.data[key]

    def __iter__(self):
        keys = list(self.data.keys())
        keys.sort()
        for key in keys:
            yield key

    def __len__(self):
        return len(self.data)


votes = {
    'otter': 1281,
    'polar bear': 587,
    'fox': 863,
}


def populate_ranks(votes, ranks):
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i


def get_winner(ranks):
    return next(iter(ranks))


sorted_ranks = SortedDict()
populate_ranks(votes, sorted_ranks)
print(sorted_ranks.data)
winner = get_winner(sorted_ranks)
print(winner)

>>>
{'otter': 1, 'fox': 2, 'polar bear': 3}
fox

It doesn’t get the expected result, so why is that? Because the get_winner function always assumes that the dictionary should be iterated over in the same order as the populate_ranks function inserted the data into the dictionary in the first place. But this time, we’re using a SortedDict instance instead of a standard dict instance, so this assumption doesn’t hold. Therefore, the function returns the data that appears first in alphabetical order, i.e. 'fox', with the following 3 solutions.

  1. Reimplement the get_winner function so that it no longer assumes that the ranks dictionary can always be iterated over in a fixed order. This is the safest and most secure solution.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    def get_winner(ranks):
    for name, rank in ranks.items():
        if rank == 1:
            return name
    
    winner = get_winner(sorted_ranks)
    print(winner)
    
    >>>
    otter
    
  2. The function starts by determining whether ranks is a standard dict as expected. If not, an exception is thrown. This approach has better performance than the one just described.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    def get_winner(ranks):
    if not isinstance(ranks, dict):
        raise TypeError('must provide a dict instance')
    return next(iter(ranks))
    
    get_winner(sorted_ranks)
    
    >>>
    Traceback ...
    TypeError: must provide a dict instance
    
  3. The type annotation ensures that the get_winner function is actually a real dict instance and not a MutableMapping that behaves like a standard dictionary (see clause 90). The mypy tool is run in strict mode on the annotated code.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    
    from typing import Dict, MutableMapping
    
    def populate_ranks(votes: Dict[str, int],
                    ranks: Dict[str, int]) -> None:
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i
    
    def get_winner(ranks: Dict[str, int]) -> str:
    return next(iter(ranks))
    
    class SortedDict(MutableMapping[str, int]):
    def __init__(self):
        self.data = {}
    
    def __getitem__(self, key):
        return self.data[key]
    
    def __setitem__(self, key, value):
        self.data[key] = value
    
    def __delitem__(self, key):
        del self.data[key]
    
    def __iter__(self):
        keys = list(self.data.keys())
        keys.sort()
        for key in keys:
            yield key
    
    def __len__(self):
        return len(self.data)
    
    votes = {
    'otter': 1281,
    'polar bear': 587,
    'fox': 863,
    }
    
    sorted_ranks = SortedDict()
    populate_ranks(votes, sorted_ranks)
    print(sorted_ranks.data)
    winner = get_winner(sorted_ranks)
    print(winner)
    
    $ python3 -m mypy --strict test.py
    test.py:42: error: Argument 2 to "populate_ranks" has incompatible type "SortedDict"; expected "Dict[str, int]"
    test.py:44: error: Argument 1 to "get_winner" has incompatible type "SortedDict"; expected "Dict[str, int]"
    

    This checks for type mismatches, and mypy flags the incorrect usage by pointing out that the function asks for a dict but passes in a MutableMapping. this solution ensures that the static types are accurate without affecting the efficiency of the program.

Key points

  1. Starting with Python 3.7, we can be sure that the order we see when we iterate over a standard dictionary is the same as the order in which these key-value pairs are inserted into the dictionary.
  2. In Python code, it’s easy to define objects that look like standard dictionaries but aren’t themselves dict instances. For this type of object, you can’t assume that the order you see it when iterating is necessarily the same as the order in which it is inserted.
  3. If you do not want to treat this type, which is very similar to a standard dictionary, as a standard dictionary, there are three approaches to consider.
    1. do not rely on the order in which the code is written at insertion time.
    2. explicitly determine at program runtime whether it is a standard dictionary or not.
    3. add type annotations to the code and do static analysis.