Background

Two nights ago, the online system suddenly broke down, and after immediately opening the online error log, I only got a bunch of meaningless program call stack (traceback) output, so the team members were caught in a long and blind problem solving process. The problem was fortunately solved, but I couldn’t figure out why the call stack printed in the log was meaningless when, as a rule of thumb, it should print the call stack during the exception generation process. After subsequent source code analysis and troubleshooting, I realized it was actually due to an old piece of code in the project using monkey patch, which is what this article is trying to discuss.

What is a monkey patch

Monkey patching is a type of programming used to modify (add, change, delete, etc.) the behavior of system software at runtime. There is a wide range of monkey patch applications in dynamic languages, such as Ruby’s open class feature that supports extending class definitions and even replacing method implementations at runtime, Python’s methods or functions that can be replaced at runtime, and other languages like JavaScript that can also apply monkey patches.

Monkey patching is a double-edged sword

Monkey patches are flexible enough to allow complete separation of the patch code from the application code, while allowing the application code to be invoked in a way that remains constant. From the perspective of the application code, it calls the original defined method or function of a module; from the perspective of the called method or function, the presence of the monkey patch is transparent to it, as shown in the following demo in Python.

Let’s start with a minimalist example and say hello to a wonderful world.

1
2
3
4
5
def greet():
    print("Hello World!")

if __name__ == "__main__":
    greet()

If the above script is executed, the result will be

1
2
$ python demo.py
Hello World!

This is easy. Next, let’s say we do a monkey patch: we expand the original greet behavior and now print the current time in addition to the message.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from datetime import datetime
def greet():
    print("Hello World!")

# monkey patch
original_greet = greet
def greet_with_time():
    original_greet()
    print(datetime.now())
greet = greet_with_time  # replace the implementation
# monkey patch

if __name__ == "__main__":
    greet() # 这里的调用和原来没有变化

Run it and get the following results.

1
2
3
$ python demo.py
Hello World!
2019-09-21 23:40:42.575782

We get the expected result!

From the code analysis, we added a new function greet_with_time, which calls the original greet function, then prints the current time, and finally replaces the greet function by assigning the function to a variable. The final call to the greet function does not require any changes, thus achieving the same call to the greet function but with a very different behavior.

The above demo is just a simplified code for the sake of space, the monkey patch code in a real project is always in a separate module or file. Imagine a large, complex project where your code is flooded with monkey patches, it would be a disastrous challenge to analyze the behavior of the system and troubleshoot problems.

Now that we have some understanding of monkey patches, let’s look at an example I encountered in a real project.

A bunch of meaningless stack information

I reproduced locally the exception we encountered as I mentioned at the beginning, and here are the stack messages consistent with the online environment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
2019-09-19 17:30:11.103|CRITICAL|138:140147476383488|log.py:282|log.log|Task command.celery.crontab_task.some_task[538ddb72-89b0-45fe-811e-107202dc665b] INTERNAL ERROR: AttributeError("'long' object has no attribute 'insert'",)
Traceback (most recent call last):
  File "/usr/local/bin/celery", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/celery/__main__.py", line 30, in main
    main()
  ...... 限于篇幅,这里省略很多无意义的内容
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/job.py", line 384, in on_success
    return self.on_failure(ret_value)
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/job.py", line 443, in on_failure
    self._log_error(exc_info, send_failed_event=send_failed_event)
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/job.py", line 511, in _log_error
    'internal': internal}})
  File "/usr/local/lib/python2.7/dist-packages/celery/utils/log.py", line 282, in log
    return Logger.log(self, *args, **kwargs)
None

From this stack message, it actually prints the stack of the Logger.log function called, in which there is no code at all to see the word .insert, which has nothing to do with AttributeError("'long' object has no attribute 'insert'",). Such a stack of information is basically the same with and without. So, I went on to explore more through the editor and the source code.

The first thing I did was to use the stack above to analyze what went wrong, so I looked at the code at celery/worker/job.py:504-511 first.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
        context = {
            'hostname': self.hostname,
            'id': self.id,
            'name': self.name,
            'exc': exception,
            'traceback': traceback,
            'args': sargs,
            'kwargs': skwargs,
            'description': description,
        }

        logger.log(severity, format.strip(), context,
                   exc_info=exc_info,
                   extra={'data': {'id': self.id,
                                   'name': self.name,
                                   'args': sargs,
                                   'kwargs': skwargs,
                                   'hostname': self.hostname,
                                   'internal': internal}})

The logger.log method is called here (the source of the logger can be analyzed in Celery’s code, but is not the focus of this article, so I won’t expand on it) and two important messages are passed in via the context object: exception and traceback. Further reading of the logger.log source code confirmed that the core of this piece of log printing relies on a call to the traceback.print_exception function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    def formatException(self, ei):
        """
        Format and return the specified exception information as a string.

        This default implementation just uses
        traceback.print_exception()
        """
        sio = io.StringIO()
        tb = ei[2]
        traceback.print_exception(ei[0], ei[1], tb, None, sio)

So, I went back to the code at celery/worker/job.py:504-511 and inserted two codes that print error stack information before logger.log.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# context  = ...

################################################################
import traceback as _traceback
# Method 1: like what logger.log does
_traceback.print_exception(*exc_info)

# Method 2: use `format_exception` instead
print(''.join(_traceback.format_exception(*exc_info)))
################################################################

logger.log(....

After restarting celery and executing the asynchronous task, the first error stack is identical to the one I posted earlier, which is understandable since the call to the print_exception function here is the core implementation in logger.log. The call to format_exception gives me the real meaningful error stack information.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 283, in trace_task
    uuid, retval, SUCCESS, request=task_request,
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 271, in store_result
    request=request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 505, in _store_result
    self.set(self.get_key_for_task(task_id), self.encode(meta))
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/redis.py", line 161, in set
    return self.ensure(self._set, (key, value), **retry_policy)
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/redis.py", line 150, in ensure
    **retry_policy
  File "/usr/local/lib/python2.7/dist-packages/kombu/utils/__init__.py", line 246, in retry_over_time
    return fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/redis.py", line 170, in _set
    pipe.execute()
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2879, in execute
    return execute(conn, stack, raise_on_error)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2785, in _execute_transaction
    response.insert(i, e)
AttributeError: 'long' object has no attribute 'insert'

Now it’s clear, this is where the exception in this code is really coming from!

But then the question arises, why do print_exce ption and format_exception give different stack information? I went to the official documentation with a lot of questions, but the confusion got worse.

traceback.format_exception(etype, value, tb[, limit]) Format a stack trace and the exception information. The arguments have the same meaning as the corresponding arguments to print_exception(). The return value is a list of strings, each ending in a newline and some containing internal newlines. When these lines are concatenated and printed, exactly the same text is printed as does print_exception().

Focus on the last sentence, the official Python documentation says that both functions output the same error stack (EXACTLY the same text)!

Spot the monkey patch

Actually, the real process of troubleshooting the problem took me a lot of time. I hadn’t been thinking about the monkey patch, but I finally found the culprit on the subway on my way out to a friend’s dinner date, when I looked at the project code on my company’s GitLab on my phone.

1
2
3
4
5
6
7
8
9
def _patch_print_exception():
import traceback

def custom_print_exception(etype, value, tb, limit=None, file=None):
    exc_info = sys.exc_info()
    stack = traceback.extract_stack()
    # ... omit other source codes 

traceback.print_exception = custom_print_exception

From the patch code, the patch directly overwrites the original code, and the implementation also directly and brutally ignores the several exception message parameters passed in! That’s why it’s such a big oops, with unrelated exception stack information! (╯‵□′)╯︵┻┻

Tips for troubleshooting monkey patches

There are certainly pros and cons to this type of programming technique, monkey patches, and using them necessarily requires extra caution. However, it is not necessary to stay away from them, the focus is on mastering the necessary troubleshooting skills, and I will go back to this lesson to find some methods that may help.

1. Check the method or function information through the function or method itself properties

As you know, all Python objects have a bunch of built-in properties, and functions are no exception.

1
2
3
# django shell
In [1]: traceback.print_exception.func_code
Out[1]: <code object custom_print_exception at 0x109e9f030, file "/Users/boy/work_area/project/project-source/lib/common/logger.py", line 295>

A quick look shows that the real code of this function is actually the patch code of the project!

2. Inspecting with the inspect package

Python provides a lot of toolkits, and inspect is naturally one of them, and it can be used to do runtime checks on almost any type.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# django shell
In [1]: import inspect
In [2]: inspect.getfile(traceback.print_exception)
Out[2]: '/Users/boy/work_area/project/project-source/lib/common/logger.py'

In [3]: inspect.getsource(traceback.print_exception)
Out[3]: '\tdef custom_print_exception(etype, value, tb, limit=None, file=None): ......\n'

In [4]: print inspect.getsource(traceback.print_exception)
Out[4]: def custom_print_exception(etype, value, tb, limit=None, file=None):disable=redefined-builtin
        if file is None:
            file = sys.stderr
        exc_info = sys.exc_info()
        stack = traceback.extract_stack()
        ...

In short, if you encounter code behavior does not match the expected but can not correspond with the official documentation or official source code, then it is possible that the dependent method or function was monkey patched, and the quickest way to confirm the monkey patch is to check the actual definition of the function or method called in the first place, that is, to apply the above method!

Off-topic remarks

I’ve encountered the monkey patch trap when doing Ruby development, and there is a similar approach in Ruby.

1
2
3
file, line = A.new.method(:foo).source_location
puts "Method foo is defined in #{file}, line #{line}"
# => "Method foo is defined in temp.rb, line 2"