Tracing memory leaks in python
Couple of weeks back I started to rewrite one of our core service layer where I refactored a bunch of code and abstracted in a more reusable fashion. This involved touching some mysql cursor code and I accidentally introduced a memory leak during this. Unfortunately we found this bug right about one hour after we pushed to production. The service leaked memory and took about one hour for the server to start swapping. I started investigating this behavior and was able to reproduce this on my local machine after writing a quick load emulator with python threads.
Running top showed that RSS memory for the process starts around 21MB and grows to 46 MB in 30 minutes. In the past I used python object graph module objgraph to track the type of various objects in the python memory. I quickly placed a small snippet of code to print the count of type of objects in memory.
#!python import objgraph objgraph.show_most_common_types() dict 6230 tuple 5834 function 740
So I got the sense that there are way more dicts in memory. Although the counts of dicts are decreasing once in x seconds which means the python gc is performing its job. Meanwhile I reverted the production to its previous stable phase and ran the same stable build on local and saw that the number the dicts and tuples are about the same. So I knew that the problem is something else.
I read a little more reading on the web and found that heapy might be able to help me. But I failed in installing guppy/heapy on my mac. It constantly complained No module named gobject. I gave up on this direction and looked more into objgraph. Then I found this cool function which outputs the growing data structures.
#!python import objgraph objgraph.show_growth(limit=3) tuple 3432 +23 dict 2430 +18 function 630 +2
This function actually prints the count of the data structure and by how much it grew over the last call. That is cool but I found the same data structures growing constantly. I left the test running for couple of minutes while I rushed to fetch my laundry. While I was back I found that there was a new entry in the terminal.
tuple 4743 +23 dict 3703 +18 DictCursor 734 +1
Bingo that tells me something. So somehow the DictCursor objects were not garbage collected. Ok now how to trace down that why are these objects not garbage collected. The function which creates and destroy is this utility function get_mysql_cursor.
#!python from types import MethodType def fetchall_items(self): """Cursor returns a list of tuples of items Ex: [(3434,), (544,), (4545,)] """ return [x for x in self.fetchall()] def get_mysql_cursor(host): # returns pooled connection to host conn = get_mysql_connection_for_host(host) cursor = conn.cursor(cursor) try: # patch the cursor to add a simple helper cursor.fetchall_items = MethodType(fetchall_items, cursor) yield cursor except Exception, err: import traceback conn.refresh() logger.error("mysql query failed on %s with %s" %(host, traceback.print_exc())) finally: cursor.close() del cursor del conn
The function actually invokes del method after it is done with the cursor and connection. The __del__ method connection id hacked to return the connection back to the pool. After a little more work I was able to find this little trick in the objgraph docs, where objgraph can print the references between objects. That way we can see who is holding a reference to the cursor object.
#!python import objgraph objgraph.show_refs([DictCursor], filename='sample-graph.png')
For this to work I had to install graphviz and xdot modules.
#!bash $ sudo pip install graphviz xdot
This cracked it. The sample-graph.png showed the objects in memory and how they are referenced with each other.
You can clearly see that DictCursor has a back reference from instance method fetchall_items which is clearly making gc to not garbage collect the cursor. Yes I was patching the cursor with this utility function at runtime and clearly the function fetchall_items was in global scope of the module.
#!python try: # patch the cursor to add a simple helper cursor.fetchall_items = MethodType(fetchall_items, cursor) yield cursor except Exception, err: import traceback conn.refresh() logger.error("mysql query failed on %s with %s" %(host, traceback.print_exc())) finally: # remove the dynamic method to avoid memory leak delattr(cursor, "fetchall_items") cursor.close() del cursor del conn
I wrote a quick fix to delete the function before deleting the cursor, which seems to fix the problem.
Lesson Learnt: It is not a good practice to patch objects at run time. Such things result in these scenarios where you will have to spend a lot of time trying to find a problem which you don’t understand. I just took around 9 hours for me to crack this down.