Couple of weeks back I started to rewrite one of our core service layer where I refactored a bunch of code and abstracted in a more reusable fashion. This involved touching some mysql cursor code and I accidentally introduced a memory leak during this. Unfortunately we found this bug right about one hour after we pushed to production. The service leaked memory and took about one hour for the server to start swapping. I started investigating this behavior and was able to reproduce this on my local machine after writing a quick load emulator with python threads.

Running top showed that RSS memory for the process starts around 21MB and grows to 46 MB in 30 minutes. In the past I used python object graph module objgraph to track the type of various objects in the python memory. I quickly placed a small snippet of code to print the count of type of objects in memory.

#!python
import objgraph
objgraph.show_most_common_types()

dict                       6230
tuple                      5834
function                   740

So I got the sense that there are way more dicts in memory. Although the counts of dicts are decreasing once in x seconds which means the python gc is performing its job. Meanwhile I reverted the production to its previous stable phase and ran the same stable build on local and saw that the number the dicts and tuples are about the same. So I knew that the problem is something else.

I read a little more reading on the web and found that heapy might be able to help me. But I failed in installing guppy/heapy on my mac. It constantly complained No module named gobject. I gave up on this direction and looked more into objgraph. Then I found this cool function which outputs the growing data structures.

#!python
import objgraph
objgraph.show_growth(limit=3)

tuple              3432     +23
dict               2430     +18
function           630      +2

This function actually prints the count of the data structure and by how much it grew over the last call. That is cool but I found the same data structures growing constantly. I left the test running for couple of minutes while I rushed to fetch my laundry. While I was back I found that there was a new entry in the terminal.

tuple              4743     +23
dict               3703     +18
DictCursor          734     +1

Bingo that tells me something. So somehow the DictCursor objects were not garbage collected. Ok now how to trace down that why are these objects not garbage collected. The function which creates and destroy is this utility function get_mysql_cursor.

#!python
from types import MethodType

def fetchall_items(self):
    """Cursor returns a list of tuples of items
        Ex:   [(3434,), (544,), (4545,)]
    """
    return [x[0] for x in self.fetchall()]


def get_mysql_cursor(host):
    # returns pooled connection to host
    conn = get_mysql_connection_for_host(host)
    cursor = conn.cursor(cursor)
    try:
        # patch the cursor to add a simple helper
        cursor.fetchall_items = MethodType(fetchall_items, cursor)
        yield cursor
    except Exception, err:
        import traceback
        conn.refresh()
        logger.error("mysql query failed on %s with %s" %(host, traceback.print_exc()))
    finally:
        cursor.close()
        del cursor
        del conn

The function actually invokes del method after it is done with the cursor and connection. The __del__ method connection id hacked to return the connection back to the pool. After a little more work I was able to find this little trick in the objgraph docs, where objgraph can print the references between objects. That way we can see who is holding a reference to the cursor object.

#!python
import objgraph
objgraph.show_refs([DictCursor], filename='sample-graph.png')

For this to work I had to install graphviz and xdot modules.

#!bash
$ sudo pip install graphviz xdot

This cracked it. The sample-graph.png showed the objects in memory and how they are referenced with each other.

You can clearly see that DictCursor has a back reference from instance method fetchall_items which is clearly making gc to not garbage collect the cursor. Yes I was patching the cursor with this utility function at runtime and clearly the function fetchall_items was in global scope of the module.

#!python
try:
    # patch the cursor to add a simple helper
    cursor.fetchall_items = MethodType(fetchall_items, cursor)
    yield cursor
except Exception, err:
    import traceback
    conn.refresh()
    logger.error("mysql query failed on %s with %s" %(host, traceback.print_exc()))
finally:
    # remove the dynamic method to avoid memory leak
    delattr(cursor, "fetchall_items")
    cursor.close()
    del cursor
    del conn

I wrote a quick fix to delete the function before deleting the cursor, which seems to fix the problem.

Lesson Learnt: It is not a good practice to patch objects at run time. Such things result in these scenarios where you will have to spend a lot of time trying to find a problem which you don’t understand. I just took around 9 hours for me to crack this down.