Couple of weeks back I started to rewrite one of our core service layer where I refactored a bunch of code and abstracted in a more reusable fashion. This involved touching some mysql cursor code and I accidentally introduced a memory leak during this. Unfortunately we found this bug right about one hour after we pushed to production. The service leaked memory and took about one hour for the server to start swapping. I started investigating this behavior and was able to reproduce this on my local machine after writing a quick load emulator with python threads.

Running top showed that RSS memory for the process starts around 21MB and grows to 46 MB in 30 minutes. In the past I used python object graph module objgraph to track the type of various objects in the python memory. I quickly placed a small snippet of code to print the count of type of objects in memory.

    #!python
    import objgraph
    objgraph.show_most_common_types()

    dict                       6230
    tuple                      5834
    function                   740

So I got the sense that there are way more dicts in memory. Although the counts of dicts are decreasing once in x seconds which means the python gc is performing its job. Meanwhile I reverted the production to its previous stable phase and ran the same stable build on local and saw that the number the dicts and tuples are about the same. So I knew that the problem is something else.

I read a little more reading on the web and found that heapy might be able to help me. But I failed in installing guppy/heapy on my mac. It constantly complained No module named gobject. I gave up on this direction and looked more into objgraph. Then I found this cool function which outputs the growing data structures.

    #!python
    import objgraph
    objgraph.show_growth(limit=3)

    tuple              3432     +23
    dict               2430     +18
    function           630      +2

This function actually prints the count of the data structure and by how much it grew over the last call. That is cool but I found the same data structures growing constantly. I left the test running for couple of minutes while I rushed to fetch my laundry. While I was back I found that there was a new entry in the terminal.

    tuple              4743     +23
    dict               3703     +18
    DictCursor          734     +1

Bingo that tells me something. So somehow the DictCursor objects were not garbage collected. Ok now how to trace down that why are these objects not garbage collected. The function which creates and destroy is this utility function get_mysql_cursor.

    #!python
    from types import MethodType

    def fetchall_items(self):
        """Cursor returns a list of tuples of items
            Ex:   [(3434,), (544,), (4545,)]
        """
        return [x[0] for x in self.fetchall()]


    def get_mysql_cursor(host):
        # returns pooled connection to host
        conn = get_mysql_connection_for_host(host)
        cursor = conn.cursor(cursor)
        try:
            # patch the cursor to add a simple helper
            cursor.fetchall_items = MethodType(fetchall_items, cursor)
            yield cursor
        except Exception, err:
            import traceback
            conn.refresh()
            logger.error("mysql query failed on %s with %s" %(host, traceback.print_exc()))
        finally:
            cursor.close()
            del cursor
            del conn

The function actually invokes del method after it is done with the cursor and connection. The del method connection id hacked to return the connection back to the pool. After a little more work I was able to find this little trick in the objgraph docs, where objgraph can print the references between objects. That way we can see who is holding a reference to the cursor object.

    #!python
    import objgraph
    objgraph.show_refs([DictCursor], filename='sample-graph.png')

For this to work I had to install graphviz and xdot modules.

    #!bash
    $ sudo pip install graphviz xdot

This cracked it. The sample-graph.png showed the objects in memory and how they are referenced with each other.

Object Graph References

You can clearly see that DictCursor has a back reference from instance method fetchall_items which is clearly making gc to not garbage collect the cursor. Yes I was patching the cursor with this utility function at runtime and clearly the function fetchall_items was in global scope of the module.

    #!python
    try:
        # patch the cursor to add a simple helper
        cursor.fetchall_items = MethodType(fetchall_items, cursor)
        yield cursor
    except Exception, err:
        import traceback
        conn.refresh()
        logger.error("mysql query failed on %s with %s" %(host, traceback.print_exc()))
    finally:
        # remove the dynamic method to avoid memory leak
        delattr(cursor, "fetchall_items")
        cursor.close()
        del cursor
        del conn

I wrote a quick fix to delete the function before deleting the cursor, which seems to fix the problem.

Lesson Learnt: It is not a good practice to patch objects at run time. Such things result in these scenarios where you will have to spend a lot of time trying to find a problem which you don’t understand. I just took around 9 hours for me to crack this down.