Visualizing the Heap on Embedded Systems

Debugging a memory leak can seem trivial compared to debugging fragmentation. Faced with such a problem recently, I decided I really needed to visualize what the heap looked like to determine how to fix the problem.

Many embedded systems avoid using dynamic memory allocation entirely to avoid just this kind of problem, among others. But the larger your system is, the more likely you are to want it to make life easier and more efficient. If your system is large enough, you may end up running embedded Linux and have all the power that entails. But if you’re stuck in the middle, you have an embedded system running some kind of RTOS using malloc and not nearly enough memory to do much of any useful debugging on target. In my case, we do have some tools for memory tracking and enough memory to theoretically use them, but they didn’t do what I wanted.

I was left uninspired after a number of Google searches, so I decided to blaze my own path and develop my own tool. There are three primary steps to visualizing heap data in this approach:

  1. Instrument ([mc]|re)alloc and free calls
  2. Process raw data
  3. Visualize processed data

I had previously worked on the first two steps, and saved my work, but it was really a one-off hack. This time I decided to automate the entire process.

Instrument ([mc]|re)alloc and free calls

The first step involves modifying your code to run on target with sufficient instrumentation to record memory allocation.

Instrument memory allocation calls

The first thing you need to do is modify ([mc]|re)alloc and free calls to so that they can do extra processing. If you’ve got the source code to your malloc and free, then you’re done; you can modify it directly. The same applies if you consistently use a malloc wrapper in your code. In my case, neither was true, so I had to do more work.

There are are least two options:

  1. Replace all calls with wrapped calls
  2. Link in replacement versions of ([mc]|re)alloc and free

The second option seemed more difficult than what I wanted to deal with since these functions are not weakly linked, and this would require some linker magic. So, I opted for option 1.

I wrote a simple Python script to replace all calls to malloc, calloc and free with instrumented versions: imalloc, icalloc and ifree. realloc is not used in our codebase. The script is not a lexer, but a few simple regular expressions were sufficient to catch all instances of these calls (and the ones inside comments, for good measure).

import fileinput
import os
import fnmatch
import re
import sys

malloc_re = re.compile(r'\bmalloc[(](.*?)[)]')
calloc_re = re.compile(r'\bcalloc[(](.*?)[)]')
free_re = re.compile(r'\bfree[(](.*?)[)]')

for root, dirs, files in os.walk('.'):
    candidates = fnmatch.filter(files, '*.c')
    for name in candidates:
        print name

        for line in fileinput.input(os.path.join(root, name), inplace=1):

            line = re.sub(malloc_re, r'imalloc(\1)', line)
            line = re.sub(calloc_re, r'icalloc(\1)', line)
            line = re.sub(free_re, r'ifree(\1)', line)


My litmus test for this script was whether the code linked. It did.

Writing imalloc

The next step is highly platform-dependent and will vary given your needs and restrictions. The basic idea is to write wrapper functions that record the parameter of malloc. So, first, just call the appropriate real function inside the wrappers. Next, decide how you want to collect your data. You could store it in memory, but that’s almost certainly not practical given the constraints mentioned in this article. I decided to push the data out through a UART running at 115200 baud. You can potentially use any external interface, though choosing one that requires use of malloc would likely be a bad idea. I choose the UART simply because it was the first thing that came to mind, but it in fact has a number of desirable features:

  1. Relatively fast (compared to some other choices, like the JTAG debugger messaging interface)
  2. Extremely simple (doesn’t use malloc, simple interface requires little stack use)
  3. Already bound to stdio in this application

The second point is important to keep in mind. You may call malloc all over the place in your code, in the context of different threads. In my case, some of the threads are huge with 8K of stack space, but others are tiny (a couple hundred bytes) with little room to spare. It is safest to keep the instrumentation as lightweight as possible so you don’t blow the stack. I was overly optimistic the first time I tried this and just called printf directly. I ended up settling with an implementation that seems to work well enough:

static char str[64];
str = sprintf("m,%8.8x,%8.8x", size, ptr);

This generates a line of CSV, with the type of call (malloc), size allocated, and address of the allocation. A similar implementation for free does the same thing, but leaves the size parameter blank since it is unknown. My very first attempt included str on the stack, but that blew up in some of the tighter threads. Hopefully you’ll realize the big mistake I made here without much thought quicker than I did. To my meager defense, when I initially did this, I was only interested in initial mallocs at startup.

Being lazy, I stuck this code in an existing C file and turned off the annoying stop on warnings compiler flag we had enabled so the new wrapper functions would link. I could have been more thorough and either modified stdlib.h or included a new header file, but I didn’t.

This ran fine until multiple threads started calling malloc and my analysis script started reporting errors. Of course, my wrapper function is missing the global lock that malloc uses for thread-safety. I added global lock calls around the wrappers to create the final code:

void *imalloc(size_t size) {
    void *ptr;
    static char str[64];

    ptr = malloc(size);
    str = sprintf("m,%8.8x,%8.8x", size, ptr);

    return ptr;

Running the instrumented code on target produces a nice long list of CSV which I can record indefinitely with a terminal program attached to the UART:


We’ve now acquired all the data we need from the target. By itself, it’s not very insightful, but once it is processed we can learn a great deal from it.

Processing the Raw Data

The raw data contains only the information known from the parameters of the functions called. Thus, we don’t have the size of the allocation freed in the free lines. This can be taken care of easily enough with a small script and linear search backward through the data. For example, given


We don’t know how much was allocated at 0x50000000. However, if we step backward through the list, we can look for the first allocation line we find with the same address:


We now know the allocation size (12) and can fill in the missing value and move on to the next entry.

My data was interlaced with other debug trace, so I passed it through a line filter first. Then, I wrote another script to backfill the free addresses:

import csv
import sys

mem = csv.reader(open(sys.argv[1]), delimiter=',')

mallocd = {}
total = 0
num = 0

for row in mem:
    if row[0] == 'm' or row[0] == 'c':
        row[1] = str(int(row[1]))
        if row[2] in mallocd:
            # This section was for debugging problems with the data
            if int(row[1], 16) == int(mallocd[row[2]][1],16):
        mallocd[row[2]] = row
        total += int(row[1])
        print ','.join(row)
    elif row[0] is 'f':
            row[1] = str(-int(mallocd[row[2]][1]))
            del mallocd[row[2]]
        except KeyError:
            row[1] = str(0)
        total += int(row[1])
        print ','.join((row))
        print row
        raise Exception
    num += 1

The script parses the raw data as csv and does a simple reverse linear search for each free entry until it finds a malloc or calloc with matching address. If all goes well, a backfilled version of the csv file will print to stdout. However, it is important to ensure that your data are valid and the script above does very little validation. I added some additional debugging statements initially to ensure that everything matched up. There are a number of problems that can occur (and did occur for me):

  • Not all allocations are recorded. This could happen if your IO device (a UART in this case) is not initialized before the first allocation occurs.
  • Allocations don’t match deallocations. This can happen because of missing output (IO not functional during certain periods) or if you get overzealous trying to save memory like I did. When I allocated the string array as a static function variable, it worked great until the RTOS was started and threads started overwriting the data. Don’t forget to acquire a lock!

What’s Next

In a subsequent article, I’ll describe how to visualize the processed data.

This entry was posted in Uncategorized. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Comment

  1. Posted January 28, 2011 at 10:02 am | Permalink

    Thanks for this. I’ve been wrestling with these issues myself with PS3 and Android game development. I am going to try a scheme that involves having memory addresses on the y axis and t on the x axis. Before I’d been using a simple coloured grid of memory.

    Hopefully this will work.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>