on garbage collection

part 2: what is garbage collection

generated at Wed Jul 24 12:02:17 UTC 2019

This is a series of blog posts where we talk about garbage collection: why do it, what is it, and how to implement a simple one.

Last time I've discussed about reference counting and escape analysis as simple forms of automatic memory management. The two forms of management are suitable for simple acyclic data structures, or cyclic data structures where one can determine the parent–child relation.

For complex graphs with cyclic references, it's really hard to use reference counting without encountering memory leaks, this is where garbage collection comes in.

Garbage collection: zero owners, garbage collector does deallocation

Where as in a reference counted model where we track all ownership using a reference count, in garbage collected model, we don't track ownership at all, at least not statically. We can not determine if an object is being owned without runtime information.

While the garbage collector does not care about who owns what, it does care about which unused objects it can deallocate. How does it determine this?

A simple mark–sweep garbage collector

We first give it a set of known roots, objects that we are certain are reachable, we are currently using it, and we don't want it to be deallocated. The roots might be objects referenced by local variables (on the stack) or by global variables.

Next, we wait. Until the heap is x percent full, we can proceed to allocate memory as usual. Only when it reaches that threshold do we start doing garbage collection.

We reached that point when the heap is at its limit, the garbage collection algorithm triggers.

Let's see how it performs on this hypothetical graph of nodes. We will represent known reachable nodes with o and unreachable nodes with x.

Nodes are connected with a -. 2 nodes connected with a = means they have a cyclic reference.

         x
        / \
       x   x
      / \ /
     x   x
            x=x
    x - x

The garbage collection algorithm triggers, and a GC cycle starts. In every GC cycle we do:

  1. Reset all nodes to be unreachable
             x
            / \
           x   x
          / \ /
         x   x
                x=x
        x - x
    
  2. Mark the known root nodes as reachable
             o
            / \
           x   x
          / \ /
         x   x
                x=x
        x - x
    
  3. Recursively walk down the known root nodes, marking it as reachable
             o
            / \
           o   o
          / \ /
         o   o
                x=x
        x - x
    
  4. Sweep up all the nodes that are unreachable
             o
            / \
           o   o
          / \ /
         o   o
    

What you've seen there is a precise mark and sweep garbage collector in action. Precise, meaning we know the exact location of its child. In contrast to this is a conservative style garbage collector, where it has to literally scan the memory of the nodes in order to find its children.

This and derivative garbage collectors stop the world, you can not do anything else when a GC cycle occurs, or else memory corruption would happen.

Although garbage collection, implemented correctly, takes out the pain of managing object relationships, each GC cycle is costly, leading to infamous garbage collection pauses found in many interpreted languages using this paradigm of management. Not that suitable for high latency applications.

An incremental tri-color garbage collector

One solution we could do to lower GC pause times, is to add another state to a GC node:

  1. an unreachable state denoted by x
  2. a reachable state denoted by o
  3. an intermediate state denoted by ?: we know that the node is reachable, however we don't know whether its children are or not

For the garbage collector to be stable and not free out memory that is in use, we'll have to uphold the following invariants:

We must maintain this invariance because the garbage collector is incremental, a GC cycle doesn't fully walk through all of the nodes.

The algorithm looks like this:

  1. Reset all nodes to be unreachable
             x
            / \
           x   x
          / \ /
         x   x
                x=x
        x - x
    
  2. Mark the known root nodes as reachable, and mark its child as intermediate
             o
            / \
           ?   ?
          / \ /
         x   x
                x=x
        x - x
    
  3. Repeatedly walk down the intermediate nodes, marking itself as reachable, its children as intermediates. This will be done per GC cycle.
             o
            / \
           o   o
          / \ /
         ?   ?
                x=x
        x - x
    
             o
            / \
           o   o
          / \ /
         o   o
                x=x
        x - x
    
  4. Once there are no intermediate nodes left, sweep up all the nodes that are unreachable
             o
            / \
           o   o
          / \ /
         o   o
    

What if you want to allocate in the middle of the marking phase? You can allocate nodes as normally, however you must maintain the invariance by either:

In either case, the invariance is namintained and the GC can continue its business.

Beyond simple garbage collectors

Many high performance language interpreters like v8, the JVM,... provide much better garbage collectors than one discussed here.

There are concurrent garbage collectors which performs each GC cycle on a separate thread, only stopping the world when the roots are scanned.

There are parallel garbage collectors which splits their marking/sweeping workloads across multiple cores.

There are even collectors that physically move objects, defragmenting allocated memory so as to increase memory usage efficiency.

You should check out luajit's wiki and v8's development blog to see how those developers implement much more complex GC algorithms than discussed here.

Next time, I'll show you how to implement a simple mark and sweep garbage collector in C++. See you then.