Simulation companion to esp32-controller — 6-DOF quadcopter physics, cascaded PID with anti-windup, 3D visualisation. micrograd.c

project

#Simulation companion to esp32-controller — 6-DOF quadcopter physics, cascaded PID with anti-windup, 3D visualisation. micrograd.c

A C implementation of a scalar-valued autograd engine and a small neural network library built on top of it, inspired by Andrej Karpathy’s micrograd.

GitHub repo: micrograd.c
Last update: 2026/05/21

Motivation

Karpathy’s micrograd is a beautifully minimal Python library — about 150 lines total — that implements backpropagation over a dynamically-built computation graph of scalar values. The point is pedagogical: strip machine learning down until the essential idea is fully visible, with no framework abstractions in the way. It’s one of the clearest possible illustrations of how neural networks actually work.

I wanted to reimplement it in C. Not because C is a better fit for machine learning (it isn’t), but because doing it at a lower level forces you to be explicit about things Python quietly handles for you: heap allocation, pointer lifetimes, object ownership, and the structure of a computation DAG when there is no garbage collector to fall back on.

The result is micrograd.c, a faithful port of the concept with a few additions: leaky ReLU, Adam instead of plain SGD, and clean memory management.

Comparison with micrograd.py

The two implementations share the same core idea: a Value wraps a scalar and records, at creation time, how to propagate gradients backwards through it. Everything else flows from that.

In Python, Value is a class. The _prev set and _backward callable are instance attributes:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad out._backward = _backward
        return out

The closure over self and other inside _backward is what makes this so clean in Python: the backward function has direct access to the input nodes via lexical capture, without needing to be passed anything.

In C, there are no closures. The backward function is a plain function pointer that receives only the output node:

typedef struct Value {
    double data;
    double grad;
    struct Value* prev[2];          // replaces _prev set — binary ops only
    char op[10];                    // replaces _op string
    void (*backward)(struct Value*); // replaces _backward callable
} Value;

Instead of capturing inputs via a closure, the backward function retrieves them from out->prev[]. This works cleanly for binary operations — the struct always has exactly two slots. It requires being explicit about an assumption micrograd.py handles implicitly: every operation has at most two inputs. Unary operations like relu only use prev[0]; prev[1] is left NULL.

The add_backward example in C directly mirrors the Python version structurally, but the input references are resolved from the struct rather than captured:

// Python: self.grad += out.grad; other.grad += out.grad
void add_backward(Value* out) {
    out->prev[0]->grad += out->grad;
    out->prev[1]->grad += out->grad;
}

The mul_backward shows the same pattern with the product rule — each input’s gradient is the other input’s value times the upstream gradient:

// Python: self.grad += other.data * out.grad; other.grad += self.data * out.grad
void mul_backward(Value* out) {
    out->prev[0]->grad += out->prev[1]->data * out->grad;
    out->prev[1]->grad += out->prev[0]->data * out->grad;
}

The topological sort for backward() is likewise conceptually identical — depth-first search that visits each node exactly once, reversed to get root-to-leaf order — but the Python version builds a list using recursion and a visited set, while the C version uses an explicit stack to avoid blowing the call stack on large graphs, and a flat array instead of a Python set for the visited tracking:

# Python micrograd
def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1
    for node in reversed(topo):
        node._backward()

// C micrograd.c — iterative DFS, explicit stack
void build_topo(Value* v, Value** topo, int* topo_size,
                Value** visited, int* visited_size, int cap) {
    Value** stack = malloc(cap * sizeof(Value*));
    int stack_size = 0;
    stack[stack_size++] = v;
    while (stack_size > 0) {
        Value* node = stack[--stack_size];
        // linear scan dedup (visited is small in practice)
        int is_visited = 0;
        for (int i = 0; i < *visited_size; i++)
            if (visited[i] == node) { is_visited = 1; break; }
        if (is_visited) continue;
        visited[(*visited_size)++] = node;
        for (int i = 1; i >= 0; i--)
            if (node->prev[i] != NULL)
                stack[stack_size++] = node->prev[i];
    }
    for (int i = *visited_size - 1; i >= 0; i--)
        topo[(*topo_size)++] = visited[i];
    free(stack);
}

The one place the C version meaningfully diverges is memory management, which Python never has to think about at all. More on that below.

The Autograd Engine

Operations and Their Gradients

Every operation allocates a new Value on the heap, sets its prev[] pointers to its inputs, and registers its backward function. The full set:

Operation	Forward	Backward rule
`add(a, b)`	$a + b$	$\frac{\partial L}{\partial a} \mathrel{+}= \frac{\partial L}{\partial \text{out}}$ , same for $b$
`mul(a, b)`	$a \cdot b$	$\frac{\partial L}{\partial a} \mathrel{+}= b \cdot \frac{\partial L}{\partial \text{out}}$
`power(a, n)`	$a^n$	$\frac{\partial L}{\partial a} \mathrel{+}= n \cdot a^{n-1} \cdot \frac{\partial L}{\partial \text{out}}$
`relu(a)`	$\max(0.01a,\; a)$	$\frac{\partial L}{\partial a} \mathrel{+}= (out > 0\ ?\ 1 : 0.01) \cdot \frac{\partial L}{\partial \text{out}}$
`neg(a)`	$-a$	via `mul(a, -1)`
`sub(a, b)`	$a - b$	via `add(a, neg(b))`
`truediv(a, b)`	$a / b$	via `mul(a, power(b, -1))`

The gradient accumulates with += rather than = because a node can be a prev[] of multiple nodes — for example a weight used in several neurons. Each path through the graph contributes to the gradient independently, and summing them is exactly what the multivariable chain rule requires:

$\frac{\partial L}{\partial w} = \sum_{\text{paths through } w} \frac{\partial L}{\partial \text{path}} \cdot \frac{\partial \text{path}}{\partial w}$

relu is leaky rather than standard — for negative inputs it returns $0.01 \cdot x$ rather than $0$ , so the gradient is $0.01$ rather than $0$ . This prevents neurons from getting stuck with a permanently zero gradient when their pre-activation is strongly negative (the “dying ReLU” problem).

Why Topological Order Matters

The backward pass must process nodes from the loss back towards the inputs — the gradient of a node cannot be computed until all nodes that use it as an input have already been processed and accumulated their contribution into its grad. Topological order guarantees this.

For a concrete example: given $L = (a + b) \cdot c$ , define $d = a + b$ , $L = d \cdot c$ . The correct backward sequence is $L \to d \to a$ and $L \to c$ , not $a \to d \to L$ . Reversing the DFS post-order gives this ordering automatically.

The backward() function first does a counting pass to size its working arrays, then sorts, then propagates:

void backward(Value* v) {
    int n   = count_nodes(v);
    int cap = (n < 16 ? 16 : n) * 2;  // 2x for DFS stack headroom
    Value** topo    = malloc(cap * sizeof(Value*));
    Value** visited = malloc(cap * sizeof(Value*));
    int topo_size = 0, visited_size = 0;
    build_topo(v, topo, &topo_size, visited, &visited_size, cap);
    v->grad = 1.0;
    for (int i = topo_size - 1; i >= 0; i--)
        if (topo[i]->backward != NULL)
            topo[i]->backward(topo[i]);
    free(topo);
    free(visited);
}

Setting v->grad = 1.0 before the loop seeds the pass: $\frac{\partial L}{\partial L} = 1$ by definition.

The count_nodes pre-pass is not optional. The DFS stack can temporarily hold more entries than there are unique nodes, because a node that is reachable by multiple paths gets pushed once per incoming edge before the dedup check runs. Without pre-sizing, the stack allocation is just a guess that may be too small, silently writing past the end of the heap allocation and corrupting adjacent memory.

The Neural Network

Neuron

The Neuron struct owns its weights and bias — allocated once in neuron_init, freed in neuron_free. Calling a neuron builds a fresh sub-graph of transient intermediate nodes:

Value* neuron_call(Neuron *neuron, Value **x) {
    Value **products = malloc(neuron->n_inputs * sizeof(Value*));
    for (int i = 0; i < neuron->n_inputs; i++)
        products[i] = mul(neuron->w[i], x[i]);
    Value *act = products[0];
    for (int i = 1; i < neuron->n_inputs; i++)
        act = add(act, products[i]);
    act = add(act, neuron->b);
    if (neuron->config.nonlin) act = relu(act);
    free(products);   // free the temp array, not the Value nodes
    return act;
}

This computes $\text{out} = \text{ReLU}\!\left(\displaystyle\sum_{i} w_i x_i + b\right)$ . The mul nodes have the weights as prev[0] — the weights are leaves of this sub-graph, connected but persistent. free(products) frees only the temporary pointer array used to collect intermediate nodes during summation, not the Value nodes themselves.

Layer and MLP

A Layer is an array of neurons all receiving the same input vector. The MLP chains layers, freeing each intermediate output array as it moves forward:

Value** mlp_call(MLP *mlp, Value **x) {
    Value **current_x = x;
    for (int i = 0; i < mlp->n_layers; i++) {
        Value **layer_out = layer_call(&mlp->layers[i], current_x);
        if (i > 0) free(current_x); // free the array, not the Values
        current_x = layer_out;
    }
    return current_x;
}

The final layer uses nonlin = 0 — no activation — leaving a raw scalar output for the loss to operate on.

Parameter count for a network with input size $n_0$ and layer sizes $n_1, n_2, \ldots, n_L$ :

$P = \sum_{l=1}^{L} n_l \cdot (n_{l-1} + 1)$

The $+1$ is the bias. For the 3→4→4→1 network used in testing:

$P = 4 \cdot (3+1) + 4 \cdot (4+1) + 1 \cdot (4+1) = 16 + 20 + 5 = 41$

Training

Loss Function

The training uses mean squared error:

$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \left(\hat{y}_i - y_i\right)^2$

Built as a computation graph:

Value* total_loss = create_value(0.0);
for (int i = 0; i < 4; i++) {
    Value** output = mlp_call(&mlp, inputs[i]);
    Value*  sq     = power(sub(output[0], targets[i]), 2.0);
    total_loss     = add(total_loss, sq);
    free(output);
}
Value* avg_loss = truediv(total_loss, create_value(4.0));
backward(avg_loss);

Adam Optimiser

micrograd.py uses vanilla SGD: p.data -= lr * p.grad. This version uses Adam, which adapts the effective learning rate per parameter by tracking gradient statistics.

Adam maintains exponential moving averages of the gradient $g_t$ and its square:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1)\, g_t$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2)\, g_t^2$

At early timesteps $m_t$ and $v_t$ are biased towards zero because they were initialised at zero. The bias-corrected estimates are:

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

The parameter update divides by the square root of the second moment — effectively normalising the step size by a per-parameter measure of gradient variance:

$\theta_t = \theta_{t-1} - \alpha\, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$

Parameters with consistently large gradients get a smaller effective step; parameters with noisy or infrequent gradients get a larger one. With $\alpha = 0.02$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\varepsilon = 10^{-8}$ , the 3→4→4→1 network reaches an average MSE below $0.002$ within 200 epochs on the 4-sample toy dataset.

Memory Management

This is where porting to C diverges most sharply from the Python original. Python’s garbage collector makes the lifetime of computation graph nodes a non-issue — once there are no more references, memory is reclaimed automatically. In C, every malloc must be paired with a free, and the DAG structure of the computation graph makes this non-trivial.

Ownership Model

Every Value node in the graph falls into one of three categories with distinct ownership:

MLP parameters — weights and biases — are owned by Neuron structs, allocated in neuron_init, and freed by mlp_free. Nothing else should ever free them.

Input and target leaf Values are allocated by the caller and freed by the caller explicitly, after all epochs are done.

Transient intermediate nodes — every mul, add, relu, power result produced during a forward pass — belong to one epoch’s computation graph. They must be freed after backward() runs, before the next epoch begins.

The complication: the transient nodes have prev[] pointers that point into the parameter nodes. A naive free_value that recursively follows prev[] will reach the weights and biases and free them, corrupting the MLP. On the next epoch, mlp_zero_grad writes to those freed blocks — a use-after-free that produces silent corruption before eventually segfaulting.

`free_graph_safe`

The solution is a graph walker that accepts an exclusion set: nodes in the set are treated as persistent boundaries and are neither freed nor recursed into.

static void fgs_helper(Value* v,
                        Value** excl, int n_excl,
                        Value** visited, int* vs, int cap) {
    if (v == NULL) return;
    if (in_set(v, excl, n_excl)) return;  // persistent node — stop here
    if (in_set(v, visited, *vs)) return;  // already freed in this walk
    visited[(*vs)++] = v;
    fgs_helper(v->prev[0], excl, n_excl, visited, vs, cap);
    fgs_helper(v->prev[1], excl, n_excl, visited, vs, cap);
    free(v);
}

The exclusion set is built once before the training loop. It contains both the MLP parameters and the input/target Values — because those are also prev[] leaves of the forward graph, and also must not be freed mid-training:

Value** excl = build_excl(params, n_params, inputs, 4, targets, 4, &n_excl);

for (int epoch = 0; epoch < 200; epoch++) {
    // ... build graph, run backward ...
    free_graph_safe(avg_loss, excl, n_excl);
    // params still alive; next epoch can proceed safely
}

After 200 epochs and all the test functions: zero errors, zero leaks under Valgrind.

Project Structure

.
├── engine.h / engine.c     # Value struct, all operations, backward()
├── nn.h / nn.c             # Neuron, Layer, MLP
├── test_engine.c           # Unit tests: add, mul, power, relu, combined
├── test_nn.c               # Tests: init, forward pass, backward, training loop
└── Makefile

make
./test_engine
./test_nn

Demo

References

Andrej Karpathy — micrograd, the Python original this is ported from.
Diederik P. Kingma, Jimmy Ba — Adam: A Method for Stochastic Optimization (2014).