project
#Simulation companion to esp32-controller — 6-DOF quadcopter physics, cascaded PID with anti-windup, 3D visualisation. micrograd.c
A C implementation of a scalar-valued autograd engine and a small neural network library built on top of it, inspired by Andrej Karpathy’s micrograd.
GitHub repo:
micrograd.c
Last update: 2026/05/21
Motivation
Karpathy’s micrograd is a beautifully minimal Python library — about 150 lines total — that implements backpropagation over a dynamically-built computation graph of scalar values. The point is pedagogical: strip machine learning down until the essential idea is fully visible, with no framework abstractions in the way. It’s one of the clearest possible illustrations of how neural networks actually work.
I wanted to reimplement it in C. Not because C is a better fit for machine learning (it isn’t), but because doing it at a lower level forces you to be explicit about things Python quietly handles for you: heap allocation, pointer lifetimes, object ownership, and the structure of a computation DAG when there is no garbage collector to fall back on.
The result is micrograd.c, a faithful port of the
concept with a few additions: leaky ReLU, Adam instead of plain SGD, and
clean memory management.
Comparison with micrograd.py
The two implementations share the same core idea: a
Value wraps a scalar and records, at creation time, how to
propagate gradients backwards through it. Everything else flows from
that.
In Python, Value is a class. The
_prev set and _backward callable are instance
attributes:
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad out._backward = _backward
return out
The closure over self and other inside
_backward is what makes this so clean in Python: the
backward function has direct access to the input nodes via lexical
capture, without needing to be passed anything.
In C, there are no closures. The backward function is a plain function pointer that receives only the output node:
typedef struct Value {
double data;
double grad;
struct Value* prev[2]; // replaces _prev set — binary ops only
char op[10]; // replaces _op string
void (*backward)(struct Value*); // replaces _backward callable
} Value;
Instead of capturing inputs via a closure, the backward function
retrieves them from out->prev[]. This works cleanly for
binary operations — the struct always has exactly two slots. It requires
being explicit about an assumption micrograd.py handles implicitly:
every operation has at most two inputs. Unary operations like
relu only use prev[0]; prev[1] is
left NULL.
The add_backward example in C directly mirrors the
Python version structurally, but the input references are resolved from
the struct rather than captured:
// Python: self.grad += out.grad; other.grad += out.grad
void add_backward(Value* out) {
out->prev[0]->grad += out->grad;
out->prev[1]->grad += out->grad;
}
The mul_backward shows the same pattern with the product
rule — each input’s gradient is the other input’s value times the
upstream gradient:
// Python: self.grad += other.data * out.grad; other.grad += self.data * out.grad
void mul_backward(Value* out) {
out->prev[0]->grad += out->prev[1]->data * out->grad;
out->prev[1]->grad += out->prev[0]->data * out->grad;
}
The topological sort for backward() is likewise
conceptually identical — depth-first search that visits each node
exactly once, reversed to get root-to-leaf order — but the Python
version builds a list using recursion and a visited set,
while the C version uses an explicit stack to avoid blowing the call
stack on large graphs, and a flat array instead of a Python set for the
visited tracking:
# Python micrograd
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for node in reversed(topo):
node._backward()
// C micrograd.c — iterative DFS, explicit stack
void build_topo(Value* v, Value** topo, int* topo_size,
Value** visited, int* visited_size, int cap) {
Value** stack = malloc(cap * sizeof(Value*));
int stack_size = 0;
stack[stack_size++] = v;
while (stack_size > 0) {
Value* node = stack[--stack_size];
// linear scan dedup (visited is small in practice)
int is_visited = 0;
for (int i = 0; i < *visited_size; i++)
if (visited[i] == node) { is_visited = 1; break; }
if (is_visited) continue;
visited[(*visited_size)++] = node;
for (int i = 1; i >= 0; i--)
if (node->prev[i] != NULL)
stack[stack_size++] = node->prev[i];
}
for (int i = *visited_size - 1; i >= 0; i--)
topo[(*topo_size)++] = visited[i];
free(stack);
}
The one place the C version meaningfully diverges is memory management, which Python never has to think about at all. More on that below.
The Autograd Engine
Operations and Their Gradients
Every operation allocates a new Value on the heap, sets
its prev[] pointers to its inputs, and registers its
backward function. The full set:
| Operation | Forward | Backward rule |
|---|---|---|
add(a, b) |
, same for | |
mul(a, b) |
||
power(a, n) |
||
relu(a) |
||
neg(a) |
via mul(a, -1) |
|
sub(a, b) |
via add(a, neg(b)) |
|
truediv(a, b) |
via mul(a, power(b, -1)) |
The gradient accumulates with += rather than
= because a node can be a prev[] of multiple
nodes — for example a weight used in several neurons. Each path through
the graph contributes to the gradient independently, and summing them is
exactly what the multivariable chain rule requires:
relu is leaky rather than standard — for negative inputs
it returns
rather than
,
so the gradient is
rather than
.
This prevents neurons from getting stuck with a permanently zero
gradient when their pre-activation is strongly negative (the “dying
ReLU” problem).
Why Topological Order Matters
The backward pass must process nodes from the loss back towards the
inputs — the gradient of a node cannot be computed until all nodes that
use it as an input have already been processed and accumulated their
contribution into its grad. Topological order guarantees
this.
For a concrete example: given , define , . The correct backward sequence is and , not . Reversing the DFS post-order gives this ordering automatically.
The backward() function first does a counting pass to
size its working arrays, then sorts, then propagates:
void backward(Value* v) {
int n = count_nodes(v);
int cap = (n < 16 ? 16 : n) * 2; // 2x for DFS stack headroom
Value** topo = malloc(cap * sizeof(Value*));
Value** visited = malloc(cap * sizeof(Value*));
int topo_size = 0, visited_size = 0;
build_topo(v, topo, &topo_size, visited, &visited_size, cap);
v->grad = 1.0;
for (int i = topo_size - 1; i >= 0; i--)
if (topo[i]->backward != NULL)
topo[i]->backward(topo[i]);
free(topo);
free(visited);
}
Setting v->grad = 1.0 before the loop seeds the pass:
by definition.
The count_nodes pre-pass is not optional. The DFS stack
can temporarily hold more entries than there are unique nodes, because a
node that is reachable by multiple paths gets pushed once per incoming
edge before the dedup check runs. Without pre-sizing, the stack
allocation is just a guess that may be too small, silently writing past
the end of the heap allocation and corrupting adjacent memory.
The Neural Network
Neuron
The Neuron struct owns its weights and bias — allocated
once in neuron_init, freed in neuron_free.
Calling a neuron builds a fresh sub-graph of transient intermediate
nodes:
Value* neuron_call(Neuron *neuron, Value **x) {
Value **products = malloc(neuron->n_inputs * sizeof(Value*));
for (int i = 0; i < neuron->n_inputs; i++)
products[i] = mul(neuron->w[i], x[i]);
Value *act = products[0];
for (int i = 1; i < neuron->n_inputs; i++)
act = add(act, products[i]);
act = add(act, neuron->b);
if (neuron->config.nonlin) act = relu(act);
free(products); // free the temp array, not the Value nodes
return act;
}
This computes
.
The mul nodes have the weights as prev[0] —
the weights are leaves of this sub-graph, connected but persistent.
free(products) frees only the temporary pointer array used
to collect intermediate nodes during summation, not the
Value nodes themselves.
Layer and MLP
A Layer is an array of neurons all receiving the same
input vector. The MLP chains layers, freeing each
intermediate output array as it moves forward:
Value** mlp_call(MLP *mlp, Value **x) {
Value **current_x = x;
for (int i = 0; i < mlp->n_layers; i++) {
Value **layer_out = layer_call(&mlp->layers[i], current_x);
if (i > 0) free(current_x); // free the array, not the Values
current_x = layer_out;
}
return current_x;
}
The final layer uses nonlin = 0 — no activation —
leaving a raw scalar output for the loss to operate on.
Parameter count for a network with input size and layer sizes :
The is the bias. For the 3→4→4→1 network used in testing:
Training
Loss Function
The training uses mean squared error:
Built as a computation graph:
Value* total_loss = create_value(0.0);
for (int i = 0; i < 4; i++) {
Value** output = mlp_call(&mlp, inputs[i]);
Value* sq = power(sub(output[0], targets[i]), 2.0);
total_loss = add(total_loss, sq);
free(output);
}
Value* avg_loss = truediv(total_loss, create_value(4.0));
backward(avg_loss);
Adam Optimiser
micrograd.py uses vanilla SGD: p.data -= lr * p.grad.
This version uses Adam, which adapts the effective learning rate per
parameter by tracking gradient statistics.
Adam maintains exponential moving averages of the gradient and its square:
At early timesteps and are biased towards zero because they were initialised at zero. The bias-corrected estimates are:
The parameter update divides by the square root of the second moment — effectively normalising the step size by a per-parameter measure of gradient variance:
Parameters with consistently large gradients get a smaller effective step; parameters with noisy or infrequent gradients get a larger one. With , , , , the 3→4→4→1 network reaches an average MSE below within 200 epochs on the 4-sample toy dataset.
Memory Management
This is where porting to C diverges most sharply from the Python
original. Python’s garbage collector makes the lifetime of computation
graph nodes a non-issue — once there are no more references, memory is
reclaimed automatically. In C, every malloc must be paired
with a free, and the DAG structure of the computation graph
makes this non-trivial.
Ownership Model
Every Value node in the graph falls into one of three
categories with distinct ownership:
MLP parameters — weights and biases — are owned by
Neuron structs, allocated in neuron_init, and
freed by mlp_free. Nothing else should ever
free them.
Input and target leaf Values are allocated by the caller and freed by the caller explicitly, after all epochs are done.
Transient intermediate nodes — every
mul, add, relu,
power result produced during a forward pass — belong to one
epoch’s computation graph. They must be freed after
backward() runs, before the next epoch begins.
The complication: the transient nodes have prev[]
pointers that point into the parameter nodes. A naive
free_value that recursively follows prev[]
will reach the weights and biases and free them, corrupting the MLP. On
the next epoch, mlp_zero_grad writes to those freed blocks
— a use-after-free that produces silent corruption before eventually
segfaulting.
free_graph_safe
The solution is a graph walker that accepts an exclusion set: nodes in the set are treated as persistent boundaries and are neither freed nor recursed into.
static void fgs_helper(Value* v,
Value** excl, int n_excl,
Value** visited, int* vs, int cap) {
if (v == NULL) return;
if (in_set(v, excl, n_excl)) return; // persistent node — stop here
if (in_set(v, visited, *vs)) return; // already freed in this walk
visited[(*vs)++] = v;
fgs_helper(v->prev[0], excl, n_excl, visited, vs, cap);
fgs_helper(v->prev[1], excl, n_excl, visited, vs, cap);
free(v);
}
The exclusion set is built once before the training loop. It contains
both the MLP parameters and the input/target Values —
because those are also prev[] leaves of the forward graph,
and also must not be freed mid-training:
Value** excl = build_excl(params, n_params, inputs, 4, targets, 4, &n_excl);
for (int epoch = 0; epoch < 200; epoch++) {
// ... build graph, run backward ...
free_graph_safe(avg_loss, excl, n_excl);
// params still alive; next epoch can proceed safely
}
After 200 epochs and all the test functions: zero errors, zero leaks under Valgrind.
Project Structure
.
├── engine.h / engine.c # Value struct, all operations, backward()
├── nn.h / nn.c # Neuron, Layer, MLP
├── test_engine.c # Unit tests: add, mul, power, relu, combined
├── test_nn.c # Tests: init, forward pass, backward, training loop
└── Makefile
make
./test_engine
./test_nn
Demo

References
- Andrej Karpathy — micrograd, the Python original this is ported from.
- Diederik P. Kingma, Jimmy Ba — Adam: A Method for Stochastic Optimization (2014).