Almost 4 AM.. really time to sleep !!
I realized that the grids I've been using have a maximum of 4096 samples (roughly an area of 64x64). It's a value I set, but seems like a good value, somewhat cache friendly.
In light of that, I think I can proceed to assign primitives to all buckets they touch (or they bounding box touches) and then have a recursive split decide which portions to put in which bucket.
The subdivision is still rough and eventually a bucket will have to enforce strict bounds, as each bucket will have its own small frame buffer. This is important to parallelize work at the thread level, assigning one thread per bucket or so, but generally avoiding inter-bucket data sharing.
Here is what the teapot looks so far.. with buckets edges, and then selectively rendering one bucket at the time..
In this case, one bucket is 128x128 pixels, and splits are recursively applied until a section of a primitive can be rendered in 4096 samples or less (with an estimated density of one sample per pixel.. but the estimation is currently not so good, as it assumes grids of samples to be generally square).
Eventually, samples that fall outside the bucket will be discarded before being shaded (but after being displaced !). This should be better than using micro-polygons with 4 shared-memory vertices each and with reference counters.
Once again, this is all possible because I've forced myself to ignore current real-time non-performance.. make it work first, see about possible real-time later 8)