PRAXIS - A Maximally GPU-Optimized Digital Organoid

26 Jun, 2025

With the recent uptick in neuromorphic computing and increasing demand for recurrence and statefulness in LLMs, brain-inspired computing architectures are gaining renewed interest. Yet the field faces a fundamental dilemma: how do we reconcile biological realism with the constraints of modern computing hardware?

The neuromorphic computing community has largely bet on custom silicon - spending billions to design and manufacture specialized chips that can efficiently handle spike-based, asynchronous neural networks. But this approach overlooks a critical opportunity: what if algorithmic modifications could achieve brain-like computing capabilities using the massive GPU infrastructure already deployed for deep learning?

The GPU-first philosophy

Deep learning succeeded by abandoning biological fidelity. Neurons became matrix multiplications, spikes became floating-point activations, and complex dendritic trees simplified into linear transformations. This pragmatism unlocked unprecedented scale and capability.

We can propose taking the same pragmatic approach to neuromorphic computing. Rather than forcing biology’s asynchronous, spike-based computation onto synchronous hardware, we can ask: which biological principles truly matter for intelligent learning, and how can we preserve them while maximizing GPU efficiency?

This writing introduces PRAXIS (Practical Recurrent Additive eXpander Intelligence System) — a new framework for building “digital organoids” that captures the essential learning dynamics of biological neural networks while running efficiently on standard GPU hardware. By “digital organoid,” we mean a computational system that preserves the self-organizing, adaptive learning properties of biological neural tissue while being optimized for digital substrates.

Core principles

PRAXIS makes deliberate tradeoffs, prioritizing:

Learning capability over biological accuracy: we preserve the computational principles that enable flexible, continual learning while discarding implementation details that don’t contribute to intelligence
GPU optimization over energy efficiency: unlike biological brains optimized for 20 watt operation, we target the parallel matrix operations that modern accelerators excel at
Mathematical elegance over neural realism: we exploit mathematical properties like Gaussian distributions and expander graphs that have no biological analog but offer tremendous computational advantages
Practical deployment over theoretical purity: every design decision prioritizes real world usability on existing hardware, specifically commodity GPUs

The problems with traditional neuromorphic computing

Traditional approaches to digital brain simulation suffer from a fundamental mismatch between biological operating principles and modern computing architectures. We can first examine why naive brain emulations fail on today’s hardware.

Asynchronous spikes vs. synchronous hardware

Biological neurons communicate through discrete, asynchronous spikes — each neuron fires independently based on its internal dynamics. This event-driven computation is elegant in wetware but catastrophic for GPUs, which achieve their tremendous throughput through lockstep parallel execution. Every spike requires branching logic, breaking the SIMD (Single Instruction, Multiple Data) paradigm that makes GPUs fast.

Information bottleneck

A single spike carries exactly one bit of information: fired or not fired. To convey analog values, biological neurons must integrate spike trains over time, requiring hundreds of computational steps to transmit what a floating point number represents instantly. This temporal coding is massively inefficient on hardware designed for dense matrix operations.

Sparse, irregular connectivity

Real neural networks follow log-normal distributions — most neurons have few connections while a small subset are highly connected hubs. This irregular topology requires sparse matrix operations and indirect memory access patterns that devastate GPU performance. Worse, the number of synapses grows quadratically with neuron count, creating a memory wall that limits practical network sizes.

Biological baggage

Traditional neuromorphic systems faithfully reproduce biological constraints that exist for metabolic, not computational reasons:

Ion channel dynamics: complex differential equations modeling Na+ / K+ pumps
Neurotransmitter kinetics: detailed chemical signaling cascades
Spatial constraints: 3D positioning and connection distances that matter in physical brains but not in silicon

Added together, these features consume enormous computational resources while contributing little to learning capability.

The scaling catastrophe

The combination of these factors creates a perfect bottleneck: as networks grow, they become exponentially slower and more memory intensive. A biologically realistic simulation of even a small cortical column can bring a GPU cluster to its knees, while a transformer model with billions of parameters runs smoothly on a single accelerator.

This is why PRAXIS takes a radically different approach — preserving the computational principles that make brains powerful learners while reimagining their implementation for modern hardware.

Core adaptations for GPU efficiency

To bridge the gap between brain-like computation and GPU architecture, PRAXIS implements several key adaptations. Each represents a deliberate tradeoff — sacrificing biological fidelity for massive gains in computational efficiency while preserving the essential dynamics that enable learning.

1. Synchronous tick-based execution

Instead of asynchronous, event-driven spikes, PRAXIS operates on synchronized global time steps where all neurons update simultaneously. This aligns perfectly with GPUs, where thousands of cores execute the same instruction in parallel. The entire network state can be updated with a single matrix multiplication rather than complex event queues and conditional logic.

While some learning mechanisms that depend on exact spike sequences (fine grained spike timing and precise temporal coding) are lost, the essential learning dynamics remain intact. After all, artificial neural networks have demonstrated powerful learning with synchronous updates. The brain’s asynchrony may be more about biological constraints than computational necessity.

2. Rate-coded analog values

Rather than binary spikes, each neuron outputs a continuous value representing its expected spike rate over the time window of a single computational tick — essentially converting discrete events into analog streams. A single analog value (e.g. signed 8 bit integers) can encode what would require dozens of binary spikes, dramatically increasing information density per compute cycle. Matrix operations on continuous values map directly to tensor cores and benefit from decades of GPU optimization.

Losses on learning mechanisms that rely on individual spike timing precision do occur, but the core principle of importance — neurons with higher activation influencing their targets more strongly - is preserved. Research shows that most neural computation relies on firing rates rather than precise spike timing, especially for learning over longer timescales.

The key insight

These adaptations share a common philosophy: the medium is not the message. Whether information flows through discrete spikes or continuous values, whether updates happen asynchronously or in lockstep — these are implementation details. What matters is preserving the computational principles that enable flexible, powerful learning.

By making these pragmatic choices, PRAXIS can leverage the full power of modern GPU hardware while maintaining the essential dynamics that make biological neural networks such capable learners. The next adaptations take this philosophy even further, reimagining fundamental aspects of neural computation for the realities of silicon.

The Gaussian distribution revolution

While synchronous execution and rate coding have been explored in prior literature, PRAXIS introduces a genuinely novel insight that unlocks dramatic efficiency gains: enforcing Gaussian distributions throughout the network.

The problem with log-normal networks

Biological neural networks — and most artificial networks that use multiplicative operations — naturally develop log-normal distributions. This happens because neural activations typically result from multiplying many factors together (synaptic weights, gating mechanisms, normalization terms), and the product of many positive random variables tends toward a log-normal distribution.

This creates two critical problems:

Memory inefficiency: log-normal distributions have long tails, requiring wide numerical ranges to capture rare but important values - most bits are wasted representing extremely large or small values that rarely occur.
Connectivity explosion: to reliably sum to a target value from log-normally distributed inputs, you need many more connections to average out the high variance - this drives the quadratic growth in synapses that plagues biological simulations.

The Gaussian solution

PRAXIS actively maintains all neuron activations and synaptic weights in a Gaussian (normal) distribution centered at zero with unit variance. This seemingly simple constraint gifts profound implications:

1. Extreme compression without information loss

With values guaranteed to follow predictable ranges, we can confidently use signed 8 bit integers instead of 32 bit floats. Since 99.7% of values fall within ±3 standard deviations, we lose virtually no information while achieving 4 times the memory compression. This isn’t just quantization — it’s quantization with mathematical guarantees.

2. Natural sparsity through concentration of measure

An even more promising gain is that: when inputs follow a Gaussian distribution, you need only k√N connections per neuron (where k ≈ 2-4) rather than N to capture most relevant information. This is due to concentration of measure phenomena in high dimensions. For example, with 1000 neurons, each needs only ~32 connections instead of 1000 — a 30 times reduction in memory and computation. This further reduces computing and (more importantly) memory consumption.

3. Inherent stability

Gaussian distributions are closed under addition — adding Gaussian variables yields another Gaussian. This makes the network naturally stable, preventing the gradient explosions and vanishing that plague deep networks. No careful initialization schemes or gradient clippings are needed.

Implementation: the additive constraint

Maintaining Gaussian distributions requires a fundamental shift: all operations must be additive, not multiplicative. Traditional weight updates such as w = w * gradient are forbidden. Instead, PRAXIS uses pure addition: w = w + delta.

There however exists a fundamental limitation of such systems alone: additive-only networks are not Turing complete. A feedforward network using only addition and linear operations cannot compute arbitrary functions.

In order to achieve Turing-completeness, we must add: 1) recurrence (feedback of the previous hidden state into the next step), and 2) at least one non-polynomial activation function. Such recurrent nets with smooth activations were proven to be Turing complete by Siegelmann & Sontag.

Adding in recurrence

PRAXIS solves this by introducing recurrence in the network. Each neuron maintains a hidden state that feeds back into its next computation. Combined with at least one non-polynomial activation function (specifically, a custom saturating tanh function defined later), this restores Turing completeness while preserving the Gaussian property.

Remarkably, this recurrent state doesn’t just restore computational power - it enables further sophistication of learning through eligibility traces, allowing the network to bridge temporal gaps between causes and effects. The hidden values necessary for recurrence naturally provide the signals needed to compute eligibility traces.

Eligibility traces are essentially a “short-term memory” of recent neural coincidences which can wait for a global signal (e.g. a reward or disincentive) before deciding how the synaptic weight is updated. A biological example of such global signals is dopamine, acting as a signaler to tell neurons which recent activities were worthwhile. This allows an easy expansion to make use of three-factor learning rules - where synaptic updates consider presynaptic activity, postsynaptic activity, and a global modulation signal. This closely mirrors how real brains are thought to make learning progress.

This marriage of Gaussian distributions and recurrent dynamics isn’t just a technical hack. It’s a fundamentally different way of thinking about neural computation — one optimized for the realities of digital hardware rather than the constraints of biology.

Combined together, PRAXIS implements learning through a biologically plausible mechanism combining three factors:

Local synaptic traces: each connection tracks recent coactivation of its connected neurons
Global modulation: a single scalar signal (like dopamine in the brain) indicates whether recent actions were beneficial
Additive updates: weights change through addition rather than multiplication, maintaining the Gaussian distribution

This approach, coined as Additive STDP with Eligibility Traces (A-STDP-ET), allows PRAXIS to learn from various signals — whether from supervised learning, reinforcement learning, or self-supervised objectives—while keeping all computations local and biologically realistic.

Breaking free from physical topology: 0-dimensionalism

The tyranny of space in biological brains

Biological neural networks are prisoners of physics. Neurons occupy physical space, axons must traverse real distances, and connections require metabolic energy to maintain. These constraints fundamentally shape brain architecture:

Distance costs: connecting distant neurons requires long axons that are metabolically expensive and suffer signal degradation. This forces brains to favor local connections, creating modular, hierarchical structures
Volume limitations: the cubic scaling of brain volume versus the quadratic scaling of surface area creates the famous “wiring problem”—why brains are folded and why most connections must be local
Layered architectures: The human cortex’s six-layer structure isn’t computationally optimal — it’s a compromise between connection efficiency and physical constraints. Birds evolved different solutions with nuclear rather than layered organizations, showing these structures are accidents of evolutionary history, not computational necessities

Digital spatial freedom

In silicon, these constraints vanish. There’s no metabolic cost to a long distance connection, no signal degradation over distance, no physical volume to optimize. A synapse between adjacent neurons costs exactly the same as one spanning the entire network.

Yet most neural network architectures slavishly copy biological structures — layers, local connectivity, hierarchical organization. PRAXIS asks: what’s the optimal connectivity pattern when physics doesn’t matter?.

Enter expander graphs

The answer comes from pure mathematics: expander graphs. These structures achieve seemingly impossible efficiency:

Each neuron connects to only k√N others (for N total neurons, where k ≈ 2-4)
Any two neurons are separated by at most ln(N) connections
Information spreads uniformly — no bottlenecks or isolated clusters
The entire topology can be computed algorithmically—no need to store connections

Think of it as the “small world” phenomenon perfected. For example, in a network of a million neurons, each connects to just 1,000 others, yet information travels between any pair in under 20 hops.

Ramanujan graphs: optimal expanders

PRAXIS specifically uses Ramanujan graphs, which are mathematically proven to be optimal expanders. The construction is elegant:

Choose a prime number p ≈ √N for the degree
Use modular arithmetic with primitive roots to deterministically generate connections
Each neuron’s connections can be computed on the fly from its index

No connection matrices to store. No irregular memory access patterns. Just pure mathematical relationships that GPUs can compute in parallel.

The advantages of topology-free design

This radical connectivity pattern delivers multiple benefits:

Uniform information flow: no architectural bottlenecks or preferred pathways - the network has no built in biases about information flow
Scalable by design: connection count grows as O(N^1.5) rather than O(N^2), making million-neuron (or even billion-neuron) networks practical
Memory efficient: connections are computed, not stored - the entire topology fits in a few bytes of parameters
Naturally parallel: each neuron’s connections are independent, enabling perfect GPU parallelization

By abandoning the physical constraints that shape biological brains, PRAXIS achieves a connectivity pattern that’s not just different — it’s mathematically optimal for information processing in digital substrates.

Implementation architecture

With the core principles established, we can now examine how PRAXIS actually computes. The implementation elegantly combines all our design choices into a unified system that runs efficiently on GPUs while maintaining brain-like learning dynamics.

1. Neuron dynamics

Each neuron in PRAXIS maintains a simple state that evolves over discrete time steps. Unlike traditional artificial neurons that instantly compute outputs, PRAXIS neurons have temporal dynamics inspired by biological membrane potentials:

membrane_potential[j] = leak_rate * previous_activation[j]                                       
                      + sum(weight[i,j] * previous_activation[i] for i in pre_synaptic_neurons)  
                      + sum(input_weight[k,j] * external_input[k])                               
                      + bias[j]                                                                  
                                                                                                 
activation[j] = sat_tanh(membrane_potential[j])

Where the key design choices are:

Leak rate: creates memory / momentum between time steps, allowing temporal integration
Saturating tanh: nearly linear near zero (preserving Gaussian properties) but saturates at ±2 to prevent runaway activations
Interpretation: activation[j] represents expected spikes per computational tick, bounded between -2 and +2 to maintain stable dynamics

And the key parameters being:

weight[i,j]: connection strength from neuron i to neuron j
previous_activation: neuron outputs from the previous tick
external_input: sensory or task inputs
bias: per-neuron threshold adjustment

The saturating activation function is crucial:

sat_tanh(x) = 2·tanh(x/2)

This maintains near-linear dynamics in the normal operating range while preventing extreme values.

2. Efficient connectivity via Ramanujan graphs

Rather than storing massive connection matrices, PRAXIS generates connectivity on the fly using mathematical properties:

For a network with N neurons:

Each neuron has degree d = k√N (where k ≈ 2-4)
Choose a prime p that is close to d
For each neuron v, connect to neighbors using standard Ramanujan graph construction algorithms
Double the pattern with a second generator set for redundancy

Benefits:

Deterministic: connections computed from neuron indices
Memory efficient: no connection matrix to store
GPU friendly: connections are contiguous blocks enabling coalesced memory access
Optimal expansion: diameter ≤ ⌈log_p N⌉

3. Learning via Additive STDP with Eligibility Traces (A-STDP-ET)

PRAXIS implements a biologically-inspired learning rule that combines local correlation detection with global modulation signals — similar to how dopamine modulates learning in the brain.

Phase 1: local correlation tracking (per every tick)

Each synapse maintains an eligibility trace that tracks correlation between presynaptic and postsynaptic activity, with the trace decaying over time:

eligibility[i,j] = decay_rate * eligibility[i,j] + pre_neuron[i] * post_neuron[j]

This creates a “memory” of which connections were recently active together. The decay rate should be high enough (e.g., 0.9+) to maintain temporal information across multiple ticks.

Phase 2: weight updates with global modulation

Actual weight changes combine local correlation with global teaching signals:

weight[i,j] = weight[i,j] + learning_term - decay_term

Where:

learning_term = learning_rate * modulator * eligibility[i,j]
decay_term = weight_decay * weight[i,j]

From where:

learning_rate: use small step size values for stable learning
modulator: global teaching signal broadcast to all synapses
eligibility[i,j]: local correlation we tracked

The decay_term prevents weights from growing indefinitely, where:

weight_decay: speed of weight decay, with gentle regularization to prevent weight explosion(s)
weight[i,j]: synaptic weight between presynaptic neuron i and postsynaptic neuron j

The global modulator depends on the learning scenario:

Supervised: error gradient (backpropagated and averaged)
Reinforcement: reward - expected_reward (like dopamine)
Self-supervised: reconstruction error or surprise signal

Phase 3: maintain Gaussian distribution

After updates, weights are clipped to maintain our crucial Gaussian property:

weight[i,j] = clip(weight[i,j], -3 * weight_stddev, +3 * weight_stddev)

This hard boundary prevents outlier weights, maintains int8 quantization quality, and keeps the network stable, all while preserving 99.7% of the distribution.

Note: specific parameter values require empirical tuning. This writing focuses on key insights of algorithmic structure that maintains Gaussian distributions while enabling brain-like learning, not the exact constants.

Why this design?

Purely additive: no multiplication by current weight -> maintains Gaussian distribution
Three-factor rule: local correlation (times) global signal -> biologically plausible
Delayed credit: eligibility traces bridge time between cause and effect
Universal: same mechanism works for supervised, RL, and self-supervised learning

4. Homeostatic regulation

To maintain stability without explicit normalization layers, PRAXIS implements two biological-inspired homeostatic mechanisms:

Variance control

Monitor and adaptively control activation variance to maintain stable dynamics:

Track the running variance of neuron activations and scale weights if needed:

activation_variance = momentum * activation_variance + (1 - momentum) * variance(current_activations)  
                                                                                                       
if activation_variance > 1.2:                                                                          
    all_weights_to_neuron[j] = all_weights_to_neuron[j] / sqrt(activation_variance)

This uses a very slow momentum (close to 1) to avoid disrupting learned patterns. The example variance threshold of 1.2 is projected to allow slight fluctuation above unit variance while preventing runaway dynamics, but further calibrations would reveal more fitting values.

This scales entire columns of the weight matrix uniformly, preserving relative strengths while controlling magnitude.

Firing rate homeostasis

Each neuron adjusts its bias to maintain balanced activation:

mean_rate[j] = exponential_moving_average(activation[j])    
bias[j] -= homeostasis_rate * (mean_rate[j] - target_rate)

The exponential moving average tracks the neuron’s average activation over recent history. The homeostasis mechanism gently corrects biases to keep neurons centered around zero activation, preventing them from becoming permanently silent or saturated.

5. Quantization strategy

PRAXIS’s Gaussian constraints enable aggressive quantization without information loss:

Activations: bounded at ±2 and quantized to signed int8 with scale factor 2/127 - since 95% of a standard normal distribution falls within ±2σ, this captures virtually all activation values with 8-bit precision.
Weights: bounded at ±3 and quantized to signed int8 with scale factor 3/127 - this wider range accommodates 99.7% of the Gaussian distribution, ensuring even rare but important weight values are preserved.
Eligibility traces: Use μ-law int8 encoding for non-uniform quantization - this provides more precision near zero where most eligibility values concentrate, efficiently capturing the temporal correlation dynamics.

During computation, dot products accumulate in int32 to prevent overflow, then dequantize once per matrix multiplication. This approach maximizes compute density—leveraging int8 tensor cores—while maintaining numerical stability throughout the network.

The unified system

These components all work together in synergy:

Gaussian distributions enable extreme quantization
Ramanujan connectivity provides optimal information flow
Eligibility traces bridge temporal gaps
Homeostasis maintains stability without explicit normalization
Everything mapping to efficient GPU operations

The result is a system that learns like a brain but computes like a modern neural network — the best of both worlds.

Scaling to extreme networks

PRAXIS’s design truly shines at massive scale. While traditional approaches hit memory walls with N^2 connectivity, PRAXIS needs only N^1.5 connections - the difference between impossible and practical for billion-neuron networks.

The numbers are compelling: A billion-neuron PRAXIS network requires approximately 32TB with int8 weights (10^9 neurons × √(10^9) connections × 1 byte ≈ 32TB), fitting on a single modern GPU cluster. A traditional fully connected approach would need 1000x more memory.

Three key scaling advantages exist for PRAXIS:

Ramanujan connectivity computed on the fly (zero storage)
Gaussian distributions enable safe int8 quantization at any scale
Perfect GPU parallelism with synchronous updates

The main scaling challenge: global synchrony. Updating billions of neurons per tick pushes memory bandwidth limits. Future work might explore hierarchical time scales or sparse temporal updates.

If the theoretical promise holds, PRAXIS could be the first practical path to human-brain-scale networks on existing hardware - no exotic neuromorphic chips required.

Neuromorphic computing, but NOW

PRAXIS represents a fundamental rethinking of neuromorphic computing by bridging the gap between digital organoids and practical deployment. Where the organoid paradigm promises brain-like learning, PRAXIS delivers it on modern GPUs at scale. As a digital organoid, PRAXIS captures the self-organizing, adaptive properties of biological neural networks while being engineered for silicon substrates.

PRAXIS dissolves the false choice between biological realism and computational efficiency: we don’t need biological fidelity to capture biological learning dynamics. It achieves what traditional approaches cannot:

10-100x efficiency gains through Gaussian distributions and sparse Ramanujan connectivity
Native GPU acceleration via synchronous execution and dense matrix operations
Stable, scalable learning through additive dynamics and mathematical homeostasis
Practical deployment using standard deep learning infrastructure

This isn’t just another neural network architecture — it’s proof that brain-inspired computing can be both biologically meaningful and computationally practical. As we push toward more capable AI, PRAXIS shows that the path forward isn’t choosing between brains or machines, but building systems that learn like one while computing like the other.

With PRAXIS, new possibilities are opened in the field of AI:

Continual learning without catastrophic forgetting
Real-time adaptation to changing environments
Unified learning across supervised, reinforcement, and self-supervised signals
Scalability to billions of neurons on existing hardware

The digital organoid revolution doesn’t require waiting for neuromorphic hardware. It’s here, it’s practical, and it runs on the GPUs you already have. The future of neuromorphic computing isn’t in perfect neural simulation. It’s in pragmatic systems that capture what matters while exploiting every advantage our digital substrate provides. PRAXIS lights the way forward.