~/shanegraffiti.com/research/em-nesy
Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

EXPECTATION
MAXIMIZATION
FOR NEUROSYMBOLIC
LEARNING

Neurosymbolic (NeSy) models integrate neural networks with symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require the symbolic component to be expressed differentiably, which complicates the use of approximate inference. EM-NeSy recasts probabilistic NeSy learning as an instance of the Expectation-Maximization algorithm. In the E-step, the posterior over neurally predicted symbols conditioned on the label is computed via probabilistic inference. In the M-step, neural parameters are updated based on this posterior using gradient descent solely through the neural component — unlocking the full potential of EM for NeSy learning, with no differentiability requirements on the symbolic side.

Division Semantic Adversarial Research
Domain Neurosymbolic AI / EM Learning
Published arXiv 2606.14463 — 2026
Key Result 2.6× faster, 60% less memory
Expectation-Maximization Neurosymbolic Learning Probabilistic Inference Latent Variable Models Belief Propagation ABC Sampling Hard EM Stochastic EM Posterior Marginals Symbolic Inference Visual Sudoku MNISTAdd Warcraft Path-Planning Expectation-Maximization Neurosymbolic Learning Probabilistic Inference Latent Variable Models Belief Propagation ABC Sampling Hard EM Stochastic EM Posterior Marginals Symbolic Inference Visual Sudoku MNISTAdd Warcraft Path-Planning
§ 3.0 The EM-NeSy Algorithm

Standard NeSy learning backpropagates gradients through both the symbolic and neural components. EM-NeSy breaks this dependency. The symbolic component no longer needs to be differentiable — it simply needs to compute a posterior. That posterior becomes the training signal for the neural component.

E-Step — Expectation
Posterior
Computation
The symbolic component computes the posterior distribution over the neurally predicted latent variables conditioned on the observed label. Any inference engine — exact or approximate, differentiable or not — can execute this step. It acts as a probabilistic corrector: abducing labels for the latent variables from the neural prior and the ground truth.
iterate
M-Step — Maximization
Neural
Parameter Update
Gradient descent updates the neural component parameters using the cross-entropy loss between the neural prior and the posterior computed in the E-step. Backpropagation runs only through the neural component — the symbolic component is treated as a constant. No differentiability requirement. No gradient checkpointing overhead.
Loss function — EM-NeSy LEM(θ) = − Σz ℓ(z) · log pNe(z | x, θ) // ℓ(z) = posterior p(z | y, x, θ) — treated as constant in the M-step // gradient flows only through the neural component pNe θ L(θ) = ∇θ LEM(θ) // Theorem 3.1: gradients are identical under exact inference
§ 2.0 Core Concepts

NeSy learning is reframed as a latent variable model. The subsymbolic input X is observed. The symbolic output Z — the neural component's prediction — is latent. The target Y is observed and inferred from Z by the symbolic component encoding background knowledge.

Latent Variable Model
The NeSy predictor is interpreted as p(y|x,θ) = Σ pNe(z|x,θ) · pSy(y|z). The neural component acts as a conditional prior over the latent symbolic variables Z. The symbolic component predicts the target from those latent variables via probabilistic inference.
Probabilistic Corrector
In EM-NeSy, the symbolic component shifts role: instead of predicting label likelihood, it computes the posterior of the latent variables given the label. This is abductive reasoning — inferring what the symbolic state must have been to produce the observed output.
Soft vs. Hard Labels
Abduced labels can be soft — full posterior probability distributions — or hard, meaning MAP assignments: ℓ(z) = I(z = argmax p(z'|y,x,θ)). Soft EM recovers the standard NeSy gradient exactly. Hard EM trades accuracy for speed but can converge to poor local optima on complex tasks.
Multi-M EM
After computing the posterior once in the E-step, multiple gradient steps can be taken in the M-step against the same posterior before re-computing. Each additional backpropagation saves one expensive posterior re-computation, and produces more stable convergence than simply increasing the learning rate.
Stochastic / Online EM
Rather than computing the Q-function over the entire dataset, the E-step processes a single training pair or mini-batch. The M-step performs a partial parameter update scaled by the learning rate. This makes EM-NeSy naturally suited to large-scale and streaming data settings — the standard NeSy setup doesn't accommodate this easily.
Inference Engine Agnosticism
Because the M-step only needs posterior marginals — not gradients of the inference computation — any inference engine works: exact arithmetic circuits, loopy belief propagation, ABC sampling, rejection sampling, Gibbs sampling, MAP solvers. No differentiability required. Plug in whatever is fastest for the task.
§ 4.0 Inference Schemes

The E-step is inference-engine-agnostic. Each approach has tradeoffs. EM-NeSy doesn't require a new approach per method — the framework stays fixed while the inference engine is swapped.

Method
How it works in EM-NeSy
Tradeoff
Exact / Arithmetic Circuit
Knowledge compilation offline. Posterior marginals obtained via a single backward pass through the compiled circuit. EM-NeSy and standard NeSy produce identical gradients — no differentiability advantage here, but Multi-M EM can accelerate convergence.
Exact — slow to compile
Belief Propagation
Sum-product BP computes all posterior marginals during inference — directly usable in the M-step with zero additional computation. Exact on trees (MNISTAdd), approximate on loopy graphs (9×9 Sudoku). Avoids backpropagation through iterative updates entirely.
Fast — loopy graphs degrade
ABC Sampling
Samples from the neural prior pNe(·|x,θ). Each sample weighted by the fraction of symbolic constraints it satisfies — making it tractable even when the exact likelihood is degenerate or intractable (e.g. valid 9×9 Sudoku). No MAP solver required.
Flexible — high variance
Rejection Sampling
Draws from the neural prior, accepts only samples consistent with the label. Recovers REINFORCE up to a 1/p(y|x,θ) scaling factor — EM-NeSy targets the log-likelihood gradient; REINFORCE targets the likelihood gradient. Struggles in large or sparse sampling spaces.
Simple — sparse spaces fail
Hard EM / MAP
Instead of full posteriors, only the MAP assignment is used: ℓ(z) = I(z = argmax posterior). Reduces memory and compute further but introduces bias — the gradient signal ignores all non-MAP configurations. Often converges to poor local optima on complex combinatorial tasks.
Fastest — biased signal
§ 6.0 Experimental Results

Three benchmarks. All combine visual perception with symbolic reasoning under weak supervision — no direct annotation on intermediate symbols, only the final label.

2.6×
Faster training vs BP-std at 100 digits
60%
Peak memory reduction (90→35 GB)
SOTA
ABC-EM on Warcraft 12×12 (98.8%)
T/O
Competitors at 15+ digit MNISTAdd
MNISTAdd — accuracy % / training time (min)
Method4 digits acc%15 digits acc%100 digits acc%4d time15d time100d time
A-NeSI93.2855.88overflow25.7557.55
EXAL90.7162.626.674.1742.44372.20
DeepStochLog (exact)92.70T/OT/OT/OT/O
BP-std (exact)92.1672.7911.602.217.9336.26
BP-EM (ours)92.5374.859.201.331.958.51
Visual Sudoku — accuracy % / training time (min)
Method4×4 acc%9×9 acc%4×4 time9×9 time
A-NeSI87.2059.2046.0873.33
Scallop75.00T/O1.52T/O
SL (exact)86.70T/OT/O
BP-std87.800.5038.23
BP-EM (ours)89.700.5021.23
ABC-EM (ours)86.3053.301.408.24
Warcraft Path-Planning — accuracy % / training time (min)
Method12×12 acc%30×30 acc%12×12 time30×30 time
A-NeSI98.9667.57439.101596.51
I-MLE95.3493.4026.77227.82
EXAL94.1980.8511.184.3
SPL (exact)78.20T/OT/O
ABC-EM (ours)98.8069.607.268.37
§ 7.0 What EM-NeSy Unlocks

Standard NeSy models require inference-specific modifications to stay differentiable under approximate inference. EM-NeSy eliminates that requirement entirely, providing one unified framework that handles any inference method without modification.

01
Any inference engine, zero rewiring
Belief propagation, ABC sampling, rejection sampling, Gibbs, MAP solvers — all work as drop-in E-step implementations. No gradient estimators. No reparameterization tricks. No backprop through iterative updates. The framework stays fixed; the engine swaps.
02
Scalability where exact methods time out
Knowledge compilation fails at 15 digits on MNISTAdd. EM-NeSy with belief propagation scales to 100 digits and runs 4× faster with 60% less peak memory. ABC sampling solves 9×9 Sudoku where loopy BP fails — without requiring a MAP solver or a valid-Sudoku sampler.
03
Exact equivalence under exact inference
Theorem 3.1 proves that under exact inference, the gradient of the standard NeSy loss and the EM-NeSy loss are identical. EM-NeSy is not an approximation of the standard approach — it recovers it exactly while removing the computational overhead of differentiating through the symbolic component.
04
EM variants for free
Because EM-NeSy is a genuine instance of the EM algorithm, all EM variants apply without modification: Generalized EM relaxes the M-step, Stochastic EM handles streaming data, Hard EM reduces cost via MAP inference, Multi-M EM accelerates convergence through multiple M-steps per posterior computation.
~/conclusion
$ query: what does EM-NeSy change // NeSy learning is an instance of the EM algorithm. // The symbolic component no longer needs to be differentiable. // Any inference engine becomes a valid E-step. $ query: what does this cost // Under exact inference: nothing. Gradients are identical. // Under approximate inference: variance scales with inference quality. // Trade accuracy for scale. Tune the inference engine, not the framework. $ query: what is the actual result // 2.6× faster. 60% less memory. Scales where competitors time out. // One framework. Any inference engine. No differentiability required.

THE
SYMBOLIC
COMPONENT
DOESN'T
NEED TO
BACKPROP.