~/shanegraffiti.com/research/em-nesy

Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

EXPECTATION
MAXIMIZATION
FOR NEUROSYMBOLIC
LEARNING

Neurosymbolic (NeSy) models integrate neural networks with symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require the symbolic component to be expressed differentiably, which complicates the use of approximate inference. EM-NeSy recasts probabilistic NeSy learning as an instance of the Expectation-Maximization algorithm. In the E-step, the posterior over neurally predicted symbols conditioned on the label is computed via probabilistic inference. In the M-step, neural parameters are updated based on this posterior using gradient descent solely through the neural component — unlocking the full potential of EM for NeSy learning, with no differentiability requirements on the symbolic side.

Division Semantic Adversarial Research

Domain Neurosymbolic AI / EM Learning

Published arXiv 2606.14463 — 2026

Key Result 2.6× faster, 60% less memory

§ 3.0 The EM-NeSy Algorithm

Standard NeSy learning backpropagates gradients through both the symbolic and neural components. EM-NeSy breaks this dependency. The symbolic component no longer needs to be differentiable — it simply needs to compute a posterior. That posterior becomes the training signal for the neural component.

E-Step — Expectation

Posterior
Computation

The symbolic component computes the posterior distribution over the neurally predicted latent variables conditioned on the observed label. Any inference engine — exact or approximate, differentiable or not — can execute this step. It acts as a probabilistic corrector: abducing labels for the latent variables from the neural prior and the ground truth.

⇄ iterate

M-Step — Maximization

Neural
Parameter Update

Gradient descent updates the neural component parameters using the cross-entropy loss between the neural prior and the posterior computed in the E-step. Backpropagation runs only through the neural component — the symbolic component is treated as a constant. No differentiability requirement. No gradient checkpointing overhead.

Loss function — EM-NeSy L_EM(θ) = − Σ_z ℓ(z) · log p_Ne(z | x, θ) // ℓ(z) = posterior p(z | y, x, θ) — treated as constant in the M-step // gradient flows only through the neural component p_Ne ∇_θ L(θ) = ∇_θ L_EM(θ) // Theorem 3.1: gradients are identical under exact inference

§ 2.0 Core Concepts

NeSy learning is reframed as a latent variable model. The subsymbolic input X is observed. The symbolic output Z — the neural component's prediction — is latent. The target Y is observed and inferred from Z by the symbolic component encoding background knowledge.

Latent Variable Model

The NeSy predictor is interpreted as p(y|x,θ) = Σ p_Ne(z|x,θ) · p_Sy(y|z). The neural component acts as a conditional prior over the latent symbolic variables Z. The symbolic component predicts the target from those latent variables via probabilistic inference.

Probabilistic Corrector

In EM-NeSy, the symbolic component shifts role: instead of predicting label likelihood, it computes the posterior of the latent variables given the label. This is abductive reasoning — inferring what the symbolic state must have been to produce the observed output.

Soft vs. Hard Labels

Abduced labels can be soft — full posterior probability distributions — or hard, meaning MAP assignments: ℓ(z) = I(z = argmax p(z'|y,x,θ)). Soft EM recovers the standard NeSy gradient exactly. Hard EM trades accuracy for speed but can converge to poor local optima on complex tasks.

Multi-M EM

After computing the posterior once in the E-step, multiple gradient steps can be taken in the M-step against the same posterior before re-computing. Each additional backpropagation saves one expensive posterior re-computation, and produces more stable convergence than simply increasing the learning rate.

Stochastic / Online EM

Rather than computing the Q-function over the entire dataset, the E-step processes a single training pair or mini-batch. The M-step performs a partial parameter update scaled by the learning rate. This makes EM-NeSy naturally suited to large-scale and streaming data settings — the standard NeSy setup doesn't accommodate this easily.

Inference Engine Agnosticism

Because the M-step only needs posterior marginals — not gradients of the inference computation — any inference engine works: exact arithmetic circuits, loopy belief propagation, ABC sampling, rejection sampling, Gibbs sampling, MAP solvers. No differentiability required. Plug in whatever is fastest for the task.

§ 4.0 Inference Schemes

The E-step is inference-engine-agnostic. Each approach has tradeoffs. EM-NeSy doesn't require a new approach per method — the framework stays fixed while the inference engine is swapped.

Exact / Arithmetic Circuit

Knowledge compilation offline. Posterior marginals obtained via a single backward pass through the compiled circuit. EM-NeSy and standard NeSy produce identical gradients — no differentiability advantage here, but Multi-M EM can accelerate convergence.

Exact — slow to compile

Belief Propagation

Sum-product BP computes all posterior marginals during inference — directly usable in the M-step with zero additional computation. Exact on trees (MNISTAdd), approximate on loopy graphs (9×9 Sudoku). Avoids backpropagation through iterative updates entirely.

Fast — loopy graphs degrade

ABC Sampling

Samples from the neural prior pNe(·|x,θ). Each sample weighted by the fraction of symbolic constraints it satisfies — making it tractable even when the exact likelihood is degenerate or intractable (e.g. valid 9×9 Sudoku). No MAP solver required.

Flexible — high variance

Rejection Sampling

Draws from the neural prior, accepts only samples consistent with the label. Recovers REINFORCE up to a 1/p(y|x,θ) scaling factor — EM-NeSy targets the log-likelihood gradient; REINFORCE targets the likelihood gradient. Struggles in large or sparse sampling spaces.

Simple — sparse spaces fail

Hard EM / MAP

Instead of full posteriors, only the MAP assignment is used: ℓ(z) = I(z = argmax posterior). Reduces memory and compute further but introduces bias — the gradient signal ignores all non-MAP configurations. Often converges to poor local optima on complex combinatorial tasks.

Fastest — biased signal

§ 6.0 Experimental Results

Three benchmarks. All combine visual perception with symbolic reasoning under weak supervision — no direct annotation on intermediate symbols, only the final label.

2.6×

Faster training vs BP-std at 100 digits

60%

Peak memory reduction (90→35 GB)

SOTA

ABC-EM on Warcraft 12×12 (98.8%)

T/O

Competitors at 15+ digit MNISTAdd

MNISTAdd — accuracy % / training time (min)

Method	4 digits acc%	15 digits acc%	100 digits acc%	4d time	15d time	100d time
A-NeSI	93.28	55.88	overflow	25.75	57.55	—
EXAL	90.71	62.62	6.67	4.17	42.44	372.20
DeepStochLog (exact)	92.70	T/O	T/O	—	T/O	T/O
BP-std (exact)	92.16	72.79	11.60	2.21	7.93	36.26
BP-EM (ours)	92.53	74.85	9.20	1.33	1.95	8.51

Visual Sudoku — accuracy % / training time (min)

Method	4×4 acc%	9×9 acc%	4×4 time	9×9 time
A-NeSI	87.20	59.20	46.08	73.33
Scallop	75.00	T/O	1.52	T/O
SL (exact)	86.70	T/O	—	T/O
BP-std	87.80	0.50	38.23	—
BP-EM (ours)	89.70	0.50	21.23	—
ABC-EM (ours)	86.30	53.30	1.40	8.24

Warcraft Path-Planning — accuracy % / training time (min)

Method	12×12 acc%	30×30 acc%	12×12 time	30×30 time
A-NeSI	98.96	67.57	439.10	1596.51
I-MLE	95.34	93.40	26.77	227.82
EXAL	94.19	80.85	11.1	84.3
SPL (exact)	78.20	T/O	—	T/O
ABC-EM (ours)	98.80	69.60	7.26	8.37

§ 7.0 What EM-NeSy Unlocks

Standard NeSy models require inference-specific modifications to stay differentiable under approximate inference. EM-NeSy eliminates that requirement entirely, providing one unified framework that handles any inference method without modification.

01

Any inference engine, zero rewiring

Belief propagation, ABC sampling, rejection sampling, Gibbs, MAP solvers — all work as drop-in E-step implementations. No gradient estimators. No reparameterization tricks. No backprop through iterative updates. The framework stays fixed; the engine swaps.

02

Scalability where exact methods time out

Knowledge compilation fails at 15 digits on MNISTAdd. EM-NeSy with belief propagation scales to 100 digits and runs 4× faster with 60% less peak memory. ABC sampling solves 9×9 Sudoku where loopy BP fails — without requiring a MAP solver or a valid-Sudoku sampler.

03

Exact equivalence under exact inference

Theorem 3.1 proves that under exact inference, the gradient of the standard NeSy loss and the EM-NeSy loss are identical. EM-NeSy is not an approximation of the standard approach — it recovers it exactly while removing the computational overhead of differentiating through the symbolic component.

04

EM variants for free

Because EM-NeSy is a genuine instance of the EM algorithm, all EM variants apply without modification: Generalized EM relaxes the M-step, Stochastic EM handles streaming data, Hard EM reduces cost via MAP inference, Multi-M EM accelerates convergence through multiple M-steps per posterior computation.

~/conclusion

$ query: what does EM-NeSy change // NeSy learning is an instance of the EM algorithm. // The symbolic component no longer needs to be differentiable. // Any inference engine becomes a valid E-step. $ query: what does this cost // Under exact inference: nothing. Gradients are identical. // Under approximate inference: variance scales with inference quality. // Trade accuracy for scale. Tune the inference engine, not the framework. $ query: what is the actual result // 2.6× faster. 60% less memory. Scales where competitors time out. // One framework. Any inference engine. No differentiability required.

THE
SYMBOLIC
COMPONENT
DOESN'T
NEED TO
BACKPROP.

EXPECTATION MAXIMIZATION FOR NEUROSYMBOLIC LEARNING

EXPECTATION
MAXIMIZATION
FOR NEUROSYMBOLIC
LEARNING