~/shanegraffiti.com/research/genai-industrial-cv

Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

GENAI
DATA
FOR INDUSTRIAL
VISION

Industrial computer vision needs data before it can build trust, and trust before users will tolerate the imperfections that come with early data. GenAI promises to break that deadlock — but the domain gap between human-centric generative models and featureless industrial parts runs deeper than expected. This review tests three GenAI strategies (personalization, augmentation, CAD synthesis) against MVIP, a 308-class dataset of used car components, and finds that the gap is linguistic as much as visual: the models know "rusty" as a descriptor of dogs, not generators.

Division Semantic Adversarial Research

Domain GenAI · Industrial CV · Data Aug.

Published arXiv 2606.14578 — 2026

Key Result ID data: +6.1pp Top-1, +11pp conf. gap

§ 1.0 The Chicken-and-Egg Dilemma

Industrial AI requires data to be predictable. Users require predictability to trust AI. Trust is required before users will tolerate the uncertainty of early data collection. The cycle stalls before it starts — and when early models disappoint, the lost trust is almost impossible to recover.

Active learning can ramp up data incrementally, but the performance dip during ramp-up is itself the trust-killer. GenAI offers a different entry point: generate plausible training data before any production deployment, so the first version the user sees already has enough coverage to behave reasonably.

The Dilemma

No data → weak model → user distrust → no deployment → no production data. Active learning iterates but the early dip breaks trust before the loop can close.

Domain Gap

GenAI models train on human-centric internet data. Industrial objects — used car generators, starters, reverse-logistics parts — appear rarely, are described differently, and have visual properties the models have never prioritized.

The Language Problem

MVIP tags objects as "dirty", "shiny", "rusty", "edgy", "round". These adjectives exist in generative model vocabularies, but mapped to animals and faces — not to metal surfaces with industrial wear patterns.

MVIP Challenge

308 classes. Used car components captured from multiple perspectives. Many classes are visually near-identical even for expert human workers. High fidelity is not a preference — it is a requirement for correct annotation.

§ 2.0 Three GenAI Strategies

Each strategy occupies a different point on the fidelity–diversity tradeoff. None is universally better — each has a specific failure mode in the industrial context.

Strategy A · Personalization

Perfusion

Finetunes a text-to-image model on 4 images of an object to learn a new concept token (e.g. generator*). Novel images generated by prompting with that token. Sensitive to exact prompt phrasing — "an image of a generator*" works; "image of a generator*" produces markedly worse results.

1000 steps · 4 images · perspective-locked · peaks at 600 steps loss minimum

Strategy B · Augmentation

DA-Fusion

Applies latent-space diffusion noise to existing images rather than generating new ones. An intensity factor controls how far the augmentation departs from the original. Low intensity: valid supervised training data. High intensity: useful near-OoD data for contrastive pretraining.

20% intensity → in-distribution · 50% intensity → near-OoD · no masking required

Strategy C · CAD Synthesis

NVDIFFRECMC

Reconstructs CAD, texture, and surface normals from a turntable image array. Feeds a Blender simulation pipeline that renders novel perspectives under randomized lighting. Fails on plain featureless surfaces — a very common characteristic in industrial parts.

Blender pipeline · any angle · fails on textureless / featureless objects

§ 3.0 Personalization — What Breaks

Perfusion generates recognizable objects. The fidelity is not sufficient for MVIP's fine-grained classification challenge, but it is sufficient for pretraining. The most important breakage is linguistic.

Failure mode

What happens

Verdict

Prompt sensitivity

"An image of a generator*" works. "Image of a generator*" and "generator* shown in a photo" degrade substantially. Exact sentence structure is load-bearing — incompatible with automated pipelines that vary phrasing.

Pipeline risk

Adjective domain gap

Adding MVIP tags like "old" or "dirty" to the prompt worsens results. Chaining a learned adjective concept (Shiny*) to a learned object concept (generator*) causes large fidelity drops — the model maps Shiny to white cat, not metallic surface.

Concept mismatch

Perspective lock

Training images must maintain a near-constant viewpoint. Multi-side coverage requires separate finetunes per side: motor-left, motor-right, motor-top. Each finetune costs 1000 steps.

Scale cost

Pretraining value

Despite classification fidelity limits, Perfusion generates diverse visual distributions useful for unsupervised and self-supervised encoder pretraining — widening the latent space before the labeled data ever arrives.

Pretraining ✓

§ 4.0 Augmentation — The Intensity Split

DA-Fusion's single parameter — diffusion intensity — produces two qualitatively different outputs that serve different training purposes. The split is clean and deliberate.

Intensity as a routing decision intensity ≤ 20% → in-distribution (ID) training data // minor texture changes · annotation-preserving · valid supervised learning signal intensity ≥ 50% → near out-of-distribution (near-OoD) pretraining data // blurred classification boundary · NOT for supervised labels // valuable for supervised contrastive learning — keeps similar classes close in latent space // also useful: anomaly / defect injection when combined with object masking

01

Slight augmentation (ID)

High fidelity. Minor texture and semantic drift. Suitable as labeled training data — the small changes reduce overfitting on minor artifacts without altering the classification signal. Object identity and annotation integrity preserved.

02

Heavy augmentation (near-OoD)

Lower fidelity. Severe texture and semantic changes. Classification boundary ambiguous — cannot be used as supervised labels. Strong utility for contrastive pretraining, where the goal is not correct classification but correct latent-space clustering of related objects.

03

Masking-guided defect injection

DA-Fusion supports semantic masking to restrict where diffusion changes are applied. This enables automatic placement and annotation of anomalies or surface defects — a direct path to anomaly detection training data without manual labeling.

§ 5.0 Experimental Results

ResNet18 trained from scratch on all 308 MVIP classes. Encoder pretrained with supervised contrastive learning, then frozen. Final classification layer finetuned on labeled data. OoD data is measured for its confidence effect, not accuracy — because confidence calibration is the trust signal users see.

+6.1pp

Top-1 accuracy gain with ID data

+11pp

Confidence gap (ID vs OoD), ID data

−29pp

OoD data drops avg. ID confidence

308

MVIP classes trained end-to-end

Ablation — DA-Fusion ID and OoD dataset extensions to MVIP

Dataset	Top-1 Accuracy	avg. ID Confidence	avg. near-OoD Confidence	Conf. Δ
MVIP (baseline)	71.4%	69%	62%	7%
+ ID Data (DA-Fusion 20%)	77.5%	76%	65%	11%
+ OoD Data (DA-Fusion 50%)	75.3%	40%	35%	5%

The OoD result is not a failure — it is a feature. Collapsing model confidence on near-OoD inputs is precisely what makes the model safer to deploy. A user sees low confidence and defers to a human. The confidence gap (Δ) between ID and OoD data is the operational signal: ID data widens it to 11pp, giving users a clearer uncertain/confident split.

§ 6.0 CAD Synthesis — Where It Fails

NVDIFFRECMC reconstructs textured 3D models directly from image arrays. When it works, the results are photorealistic and simulation-ready. When it fails, it fails silently — producing geometry that looks plausible but is metrically wrong.

Condition

What happens

Status

Textured surfaces with features

High-fidelity CAD and texture reconstruction. Surface stains, bolts, casting marks, and other key features guide depth inference correctly. Synthetic renders are well-suited for supervised training data.

Works ✓

Plain / featureless surfaces

Depth inference breaks down without surface features to anchor correspondence. Object shape is reconstructed incorrectly. Common in industrial parts — smooth cast housings, flat flanges, uniform painted surfaces.

Fails ✗

Imprecise object masking

MVIP's automatic masks are sometimes imprecise. NVDIFFRECMC is sensitive to mask quality during reconstruction — bad masks produce garbled geometry and texture hallucinations.

Fails ✗

Bottom occlusion

Turntable capture occludes the object bottom. Multiple capture sessions required per object. A robotic arm would automate this, but adds setup cost and removes the simplicity advantage of the turntable approach.

Workaround needed

~/conclusion

$ query: what does GenAI change for industrial CV // The chicken-and-egg cycle can be interrupted before deployment. // ID data from DA-Fusion raises accuracy and widens conf. gap. // OoD data reduces overconfidence — makes models safer to ship. $ query: what is the actual domain gap // Not just visual. The language is wrong too. // "Rusty" maps to a red dog, not an oxidized generator housing. // Industrial adjectives need their own concept finetunes. $ query: which method should you use // DA-Fusion ID data for supervised training. Highest accuracy gain. // DA-Fusion OoD for contrastive pretraining. Confidence calibration. // Perfusion for encoder pretraining diversity. Not for labels. // NVDIFFRECMC only when surfaces have features to reconstruct from.

RUSTY
IS NOT
A DOG.
TEACH IT.

GENAI DATA FOR INDUSTRIAL VISION

GENAI
DATA
FOR INDUSTRIAL
VISION