~/shanegraffiti.com/research/genai-industrial-cv
Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

GENAI
DATA
FOR INDUSTRIAL
VISION

Industrial computer vision needs data before it can build trust, and trust before users will tolerate the imperfections that come with early data. GenAI promises to break that deadlock — but the domain gap between human-centric generative models and featureless industrial parts runs deeper than expected. This review tests three GenAI strategies (personalization, augmentation, CAD synthesis) against MVIP, a 308-class dataset of used car components, and finds that the gap is linguistic as much as visual: the models know "rusty" as a descriptor of dogs, not generators.

Division Semantic Adversarial Research
Domain GenAI · Industrial CV · Data Aug.
Published arXiv 2606.14578 — 2026
Key Result ID data: +6.1pp Top-1, +11pp conf. gap
GenAI Data Generation Image Personalization DA-Fusion Perfusion NVDIFFRECMC Domain Gap MVIP Dataset Latent Diffusion Out-of-Distribution Contrastive Learning CAD Synthesis Reverse Logistics GenAI Data Generation Image Personalization DA-Fusion Perfusion NVDIFFRECMC Domain Gap MVIP Dataset Latent Diffusion Out-of-Distribution Contrastive Learning CAD Synthesis Reverse Logistics
§ 1.0 The Chicken-and-Egg Dilemma

Industrial AI requires data to be predictable. Users require predictability to trust AI. Trust is required before users will tolerate the uncertainty of early data collection. The cycle stalls before it starts — and when early models disappoint, the lost trust is almost impossible to recover.

Active learning can ramp up data incrementally, but the performance dip during ramp-up is itself the trust-killer. GenAI offers a different entry point: generate plausible training data before any production deployment, so the first version the user sees already has enough coverage to behave reasonably.

The Dilemma
No data → weak model → user distrust → no deployment → no production data. Active learning iterates but the early dip breaks trust before the loop can close.
Domain Gap
GenAI models train on human-centric internet data. Industrial objects — used car generators, starters, reverse-logistics parts — appear rarely, are described differently, and have visual properties the models have never prioritized.
The Language Problem
MVIP tags objects as "dirty", "shiny", "rusty", "edgy", "round". These adjectives exist in generative model vocabularies, but mapped to animals and faces — not to metal surfaces with industrial wear patterns.
MVIP Challenge
308 classes. Used car components captured from multiple perspectives. Many classes are visually near-identical even for expert human workers. High fidelity is not a preference — it is a requirement for correct annotation.
§ 2.0 Three GenAI Strategies

Each strategy occupies a different point on the fidelity–diversity tradeoff. None is universally better — each has a specific failure mode in the industrial context.

Strategy A · Personalization
Perfusion
Finetunes a text-to-image model on 4 images of an object to learn a new concept token (e.g. generator*). Novel images generated by prompting with that token. Sensitive to exact prompt phrasing — "an image of a generator*" works; "image of a generator*" produces markedly worse results.
1000 steps · 4 images · perspective-locked · peaks at 600 steps loss minimum
Strategy B · Augmentation
DA-Fusion
Applies latent-space diffusion noise to existing images rather than generating new ones. An intensity factor controls how far the augmentation departs from the original. Low intensity: valid supervised training data. High intensity: useful near-OoD data for contrastive pretraining.
20% intensity → in-distribution · 50% intensity → near-OoD · no masking required
Strategy C · CAD Synthesis
NVDIFFRECMC
Reconstructs CAD, texture, and surface normals from a turntable image array. Feeds a Blender simulation pipeline that renders novel perspectives under randomized lighting. Fails on plain featureless surfaces — a very common characteristic in industrial parts.
Blender pipeline · any angle · fails on textureless / featureless objects
§ 3.0 Personalization — What Breaks

Perfusion generates recognizable objects. The fidelity is not sufficient for MVIP's fine-grained classification challenge, but it is sufficient for pretraining. The most important breakage is linguistic.

Failure mode
What happens
Verdict
Prompt sensitivity
"An image of a generator*" works. "Image of a generator*" and "generator* shown in a photo" degrade substantially. Exact sentence structure is load-bearing — incompatible with automated pipelines that vary phrasing.
Pipeline risk
Adjective domain gap
Adding MVIP tags like "old" or "dirty" to the prompt worsens results. Chaining a learned adjective concept (Shiny*) to a learned object concept (generator*) causes large fidelity drops — the model maps Shiny to white cat, not metallic surface.
Concept mismatch
Perspective lock
Training images must maintain a near-constant viewpoint. Multi-side coverage requires separate finetunes per side: motor-left, motor-right, motor-top. Each finetune costs 1000 steps.
Scale cost
Pretraining value
Despite classification fidelity limits, Perfusion generates diverse visual distributions useful for unsupervised and self-supervised encoder pretraining — widening the latent space before the labeled data ever arrives.
Pretraining ✓
§ 4.0 Augmentation — The Intensity Split

DA-Fusion's single parameter — diffusion intensity — produces two qualitatively different outputs that serve different training purposes. The split is clean and deliberate.

Intensity as a routing decision intensity ≤ 20% → in-distribution (ID) training data // minor texture changes · annotation-preserving · valid supervised learning signal intensity ≥ 50% → near out-of-distribution (near-OoD) pretraining data // blurred classification boundary · NOT for supervised labels // valuable for supervised contrastive learning — keeps similar classes close in latent space // also useful: anomaly / defect injection when combined with object masking
01
Slight augmentation (ID)
High fidelity. Minor texture and semantic drift. Suitable as labeled training data — the small changes reduce overfitting on minor artifacts without altering the classification signal. Object identity and annotation integrity preserved.
02
Heavy augmentation (near-OoD)
Lower fidelity. Severe texture and semantic changes. Classification boundary ambiguous — cannot be used as supervised labels. Strong utility for contrastive pretraining, where the goal is not correct classification but correct latent-space clustering of related objects.
03
Masking-guided defect injection
DA-Fusion supports semantic masking to restrict where diffusion changes are applied. This enables automatic placement and annotation of anomalies or surface defects — a direct path to anomaly detection training data without manual labeling.
§ 5.0 Experimental Results

ResNet18 trained from scratch on all 308 MVIP classes. Encoder pretrained with supervised contrastive learning, then frozen. Final classification layer finetuned on labeled data. OoD data is measured for its confidence effect, not accuracy — because confidence calibration is the trust signal users see.

+6.1pp
Top-1 accuracy gain with ID data
+11pp
Confidence gap (ID vs OoD), ID data
−29pp
OoD data drops avg. ID confidence
308
MVIP classes trained end-to-end
Ablation — DA-Fusion ID and OoD dataset extensions to MVIP
DatasetTop-1 Accuracyavg. ID Confidenceavg. near-OoD ConfidenceConf. Δ
MVIP (baseline)71.4%69%62%7%
+ ID Data (DA-Fusion 20%)77.5%76%65%11%
+ OoD Data (DA-Fusion 50%)75.3%40%35%5%

The OoD result is not a failure — it is a feature. Collapsing model confidence on near-OoD inputs is precisely what makes the model safer to deploy. A user sees low confidence and defers to a human. The confidence gap (Δ) between ID and OoD data is the operational signal: ID data widens it to 11pp, giving users a clearer uncertain/confident split.

§ 6.0 CAD Synthesis — Where It Fails

NVDIFFRECMC reconstructs textured 3D models directly from image arrays. When it works, the results are photorealistic and simulation-ready. When it fails, it fails silently — producing geometry that looks plausible but is metrically wrong.

Condition
What happens
Status
Textured surfaces with features
High-fidelity CAD and texture reconstruction. Surface stains, bolts, casting marks, and other key features guide depth inference correctly. Synthetic renders are well-suited for supervised training data.
Works ✓
Plain / featureless surfaces
Depth inference breaks down without surface features to anchor correspondence. Object shape is reconstructed incorrectly. Common in industrial parts — smooth cast housings, flat flanges, uniform painted surfaces.
Fails ✗
Imprecise object masking
MVIP's automatic masks are sometimes imprecise. NVDIFFRECMC is sensitive to mask quality during reconstruction — bad masks produce garbled geometry and texture hallucinations.
Fails ✗
Bottom occlusion
Turntable capture occludes the object bottom. Multiple capture sessions required per object. A robotic arm would automate this, but adds setup cost and removes the simplicity advantage of the turntable approach.
Workaround needed
~/conclusion
$ query: what does GenAI change for industrial CV // The chicken-and-egg cycle can be interrupted before deployment. // ID data from DA-Fusion raises accuracy and widens conf. gap. // OoD data reduces overconfidence — makes models safer to ship. $ query: what is the actual domain gap // Not just visual. The language is wrong too. // "Rusty" maps to a red dog, not an oxidized generator housing. // Industrial adjectives need their own concept finetunes. $ query: which method should you use // DA-Fusion ID data for supervised training. Highest accuracy gain. // DA-Fusion OoD for contrastive pretraining. Confidence calibration. // Perfusion for encoder pretraining diversity. Not for labels. // NVDIFFRECMC only when surfaces have features to reconstruct from.

RUSTY
IS NOT
A DOG.
TEACH IT.