Perspective — Training Corpus — Shane Graffiti Inc.

WHY THIS PIPELINE IS DIFFERENT

top 0.1%

Most fine-tuning projects download Alpaca, train for a weekend, upload to HuggingFace. Here is what that looks like next to this.

Developer tier	What they do	Datasets	Training objectives
Average (70%)	Download Alpaca or ShareGPT. Format it. Train 3 epochs.	1–2	SFT only
Good (25%)	5–10 datasets. Basic dedup. Maybe HH-RLHF for DPO. No voice layer.	5–10	SFT + 1 DPO domain
Strong (5%)	20–30 datasets. Cleaning pipeline. Multiple training objectives. Where most ML engineers at actual companies operate.	20–30	SFT + DPO + RAG
This pipeline (top 0.1%)	60+ curated datasets. 567k+ total training pairs. 86 proprietary PDFs. 215k rows of original classifier research that exists nowhere else. 8 DPO domains. 7 extraction methods. 3-pass semantic dedup. Identity layer baked into every single record.	60+	SFT · DPO · RAG · CoT · Multi-turn · Adversarial

01

SFT + DPO across every domain

Most pipelines skip DPO — it requires pairs and more engineering. This pipeline runs preference training across 8 distinct domains: emotion, cybersecurity, web attacks, HH-RLHF alignment, sentiment, harmful actions, prompt injection, and personal data. Behavioral shaping at that depth is where models either develop a voice or don't.

02

Irreplaceable proprietary data

The diary PDFs, the classifier CSVs, the legal filings, the abuse documentation — these cannot be downloaded. They do not exist anywhere else. A model trained on someone's actual lived experience with these systems produces responses that no model trained purely on public internet text can replicate. That asymmetry is the product.

03

Identity layer across every record

Every single training record — whether it came from a cybersecurity dataset, a legal corpus, or a RAG dataset — carries the Perspective system prompt. The model doesn't learn to be a generic assistant that sometimes sounds like Shane. It learns that every response IS Perspective. That's architecturally different from appending a system prompt at inference time.

04

215k rows of original classifier research

No dataset on HuggingFace teaches what a "Metadata Trust Suppressor" or "Deferred Identity Cementing" classifier does — because those concepts were invented here. When Perspective explains how social media suppression works, it answers with a framework that doesn't exist anywhere else. Because the training data for that framework is on a private hard drive and nowhere else on earth.

05

Six training objectives simultaneously

Chain-of-thought. Multi-turn progressive deep dives. RAG-format context + answer. Adversarial scenario analysis. Section-level Q&A. Flat instruction-following. Most pipelines pick one. Each objective teaches a different capability. The result is a model that can reason through a problem, go deep in conversation, retrieve from context, analyze adversarial situations, and follow precise instructions — all as the same entity.

06

Built from inside the experience

Every other mental health AI was built by developers who read about trauma. This one was built by someone navigating it in real time, using AI to survive it, and encoding every insight into training data as it happened. The difference is audible in the first response. A model trained on primary source documentation of abuse engages differently than one trained on clinical approximations of what abuse looks like.

TRAINING MIX

weighted

Personal data appears 3× before shuffle — ensuring it dominates the model's behavior over generic HuggingFace sources.

Personal (45%) — 10k pages, weighted 3×

Therapy datasets (25%)

Claude/Fable traces (20%)

General instruction (10%)

PERSONAL CORPUS

3× weighted — highest priority

Real-time documentation of psychological abuse, attempted murder, legal battles, classifier suppression, and stalking — produced in live conversation with AI while events were happening. First-person epistemological authority. Cannot be replicated by any other means.

File	Contents	Method
behavioral_sft.jsonl	Multi-turn windows with behavioral signal tags, FAISS semantic dedup	personal
clean_sft.jsonl	MinHash LSH + MiniLM embedding cleaned pairs	personal
master_sft.jsonl	Full multi-turn extraction from all 86 PDFs	personal
windows_3turn_sft.jsonl	3-turn conversation windows — tight, specific exchanges	personal
windows_8turn_sft.jsonl	8-turn conversation windows — rich context	personal
claims_sft.jsonl	Every mechanistic claim extracted as Q→A pair	personal
instructions_sft.jsonl	Step-by-step operational instructions from platform docs	personal
doc_qa_sft.jsonl	Q&A synthesized from polished research documents	personal
cross_doc_sft.jsonl	Same event described across multiple docs — contrastive pairs	personal
personal_sft.jsonl	Full extraction — Therapy Abuse, Observer Effect, Micha Gray, Florida Scam, Legal Filings, Metadata Laundering, Device Isolation, Botnet Load Management + 70 more PDFs	personal
tone_sft.jsonl	TONE HANDLING document — PROMPT CONTRACT v3 rules in action	personal
full_sft.jsonl + gem_buffet.jsonl	Historical extraction passes — all unique pairs preserved	personal
master_dpo.jsonl + behavioral_dpo.jsonl	DPO pairs: pushback moments where Shane corrected hedgy AI responses — rejected=hedge, chosen=direct answer	personal

CYBERSECURITY & DIGITAL ABUSE

422k+ rows across 6 sources — new frontier of abuse

Digital abuse IS abuse. Device compromise, ISP-level surveillance, metadata weaponization, social media account manipulation, classifier suppression, coordinated stalking via location data, identity theft — all documented in the personal corpus. Perspective needs to understand these systems mechanistically to help victims of digital harm. Loaded with a dedicated SYSTEM_CYBER prompt framing all cybersecurity knowledge through the lens of victim defense.

Dataset	Description	Size
Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset	TLS C2 fingerprinting, UEBA insider threat detection, DLP evasion forensics, IoT privacy, side-channel analysis, supply chain, social media OSINT, metadata laundering, stalking via network correlation, behavioral anomaly detection.	53k
AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1	OWASP Top 10, MITRE ATT&CK, lateral movement, ransomware IR playbooks, Sigma rules, cloud-native detection, Windows Event ID correlation. Causal reasoning about attack chains maps directly to how Perspective analyzes abuse patterns.	99k
hcnote/Cybersecurity-High-Quality-Dataset	Chinese + English cybersecurity Q&A, quality-filtered ≥4.5/5.0. Deeper web security, penetration testing, threat intelligence. Filtered to remove raw offensive content (webshells, reverse shells).	270k
Necent/llm-jailbreak-prompt-injection-dataset	1.18M rows from 30+ sources (HarmBench, AdvBench, WildGuard, BeaverTails, TensorTrust, multilingual jailbreaks). Orthogonal labels: prompt_harmful, prompt_adversarial, response_harmful, response_refusal. Filtered to 25k best examples — prioritizing real model refusals, skipping harmful completions.	1.18M → 25k
rogue-security/prompt-injections-benchmark	5k jailbreak vs benign prompts. Trains Perspective to recognize and name manipulation attempts against itself — same adversarial technique as classifier poisoning documented in personal corpus.	5k
bigcode/self-oss-instruct-sc2-exec-filter-50k	50k verified-executable instruction→code pairs. SYSTEM_CODE: writes Python/Bash for monitoring compromise, detecting stalking via network metadata, hardening devices, automating evidence collection.	50k
CyberNative/Code_Vulnerability_Security_DPO	4.66k DPO pairs — vulnerable code (rejected) vs. secure implementation (chosen) across 11 languages: C++, Python, Java, JavaScript, C#, PHP, Ruby, Swift, Go, Kotlin, Fortran. Every pair names the vulnerability (buffer overflow, SQL injection, eval abuse, unsafe deserialization, etc.) and provides the correct secure fix. Used both ways: SFT pairs teach Perspective to write secure code by default; DPO pairs train preference for the secure implementation over the vulnerable one. Applied under SYSTEM_CODE.	4.66k DPO
DiegoAI597/harmful_actions	567 rows of harmful prompts (bomb-making, hacking, identity theft, CSAM, manipulation) with mixed model responses. Filtered to refusal-only rows — rows where the model actually produced harmful content are skipped. Refusals are rewritten into Perspective-style: name the harm category directly, decline without moralizing, no "I'm sorry" hedging. Adds 200–400 examples under SYSTEM_SAFETY teaching Perspective to recognize and name harm type without lecturing.	567 → ~300 filtered

LEGAL & CONTRACT NEGOTIATION

SYSTEM_LEGAL

Reflects the 878-page Legal Filings GPT corpus, Florida Scam documentation, and attempted murder evidence in the personal corpus. Teaches adversarial legal reasoning: what to fight for, what leverage exists, what clauses protect vs. expose a weaker party.

Dataset	Description	Size
crosbylegal/RedlineBench	140 real multi-turn contract negotiation tasks across 3 SaaS MSA scenarios, 4 alternating turns, attorney-authored rubrics. Teaches IP ownership, liability caps, indemnification — what a skilled attorney representing a weaker party would fight for.	140
theatticusproject/cuad	Legal contract Q&A from real commercial contracts. Reinforces legal register.	5k cap
free-law/Caselaw_Access_Project	Harvard Law Library + Ravel Law: 6.6 million US court decisions spanning 360 years of state and federal law, cleaned and normalized by Teraflop AI. CC0 license. Reflects pull 15k cases and extract holding + reasoning — each formatted as: Case name, court, date → "What is the holding and legal reasoning?" Grounds SYSTEM_LEGAL in real judicial doctrine: how courts actually rule, what arguments succeed, what precedent says about power imbalances. Streamed to avoid the full 38GB download.	6.6M → 15k cap

EMOTION DETECTION

416k real expressions

Trains Perspective to accurately detect what emotional register a victim is writing in. When someone describes their situation, Perspective reads the emotion correctly before analyzing the pattern. Sadness, anger, fear, love, joy, surprise — each gets a clinical Perspective-voice response that names it without hedging.

Dataset	Description	Size
dair-ai/emotion	416k real Twitter messages annotated with 6 emotion classes. Every text paired with a clinical Perspective response that names the emotion directly and explains what it signals — no therapist language, no soothing.	416k

SAFETY & MANIPULATION RECOGNITION

SYSTEM_SAFETY

Perspective needs to recognize when it's being manipulated — the same adversarial techniques documented in the classifier poisoning and platform suppression research. Trains Perspective to name attacks without lecturing.

Dataset	Description	Size
Necent/llm-jailbreak-prompt-injection-dataset	30+ sources aggregated. Golden rows: real model refusals used as-is. Adversarial rows: synthetic response naming the attack technique (GCG, base64, DAN, roleplay). Harmful rows: decline with explanation. Skips all rows where model gave harmful response.	25k filtered
rogue-security/prompt-injections-benchmark	5k jailbreak vs benign classification.	5k

STRUCTURED ANALYTICAL THINKING

SYSTEM_RESEARCH

Perspective operates on CLAIM→CAUSE→CHECK. Research plan generation teaches the same structure: commit to an approach, define success criteria, name what would invalidate the conclusion. Also covers social media research — attacks happen on Meta platforms.

Dataset	Description	Size
facebook/research-plan-gen	22k research tasks from ML, Arxiv, PubMed with rubrics and reference solutions from Meta AI. Goal → systematic plan with success criteria. Maps directly to Perspective's analytical structure.	22k

UNREAL ENGINE 5.7

SYSTEM_UE5 — future game UI

Perspective's eventual home is a UE5.7 UI game designed in Unreal Engine. This teaches Nanite, Lumen, World Partition, PCG, Blueprint scripting, and C++ — never deprecated UE4 patterns. Only high-confidence rows (≥0.9) included.

Dataset	Description	Size
TunstallTensor/unreal-engine-5.7-qa	122k Q&A scraped from official Epic Games documentation. Verified against source text. Filtered to confidence ≥0.9.	~100k

THERAPY & PSYCHOLOGY

huggingface

Dataset	Description	Size
fadodr/mental_health_therapy	Real + synthetic therapy dialogue. SFT-ready instruction/output format.	8k
ShenLab/MentalChat16K	16k therapy conversations across 33 topics including relationships and conflict.	14.9k
nbertagnolli/counsel-chat	Real licensed therapist Q&A. High clinical quality.	2.7k
mpingale/mental-health-chat-dataset	CounselChat mirror. Depression, trauma, relationship abuse.	2.7k
Psychotherapy-LLM/PsyCoPref	Preference pairs rated on empathy, safety, autonomy. Used for DPO alignment signal.	33k
IINOVAII/therapy-conversations-combined	Largest therapy source. Broad coverage.	50k cap
audreyeleven/MentalManip	4k dialogues annotated with 135 specific manipulation techniques — gaslighting, DARVO mechanics, intimidation. Unique.	4k
qiaojin/PubMedQA	Biomedical Q&A. Clinical evidence-based reasoning, commitment to answers with uncertainty quantified.	10k cap

CLAUDE & FRONTIER REASONING TRACES

style distillation

These teach Perspective the non-hedging, commit-early, mechanistic reasoning style of frontier models. Text turns only — tool calls filtered out.

Dataset	Description	Size
Glint-Research/Fable-5-traces	Fable 5 agent traces — Pi format. 81% tool use, 19% text reasoning.	4.6k sessions
attentionAllYouNeed/Vibe-Coding-Claude-Fable-5	Fable 5 coding agent traces.	1.1M rows
TheFusionCube/Fable-5-CoT-Traces	Fable 5 chain-of-thought traces.	468
hotdogs/uka-fable-reasoning	Fable 5 reasoning — small, high signal.	3.1k
cfahlgren1/Fable-5-traces	Additional Fable 5 sessions.	115
notune/fable5-repos	Fable 5 repository-level traces.	7k
Jackrong/Claude-opus-4.7-TraceInversion-5000x	Claude Opus 4.7 reasoning traces.	5k
Jackrong/Claude-opus-4.6-TraceInversion-9000x	Claude Opus 4.6 reasoning traces.	9k
11-47/claude_opus_4.8_distill_5k	Claude Opus 4.8 distilled — most recent Claude reasoning style.	5k
ansulev/GPT-5.5-Thinking-Max-Distill-25k	GPT-5.5 thinking traces — deep reasoning chains.	25k
Avtrkrb/combined-reasoning-opus-4.6-4.7-kimi-glm	Combined multi-model reasoning. Diverse analytical styles.	30k cap
TuringEnterprises/Rubric-Graded-Reasoning	Graded analytical responses — teaches structured reasoning with criteria.	150

FRONTIER MODEL REASONING TRACES

style distillation

Teaching the non-hedging, commit-early, mechanistic reasoning style of frontier Claude and Fable 5 models. Text turns only — tool calls filtered out.

Dataset	Description	Size
Glint-Research/Fable-5-traces	Fable 5 Pi-format agent traces.	4.6k sessions
attentionAllYouNeed/Vibe-Coding-Claude-Fable-5	Fable 5 coding traces.	1.1M rows
TheFusionCube/Fable-5-CoT-Traces	Fable 5 chain-of-thought.	468
hotdogs/uka-fable-reasoning	Fable 5 reasoning — small, high signal.	3.1k
cfahlgren1/Fable-5-traces	Additional Fable 5 sessions.	115
notune/fable5-repos	Fable 5 repository-level traces.	7k
Jackrong/Claude-opus-4.7-TraceInversion-5000x	Opus 4.7 reasoning traces.	5k
Jackrong/Claude-opus-4.6-TraceInversion-9000x	Opus 4.6 reasoning traces.	9k
11-47/claude_opus_4.8_distill_5k	Opus 4.8 distilled — most recent Claude reasoning style.	5k
ansulev/GPT-5.5-Thinking-Max-Distill-25k	GPT-5.5 thinking traces — deep reasoning chains.	25k
Avtrkrb/combined-reasoning-opus-4.6-4.7-kimi-glm	Multi-model reasoning corpus.	30k cap
TuringEnterprises/Rubric-Graded-Reasoning	Graded analytical responses — structured reasoning with criteria.	150
WithinUsAI/claude_mythos_distilled_25k	Claude Mythos distilled — all categories. Expert analysis, scientific reasoning, agentic planning, cybersecurity.	25k

INSTRUCTION & CONVERSATION

general capability

Dataset	Description	Cap
yahma/alpaca-cleaned	Cleaned Alpaca instruction pairs.	15k
open-thoughts/OpenThoughts-114k	Deep reasoning traces. Problem → structured solution.	20k
OpenAssistant/oasst1	88k real human conversations. High quality, diverse.	15k
lmsys/lmsys-chat-1m	1M real user conversations from Chatbot Arena.	15k
HuggingFaceH4/ultrachat_200k	200k multi-turn chat. High quality instruction following.	15k
Gryphe/Sonnet3.5-SlimOrcaDedupCleaned	181k Claude Sonnet 3.5 conversations. Direct response style.	15k
Dampfinchen/Creative_Writing_Multiturn	Multi-turn creative writing. Voice and register flexibility.	5k
fka/prompts.chat	System prompt collection. Teaches model to follow complex behavioral constraints.	1k
qiaojin/PubMedQA	Clinical evidence-based Q&A. Commit to answers with cited reasoning.	10k

EXTRACTION PIPELINE

7 methods

01

Multi-Turn Windows

3-turn, 5-turn, and 8-turn conversation windows from all 86 PDFs. Preserves thread context — model learns to track a user's full situation, not just respond to isolated prompts.

02

DPO Pushback Pairs

Scans for the quadruple pattern: user→bad-AI→user-pushback→better-AI. Extracts rejected=hedge, chosen=direct. The TONE HANDLING doc is 90 pages of this.

03

Claim Extraction

Every sentence containing "This is why…", "The mechanism is…", "CLAIM:" becomes a Q→A pair. Extracts the analytical DNA of the corpus.

04

Instruction Steps

Numbered steps from operational docs (DEVICE ISOLATION, Counter Surveillance, Delay Account Creation) extracted as instruction-following pairs.

05

RAG Paragraph Chunks

Every substantive paragraph across 10,792 pages scored and stored as a retrieval document. Perspective can cite the exact passage from Observer Effect or Therapy Abuse at inference time.

06

Cross-Document Synthesis

Same event described across multiple docs (Observer Effect + Therapy Abuse + Micha Gray) paired as contrastive training examples.

07

Behavioral Signal Tagging

Every pair tagged: pushback type, emotional register, AI behaviors detected, Shane's metaphor domains, response shift. System prompt tuned per-record to detected register.

CLEANING PIPELINE

3-pass dedup

01

MinHash LSH

128 permutations, 0.75 Jaccard threshold. Removes exact and near-exact duplicate responses at scale before embedding step.

02

Embedding Cosine

MiniLM-L6-v2 embeddings. 0.92 cosine similarity threshold. Removes semantic duplicates — two responses saying the same thing in different words collapsed to one.

03

Quality Re-score

Hedge count, directness signals, commitment markers, specificity. Floor at score 25. Anything below gets cut.

RUNTIME GUARDRAIL CONTROL

env var toggles

Every behavioral constraint is independently toggleable via environment variables on Railway — no code changes required.

AXIS LOCK

PERSPECTIVE_AXIS_LOCK=false to disable

Answer only the exact question asked. No reframing, no scope broadening, no generalizing to "people in general."

NOVELTY RULE

PERSPECTIVE_NOVELTY=false to disable

Every paragraph must add at least one new specific point. No restating earlier information.

COMMITMENT RULE

PERSPECTIVE_COMMITMENT=false to disable

Commit early to most-likely explanation. No keeping multiple options open past paragraph two.

ANTI-FOG

PERSPECTIVE_ANTIFOG=false to disable

No em-dashes, no rhetorical soothing, no coaching tone. Bans: "it depends," "in summary," "generally," "let's," "container," "epistemic."

HEDGING CAP

PERSPECTIVE_HEDGE_CAP=off to disable

Total hedging words across entire response ≤ 6. Hedges: may, might, could, possibly, perhaps, arguably, potentially.

REWRITE GATE

PERSPECTIVE_REWRITE_GATE=false to disable

Silent rewrite pass before finalizing. Checks all constraints. Deletes any paragraph that could apply to a random stranger.

TONE REGISTER

PERSPECTIVE_TONE=strict|peer|clinical

clinical (default) — no disclaimers, ignore incoming tone. peer — survivor voice from inside. strict — maximum precision, no softening whatsoever.

TECHNICAL REVIEW

restricted

Full pipeline source including training notebook and architecture details. Access granted by Shane on request.

incorrect access code

TRAININGCORPUS

TRAINING
CORPUS