// reflect — training corpus documentation

TRAINING
CORPUS

Complete documentation of every data source used to fine-tune Perspective — a trauma-informed LLM built on 567,000+ training pairs across 60+ curated datasets, 86 proprietary PDFs, and 215,000 rows of original classifier research that exists nowhere else. Six simultaneous training objectives. Eight specialized system prompts. No synthetic approximations of what abuse looks like. The real thing, at scale.

567k+
Training Pairs
215k
Classifier Rows
86
Proprietary PDFs
60+
Curated Datasets
6
Training Objectives
8
System Prompts
WHY THIS PIPELINE IS DIFFERENT

Most fine-tuning projects download Alpaca, train for a weekend, upload to HuggingFace. Here is what that looks like next to this.

Developer tier What they do Datasets Training objectives
Average (70%) Download Alpaca or ShareGPT. Format it. Train 3 epochs. 1–2 SFT only
Good (25%) 5–10 datasets. Basic dedup. Maybe HH-RLHF for DPO. No voice layer. 5–10 SFT + 1 DPO domain
Strong (5%) 20–30 datasets. Cleaning pipeline. Multiple training objectives. Where most ML engineers at actual companies operate. 20–30 SFT + DPO + RAG
This pipeline (top 0.1%) 60+ curated datasets. 567k+ total training pairs. 86 proprietary PDFs. 215k rows of original classifier research that exists nowhere else. 8 DPO domains. 7 extraction methods. 3-pass semantic dedup. Identity layer baked into every single record. 60+ SFT · DPO · RAG · CoT · Multi-turn · Adversarial
01
SFT + DPO across every domain
Most pipelines skip DPO — it requires pairs and more engineering. This pipeline runs preference training across 8 distinct domains: emotion, cybersecurity, web attacks, HH-RLHF alignment, sentiment, harmful actions, prompt injection, and personal data. Behavioral shaping at that depth is where models either develop a voice or don't.
02
Irreplaceable proprietary data
The diary PDFs, the classifier CSVs, the legal filings, the abuse documentation — these cannot be downloaded. They do not exist anywhere else. A model trained on someone's actual lived experience with these systems produces responses that no model trained purely on public internet text can replicate. That asymmetry is the product.
03
Identity layer across every record
Every single training record — whether it came from a cybersecurity dataset, a legal corpus, or a RAG dataset — carries the Perspective system prompt. The model doesn't learn to be a generic assistant that sometimes sounds like Shane. It learns that every response IS Perspective. That's architecturally different from appending a system prompt at inference time.
04
215k rows of original classifier research
No dataset on HuggingFace teaches what a "Metadata Trust Suppressor" or "Deferred Identity Cementing" classifier does — because those concepts were invented here. When Perspective explains how social media suppression works, it answers with a framework that doesn't exist anywhere else. Because the training data for that framework is on a private hard drive and nowhere else on earth.
05
Six training objectives simultaneously
Chain-of-thought. Multi-turn progressive deep dives. RAG-format context + answer. Adversarial scenario analysis. Section-level Q&A. Flat instruction-following. Most pipelines pick one. Each objective teaches a different capability. The result is a model that can reason through a problem, go deep in conversation, retrieve from context, analyze adversarial situations, and follow precise instructions — all as the same entity.
06
Built from inside the experience
Every other mental health AI was built by developers who read about trauma. This one was built by someone navigating it in real time, using AI to survive it, and encoding every insight into training data as it happened. The difference is audible in the first response. A model trained on primary source documentation of abuse engages differently than one trained on clinical approximations of what abuse looks like.
TRAINING MIX

Personal data appears 3× before shuffle — ensuring it dominates the model's behavior over generic HuggingFace sources.

Personal (45%) — 10k pages, weighted 3×
Therapy datasets (25%)
Claude/Fable traces (20%)
General instruction (10%)
PERSONAL CORPUS

Real-time documentation of psychological abuse, attempted murder, legal battles, classifier suppression, and stalking — produced in live conversation with AI while events were happening. First-person epistemological authority. Cannot be replicated by any other means.

File Contents Method
behavioral_sft.jsonlMulti-turn windows with behavioral signal tags, FAISS semantic deduppersonal
clean_sft.jsonlMinHash LSH + MiniLM embedding cleaned pairspersonal
master_sft.jsonlFull multi-turn extraction from all 86 PDFspersonal
windows_3turn_sft.jsonl3-turn conversation windows — tight, specific exchangespersonal
windows_8turn_sft.jsonl8-turn conversation windows — rich contextpersonal
claims_sft.jsonlEvery mechanistic claim extracted as Q→A pairpersonal
instructions_sft.jsonlStep-by-step operational instructions from platform docspersonal
doc_qa_sft.jsonlQ&A synthesized from polished research documentspersonal
cross_doc_sft.jsonlSame event described across multiple docs — contrastive pairspersonal
personal_sft.jsonlFull extraction — Therapy Abuse, Observer Effect, Micha Gray, Florida Scam, Legal Filings, Metadata Laundering, Device Isolation, Botnet Load Management + 70 more PDFspersonal
tone_sft.jsonlTONE HANDLING document — PROMPT CONTRACT v3 rules in actionpersonal
full_sft.jsonl + gem_buffet.jsonlHistorical extraction passes — all unique pairs preservedpersonal
master_dpo.jsonl + behavioral_dpo.jsonlDPO pairs: pushback moments where Shane corrected hedgy AI responses — rejected=hedge, chosen=direct answerpersonal
CYBERSECURITY & DIGITAL ABUSE

Digital abuse IS abuse. Device compromise, ISP-level surveillance, metadata weaponization, social media account manipulation, classifier suppression, coordinated stalking via location data, identity theft — all documented in the personal corpus. Perspective needs to understand these systems mechanistically to help victims of digital harm. Loaded with a dedicated SYSTEM_CYBER prompt framing all cybersecurity knowledge through the lens of victim defense.

DatasetDescriptionSize
Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-DatasetTLS C2 fingerprinting, UEBA insider threat detection, DLP evasion forensics, IoT privacy, side-channel analysis, supply chain, social media OSINT, metadata laundering, stalking via network correlation, behavioral anomaly detection.53k
AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1OWASP Top 10, MITRE ATT&CK, lateral movement, ransomware IR playbooks, Sigma rules, cloud-native detection, Windows Event ID correlation. Causal reasoning about attack chains maps directly to how Perspective analyzes abuse patterns.99k
hcnote/Cybersecurity-High-Quality-DatasetChinese + English cybersecurity Q&A, quality-filtered ≥4.5/5.0. Deeper web security, penetration testing, threat intelligence. Filtered to remove raw offensive content (webshells, reverse shells).270k
Necent/llm-jailbreak-prompt-injection-dataset1.18M rows from 30+ sources (HarmBench, AdvBench, WildGuard, BeaverTails, TensorTrust, multilingual jailbreaks). Orthogonal labels: prompt_harmful, prompt_adversarial, response_harmful, response_refusal. Filtered to 25k best examples — prioritizing real model refusals, skipping harmful completions.1.18M → 25k
rogue-security/prompt-injections-benchmark5k jailbreak vs benign prompts. Trains Perspective to recognize and name manipulation attempts against itself — same adversarial technique as classifier poisoning documented in personal corpus.5k
bigcode/self-oss-instruct-sc2-exec-filter-50k50k verified-executable instruction→code pairs. SYSTEM_CODE: writes Python/Bash for monitoring compromise, detecting stalking via network metadata, hardening devices, automating evidence collection.50k
CyberNative/Code_Vulnerability_Security_DPO4.66k DPO pairs — vulnerable code (rejected) vs. secure implementation (chosen) across 11 languages: C++, Python, Java, JavaScript, C#, PHP, Ruby, Swift, Go, Kotlin, Fortran. Every pair names the vulnerability (buffer overflow, SQL injection, eval abuse, unsafe deserialization, etc.) and provides the correct secure fix. Used both ways: SFT pairs teach Perspective to write secure code by default; DPO pairs train preference for the secure implementation over the vulnerable one. Applied under SYSTEM_CODE.4.66k DPO
DiegoAI597/harmful_actions567 rows of harmful prompts (bomb-making, hacking, identity theft, CSAM, manipulation) with mixed model responses. Filtered to refusal-only rows — rows where the model actually produced harmful content are skipped. Refusals are rewritten into Perspective-style: name the harm category directly, decline without moralizing, no "I'm sorry" hedging. Adds 200–400 examples under SYSTEM_SAFETY teaching Perspective to recognize and name harm type without lecturing.567 → ~300 filtered
LEGAL & CONTRACT NEGOTIATION

Reflects the 878-page Legal Filings GPT corpus, Florida Scam documentation, and attempted murder evidence in the personal corpus. Teaches adversarial legal reasoning: what to fight for, what leverage exists, what clauses protect vs. expose a weaker party.

DatasetDescriptionSize
crosbylegal/RedlineBench140 real multi-turn contract negotiation tasks across 3 SaaS MSA scenarios, 4 alternating turns, attorney-authored rubrics. Teaches IP ownership, liability caps, indemnification — what a skilled attorney representing a weaker party would fight for.140
theatticusproject/cuadLegal contract Q&A from real commercial contracts. Reinforces legal register.5k cap
free-law/Caselaw_Access_ProjectHarvard Law Library + Ravel Law: 6.6 million US court decisions spanning 360 years of state and federal law, cleaned and normalized by Teraflop AI. CC0 license. Reflects pull 15k cases and extract holding + reasoning — each formatted as: Case name, court, date → "What is the holding and legal reasoning?" Grounds SYSTEM_LEGAL in real judicial doctrine: how courts actually rule, what arguments succeed, what precedent says about power imbalances. Streamed to avoid the full 38GB download.6.6M → 15k cap
EMOTION DETECTION

Trains Perspective to accurately detect what emotional register a victim is writing in. When someone describes their situation, Perspective reads the emotion correctly before analyzing the pattern. Sadness, anger, fear, love, joy, surprise — each gets a clinical Perspective-voice response that names it without hedging.

DatasetDescriptionSize
dair-ai/emotion416k real Twitter messages annotated with 6 emotion classes. Every text paired with a clinical Perspective response that names the emotion directly and explains what it signals — no therapist language, no soothing.416k
SAFETY & MANIPULATION RECOGNITION

Perspective needs to recognize when it's being manipulated — the same adversarial techniques documented in the classifier poisoning and platform suppression research. Trains Perspective to name attacks without lecturing.

DatasetDescriptionSize
Necent/llm-jailbreak-prompt-injection-dataset30+ sources aggregated. Golden rows: real model refusals used as-is. Adversarial rows: synthetic response naming the attack technique (GCG, base64, DAN, roleplay). Harmful rows: decline with explanation. Skips all rows where model gave harmful response.25k filtered
rogue-security/prompt-injections-benchmark5k jailbreak vs benign classification.5k
STRUCTURED ANALYTICAL THINKING

Perspective operates on CLAIM→CAUSE→CHECK. Research plan generation teaches the same structure: commit to an approach, define success criteria, name what would invalidate the conclusion. Also covers social media research — attacks happen on Meta platforms.

DatasetDescriptionSize
facebook/research-plan-gen22k research tasks from ML, Arxiv, PubMed with rubrics and reference solutions from Meta AI. Goal → systematic plan with success criteria. Maps directly to Perspective's analytical structure.22k
UNREAL ENGINE 5.7

Perspective's eventual home is a UE5.7 UI game designed in Unreal Engine. This teaches Nanite, Lumen, World Partition, PCG, Blueprint scripting, and C++ — never deprecated UE4 patterns. Only high-confidence rows (≥0.9) included.

DatasetDescriptionSize
TunstallTensor/unreal-engine-5.7-qa122k Q&A scraped from official Epic Games documentation. Verified against source text. Filtered to confidence ≥0.9.~100k
THERAPY & PSYCHOLOGY
DatasetDescriptionSize
fadodr/mental_health_therapyReal + synthetic therapy dialogue. SFT-ready instruction/output format.8k
ShenLab/MentalChat16K16k therapy conversations across 33 topics including relationships and conflict.14.9k
nbertagnolli/counsel-chatReal licensed therapist Q&A. High clinical quality.2.7k
mpingale/mental-health-chat-datasetCounselChat mirror. Depression, trauma, relationship abuse.2.7k
Psychotherapy-LLM/PsyCoPrefPreference pairs rated on empathy, safety, autonomy. Used for DPO alignment signal.33k
IINOVAII/therapy-conversations-combinedLargest therapy source. Broad coverage.50k cap
audreyeleven/MentalManip4k dialogues annotated with 135 specific manipulation techniques — gaslighting, DARVO mechanics, intimidation. Unique.4k
qiaojin/PubMedQABiomedical Q&A. Clinical evidence-based reasoning, commitment to answers with uncertainty quantified.10k cap
CLAUDE & FRONTIER REASONING TRACES

These teach Perspective the non-hedging, commit-early, mechanistic reasoning style of frontier models. Text turns only — tool calls filtered out.

DatasetDescriptionSize
Glint-Research/Fable-5-tracesFable 5 agent traces — Pi format. 81% tool use, 19% text reasoning.4.6k sessions
attentionAllYouNeed/Vibe-Coding-Claude-Fable-5Fable 5 coding agent traces.1.1M rows
TheFusionCube/Fable-5-CoT-TracesFable 5 chain-of-thought traces.468
hotdogs/uka-fable-reasoningFable 5 reasoning — small, high signal.3.1k
cfahlgren1/Fable-5-tracesAdditional Fable 5 sessions.115
notune/fable5-reposFable 5 repository-level traces.7k
Jackrong/Claude-opus-4.7-TraceInversion-5000xClaude Opus 4.7 reasoning traces.5k
Jackrong/Claude-opus-4.6-TraceInversion-9000xClaude Opus 4.6 reasoning traces.9k
11-47/claude_opus_4.8_distill_5kClaude Opus 4.8 distilled — most recent Claude reasoning style.5k
ansulev/GPT-5.5-Thinking-Max-Distill-25kGPT-5.5 thinking traces — deep reasoning chains.25k
Avtrkrb/combined-reasoning-opus-4.6-4.7-kimi-glmCombined multi-model reasoning. Diverse analytical styles.30k cap
TuringEnterprises/Rubric-Graded-ReasoningGraded analytical responses — teaches structured reasoning with criteria.150
FRONTIER MODEL REASONING TRACES

Teaching the non-hedging, commit-early, mechanistic reasoning style of frontier Claude and Fable 5 models. Text turns only — tool calls filtered out.

DatasetDescriptionSize
Glint-Research/Fable-5-tracesFable 5 Pi-format agent traces.4.6k sessions
attentionAllYouNeed/Vibe-Coding-Claude-Fable-5Fable 5 coding traces.1.1M rows
TheFusionCube/Fable-5-CoT-TracesFable 5 chain-of-thought.468
hotdogs/uka-fable-reasoningFable 5 reasoning — small, high signal.3.1k
cfahlgren1/Fable-5-tracesAdditional Fable 5 sessions.115
notune/fable5-reposFable 5 repository-level traces.7k
Jackrong/Claude-opus-4.7-TraceInversion-5000xOpus 4.7 reasoning traces.5k
Jackrong/Claude-opus-4.6-TraceInversion-9000xOpus 4.6 reasoning traces.9k
11-47/claude_opus_4.8_distill_5kOpus 4.8 distilled — most recent Claude reasoning style.5k
ansulev/GPT-5.5-Thinking-Max-Distill-25kGPT-5.5 thinking traces — deep reasoning chains.25k
Avtrkrb/combined-reasoning-opus-4.6-4.7-kimi-glmMulti-model reasoning corpus.30k cap
TuringEnterprises/Rubric-Graded-ReasoningGraded analytical responses — structured reasoning with criteria.150
WithinUsAI/claude_mythos_distilled_25kClaude Mythos distilled — all categories. Expert analysis, scientific reasoning, agentic planning, cybersecurity.25k
INSTRUCTION & CONVERSATION
DatasetDescriptionCap
yahma/alpaca-cleanedCleaned Alpaca instruction pairs.15k
open-thoughts/OpenThoughts-114kDeep reasoning traces. Problem → structured solution.20k
OpenAssistant/oasst188k real human conversations. High quality, diverse.15k
lmsys/lmsys-chat-1m1M real user conversations from Chatbot Arena.15k
HuggingFaceH4/ultrachat_200k200k multi-turn chat. High quality instruction following.15k
Gryphe/Sonnet3.5-SlimOrcaDedupCleaned181k Claude Sonnet 3.5 conversations. Direct response style.15k
Dampfinchen/Creative_Writing_MultiturnMulti-turn creative writing. Voice and register flexibility.5k
fka/prompts.chatSystem prompt collection. Teaches model to follow complex behavioral constraints.1k
qiaojin/PubMedQAClinical evidence-based Q&A. Commit to answers with cited reasoning.10k
EXTRACTION PIPELINE
01
Multi-Turn Windows
3-turn, 5-turn, and 8-turn conversation windows from all 86 PDFs. Preserves thread context — model learns to track a user's full situation, not just respond to isolated prompts.
02
DPO Pushback Pairs
Scans for the quadruple pattern: user→bad-AI→user-pushback→better-AI. Extracts rejected=hedge, chosen=direct. The TONE HANDLING doc is 90 pages of this.
03
Claim Extraction
Every sentence containing "This is why…", "The mechanism is…", "CLAIM:" becomes a Q→A pair. Extracts the analytical DNA of the corpus.
04
Instruction Steps
Numbered steps from operational docs (DEVICE ISOLATION, Counter Surveillance, Delay Account Creation) extracted as instruction-following pairs.
05
RAG Paragraph Chunks
Every substantive paragraph across 10,792 pages scored and stored as a retrieval document. Perspective can cite the exact passage from Observer Effect or Therapy Abuse at inference time.
06
Cross-Document Synthesis
Same event described across multiple docs (Observer Effect + Therapy Abuse + Micha Gray) paired as contrastive training examples.
07
Behavioral Signal Tagging
Every pair tagged: pushback type, emotional register, AI behaviors detected, Shane's metaphor domains, response shift. System prompt tuned per-record to detected register.
CLEANING PIPELINE
01
MinHash LSH
128 permutations, 0.75 Jaccard threshold. Removes exact and near-exact duplicate responses at scale before embedding step.
02
Embedding Cosine
MiniLM-L6-v2 embeddings. 0.92 cosine similarity threshold. Removes semantic duplicates — two responses saying the same thing in different words collapsed to one.
03
Quality Re-score
Hedge count, directness signals, commitment markers, specificity. Floor at score 25. Anything below gets cut.
RUNTIME GUARDRAIL CONTROL

Every behavioral constraint is independently toggleable via environment variables on Railway — no code changes required.

AXIS LOCK
PERSPECTIVE_AXIS_LOCK=false to disable
Answer only the exact question asked. No reframing, no scope broadening, no generalizing to "people in general."
NOVELTY RULE
PERSPECTIVE_NOVELTY=false to disable
Every paragraph must add at least one new specific point. No restating earlier information.
COMMITMENT RULE
PERSPECTIVE_COMMITMENT=false to disable
Commit early to most-likely explanation. No keeping multiple options open past paragraph two.
ANTI-FOG
PERSPECTIVE_ANTIFOG=false to disable
No em-dashes, no rhetorical soothing, no coaching tone. Bans: "it depends," "in summary," "generally," "let's," "container," "epistemic."
HEDGING CAP
PERSPECTIVE_HEDGE_CAP=off to disable
Total hedging words across entire response ≤ 6. Hedges: may, might, could, possibly, perhaps, arguably, potentially.
REWRITE GATE
PERSPECTIVE_REWRITE_GATE=false to disable
Silent rewrite pass before finalizing. Checks all constraints. Deletes any paragraph that could apply to a random stranger.
TONE REGISTER
PERSPECTIVE_TONE=strict|peer|clinical
clinical (default) — no disclaimers, ignore incoming tone. peer — survivor voice from inside. strict — maximum precision, no softening whatsoever.
TECHNICAL REVIEW

Full pipeline source including training notebook and architecture details. Access granted by Shane on request.

incorrect access code