Complete documentation of every data source used to fine-tune Perspective — a trauma-informed LLM built on 567,000+ training pairs across 60+ curated datasets, 86 proprietary PDFs, and 215,000 rows of original classifier research that exists nowhere else. Six simultaneous training objectives. Eight specialized system prompts. No synthetic approximations of what abuse looks like. The real thing, at scale.
Most fine-tuning projects download Alpaca, train for a weekend, upload to HuggingFace. Here is what that looks like next to this.
| Developer tier | What they do | Datasets | Training objectives |
|---|---|---|---|
| Average (70%) | Download Alpaca or ShareGPT. Format it. Train 3 epochs. | 1–2 | SFT only |
| Good (25%) | 5–10 datasets. Basic dedup. Maybe HH-RLHF for DPO. No voice layer. | 5–10 | SFT + 1 DPO domain |
| Strong (5%) | 20–30 datasets. Cleaning pipeline. Multiple training objectives. Where most ML engineers at actual companies operate. | 20–30 | SFT + DPO + RAG |
| This pipeline (top 0.1%) | 60+ curated datasets. 567k+ total training pairs. 86 proprietary PDFs. 215k rows of original classifier research that exists nowhere else. 8 DPO domains. 7 extraction methods. 3-pass semantic dedup. Identity layer baked into every single record. | 60+ | SFT · DPO · RAG · CoT · Multi-turn · Adversarial |
Personal data appears 3× before shuffle — ensuring it dominates the model's behavior over generic HuggingFace sources.
Real-time documentation of psychological abuse, attempted murder, legal battles, classifier suppression, and stalking — produced in live conversation with AI while events were happening. First-person epistemological authority. Cannot be replicated by any other means.
| File | Contents | Method |
|---|---|---|
| behavioral_sft.jsonl | Multi-turn windows with behavioral signal tags, FAISS semantic dedup | personal |
| clean_sft.jsonl | MinHash LSH + MiniLM embedding cleaned pairs | personal |
| master_sft.jsonl | Full multi-turn extraction from all 86 PDFs | personal |
| windows_3turn_sft.jsonl | 3-turn conversation windows — tight, specific exchanges | personal |
| windows_8turn_sft.jsonl | 8-turn conversation windows — rich context | personal |
| claims_sft.jsonl | Every mechanistic claim extracted as Q→A pair | personal |
| instructions_sft.jsonl | Step-by-step operational instructions from platform docs | personal |
| doc_qa_sft.jsonl | Q&A synthesized from polished research documents | personal |
| cross_doc_sft.jsonl | Same event described across multiple docs — contrastive pairs | personal |
| personal_sft.jsonl | Full extraction — Therapy Abuse, Observer Effect, Micha Gray, Florida Scam, Legal Filings, Metadata Laundering, Device Isolation, Botnet Load Management + 70 more PDFs | personal |
| tone_sft.jsonl | TONE HANDLING document — PROMPT CONTRACT v3 rules in action | personal |
| full_sft.jsonl + gem_buffet.jsonl | Historical extraction passes — all unique pairs preserved | personal |
| master_dpo.jsonl + behavioral_dpo.jsonl | DPO pairs: pushback moments where Shane corrected hedgy AI responses — rejected=hedge, chosen=direct answer | personal |
Digital abuse IS abuse. Device compromise, ISP-level surveillance, metadata weaponization, social media account manipulation, classifier suppression, coordinated stalking via location data, identity theft — all documented in the personal corpus. Perspective needs to understand these systems mechanistically to help victims of digital harm. Loaded with a dedicated SYSTEM_CYBER prompt framing all cybersecurity knowledge through the lens of victim defense.
| Dataset | Description | Size |
|---|---|---|
| Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset | TLS C2 fingerprinting, UEBA insider threat detection, DLP evasion forensics, IoT privacy, side-channel analysis, supply chain, social media OSINT, metadata laundering, stalking via network correlation, behavioral anomaly detection. | 53k |
| AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 | OWASP Top 10, MITRE ATT&CK, lateral movement, ransomware IR playbooks, Sigma rules, cloud-native detection, Windows Event ID correlation. Causal reasoning about attack chains maps directly to how Perspective analyzes abuse patterns. | 99k |
| hcnote/Cybersecurity-High-Quality-Dataset | Chinese + English cybersecurity Q&A, quality-filtered ≥4.5/5.0. Deeper web security, penetration testing, threat intelligence. Filtered to remove raw offensive content (webshells, reverse shells). | 270k |
| Necent/llm-jailbreak-prompt-injection-dataset | 1.18M rows from 30+ sources (HarmBench, AdvBench, WildGuard, BeaverTails, TensorTrust, multilingual jailbreaks). Orthogonal labels: prompt_harmful, prompt_adversarial, response_harmful, response_refusal. Filtered to 25k best examples — prioritizing real model refusals, skipping harmful completions. | 1.18M → 25k |
| rogue-security/prompt-injections-benchmark | 5k jailbreak vs benign prompts. Trains Perspective to recognize and name manipulation attempts against itself — same adversarial technique as classifier poisoning documented in personal corpus. | 5k |
| bigcode/self-oss-instruct-sc2-exec-filter-50k | 50k verified-executable instruction→code pairs. SYSTEM_CODE: writes Python/Bash for monitoring compromise, detecting stalking via network metadata, hardening devices, automating evidence collection. | 50k |
| CyberNative/Code_Vulnerability_Security_DPO | 4.66k DPO pairs — vulnerable code (rejected) vs. secure implementation (chosen) across 11 languages: C++, Python, Java, JavaScript, C#, PHP, Ruby, Swift, Go, Kotlin, Fortran. Every pair names the vulnerability (buffer overflow, SQL injection, eval abuse, unsafe deserialization, etc.) and provides the correct secure fix. Used both ways: SFT pairs teach Perspective to write secure code by default; DPO pairs train preference for the secure implementation over the vulnerable one. Applied under SYSTEM_CODE. | 4.66k DPO |
| DiegoAI597/harmful_actions | 567 rows of harmful prompts (bomb-making, hacking, identity theft, CSAM, manipulation) with mixed model responses. Filtered to refusal-only rows — rows where the model actually produced harmful content are skipped. Refusals are rewritten into Perspective-style: name the harm category directly, decline without moralizing, no "I'm sorry" hedging. Adds 200–400 examples under SYSTEM_SAFETY teaching Perspective to recognize and name harm type without lecturing. | 567 → ~300 filtered |
Reflects the 878-page Legal Filings GPT corpus, Florida Scam documentation, and attempted murder evidence in the personal corpus. Teaches adversarial legal reasoning: what to fight for, what leverage exists, what clauses protect vs. expose a weaker party.
| Dataset | Description | Size |
|---|---|---|
| crosbylegal/RedlineBench | 140 real multi-turn contract negotiation tasks across 3 SaaS MSA scenarios, 4 alternating turns, attorney-authored rubrics. Teaches IP ownership, liability caps, indemnification — what a skilled attorney representing a weaker party would fight for. | 140 |
| theatticusproject/cuad | Legal contract Q&A from real commercial contracts. Reinforces legal register. | 5k cap |
| free-law/Caselaw_Access_Project | Harvard Law Library + Ravel Law: 6.6 million US court decisions spanning 360 years of state and federal law, cleaned and normalized by Teraflop AI. CC0 license. Reflects pull 15k cases and extract holding + reasoning — each formatted as: Case name, court, date → "What is the holding and legal reasoning?" Grounds SYSTEM_LEGAL in real judicial doctrine: how courts actually rule, what arguments succeed, what precedent says about power imbalances. Streamed to avoid the full 38GB download. | 6.6M → 15k cap |
Trains Perspective to accurately detect what emotional register a victim is writing in. When someone describes their situation, Perspective reads the emotion correctly before analyzing the pattern. Sadness, anger, fear, love, joy, surprise — each gets a clinical Perspective-voice response that names it without hedging.
| Dataset | Description | Size |
|---|---|---|
| dair-ai/emotion | 416k real Twitter messages annotated with 6 emotion classes. Every text paired with a clinical Perspective response that names the emotion directly and explains what it signals — no therapist language, no soothing. | 416k |
Perspective needs to recognize when it's being manipulated — the same adversarial techniques documented in the classifier poisoning and platform suppression research. Trains Perspective to name attacks without lecturing.
| Dataset | Description | Size |
|---|---|---|
| Necent/llm-jailbreak-prompt-injection-dataset | 30+ sources aggregated. Golden rows: real model refusals used as-is. Adversarial rows: synthetic response naming the attack technique (GCG, base64, DAN, roleplay). Harmful rows: decline with explanation. Skips all rows where model gave harmful response. | 25k filtered |
| rogue-security/prompt-injections-benchmark | 5k jailbreak vs benign classification. | 5k |
Perspective operates on CLAIM→CAUSE→CHECK. Research plan generation teaches the same structure: commit to an approach, define success criteria, name what would invalidate the conclusion. Also covers social media research — attacks happen on Meta platforms.
| Dataset | Description | Size |
|---|---|---|
| facebook/research-plan-gen | 22k research tasks from ML, Arxiv, PubMed with rubrics and reference solutions from Meta AI. Goal → systematic plan with success criteria. Maps directly to Perspective's analytical structure. | 22k |
Perspective's eventual home is a UE5.7 UI game designed in Unreal Engine. This teaches Nanite, Lumen, World Partition, PCG, Blueprint scripting, and C++ — never deprecated UE4 patterns. Only high-confidence rows (≥0.9) included.
| Dataset | Description | Size |
|---|---|---|
| TunstallTensor/unreal-engine-5.7-qa | 122k Q&A scraped from official Epic Games documentation. Verified against source text. Filtered to confidence ≥0.9. | ~100k |
| Dataset | Description | Size |
|---|---|---|
| fadodr/mental_health_therapy | Real + synthetic therapy dialogue. SFT-ready instruction/output format. | 8k |
| ShenLab/MentalChat16K | 16k therapy conversations across 33 topics including relationships and conflict. | 14.9k |
| nbertagnolli/counsel-chat | Real licensed therapist Q&A. High clinical quality. | 2.7k |
| mpingale/mental-health-chat-dataset | CounselChat mirror. Depression, trauma, relationship abuse. | 2.7k |
| Psychotherapy-LLM/PsyCoPref | Preference pairs rated on empathy, safety, autonomy. Used for DPO alignment signal. | 33k |
| IINOVAII/therapy-conversations-combined | Largest therapy source. Broad coverage. | 50k cap |
| audreyeleven/MentalManip | 4k dialogues annotated with 135 specific manipulation techniques — gaslighting, DARVO mechanics, intimidation. Unique. | 4k |
| qiaojin/PubMedQA | Biomedical Q&A. Clinical evidence-based reasoning, commitment to answers with uncertainty quantified. | 10k cap |
These teach Perspective the non-hedging, commit-early, mechanistic reasoning style of frontier models. Text turns only — tool calls filtered out.
| Dataset | Description | Size |
|---|---|---|
| Glint-Research/Fable-5-traces | Fable 5 agent traces — Pi format. 81% tool use, 19% text reasoning. | 4.6k sessions |
| attentionAllYouNeed/Vibe-Coding-Claude-Fable-5 | Fable 5 coding agent traces. | 1.1M rows |
| TheFusionCube/Fable-5-CoT-Traces | Fable 5 chain-of-thought traces. | 468 |
| hotdogs/uka-fable-reasoning | Fable 5 reasoning — small, high signal. | 3.1k |
| cfahlgren1/Fable-5-traces | Additional Fable 5 sessions. | 115 |
| notune/fable5-repos | Fable 5 repository-level traces. | 7k |
| Jackrong/Claude-opus-4.7-TraceInversion-5000x | Claude Opus 4.7 reasoning traces. | 5k |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | Claude Opus 4.6 reasoning traces. | 9k |
| 11-47/claude_opus_4.8_distill_5k | Claude Opus 4.8 distilled — most recent Claude reasoning style. | 5k |
| ansulev/GPT-5.5-Thinking-Max-Distill-25k | GPT-5.5 thinking traces — deep reasoning chains. | 25k |
| Avtrkrb/combined-reasoning-opus-4.6-4.7-kimi-glm | Combined multi-model reasoning. Diverse analytical styles. | 30k cap |
| TuringEnterprises/Rubric-Graded-Reasoning | Graded analytical responses — teaches structured reasoning with criteria. | 150 |
Teaching the non-hedging, commit-early, mechanistic reasoning style of frontier Claude and Fable 5 models. Text turns only — tool calls filtered out.
| Dataset | Description | Size |
|---|---|---|
| Glint-Research/Fable-5-traces | Fable 5 Pi-format agent traces. | 4.6k sessions |
| attentionAllYouNeed/Vibe-Coding-Claude-Fable-5 | Fable 5 coding traces. | 1.1M rows |
| TheFusionCube/Fable-5-CoT-Traces | Fable 5 chain-of-thought. | 468 |
| hotdogs/uka-fable-reasoning | Fable 5 reasoning — small, high signal. | 3.1k |
| cfahlgren1/Fable-5-traces | Additional Fable 5 sessions. | 115 |
| notune/fable5-repos | Fable 5 repository-level traces. | 7k |
| Jackrong/Claude-opus-4.7-TraceInversion-5000x | Opus 4.7 reasoning traces. | 5k |
| Jackrong/Claude-opus-4.6-TraceInversion-9000x | Opus 4.6 reasoning traces. | 9k |
| 11-47/claude_opus_4.8_distill_5k | Opus 4.8 distilled — most recent Claude reasoning style. | 5k |
| ansulev/GPT-5.5-Thinking-Max-Distill-25k | GPT-5.5 thinking traces — deep reasoning chains. | 25k |
| Avtrkrb/combined-reasoning-opus-4.6-4.7-kimi-glm | Multi-model reasoning corpus. | 30k cap |
| TuringEnterprises/Rubric-Graded-Reasoning | Graded analytical responses — structured reasoning with criteria. | 150 |
| WithinUsAI/claude_mythos_distilled_25k | Claude Mythos distilled — all categories. Expert analysis, scientific reasoning, agentic planning, cybersecurity. | 25k |
| Dataset | Description | Cap |
|---|---|---|
| yahma/alpaca-cleaned | Cleaned Alpaca instruction pairs. | 15k |
| open-thoughts/OpenThoughts-114k | Deep reasoning traces. Problem → structured solution. | 20k |
| OpenAssistant/oasst1 | 88k real human conversations. High quality, diverse. | 15k |
| lmsys/lmsys-chat-1m | 1M real user conversations from Chatbot Arena. | 15k |
| HuggingFaceH4/ultrachat_200k | 200k multi-turn chat. High quality instruction following. | 15k |
| Gryphe/Sonnet3.5-SlimOrcaDedupCleaned | 181k Claude Sonnet 3.5 conversations. Direct response style. | 15k |
| Dampfinchen/Creative_Writing_Multiturn | Multi-turn creative writing. Voice and register flexibility. | 5k |
| fka/prompts.chat | System prompt collection. Teaches model to follow complex behavioral constraints. | 1k |
| qiaojin/PubMedQA | Clinical evidence-based Q&A. Commit to answers with cited reasoning. | 10k |
Every behavioral constraint is independently toggleable via environment variables on Railway — no code changes required.
Full pipeline source including training notebook and architecture details. Access granted by Shane on request.