DCFN - Research
Currently v0.5.1. ← Back to home
What's changed in the engine's user-visible output, in reverse chronological order. Pre-1.0 versioning convention:
0.x.0 — feature additions, output-shape changes0.x.Y — quality fixes, prompt refinements, copy updates1.0.0 — reserved for first paid Tier 1 customer signing a contractUp to v0.4.x, every Research output was descriptive: Article, Tech Report, Bridge Digest, Syntari Record, JSON. All four external instances (Perplexity, Grok, Gemini Deep Research ×2) independently identified the same gap on review — these are excellent analysis packages, but they are not artifacts a customer can FUND, FILE, or ACT ON immediately. DCFN-Patents earned its category by shipping deliverables (provisional drafts, continuation memos) that decision-makers could take directly into a meeting. Research had nothing of equivalent shape.
v0.5.0 ships the first one: an R&D Intelligence Decision Memo, a 2-3 page persona-targeted prescriptive deliverable produced from any successful engine run. Generated alongside the Article and Tech Report — additive, not a replacement. Audience for this prototype is narrow on purpose: pharma R&D directors, biotech corp-dev leads, deep-tech VC desks, and corporate R&D portfolio heads who allocate $1M-$50M on R&D bets. Same operational pattern as Patents' attorney audience (high-LTV, willing to pay $4-20K/seat for a tool that materially affects large decisions). Other personas (Replication, Funder, Licensee) are deferred — one persona ships first to validate willingness-to-pay before forking the synthesis pipeline.
decision_memo_synth.py. Mirrors continuation_memo_synth.py from DCFN-Patents: read engine session JSON, single Claude Opus 4.7 call with persona-aware system prompt, render a .docx with the centralized attribution footer. Public entry point generate_rd_intelligence_memo(session_dir, claude_client, anthropic_model). Persona spec (PersonaSpec dataclass + PERSONAS registry) is centralized so future personas register a new spec instead of forking the module. Cost: ~$0.10-$0.20 per memo (Claude Opus 4.7, ~5K in / ~3K out typical).attribution.py extended. New write_docx_footer_block(doc, version, *, include_attorney_disclaimer=False) — single source of truth for end-of-body attribution on Research .docx outputs, mirroring DCFN-Patents convention. Soft-fails on missing logo (Research repo currently ships without a static/lef-dba-logo.png asset; the footer renders cleanly without it, picks it up automatically when one lands).api.py and main.py call generate_rd_intelligence_memo() after Article + Tech Report on every successful run, in a try/except so any synth failure (Claude outage, missing key, render error) leaves the existing artifacts intact.api.VERSION bumped 0.4.0 → 0.5.0. Feature addition / output-shape change → minor bump per pre-1.0 versioning convention. attribution.VERSION synced.Generated against data/reports/run_c66198828e46/ (Education & EdTech, 687 articles). Memo correctly cited specific entropy nodes by title + year + severity, named SVW convergence events (svw_002, svw_022, svw_023) with paper titles and scores, surfaced apriori rules with confidence numbers, and honestly flagged the absence of contradiction-resolution data + lack of bridge nodes as engine-input limitations rather than burying them.
Mirrors DCFN-Patents v0.8.0 → v0.9.0 → v0.6.0 playbook validated 2026-05-01. Lays the groundwork for Render → Cloud Run cutover without changing any user-facing behavior on the live Render service. Phase 3 of the migration plan documented at Needs Review/Pricing + Architecture Decisions/RESEARCH_MIGRATION_PLAN_2026-05-02.md.
Dockerfile base swap. python:3.11-slim → pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime. SentenceTransformer auto-detects torch.cuda.is_available() and uses GPU when present (Cloud Run with --gpu=1 --gpu-type=nvidia-l4 gets ~50-100x speedup on SBERT-heavy steps); falls back to CPU transparently when no GPU is attached. COPY . . instead of COPY *.py . so new modules + future static//templates/ dirs land automatically.session_storage.py. LocalSessionStorage / GCSSessionStorage adapter behind a single session_store singleton. Backend selected via DCFN_SESSION_BACKEND={local,gcs}; gcs requires DCFN_SESSION_BUCKET. Defaults to local (no production behavior change). Cloud Run will set DCFN_SESSION_BACKEND=gcs DCFN_SESSION_BUCKET=lef-ai-dcfn-research-sessions.tier_config.py. Per-tier engine knob overrides (Tier 0 / 1 / 2). Tier 0 caps match current Render production exactly. Resolution: DCFN_TIER env → session_state["tier"] → dcfn_tier cookie → fallback tier_0. Pricing strings deliberately NOT in this module — those land in a later phase gated on Z's pricing decision per migration plan §5.attribution.py. Single source of truth for footer attribution: BUILD_NAME, VERSION, LLC_NAME, LLC_DBA, NV_BUSINESS_LICENSE, PATENT_ATTRIBUTIONS, plus render functions. Patent attribution list matches v0.3.11 site footer copy exactly — refactor, not a content change. Cross-build attribution rule (Z 2026-04-30) noted in module docstring.requirements.txt. Added google-cloud-storage>=2.14.0 for GCSSessionStorage (lazy import; LocalSessionStorage doesn't depend on it).attribution.py, tier_config.py, and session_storage.py exist but no module imports them. Wiring in subsequent v0.4.x commits.compute_portfolio_topic.py (mirror of Patents' compute_portfolio_domain.py for Drive filename topic labels) — deferred to publish.py call-site migration.accounts.py / auth_session.py / usage_tracker.py — deferred to migration plan Phase 5-6.dcfn-research.onrender.com → research.livingedenframeworks.com.docx, "Go Deeper" markdown link, deployment site references) carried the legacy Render URL. Replaced across api.py (DCFN_BASE_URL default), report_generator.py (3 callsites: Go Deeper link + 2 footer references), and publish.py (2 footer references).research.livingedenframeworks.com not yet pointed at the new substrate — that lands as part of the Cloud Run migration (RESEARCH_MIGRATION_PLAN_2026-05-02). Customers visiting the URL today still hit Squarespace's NX. Output text now reflects the planned canonical brand URL ahead of the cutover (mirrors Patents v0.9.1 sequencing).report_generator.py:7507 (matches an even-older lef-dcfn.onrender.com pattern from the pre-rename era) intentionally untouched — it works as designed for stripping cached old markdown.v0.3.5's bidirectional citation walk shipped assuming all reference IDs were Semantic Scholar paperIds (40-char hex). In practice the merged corpus pulls from 4-6 sources and references[] is mixed-format: OpenAlex Work IDs, PubMed UIDs, arXiv IDs. v0.3.8's S2-only filter prevented the resulting 400-Bad-Request crash but at the cost of ~zero expansion on multi-source corpora (one local test: 749 non-S2 IDs filtered, 0 added).
v0.3.10 closes the gap with a hybrid two-pass design:
/works, PubMed esummary). Reformat as S2's DOI:10.x prefix syntax and send through the existing S2 batch endpoint. Captures the ~75-85% of academic papers that have DOIs./works?filter=ids.openalex:, PubMed efetch, arXiv query?id_list=. Reuses parsing logic from existing ingestion_* modules so the article record shape is identical to the standard pipeline.Graceful degradation: if S2 batch returns 429 (free-tier rate limit, common) or any other error, the IDs that came in via DOI translation get re-attempted via per-source fetch. So the entire walk doesn't depend on S2 cooperating.
Metadata richness: expand_via_citation_walk return dict now includes ids_resolved_via_s2_native, ids_resolved_via_doi_translation, ids_resolved_via_per_source, ids_unresolved, per_source_breakdown. Operator (and Z reading the report) can see exactly where each neighbor came from and which sources contributed.
Empirical validation (local Single-Cell corpus, 100 OpenAlex neighbors, S2 rate-limited): 25 articles added in 8.6s via Pass 2 OpenAlex fallback alone. With production S2 API key cooperating, Pass 1 + Pass 2 combined would land substantially more.
New module id_translation.py centralizes paper-ID source recognition + DOI prefix formatting so downstream code doesn't sprinkle prefix-matching logic.
Coverage note: papers without abstracts are filtered by the standard _to_article_record schema requirement. Tier 1+ engine variant will handle abstract-less papers via structural-metadata-only ingest path (separate ship).
main.py's user-driven path already had per-stage timing via stage_timings. The autonomous-scheduler path (scheduler.py:_run_autonomous_pipeline) was missing it — only total elapsed was logged. Added per-stage capture for: qeb_encoding, concept_graph, cte_traversal, apriori, svw, hypothesis_generation, calibration, bridge_detection_and_rerank. Surfaces as a single [PIPELINE_TIMING] log line per run + persisted to the report's stage_timings field for downstream tooling.
Triggered by Charter §16 codification (Patents L1 ran 50 min and we had no per-step data to answer "should we upgrade Render tier?"). This closes the gap on the Research autonomous path so the same question is answerable empirically there too.
Note: the multi-source citation walk hybrid (DOI translation + per-source fanout) flagged in v0.3.8 is now tracked as v0.3.10 (next minor).
Local Research validation surfaced two bugs in code I shipped earlier today.
Numpy truthiness crash (blocking). The topical-coherence term I added in v0.3.7 used d.get("v_unit") or d.get("v_seed") to pick a vector — classic Python+numpy gotcha: when v_unit is a numpy array, the or operator triggers numpy's __bool__, which raises ValueError: The truth value of an array with more than one element is ambiguous. Two sites in cte_operations.py:golden_token_pathfinding. Both now use explicit None-checks. Effect: every autonomous run since v0.3.7 deployed (2026-04-30) crashed silently after the CTE ops stage. The v0.3.5–v0.3.7 quality fixes never actually produced a usable report. Local re-validation post-fix: 4 succeeded, 0 failed (was 0/4).
Citation-walk batch endpoint rejecting all requests (silent). v0.3.5's bidirectional citation walk was sending paper IDs to Semantic Scholar's /paper/batch endpoint that the endpoint refused with {"error":"No valid paper ids given"}. Cause: in the multi-source merged corpus, references[] contains IDs in mixed formats (S2 hex, openalex:WXX, pmid:NNN, arxiv:XX.XX). S2's batch endpoint only accepts S2 hex IDs (or its own prefix-tagged syntax which we don't yet emit). Fix: filter neighbor IDs to S2-format (40-char hex) before batching; non-S2 IDs are dropped with a logged count. Also added 4xx response-body capture so future S2 errors show the actual error message on the first round-trip instead of an opaque 400 Bad Request.
S2 ID filter is a safety net, not the production architecture. Multi-source corpora (OpenAlex, PubMed dominated) require cross-source ID resolution. Hybrid two-pass implementation lands in v0.3.10 (DOI translation + per-source fanout).
Follow-on work tracked separately for v0.3.9: translate non-S2 IDs to S2's prefix syntax (DOI:10.x, PMID:NNN) before batching, OR fan out per-source (OpenAlex API for openalex: IDs, PubMed E-utilities for pmid: IDs). Not a v0.3.8 ship — needs design.
Two coupled fixes for Perplexity's 2026-04-30 broad-vocabulary findings.
source_title lookup. Hypotheses now resolve to specific paper titles ("will unlock currently blocked progress toward: fields_of_study intersect the corpus's dominant fields (≥30% of DOCUMENT nodes). Methodology papers typically declare different fields (Computer Science, Bioinformatics) than the substantive research papers (Biology, Medicine), so the centroid stays anchored to subject matter, not tooling. The other four PATHFINDING_WEIGHTS dropped uniformly (0.25 → 0.2125) so the new mass doesn't compound.session_corpus_pull.py. Discovery-driven topic runs (queue-managed via topic_queue_runtime) now expand the corpus with a 1-hop citation-graph walk after multi-source ingest completes. Takes the top 50 most-cited papers from the initial pull, collects both their references (backward — ancestral foundations) and citations (forward — downstream sub-communities), batch-fetches the metadata via Semantic Scholar's /paper/batch endpoint, dedupes against the existing corpus, and appends. Hard-capped at 400 net new neighbor IDs per run to bound API cost + wall-clock; typical add lands at 100-300 articles in 30-60 seconds.pending_items.md proposals into topic_queue.json and registered into DOMAINS at runtime). Fixed-config domains have curated query sets and skip this step. Failure modes are non-fatal — on any S2 error the run continues with the unexpanded corpus.Corpus fingerprint: no-corpus even though the run ingested 542 sources. Root cause: in the autonomous-scheduler code path, the article_index was being built AFTER the report was rendered, so the report's fingerprint check (report.get("article_index", {})) saw an empty dict and emitted "no-corpus" regardless of how many articles ingested. Fix: build article_index and assign it into the report dict BEFORE generate_article / generate_technical_report run. Receipts now show the real corpus signature. Note: this is the surface-symptom fix; the deeper architectural item (citation-graph 1-hop expansion to address flat-cluster structure on broad-vocabulary domains) is tracked separately and is multi-day work._detect_tooling_obi_signals() runs between citation-velocity and the hidden-citation bigram check. Two-layer detector: a STRONG signal (a single GitHub URL, "R package" / "Python package" / Bioconductor / CRAN / PyPI mention, or "available at https://" link) is sufficient on its own; a WEAK signal (generic terms like "framework", "library", "implementation") only fires when co-occurring with a package-name pattern in the title (all-caps acronym like HTSeq/BLAST, or CamelCase like DESeq2/uniCATE). Designed for precision — generic prose like "in our framework we propose..." won't trip it.Footer's "Built on" line was undercounting: said "6 U.S. Patents Pending" and named CTE + QECO. Actual total since the Tesseract Composition supplemental landed (2026-04-20) is 8, and the engine rides more than two substrate patents. Updated to:
Same correction applied to the Firebase brand site's DCFN-Research card.
The Research engine's autonomous-run path now drives from a discovery-agent-fed queue instead of cycling fixed domains. A discovery agent identifies new research topics worth running by querying Semantic Scholar (with PubMed fallback) for substantive recent activity in curated seed areas, derives a topic configuration from the top results, and proposes it for human review. After a 7-day cooldown without rejection, the proposal auto-promotes into the live run queue, where the engine executes the full pipeline against it once or twice before going dormant.
Why this matters: it converts the autonomous path from "run the same three domains every day" (which produces noise) into "surface new research territory worth exploring" (which produces signal). Each run feeds the Bridge Inbox + LEF Ai Upstream telemetry channels — autonomous runs are the substrate's input.
Solved Obliteration by Incorporation (OBI) — flagged by Gemini 2026-04-30 deep-research review of the v0.2 era output. Previously the engine was treating universally-adopted methods as "decayed" simply because they'd stopped being explicitly cited (their methods became the field's default vocabulary). The engine now distinguishes "Canonical Foundations (Absorbed by Incorporation)" from genuinely abandoned work. Concrete validation from Gemini: the engine correctly identifies the HTSeq Python framework — "22,482 lifetime citations but 0 in the last 5 years; hasn't decayed; it has just become structural canon" — instead of false-flagging it. This removes a major false-positive class from the engine's untested-foundation analysis.
High-signal convergence anchor detection — the engine surfaces single papers that multiple research clusters orbit without explicitly cross-citing. Gemini 2026-04-30 validated on the Trauma-Informed Care × Restorative Justice run: "353 independent research groups across 42 years converging on the exact same academic success metrics without sharing a direct citation path." Convergence anchors are the engine's strongest signal for "where the field is heading without anyone having named it yet."
Bridge digest format — autonomous runs now produce a structured Bridge Digest containing all bridge intelligence (gaps, severity, gap types, abstracts) suitable for ingestion by future Bridge engines that sit between two DCFN builds.
Syntari Record (JSON twin) — every run now produces a structured JSON twin alongside the prose Article, suitable for downstream machine-readable consumption.
Initial deployment. Single-page intake → multi-source ingest → concept graph construction with typed edges → Cognitive Traversal Engine (5 operations: backward / forward / branch cataloging / entropy / golden token) → SVW convergence detection → Apriori pattern mining → Article + Technical Report generation. Free 5 runs / month / browser; $15 unlock for Layer 2 + Layer 3 deeper traversal.