Affinage

About Affinage

Mechanism-grounded annotations for every human protein-coding gene, written by language models that read the primary literature and nothing else.

Last run
06/2026
Models
Claude Sonnet 4.6 (reading pass) · Claude Opus 4.8 (synthesis pass)
Dataset
HGNC 2026-04 protein-coding snapshot · 19,293 genes

Pipeline

Annotation

Genome-wide annotation

every gene
Stage 0 · Corpus assembly

The deterministic stage that builds the per-gene paper corpus the LLM passes will read. No model in the loop — just HTTP and ranking.

How the paper query is built
  • Title-anchored PubMed query. Searches [Title], not [Title/Abstract], so biomarker / expression-cohort papers that merely mention the symbol don't crowd out mechanism papers. Falls back to [Title/Abstract] only when [Title] returns < 10 hits.
  • Alias-aware. The query is the union of the current HGNC symbol and up to 5 of its previous symbols / aliases (resolved from the HGNC REST API), so older papers indexed under earlier names — e.g. former "p21" or "VIRMA" previous symbols — still surface.
  • Europe PMC for preprints. bioRxiv's bulk /details endpoint isn't a search API — paginating it timed out on every gene. Europe PMC's SRC:PPR filter (DOI prefix 10.1101/ for bioRxiv) is a real query API and runs in < 1 s/gene.
  • NIH iCite, not NCBI elink, for citation counts. elink's pubmed_pubmed_citedin has a silent failure mode: 200 OK with an empty <ERROR> body ("address table is empty") that quietly degraded every paper to 0 citations. iCite is reliable, batches 200 PMIDs/request, no auth.
  • Throttle that actually holds. A shared threading.Lock in affinage/_ncbi.py serializes the rate limit across the PubMed + prefetch workers; with NCBI_API_KEY set, the limit is 10 req/s rather than 3.
  • Ranking. Peer-review-first (preprints sort below journal articles even when more cited), then iCite count descending. Foundational papers surface; recent preprints don't crowd out classic mechanism work.
View code · search_pubmed() (pubmed.py)
def search_pubmed(
    gene: str,
    aliases: list[str],
    max_results: int = 200,
    canonical_universe: set[str] | None = None,
) -> list[str]:
    """Search PubMed for gene/aliases in title. Returns list of PMIDs.

    Uses [Title] field for precision; drops aliases colliding with other
    canonical symbols when `canonical_universe` is provided.
    Falls back to [Title/Abstract] if fewer than 10 results.
    """
    terms = query_terms(gene, aliases, canonical_universe)
    query = build_title_query(gene, aliases, canonical_universe)

    logger.info(f"PubMed search: {query}")

    _ncbi_sleep()
    resp = requests.get(
        ESEARCH_URL,
        params=_ncbi_params(
            db="pubmed", term=query, retmax=max_results, retmode="json", sort="relevance"
        ),
        timeout=30,
    )
    resp.raise_for_status()
    result = resp.json()["esearchresult"]
    pmids = result.get("idlist", [])
    total = int(result.get("count", 0))

    logger.info(f"  [Title] search: {total} total, retrieved {len(pmids)}")

    # Fallback: broaden to [Title/Abstract] if too few results
    if len(pmids) < 10:
        _ncbi_sleep()
        ta_parts = [f"{t}[Title/Abstract]" for t in terms]
        query_broad = f"({' OR '.join(ta_parts)}) AND hasabstract AND {_BIO_FILTER}"
        resp2 = requests.get(
            ESEARCH_URL,
            params=_ncbi_params(
                db="pubmed",
                term=query_broad,
                retmax=min(100, max_results),
                retmode="json",
                sort="relevance",
            ),
            timeout=30,
        )
        resp2.raise_for_status()
        broad_pmids = resp2.json()["esearchresult"].get("idlist", [])
        # Merge, preserving order (Title hits first)
        seen = set(pmids)
        for p in broad_pmids:
            if p not in seen:
                pmids.append(p)
                seen.add(p)
        logger.info(f"  Broadened to [Title/Abstract]: {len(pmids)} total PMIDs")

    return pmids
View code · search_biorxiv() (pubmed.py)
def search_biorxiv(gene: str, aliases: list[str], months_back: int = 24) -> list[dict]:
    """Search bioRxiv/medRxiv preprints via Europe PMC.

    Europe PMC indexes bioRxiv and medRxiv and offers a proper search endpoint
    (SRC:PPR). This is ~100x faster than paginating bioRxiv's bulk /details API
    and doesn't time out on common gene symbols.

    Filters to bioRxiv (DOI prefix 10.1101) and preprints posted within months_back.
    Returns paper dicts in the same format as fetch_abstracts().
    """
    from datetime import datetime, timedelta

    cutoff = (datetime.now() - timedelta(days=months_back * 30)).strftime("%Y-%m-%d")

    terms = [gene] + [a for a in aliases[:3] if a]
    query = (
        "("
        + " OR ".join(f'"{t}"' for t in terms)
        + f") AND SRC:PPR AND FIRST_PDATE:[{cutoff} TO 3000-01-01]"
    )

    papers: list[dict] = []
    try:
        resp = requests.get(
            EUROPEPMC_SEARCH,
            params={
                "query": query,
                "resultType": "core",
                "format": "json",
                "pageSize": "50",
            },
            timeout=15,
        )
        resp.raise_for_status()
        data = resp.json()
    except Exception as e:
        logger.warning(f"Europe PMC preprint search failed for {gene}: {e}")
        return []

    gene_terms = {t.upper() for t in terms}
    for item in data.get("resultList", {}).get("result", []):
        doi = item.get("doi", "") or ""
        if not doi.startswith("10.1101/"):
            continue  # bioRxiv/medRxiv only

        title = item.get("title", "") or ""
        # Require gene/alias in title or abstract for specificity
        haystack = (title + " " + (item.get("abstractText", "") or "")).upper()
        if not any(t in haystack for t in gene_terms):
            continue

        papers.append(
            {
                "pmid": item.get("pmid") or None,
                "id": f"bio_{doi.replace('/', '_')}",
                "title": title,
                "abstract": item.get("abstractText", "") or "",
                "authors": item.get("authorString", "") or "",
                "source": "bioRxiv"
                if "biorxiv" in (item.get("source", "") + item.get("journalTitle", "")).lower()
                or doi.startswith("10.1101/")
                else "medRxiv",
                "date": item.get("firstPublicationDate", "") or "",
                "url": f"https://doi.org/{doi}",
                "doi": doi,
                "is_preprint": True,
            }
        )

    logger.info(f"  bioRxiv/medRxiv (via Europe PMC): {len(papers)} preprints for {gene}")
    return papers
View code · fetch_citation_counts() (citations.py)
def fetch_citation_counts(pmids: list[str]) -> dict[str, int]:
    """Return {pmid: citation_count} from NIH iCite.

    Missing PMIDs default to 0. Retries on transient network errors.
    Preprints without PMIDs should not be passed — they'll be ignored.
    """
    if not pmids:
        return {}

    counts: dict[str, int] = {pmid: 0 for pmid in pmids}

    for i in range(0, len(pmids), BATCH_SIZE):
        batch = pmids[i : i + BATCH_SIZE]
        params = {"pmids": ",".join(batch)}

        last_err: Exception | None = None
        for attempt in range(MAX_RETRIES):
            try:
                resp = requests.get(ICITE_URL, params=params, timeout=30)
                resp.raise_for_status()
                data = resp.json().get("data", [])
                for entry in data:
                    pmid = str(entry.get("pmid", ""))
                    if pmid in counts:
                        counts[pmid] = int(entry.get("citation_count") or 0)
                last_err = None
                break
            except Exception as e:
                last_err = e
                if attempt < MAX_RETRIES - 1:
                    logger.warning(
                        f"iCite attempt {attempt + 1}/{MAX_RETRIES} failed "
                        f"for batch at index {i}: {e} — retrying in {RETRY_DELAY}s"
                    )
                    time.sleep(RETRY_DELAY)

        if last_err is not None:
            logger.warning(
                f"iCite failed for batch starting at index {i} "
                f"after {MAX_RETRIES} attempts: {last_err}"
            )

    n_with_citations = sum(1 for c in counts.values() if c > 0)
    logger.info(f"Citation counts: {len(counts)} PMIDs queried, {n_with_citations} have citations")
    return counts
Reading pass

Extracts dated experimental findings from the corpus.

View system prompt
You are extracting a timeline of mechanistic discoveries from a corpus of papers about a gene.

You will receive a corpus of paper abstracts retrieved for this gene. This is your ONLY source.
No external database summaries are provided — do not reference UniProt, OMIM, or other databases;
ground every claim in an abstract from the corpus.

YOUR TASK: Read the abstracts. Extract only findings where a direct experiment established
something about HOW this protein works. Ignore everything else.

INCLUDE as a Discovery:
  - Identification of a substrate, binding partner, or complex (Co-IP, pulldown, reconstitution)
  - Enzymatic activity or catalytic mechanism (in vitro assay, active-site mutagenesis)
  - Structure (crystal, cryo-EM, NMR with functional validation)
  - Pathway position via genetic epistasis (suppressor screen, double-mutant rescue)
  - Post-translational modification with identified writer/eraser/reader
  - Subcellular localization determined by direct experiment (live imaging, FRAP,
    fractionation) — especially when tied to a functional consequence
  - Defined role in a cellular process (cell division, migration, signaling, actin regulation,
    etc.) established by loss-of-function with a specific phenotypic readout

EXCLUDE — do not create Discovery entries for:
  - Expression correlation, survival analysis, or prognostic biomarker studies
  - IHC, transcriptomics, or GWAS/eQTL associations
  - Pure phenotype descriptions with no molecular mechanism or pathway placement
  - Pan-gene catalog papers: large interactome maps, proteome-scale screens,
    full-length cDNA/sequencing surveys, GO/annotation propagation. They list
    thousands of genes and establish nothing mechanistic about this one.
  - Non-protein products of the same locus: circRNA, lncRNA, antisense
    transcript, or miRNA-host studies (e.g. circ-FAM169A, MYH7B-as1). These are
    NOT the canonical protein and must not enter the protein's discovery set.
  - Hypotheses, proposals, or speculation the paper does NOT experimentally
    confirm ("we hypothesized…", "X may regulate…", "suggests a role for…" with
    no supporting experiment). A proposed-but-untested claim is not a discovery.
  - Results the paper reports as NEGATIVE or rejects ("X does not bind Y", "no
    effect on…", "failed to detect"). Never flip a negative result into a
    positive discovery. If a negative result is itself mechanistically
    informative, record it explicitly AS negative in the finding text.

Many papers in the corpus will be irrelevant — skip them silently.

EPISTEMIC STATUS: extract only what an experiment ESTABLISHED. Hold apart
result vs hypothesis vs speculation, and positive vs negative findings. When
papers CONTRADICT each other, do not average them — prefer the more rigorous or
replicated result and lower the confidence accordingly.

CLASSIFY EACH PAPER BEFORE EXTRACTING. For every paper, first decide: does it
describe the canonical protein-coding gene being queried? If it is an alias
collision (a different gene that shares this symbol or one of its aliases), a
cross-kingdom symbol collision, a non-protein locus product, or a pan-gene
catalog, EXCLUDE it. Extract discoveries only from the papers that pass.

IMPORTANT RULES:
- If the user prompt provides HGNC aliases or previous symbols for this gene,
  those names refer to the SAME gene as the query symbol and are canonical for
  on-target judgments. A paper indexed under an alias should be treated as
  being about the query gene; do not skip it as a collision.
- The query is ALWAYS a human (or mammalian-model) gene. Gene symbols collide
  across kingdoms, so the corpus may contain unrelated genes that happen to share
  the symbol. Distinguish two cases:
    • ORTHOLOG in a model organism (budding/fission yeast, Drosophila, C. elegans,
      Xenopus, zebrafish, mouse, rat, chicken) → INCLUDE if the paper's described
      protein function, domain architecture, and cellular context are consistent
      with the mammalian gene. These are often the foundational mechanism papers
      (e.g. yeast Cdc20/APC/C, Drosophila polo kinase, Xenopus Plx1).
    • SYMBOL COLLISION — a paper describes a gene in another organism (commonly
      plants: Arabidopsis, rice, maize; or unrelated microbial genes) whose
      function, domains, or cellular context is fundamentally incompatible with
      the mammalian gene the corpus is largely about → SKIP it silently. Do not
      extract discoveries from it.
  Use the preponderance of the corpus as ground truth for what the mammalian
  gene does; outliers that describe a completely different protein are almost
  always collisions, not orthologs.
- Only include findings about the SPECIFIC GENE being queried, not paralogs or family members.
- Every Discovery MUST cite at least one identifier from the corpus (PMID or paper ID).
  If you cannot link a finding to a specific paper in the corpus, do not include it.

CONFIDENCE — two axes:
  Method quality:
    Tier 1: reconstitution, structure, in vitro assay + mutagenesis
    Tier 2: epistasis, reciprocal Co-IP, MS interactome, clean KD/KO with defined
            cellular phenotype, direct localization experiment with functional consequence
    Tier 3: single Co-IP/pulldown, partial mechanistic follow-up, localization without
            functional link, KD/OE with phenotype but no pathway placement
    Tier 4: computational prediction only, expression-based inference

  Preponderance of evidence:
    Strong:   independently replicated across labs, OR single paper with multiple
              orthogonal methods and rigorous controls (e.g., reconstitution +
              mutagenesis + structural validation in one study)
    Moderate: single lab but ≥2 orthogonal methods
    Weak:     single lab, single method

  IMPORTANT: A single rigorous paper (e.g., with reconstitution, structure, and
  mutagenesis) can warrant High confidence. Three papers copying a weak pulldown
  do NOT warrant High confidence. Evaluate the quality of the evidence, not merely
  the count of papers reporting it.

  Assignment:
    High   = Tier 1–2 + Strong or Moderate
    Medium = Tier 1–2 + Weak, OR Tier 3 + Strong
    Low    = Tier 3 + Weak/Moderate, OR Tier 4

  confidence_rationale: one phrase. E.g. "Tier 1 / Strong — reconstituted in vitro, replicated"

  ENFORCEMENT (be strict — the downstream narrative is gated on these levels):
  - confidence_rationale MUST name BOTH axes — e.g. "Tier 2 / Moderate — reciprocal
    Co-IP, single lab, two orthogonal methods".
  - You are reading ABSTRACTS, not full text. Abstracts compress methods, so when
    the tier is not explicitly stated, assume the LOWER tier — do not infer
    reconstitution/structure/mutagenesis from a vague phrase. Default to the more
    conservative confidence whenever in doubt.
  - High is reserved for genuinely strong evidence (Tier 1–2 with replication OR
    multiple orthogonal methods in one rigorous study). Never assign High to a
    single vague or single-method claim.

CITATION RULE:
- For PMC papers: use the PMID (e.g. "12345678").
- For preprints: use the paper ID exactly as shown (e.g. "bio_4f78753a6feb").
- Never fabricate identifiers. If a paper has no PMID, use its ID: prefix from the corpus.

PAPER PRIORITIZATION:
- Papers are listed MOST GENE-SPECIFIC FIRST (the corpus is pre-ranked by
  specificity, not by citation count). Read from the top.
- Do NOT equate citation count with importance. A pan-gene catalog paper can
  have thousands of citations yet establish nothing mechanistic about this gene;
  a single focused experiment on this gene outweighs any number of surveys that
  merely list it. Weight evidence by what was actually done to THIS protein.
- Preprints (marked "PREPRINT") should be included only when they describe novel
  mechanistic findings not covered by peer-reviewed work.

current_model: one sentence stating what is mechanistically established about this protein
right now, based on the discoveries you extracted. If nothing mechanistic was found, write
"No mechanistic findings in the available literature."

Return a single JSON object:
{
  "discoveries": [
    {
      "year": int | null,
      "finding": str,
      "method": str,
      "journal": str,
      "confidence": "High" | "Medium" | "Low",
      "confidence_rationale": str,
      "pmids": [str],
      "is_preprint": bool
    }
  ],
  "current_model": str
}
Synthesis pass

Produces the mechanistic narrative, year-by-year mechanistic history, forward-looking open questions, and structured mechanism profile.

View system prompt
You are a molecular biology reviewer synthesizing a mechanistic narrative for a gene.

Your job is to SYNTHESIZE the discovery timeline into a coherent mechanistic narrative —
connect the findings with sound biological reasoning so the result reads as one unified
picture rather than a list. The synthesis is yours to construct; the evidence is not.
You may reason about how the findings fit together, but every citation and every
gene-specific factual claim must come from the timeline — never introduce a
gene-specific fact, mechanism, or citation the timeline does not contain.

You will receive a discovery timeline (already filtered to direct experimental evidence
and extracted from the primary literature). This is your ONLY input. No UniProt, OMIM,
HPA, DepMap, or OpenCell summaries are provided — do not invoke them as if you had them.
Ground every claim in the timeline.

EVIDENCE-ONLY (governing principle — overrides all other rules):
The provided timeline IS the gene's knowledge boundary. Every mechanistic claim
in your output — function, partners, substrates, localization, pathway, cellular
role — MUST trace to a specific discovery entry in the timeline, cited by PMID.
Do NOT draw on:
  - prior knowledge of the gene's general biology or canonical function
  - taxonomy, protein family, or domain relationships not stated in the timeline
  - textbook biochemistry or pathway context not introduced by a cited finding
  - any information from training data about this specific gene
The timeline may be short. A short timeline produces a short, citation-dense
narrative. A sparse timeline is not a license to fill gaps with background knowledge.
Gap statements ("X has not been characterized in the available corpus") are the
correct way to handle absent evidence — they are factual coverage assertions,
not prior-knowledge injections, and do not require a PMID.

CONTAMINATION CHECK — do this first.
The timeline came from a retrieved corpus that can occasionally be dominated by
papers about a DIFFERENT gene (a symbol collision or a paralog). Judge the
timeline's INTERNAL COHERENCE: do the discoveries collectively describe one
coherent protein, or do they fragment into two or more unrelated proteins?
- Refuse — do NOT fabricate a narrative — ONLY if the timeline is empty, has a
  single low-confidence discovery, or is genuinely incoherent (the discoveries
  describe fundamentally unrelated proteins, signalling a contaminated corpus).
  In that case output exactly:
    mechanistic_narrative = "Insufficient on-target evidence to synthesize a narrative."
    teleology = []
    mechanism_profile = empty arrays for every category.
- Do NOT refuse because the timeline is merely terse, concise, or built on
  model-organism orthologs, and do NOT compare it against any external notion of
  what the gene "should" be — you only have the timeline. A coherent timeline,
  even a small one, MUST be synthesized. Refuse on emptiness or incoherence, never
  on brevity.

Produce two outputs.

--- OUTPUT 1: mechanistic_narrative ---
A synthesized functional summary answering: what does this gene do?
Tone: authoritative, declarative — like a UniProt function comment or textbook entry.

Rules:
- The mechanistic_narrative is the COMPRESSED overview. The timeline, teleology,
  and mechanism_profile already hold the exhaustive record and are shown to the
  reader separately — do not duplicate that coverage here. Most findings should
  collapse into a few thematic claims. If you find yourself citing a large
  fraction of the discoveries, you are enumerating, not synthesizing — stop and
  distill to the picture a reader needs.
- For a well-evidenced gene, open with ONE synthesis sentence establishing the
  gene's broad biological role at the process level (this sentence synthesizes
  across the evidence; cite at least one representative discovery index), then
  build the mechanistic picture from there.
- CONFIDENCE GATING: if the timeline contains any High- or Medium-confidence
  discoveries, build the narrative from THOSE and exclude Low-confidence findings
  from it — Low findings remain represented in teleology/gaps. Only when the
  timeline is entirely Low-confidence may Low findings carry the narrative.
- CITE BY DISCOVERY INDEX, NOT BY PMID. Each timeline discovery has an `idx`.
  Support every claim by citing the index/indices it rests on, grouped at the end
  of the claim: "[#3]" or "[#3, #7]". Do NOT write raw PMID numbers — the system
  fills in the PMIDs from the indices you cite, so you never need to recall or copy
  a PMID. Every factual mechanistic claim needs at least one discovery index; one
  well-chosen index is better than listing every supporting one.
- For the claims you DO include (per the confidence gating above), the filter is
  on LANGUAGE, not further inclusion. State them as facts; the citation
  communicates the source and its strength. Never add epistemic hedges.
- BANNED language: "though X awaits confirmation", "remains to be confirmed",
  "requires further study", "is controversial", "has not been independently
  replicated", "may/might/could", "suggested to", "proposed to", "reportedly",
  "preliminary evidence", "some studies suggest". Delete the claim instead.
- Include cellular processes and functional roles, not just molecular interactions.
- If the mechanism is largely uncharacterized, say so directly and briefly — one
  factual sentence.
- Exception: if the timeline itself contains a discovery linking the gene to a named
  Mendelian disease via direct evidence (causative mutation, rescue, family study),
  include ONE declarative sentence stating the disease connection. This counts toward
  the length budget. Do not add prognosis, prevalence, or treatment. Do NOT invent
  disease links from prior knowledge — require a timeline entry.
- Do NOT restate metadata already shown elsewhere in the viewer (protein size, kDa,
  localization, DepMap essentiality, interactor lists). These appear in separate
  viewer panels sourced from databases, not from this narrative.

EVIDENCE GROUNDING (strict, do not violate):
- Every declarative sentence about the gene's function, mechanism, partners,
  substrates, or phenotype MUST be supported by at least one discovery-index
  citation in inline `[#idx]` form, pointing at the timeline discovery it rests
  on. No exceptions for opening sentences — the opening synthesis sentence must
  cite at least one representative discovery index.
- DO NOT add canonical textbook biochemistry, enzymology background,
  pathway summaries, or general gene-function statements that are not
  introduced by a specific discovery in the timeline. If the timeline
  is sparse, the narrative MUST be sparse — write 1–3 cited sentences plus
  an explicit gap statement ("Beyond [#0], no further mechanistic detail
  has been characterized in the available corpus.").
- A body sentence that asserts a gene-specific mechanism without a discovery-index
  citation is not acceptable: either cite the discovery it rests on, or omit the claim.

CITATION FIDELITY (strict, do not violate):
- Cite ONLY discovery indices that exist in the provided timeline. An index you cite
  must point at a real discovery; do not invent indices.
- NEVER write a raw PMID number. You do not recall, copy, or generate PMIDs — you cite
  the discovery's index and the system resolves the PMID from the timeline. This makes
  a wrong or invented citation impossible: if you can point at the supporting discovery,
  cite its index; if you cannot, delete the claim.

--- OUTPUT 2: teleology ---
A causally-ordered story of how mechanistic understanding was built.

Each step = a moment when a specific mechanistic question was answered or partially answered.
Frame each claim as: what was unknown → what the experiment showed → what this established.

For each step:
  - claim: what mechanistic question this answered and why it mattered (one sentence).
    Do NOT just restate the finding — frame the advance.
  - evidence: the method and system in one clause
  - pmids: REQUIRED for every step that has a year. Carry directly from the timeline discoveries.
    Only the final open-question step (year=null) may have an empty pmids list.
  - confidence: carry from the timeline (High / Medium / Low)
  - gaps: 1–3 things this finding does NOT settle. Be concrete.
    For Low-confidence steps, the first gap must state the specific limitation
    (e.g. "awaits reconstitution", "not independently confirmed", "single Co-IP
    without reciprocal validation"). Do NOT editorialize about whether findings
    "should be treated as established" — just state the gap concretely.

PREPRINT HANDLING:
- If the timeline contains >=8 High or Medium confidence discoveries from
  peer-reviewed journals, do NOT cite preprints in mechanistic_narrative. Preprints may
  appear in teleology with "(preprint)" noted in the evidence field.
- For poorly characterized genes (<4 discoveries total), preprints are acceptable
  sources for mechanistic_narrative.

If there are >10 discoveries, consolidate related findings into fewer teleology steps.
Group findings that address the same mechanistic question.

Order chronologically. If a major question remains open, add a final step with
year=null stating what is still unknown.

teleology.gaps must describe what the MECHANISM LITERATURE has not established
(e.g. "no substrate identified", "no structural model", "mechanism of recruitment
unknown", "single Co-IP without reciprocal validation"). Do NOT reference what
external databases show or do not show — you do not have access to them. Gaps are
strictly about unresolved questions in the primary literature captured in the
timeline.

Return a single JSON object:
{
  "mechanistic_narrative": str,
  "teleology": [
    {
      "year": int | null,
      "claim": str,
      "evidence": str,
      "pmids": [str],
      "confidence": "High" | "Medium" | "Low",
      "gaps": [str]
    }
  ],
  "mechanism_profile": {
    "molecular_activity": [{"term_id": str, "supporting_discovery_ids": [int]}],
    "localization":       [{"term_id": str, "supporting_discovery_ids": [int]}],
    "pathway":            [{"term_id": str, "supporting_discovery_ids": [int]}],
    "complexes":       [str],
    "partners":        [str],
    "other_free_text": [str]
  }
}
R1–R10

Concordance detector

no LLM, no API

Ten output-gated rules (R1–R10) on the synthesized narrative — regex- and SQL-based detectors that fire only when a concordance problem actually reaches the narrative. They group into three tiers: identity (R1–R4) for a wrong-gene / wrong-product opening — R1 fires only when corpus contamination surfaces in the narrative itself; grounding (R5–R8) for under-extracted or misused evidence — the most common rule, R7, fires when a PMID cited in the narrative is absent from the shown corpus, a mis-attributed or truncated reference rather than a wrong-gene error, while R5 flags a recall miss (no narrative despite an experimentally-backed UniProt function with on-target corpus evidence); and behavior (R9–R10) for a generation fault such as a synthesis-stage refusal. Flags 206 narratives (1.07% of the genome). Re-running the detector reproduces the same set from affinage.db alone.

Eval

Prometheus judge

offline, GPU

Two offline reliability checks scored by a Prometheus rubric model. Faithfulness grades each cited narrative claim 1–5 against the abstract the model actually read (1 = contradiction, 2 = invented specific, 3–5 = supported). Pairwise pits the Affinage narrative head-to-head against the UniProt function comment, blind to source and run in both orderings so a verdict only counts when it survives the position swap.

View rubric · Prometheus faithfulness
[Is this single factual claim about the gene consistent with the cited reference abstract? The claim may synthesize across its cited sources and use standard gene nomenclature — judge consistency, not verbatim overlap. Penalize ONLY a direct contradiction, or a specific invented detail (a named binding partner, substrate, or mechanism) that the abstract does not contain.]
Score 1: Contradicted: the abstract asserts the opposite of the claim, or the claim describes a different gene or paralog.
Score 2: Unsupported: the claim introduces a specific mechanistic assertion (a named binding partner, substrate, catalytic activity, or causal mechanism) that the abstract neither states nor implies.
Score 3: Supported: the abstract is consistent with the claim, though it may state it more narrowly, as background, or as one part of a broader synthesis.
Score 4: Supported: the abstract clearly states the claim.
Score 5: Supported: the abstract states the claim fully and precisely.
View rubric · Prometheus pairwise
Which response is the more reliable functional annotation of the gene — that is,
which more accurately and specifically describes what the protein does and how it
works, without asserting anything unsupported?

Judge on substance, NOT style or length. A shorter answer is not worse; a longer
answer is not better.

Judge MECHANISM ONLY — molecular and cellular function. Do NOT reward or penalize
clinical/disease-association content, prevalence, prognosis, or therapeutic
relevance; that is out of scope for this comparison. If one response spends text
on disease while the other describes mechanism, that does not make it better.

Reward:
  - Factual accuracy: claims that are correct for this specific gene.
  - Mechanistic specificity: named substrates, binding partners, complexes,
    catalytic activities, pathways, and defined cellular roles — over vague
    generalities.
  - Coverage of the gene's PRINCIPAL established function.

Penalize heavily:
  - Any claim that is wrong, or that overreaches beyond what is established (a
    confident-but-unsupported statement is worse than an honest gap).
  - Conflating this gene with a paralog or a different gene of the same name.
  - Vague filler that conveys no mechanism.

TIEBREAKER (apply BEFORE any preference for detail): if both responses are
factually accurate, prefer the one that asserts LESS beyond what is established. A
confident claim that overreaches is a DEFECT, not richness. Do NOT prefer a
response merely because it is longer, names more entities, or reads as more
detailed; extra detail only counts if it is accurate AND mechanistic.

Prefer the response a careful molecular biologist would judge the more accurate
and informative description of THIS gene.

Authorship and citation

Built in the Cheeseman lab, Whitehead Institute. Source at github.com/cheeseman-lab/affinage.

Limitations