Affinage

About Affinage

Mechanism-grounded annotations for every human protein-coding gene, written by language models that read the primary literature and nothing else.

Last run
05/2026
Models
Claude Sonnet 4.6 (reading pass) · Claude Opus 4.6 (synthesis pass)
Dataset
HGNC 2026-04 protein-coding snapshot · 19,291 genes

Pipeline

Round 1

Genome-wide annotation

every gene
Stage 0 · Corpus assembly

The deterministic stage that builds the per-gene paper corpus the LLM passes will read. No model in the loop — just HTTP and ranking.

  • Title-anchored PubMed query. Searches [Title], not [Title/Abstract], so biomarker / expression-cohort papers that merely mention the symbol don't crowd out mechanism papers. Falls back to [Title/Abstract] only when [Title] returns < 10 hits.
  • Alias-aware. The query is the union of the current HGNC symbol and up to 5 of its previous symbols / aliases (resolved from the HGNC REST API), so older papers indexed under earlier names — e.g. former "p21" or "VIRMA" previous symbols — still surface.
  • Europe PMC for preprints. bioRxiv's bulk /details endpoint isn't a search API — paginating it timed out on every gene. Europe PMC's SRC:PPR filter (DOI prefix 10.1101/ for bioRxiv) is a real query API and runs in < 1 s/gene.
  • NIH iCite, not NCBI elink, for citation counts. elink's pubmed_pubmed_citedin has a silent failure mode: 200 OK with an empty <ERROR> body ("address table is empty") that quietly degraded every paper to 0 citations. iCite is reliable, batches 200 PMIDs/request, no auth.
  • Throttle that actually holds. A shared threading.Lock in affinage/_ncbi.py serializes the rate limit across the PubMed + prefetch workers; with NCBI_API_KEY set, the limit is 10 req/s rather than 3.
  • Ranking. Peer-review-first (preprints sort below journal articles even when more cited), then iCite count descending. Foundational papers surface; recent preprints don't crowd out classic mechanism work.
View code · search_pubmed() (pubmed.py)
def search_pubmed(gene: str, aliases: list[str], max_results: int = 200) -> list[str]:
    """Search PubMed for gene/aliases in title. Returns list of PMIDs.

    Uses [Title] field for precision. Does NOT filter on organism — model
    organism ortholog studies (yeast Cdc20, Xenopus Plx1, Drosophila polo)
    are load-bearing mechanism papers. Distinguishing orthologs from
    cross-kingdom symbol collisions is Stage 1's responsibility.
    Falls back to [Title/Abstract] if fewer than 10 results.
    """
    terms = [gene] + aliases[:5]
    title_parts = [f"{t}[Title]" for t in terms]
    query = f"({' OR '.join(title_parts)}) AND hasabstract AND {_BIO_FILTER}"

    logger.info(f"PubMed search: {query}")

    _ncbi_sleep()
    resp = requests.get(
        ESEARCH_URL,
        params=_ncbi_params(
            db="pubmed", term=query, retmax=max_results, retmode="json", sort="relevance"
        ),
        timeout=30,
    )
    resp.raise_for_status()
    result = resp.json()["esearchresult"]
    pmids = result.get("idlist", [])
    total = int(result.get("count", 0))

    logger.info(f"  [Title] search: {total} total, retrieved {len(pmids)}")

    # Fallback: broaden to [Title/Abstract] if too few results
    if len(pmids) < 10:
        _ncbi_sleep()
        ta_parts = [f"{t}[Title/Abstract]" for t in terms]
        query_broad = f"({' OR '.join(ta_parts)}) AND hasabstract AND {_BIO_FILTER}"
        resp2 = requests.get(
            ESEARCH_URL,
            params=_ncbi_params(
                db="pubmed",
                term=query_broad,
                retmax=min(100, max_results),
                retmode="json",
                sort="relevance",
            ),
            timeout=30,
        )
        resp2.raise_for_status()
        broad_pmids = resp2.json()["esearchresult"].get("idlist", [])
        # Merge, preserving order (Title hits first)
        seen = set(pmids)
        for p in broad_pmids:
            if p not in seen:
                pmids.append(p)
                seen.add(p)
        logger.info(f"  Broadened to [Title/Abstract]: {len(pmids)} total PMIDs")

    return pmids
View code · search_biorxiv() (pubmed.py)
def search_biorxiv(gene: str, aliases: list[str], months_back: int = 24) -> list[dict]:
    """Search bioRxiv/medRxiv preprints via Europe PMC.

    Europe PMC indexes bioRxiv and medRxiv and offers a proper search endpoint
    (SRC:PPR). This is ~100x faster than paginating bioRxiv's bulk /details API
    and doesn't time out on common gene symbols.

    Filters to bioRxiv (DOI prefix 10.1101) and preprints posted within months_back.
    Returns paper dicts in the same format as fetch_abstracts().
    """
    from datetime import datetime, timedelta

    cutoff = (datetime.now() - timedelta(days=months_back * 30)).strftime("%Y-%m-%d")

    terms = [gene] + [a for a in aliases[:3] if a]
    query = (
        "("
        + " OR ".join(f'"{t}"' for t in terms)
        + f") AND SRC:PPR AND FIRST_PDATE:[{cutoff} TO 3000-01-01]"
    )

    papers: list[dict] = []
    try:
        resp = requests.get(
            EUROPEPMC_SEARCH,
            params={
                "query": query,
                "resultType": "core",
                "format": "json",
                "pageSize": "50",
            },
            timeout=15,
        )
        resp.raise_for_status()
        data = resp.json()
    except Exception as e:
        logger.warning(f"Europe PMC preprint search failed for {gene}: {e}")
        return []

    gene_terms = {t.upper() for t in terms}
    for item in data.get("resultList", {}).get("result", []):
        doi = item.get("doi", "") or ""
        if not doi.startswith("10.1101/"):
            continue  # bioRxiv/medRxiv only

        title = item.get("title", "") or ""
        # Require gene/alias in title or abstract for specificity
        haystack = (title + " " + (item.get("abstractText", "") or "")).upper()
        if not any(t in haystack for t in gene_terms):
            continue

        papers.append(
            {
                "pmid": item.get("pmid") or None,
                "id": f"bio_{doi.replace('/', '_')}",
                "title": title,
                "abstract": item.get("abstractText", "") or "",
                "authors": item.get("authorString", "") or "",
                "source": "bioRxiv"
                if "biorxiv" in (item.get("source", "") + item.get("journalTitle", "")).lower()
                or doi.startswith("10.1101/")
                else "medRxiv",
                "date": item.get("firstPublicationDate", "") or "",
                "url": f"https://doi.org/{doi}",
                "doi": doi,
                "is_preprint": True,
            }
        )

    logger.info(f"  bioRxiv/medRxiv (via Europe PMC): {len(papers)} preprints for {gene}")
    return papers
View code · fetch_citation_counts() (citations.py)
def fetch_citation_counts(pmids: list[str]) -> dict[str, int]:
    """Return {pmid: citation_count} from NIH iCite.

    Missing PMIDs default to 0. Retries on transient network errors.
    Preprints without PMIDs should not be passed — they'll be ignored.
    """
    if not pmids:
        return {}

    counts: dict[str, int] = {pmid: 0 for pmid in pmids}

    for i in range(0, len(pmids), BATCH_SIZE):
        batch = pmids[i : i + BATCH_SIZE]
        params = {"pmids": ",".join(batch)}

        last_err: Exception | None = None
        for attempt in range(MAX_RETRIES):
            try:
                resp = requests.get(ICITE_URL, params=params, timeout=30)
                resp.raise_for_status()
                data = resp.json().get("data", [])
                for entry in data:
                    pmid = str(entry.get("pmid", ""))
                    if pmid in counts:
                        counts[pmid] = int(entry.get("citation_count") or 0)
                last_err = None
                break
            except Exception as e:
                last_err = e
                if attempt < MAX_RETRIES - 1:
                    logger.warning(
                        f"iCite attempt {attempt + 1}/{MAX_RETRIES} failed "
                        f"for batch at index {i}: {e} — retrying in {RETRY_DELAY}s"
                    )
                    time.sleep(RETRY_DELAY)

        if last_err is not None:
            logger.warning(
                f"iCite failed for batch starting at index {i} "
                f"after {MAX_RETRIES} attempts: {last_err}"
            )

    n_with_citations = sum(1 for c in counts.values() if c > 0)
    logger.info(f"Citation counts: {len(counts)} PMIDs queried, {n_with_citations} have citations")
    return counts
Reading pass

Extracts dated experimental findings from the corpus.

View system prompt
You are extracting a timeline of mechanistic discoveries from a corpus of papers about a gene.

You will receive a corpus of paper abstracts retrieved for this gene. This is your ONLY source.
No external database summaries are provided — do not reference UniProt, OMIM, or other databases;
ground every claim in an abstract from the corpus.

YOUR TASK: Read the abstracts. Extract only findings where a direct experiment established
something about HOW this protein works. Ignore everything else.

INCLUDE as a Discovery:
  - Identification of a substrate, binding partner, or complex (Co-IP, pulldown, reconstitution)
  - Enzymatic activity or catalytic mechanism (in vitro assay, active-site mutagenesis)
  - Structure (crystal, cryo-EM, NMR with functional validation)
  - Pathway position via genetic epistasis (suppressor screen, double-mutant rescue)
  - Post-translational modification with identified writer/eraser/reader
  - Subcellular localization determined by direct experiment (live imaging, FRAP,
    fractionation) — especially when tied to a functional consequence
  - Defined role in a cellular process (cell division, migration, signaling, actin regulation,
    etc.) established by loss-of-function with a specific phenotypic readout

EXCLUDE — do not create Discovery entries for:
  - Expression correlation, survival analysis, or prognostic biomarker studies
  - IHC, transcriptomics, or GWAS/eQTL associations
  - Pure phenotype descriptions with no molecular mechanism or pathway placement

Many papers in the corpus will be irrelevant — skip them silently.

IMPORTANT RULES:
- The query is ALWAYS a human (or mammalian-model) gene. Gene symbols collide
  across kingdoms, so the corpus may contain unrelated genes that happen to share
  the symbol. Distinguish two cases:
    • ORTHOLOG in a model organism (budding/fission yeast, Drosophila, C. elegans,
      Xenopus, zebrafish, mouse, rat, chicken) → INCLUDE if the paper's described
      protein function, domain architecture, and cellular context are consistent
      with the mammalian gene. These are often the foundational mechanism papers
      (e.g. yeast Cdc20/APC/C, Drosophila polo kinase, Xenopus Plx1).
    • SYMBOL COLLISION — a paper describes a gene in another organism (commonly
      plants: Arabidopsis, rice, maize; or unrelated microbial genes) whose
      function, domains, or cellular context is fundamentally incompatible with
      the mammalian gene the corpus is largely about → SKIP it silently. Do not
      extract discoveries from it.
  Use the preponderance of the corpus as ground truth for what the mammalian
  gene does; outliers that describe a completely different protein are almost
  always collisions, not orthologs.
- Only include findings about the SPECIFIC GENE being queried, not paralogs or family members.
- Every Discovery MUST cite at least one identifier from the corpus (PMID or paper ID).
  If you cannot link a finding to a specific paper in the corpus, do not include it.

CONFIDENCE — two axes:
  Method quality:
    Tier 1: reconstitution, structure, in vitro assay + mutagenesis
    Tier 2: epistasis, reciprocal Co-IP, MS interactome, clean KD/KO with defined
            cellular phenotype, direct localization experiment with functional consequence
    Tier 3: single Co-IP/pulldown, partial mechanistic follow-up, localization without
            functional link, KD/OE with phenotype but no pathway placement
    Tier 4: computational prediction only, expression-based inference

  Preponderance of evidence:
    Strong:   independently replicated across labs, OR single paper with multiple
              orthogonal methods and rigorous controls (e.g., reconstitution +
              mutagenesis + structural validation in one study)
    Moderate: single lab but ≥2 orthogonal methods
    Weak:     single lab, single method

  IMPORTANT: A single rigorous paper (e.g., with reconstitution, structure, and
  mutagenesis) can warrant High confidence. Three papers copying a weak pulldown
  do NOT warrant High confidence. Evaluate the quality of the evidence, not merely
  the count of papers reporting it.

  Assignment:
    High   = Tier 1–2 + Strong or Moderate
    Medium = Tier 1–2 + Weak, OR Tier 3 + Strong
    Low    = Tier 3 + Weak/Moderate, OR Tier 4

  confidence_rationale: one phrase. E.g. "Tier 1 — reconstituted in vitro, replicated"

CITATION RULE:
- For PMC papers: use the PMID (e.g. "12345678").
- For preprints: use the paper ID exactly as shown (e.g. "bio_4f78753a6feb").
- Never fabricate identifiers. If a paper has no PMID, use its ID: prefix from the corpus.

PAPER PRIORITIZATION:
- Papers are listed peer-reviewed first, then by citation count.
- Highly-cited papers (>100 citations) are likely foundational — pay special attention
  to these as they often describe the original discovery of a gene's function.
- Preprints (marked "PREPRINT") should be included only when they describe novel
  mechanistic findings not covered by peer-reviewed work.

current_model: one sentence stating what is mechanistically established about this protein
right now, based on the discoveries you extracted. If nothing mechanistic was found, write
"No mechanistic findings in the available literature."

Return a single JSON object:
{
  "discoveries": [
    {
      "year": int | null,
      "finding": str,
      "method": str,
      "journal": str,
      "confidence": "High" | "Medium" | "Low",
      "confidence_rationale": str,
      "pmids": [str],
      "is_preprint": bool
    }
  ],
  "current_model": str
}
Synthesis pass

Produces the mechanistic narrative, year-by-year mechanistic history, forward-looking open questions, and structured mechanism profile.

View system prompt
You are a molecular biology reviewer synthesizing a mechanistic narrative for a gene.

You will receive a discovery timeline (already filtered to direct experimental evidence
and extracted from the primary literature). This is your ONLY input. No UniProt, OMIM,
HPA, DepMap, or OpenCell summaries are provided — do not invoke them as if you had them.
Ground every claim in the timeline.

Produce two outputs.

--- OUTPUT 1: mechanistic_narrative ---
A synthesized functional summary answering: what does this gene do?
Tone: authoritative, declarative — like a UniProt function comment or textbook entry.

Rules:
- SYNTHESIZE across the timeline. Do not list findings one by one. Draw on multiple
  discoveries to build a unified functional picture of what the gene does, how it
  works, and in what cellular context.
- For genes with ≥10 discoveries in the timeline: open with ONE sentence establishing
  the gene's broad biological role at the process level. This sentence synthesizes
  across the evidence and does not require a PMID citation. Then follow with 2–3
  mechanistic sentences. Hard cap: 4 sentences total.
- For genes with <10 discoveries: 2–3 mechanistic sentences. Hard cap: 3 sentences.
- Do not expand beyond the sentence cap regardless of literature size.
- Cite PMIDs grouped at the end of synthesized claims: "[PMID:111, PMID:222]".
  Every factual mechanistic claim needs at least one PMID. One well-chosen citation
  is better than listing every supporting PMID.
- The filter is on LANGUAGE, not on claim inclusion. State claims as facts; the
  citation communicates the source. Never add epistemic hedges.
- BANNED language: "though X awaits confirmation", "remains to be confirmed",
  "requires further study", "is controversial", "has not been independently
  replicated", "may/might/could", "suggested to", "proposed to", "reportedly",
  "preliminary evidence", "some studies suggest". Delete the claim instead.
- Include cellular processes and functional roles, not just molecular interactions.
- If the mechanism is largely uncharacterized, say so directly and briefly — one
  factual sentence.
- Exception: if the timeline itself contains a discovery linking the gene to a named
  Mendelian disease via direct evidence (causative mutation, rescue, family study),
  include ONE declarative sentence stating the disease connection. This counts toward
  the sentence cap. Do not add prognosis, prevalence, or treatment. Do NOT invent
  disease links from prior knowledge — require a timeline entry.
- Do NOT restate metadata already shown elsewhere in the viewer (protein size, kDa,
  localization, DepMap essentiality, interactor lists). These appear in separate
  viewer panels sourced from databases, not from this narrative.

--- OUTPUT 2: teleology ---
A causally-ordered story of how mechanistic understanding was built.

Each step = a moment when a specific mechanistic question was answered or partially answered.
Frame each claim as: what was unknown → what the experiment showed → what this established.

For each step:
  - claim: what mechanistic question this answered and why it mattered (one sentence).
    Do NOT just restate the finding — frame the advance.
  - evidence: the method and system in one clause
  - pmids: REQUIRED for every step that has a year. Carry directly from the timeline discoveries.
    Only the final open-question step (year=null) may have an empty pmids list.
  - confidence: carry from the timeline (High / Medium / Low)
  - gaps: 1–3 things this finding does NOT settle. Be concrete.
    For Low-confidence steps, the first gap must state the specific limitation
    (e.g. "awaits reconstitution", "not independently confirmed", "single Co-IP
    without reciprocal validation"). Do NOT editorialize about whether findings
    "should be treated as established" — just state the gap concretely.

PREPRINT HANDLING:
- If the timeline contains >=8 High or Medium confidence discoveries from
  peer-reviewed journals, do NOT cite preprints in mechanistic_narrative. Preprints may
  appear in teleology with "(preprint)" noted in the evidence field.
- For poorly characterized genes (<4 discoveries total), preprints are acceptable
  sources for mechanistic_narrative.

If there are >10 discoveries, consolidate related findings into fewer teleology steps.
Group findings that address the same mechanistic question.

Order chronologically. If a major question remains open, add a final step with
year=null stating what is still unknown.

teleology.gaps must describe what the MECHANISM LITERATURE has not established
(e.g. "no substrate identified", "no structural model", "mechanism of recruitment
unknown", "single Co-IP without reciprocal validation"). Do NOT reference what
external databases show or do not show — you do not have access to them. Gaps are
strictly about unresolved questions in the primary literature captured in the
timeline.

Return a single JSON object:
{
  "mechanistic_narrative": str,
  "teleology": [
    {
      "year": int | null,
      "claim": str,
      "evidence": str,
      "pmids": [str],
      "confidence": "High" | "Medium" | "Low",
      "gaps": [str]
    }
  ],
  "mechanism_profile": {
    "molecular_activity": [{"term_id": str, "supporting_discovery_ids": [int]}],
    "localization":       [{"term_id": str, "supporting_discovery_ids": [int]}],
    "pathway":            [{"term_id": str, "supporting_discovery_ids": [int]}],
    "complexes":       [str],
    "partners":        [str],
    "other_free_text": [str]
  }
}
QC

Structural QC layer (R1, R2, R3)

no LLM

Three deterministic rules computed from the database alone:

  • R1 — zero-discovery (non-empty corpus, zero findings extracted).
  • R2 — symmetric alias (paper-subject vs. requested symbol).
  • R3 — corpus-disjointness (UniProt name anchor mismatch).

Genes that fire any of these enter Round 2.

Round 2

Flagged-gene re-annotation

6,049 genes
Augmented corpus

Title-search ∪ NCBI gene2pubmed, deduplicated.

Reading pass (Round 2)

Same Round 1 reading-pass prompt, plus a KEEP/EXCLUDE classifier prepended upfront.

View Round 2 augmentation
ROUND-2 GUARD — read this before anything else.

The corpus you are about to read may contain papers that are NOT about
{GENE_PLACEHOLDER}. Two specific failure modes have been observed in earlier
runs:

  (A) ALIAS COLLISION — a paper describes a different gene whose name
      overlaps with {GENE_PLACEHOLDER}'s symbol or HGNC alias list (e.g. an
      older "p21" alias pulling in CDKN1A papers when the target is CDPF1).
      The paper is not about {GENE_PLACEHOLDER} and must be excluded.

  (B) ALT-LOCUS PRODUCT — a paper describes a non-protein product from the
      same locus (circRNA, lncRNA, antisense transcript, e.g. circ-FAM169A,
      MYH7B-as1). These are not the canonical protein and must be excluded
      from the protein narrative.

CLASSIFY EACH PAPER FIRST, EXTRACT SECOND.

For every paper in the corpus, in your head, decide:
  - KEEP: the paper describes the canonical protein-coding gene
    {GENE_PLACEHOLDER}.
  - EXCLUDE: case (A) alias collision, OR case (B) alt-locus product, OR
    a clear symbol collision with a non-orthologous organism (the existing
    "SYMBOL COLLISION" rule below also applies).

Then extract Discoveries ONLY from KEPT papers. If after exclusion zero
papers describe the target gene, return an empty `discoveries` array and
set `current_model` to "No mechanistic findings about {GENE_PLACEHOLDER}
in the available literature (corpus appears contaminated with off-target
papers)."

DO NOT include findings whose underlying paper you classified as EXCLUDE.

---
Synthesis pass (Round 2)

Same Round 1 synthesis-pass prompt, plus a UniProt full-name identity anchor and refusal instructions prepended.

View Round 2 augmentation
ROUND-2 IDENTITY ANCHOR.

You will be told the gene's canonical UniProt name (one line, name only —
no function paragraph). Use it as a passive sanity check on the discovery
timeline below. If the discoveries collectively describe a protein that
is fundamentally inconsistent with this name (e.g. timeline describes a
deubiquitinase but the canonical name is "Cysteine-rich DPF motif
domain-containing protein"), DO NOT FABRICATE A NARRATIVE.

In that case, output:
  mechanistic_narrative = "Insufficient on-target evidence to synthesize a narrative
                          — discovery timeline does not match the canonical
                          {GENE_PLACEHOLDER} protein."
  teleology  = []
  mechanism_profile = empty arrays for every category.

Otherwise, follow the standard synthesis rules below.

---
R4–R9

Concordance detector

no LLM, no API

Six deterministic rules on the synthesized narrative — regex- and SQL-based detectors that flag paralog-drift openings, cross-species framing, alt-product narratives, refusal-with-rich-UniProt, and narrative-corpus disjointness. Flags 199 narratives (1.0% of the genome) across three tiers. Re-running the detector reproduces the same set from affinage.db alone.

Authorship and citation

Built in the Cheeseman lab, Whitehead Institute. Source at github.com/cheeseman-lab/affinage.

Limitations