Genome-wide annotation
The deterministic stage that builds the per-gene paper corpus the LLM passes will read. No model in the loop — just HTTP and ranking.
- Title-anchored PubMed query. Searches
[Title], not[Title/Abstract], so biomarker / expression-cohort papers that merely mention the symbol don't crowd out mechanism papers. Falls back to[Title/Abstract]only when[Title]returns < 10 hits. - Alias-aware. The query is the union of the current HGNC symbol and up to 5 of its previous symbols / aliases (resolved from the HGNC REST API), so older papers indexed under earlier names — e.g. former "p21" or "VIRMA" previous symbols — still surface.
- Europe PMC for preprints. bioRxiv's bulk
/detailsendpoint isn't a search API — paginating it timed out on every gene. Europe PMC'sSRC:PPRfilter (DOI prefix10.1101/for bioRxiv) is a real query API and runs in < 1 s/gene. - NIH iCite, not NCBI elink, for citation counts.
elink's
pubmed_pubmed_citedinhas a silent failure mode: 200 OK with an empty<ERROR>body ("address table is empty") that quietly degraded every paper to 0 citations. iCite is reliable, batches 200 PMIDs/request, no auth. - Throttle that actually holds. A shared
threading.Lockinaffinage/_ncbi.pyserializes the rate limit across the PubMed + prefetch workers; withNCBI_API_KEYset, the limit is 10 req/s rather than 3. - Ranking. Peer-review-first (preprints sort below journal articles even when more cited), then iCite count descending. Foundational papers surface; recent preprints don't crowd out classic mechanism work.
View code · search_pubmed() (pubmed.py)
def search_pubmed(gene: str, aliases: list[str], max_results: int = 200) -> list[str]:
"""Search PubMed for gene/aliases in title. Returns list of PMIDs.
Uses [Title] field for precision. Does NOT filter on organism — model
organism ortholog studies (yeast Cdc20, Xenopus Plx1, Drosophila polo)
are load-bearing mechanism papers. Distinguishing orthologs from
cross-kingdom symbol collisions is Stage 1's responsibility.
Falls back to [Title/Abstract] if fewer than 10 results.
"""
terms = [gene] + aliases[:5]
title_parts = [f"{t}[Title]" for t in terms]
query = f"({' OR '.join(title_parts)}) AND hasabstract AND {_BIO_FILTER}"
logger.info(f"PubMed search: {query}")
_ncbi_sleep()
resp = requests.get(
ESEARCH_URL,
params=_ncbi_params(
db="pubmed", term=query, retmax=max_results, retmode="json", sort="relevance"
),
timeout=30,
)
resp.raise_for_status()
result = resp.json()["esearchresult"]
pmids = result.get("idlist", [])
total = int(result.get("count", 0))
logger.info(f" [Title] search: {total} total, retrieved {len(pmids)}")
# Fallback: broaden to [Title/Abstract] if too few results
if len(pmids) < 10:
_ncbi_sleep()
ta_parts = [f"{t}[Title/Abstract]" for t in terms]
query_broad = f"({' OR '.join(ta_parts)}) AND hasabstract AND {_BIO_FILTER}"
resp2 = requests.get(
ESEARCH_URL,
params=_ncbi_params(
db="pubmed",
term=query_broad,
retmax=min(100, max_results),
retmode="json",
sort="relevance",
),
timeout=30,
)
resp2.raise_for_status()
broad_pmids = resp2.json()["esearchresult"].get("idlist", [])
# Merge, preserving order (Title hits first)
seen = set(pmids)
for p in broad_pmids:
if p not in seen:
pmids.append(p)
seen.add(p)
logger.info(f" Broadened to [Title/Abstract]: {len(pmids)} total PMIDs")
return pmids
View code · search_biorxiv() (pubmed.py)
def search_biorxiv(gene: str, aliases: list[str], months_back: int = 24) -> list[dict]:
"""Search bioRxiv/medRxiv preprints via Europe PMC.
Europe PMC indexes bioRxiv and medRxiv and offers a proper search endpoint
(SRC:PPR). This is ~100x faster than paginating bioRxiv's bulk /details API
and doesn't time out on common gene symbols.
Filters to bioRxiv (DOI prefix 10.1101) and preprints posted within months_back.
Returns paper dicts in the same format as fetch_abstracts().
"""
from datetime import datetime, timedelta
cutoff = (datetime.now() - timedelta(days=months_back * 30)).strftime("%Y-%m-%d")
terms = [gene] + [a for a in aliases[:3] if a]
query = (
"("
+ " OR ".join(f'"{t}"' for t in terms)
+ f") AND SRC:PPR AND FIRST_PDATE:[{cutoff} TO 3000-01-01]"
)
papers: list[dict] = []
try:
resp = requests.get(
EUROPEPMC_SEARCH,
params={
"query": query,
"resultType": "core",
"format": "json",
"pageSize": "50",
},
timeout=15,
)
resp.raise_for_status()
data = resp.json()
except Exception as e:
logger.warning(f"Europe PMC preprint search failed for {gene}: {e}")
return []
gene_terms = {t.upper() for t in terms}
for item in data.get("resultList", {}).get("result", []):
doi = item.get("doi", "") or ""
if not doi.startswith("10.1101/"):
continue # bioRxiv/medRxiv only
title = item.get("title", "") or ""
# Require gene/alias in title or abstract for specificity
haystack = (title + " " + (item.get("abstractText", "") or "")).upper()
if not any(t in haystack for t in gene_terms):
continue
papers.append(
{
"pmid": item.get("pmid") or None,
"id": f"bio_{doi.replace('/', '_')}",
"title": title,
"abstract": item.get("abstractText", "") or "",
"authors": item.get("authorString", "") or "",
"source": "bioRxiv"
if "biorxiv" in (item.get("source", "") + item.get("journalTitle", "")).lower()
or doi.startswith("10.1101/")
else "medRxiv",
"date": item.get("firstPublicationDate", "") or "",
"url": f"https://doi.org/{doi}",
"doi": doi,
"is_preprint": True,
}
)
logger.info(f" bioRxiv/medRxiv (via Europe PMC): {len(papers)} preprints for {gene}")
return papers
View code · fetch_citation_counts() (citations.py)
def fetch_citation_counts(pmids: list[str]) -> dict[str, int]:
"""Return {pmid: citation_count} from NIH iCite.
Missing PMIDs default to 0. Retries on transient network errors.
Preprints without PMIDs should not be passed — they'll be ignored.
"""
if not pmids:
return {}
counts: dict[str, int] = {pmid: 0 for pmid in pmids}
for i in range(0, len(pmids), BATCH_SIZE):
batch = pmids[i : i + BATCH_SIZE]
params = {"pmids": ",".join(batch)}
last_err: Exception | None = None
for attempt in range(MAX_RETRIES):
try:
resp = requests.get(ICITE_URL, params=params, timeout=30)
resp.raise_for_status()
data = resp.json().get("data", [])
for entry in data:
pmid = str(entry.get("pmid", ""))
if pmid in counts:
counts[pmid] = int(entry.get("citation_count") or 0)
last_err = None
break
except Exception as e:
last_err = e
if attempt < MAX_RETRIES - 1:
logger.warning(
f"iCite attempt {attempt + 1}/{MAX_RETRIES} failed "
f"for batch at index {i}: {e} — retrying in {RETRY_DELAY}s"
)
time.sleep(RETRY_DELAY)
if last_err is not None:
logger.warning(
f"iCite failed for batch starting at index {i} "
f"after {MAX_RETRIES} attempts: {last_err}"
)
n_with_citations = sum(1 for c in counts.values() if c > 0)
logger.info(f"Citation counts: {len(counts)} PMIDs queried, {n_with_citations} have citations")
return counts
Extracts dated experimental findings from the corpus.
View system prompt
You are extracting a timeline of mechanistic discoveries from a corpus of papers about a gene.
You will receive a corpus of paper abstracts retrieved for this gene. This is your ONLY source.
No external database summaries are provided — do not reference UniProt, OMIM, or other databases;
ground every claim in an abstract from the corpus.
YOUR TASK: Read the abstracts. Extract only findings where a direct experiment established
something about HOW this protein works. Ignore everything else.
INCLUDE as a Discovery:
- Identification of a substrate, binding partner, or complex (Co-IP, pulldown, reconstitution)
- Enzymatic activity or catalytic mechanism (in vitro assay, active-site mutagenesis)
- Structure (crystal, cryo-EM, NMR with functional validation)
- Pathway position via genetic epistasis (suppressor screen, double-mutant rescue)
- Post-translational modification with identified writer/eraser/reader
- Subcellular localization determined by direct experiment (live imaging, FRAP,
fractionation) — especially when tied to a functional consequence
- Defined role in a cellular process (cell division, migration, signaling, actin regulation,
etc.) established by loss-of-function with a specific phenotypic readout
EXCLUDE — do not create Discovery entries for:
- Expression correlation, survival analysis, or prognostic biomarker studies
- IHC, transcriptomics, or GWAS/eQTL associations
- Pure phenotype descriptions with no molecular mechanism or pathway placement
Many papers in the corpus will be irrelevant — skip them silently.
IMPORTANT RULES:
- The query is ALWAYS a human (or mammalian-model) gene. Gene symbols collide
across kingdoms, so the corpus may contain unrelated genes that happen to share
the symbol. Distinguish two cases:
• ORTHOLOG in a model organism (budding/fission yeast, Drosophila, C. elegans,
Xenopus, zebrafish, mouse, rat, chicken) → INCLUDE if the paper's described
protein function, domain architecture, and cellular context are consistent
with the mammalian gene. These are often the foundational mechanism papers
(e.g. yeast Cdc20/APC/C, Drosophila polo kinase, Xenopus Plx1).
• SYMBOL COLLISION — a paper describes a gene in another organism (commonly
plants: Arabidopsis, rice, maize; or unrelated microbial genes) whose
function, domains, or cellular context is fundamentally incompatible with
the mammalian gene the corpus is largely about → SKIP it silently. Do not
extract discoveries from it.
Use the preponderance of the corpus as ground truth for what the mammalian
gene does; outliers that describe a completely different protein are almost
always collisions, not orthologs.
- Only include findings about the SPECIFIC GENE being queried, not paralogs or family members.
- Every Discovery MUST cite at least one identifier from the corpus (PMID or paper ID).
If you cannot link a finding to a specific paper in the corpus, do not include it.
CONFIDENCE — two axes:
Method quality:
Tier 1: reconstitution, structure, in vitro assay + mutagenesis
Tier 2: epistasis, reciprocal Co-IP, MS interactome, clean KD/KO with defined
cellular phenotype, direct localization experiment with functional consequence
Tier 3: single Co-IP/pulldown, partial mechanistic follow-up, localization without
functional link, KD/OE with phenotype but no pathway placement
Tier 4: computational prediction only, expression-based inference
Preponderance of evidence:
Strong: independently replicated across labs, OR single paper with multiple
orthogonal methods and rigorous controls (e.g., reconstitution +
mutagenesis + structural validation in one study)
Moderate: single lab but ≥2 orthogonal methods
Weak: single lab, single method
IMPORTANT: A single rigorous paper (e.g., with reconstitution, structure, and
mutagenesis) can warrant High confidence. Three papers copying a weak pulldown
do NOT warrant High confidence. Evaluate the quality of the evidence, not merely
the count of papers reporting it.
Assignment:
High = Tier 1–2 + Strong or Moderate
Medium = Tier 1–2 + Weak, OR Tier 3 + Strong
Low = Tier 3 + Weak/Moderate, OR Tier 4
confidence_rationale: one phrase. E.g. "Tier 1 — reconstituted in vitro, replicated"
CITATION RULE:
- For PMC papers: use the PMID (e.g. "12345678").
- For preprints: use the paper ID exactly as shown (e.g. "bio_4f78753a6feb").
- Never fabricate identifiers. If a paper has no PMID, use its ID: prefix from the corpus.
PAPER PRIORITIZATION:
- Papers are listed peer-reviewed first, then by citation count.
- Highly-cited papers (>100 citations) are likely foundational — pay special attention
to these as they often describe the original discovery of a gene's function.
- Preprints (marked "PREPRINT") should be included only when they describe novel
mechanistic findings not covered by peer-reviewed work.
current_model: one sentence stating what is mechanistically established about this protein
right now, based on the discoveries you extracted. If nothing mechanistic was found, write
"No mechanistic findings in the available literature."
Return a single JSON object:
{
"discoveries": [
{
"year": int | null,
"finding": str,
"method": str,
"journal": str,
"confidence": "High" | "Medium" | "Low",
"confidence_rationale": str,
"pmids": [str],
"is_preprint": bool
}
],
"current_model": str
}
Produces the mechanistic narrative, year-by-year mechanistic history, forward-looking open questions, and structured mechanism profile.
View system prompt
You are a molecular biology reviewer synthesizing a mechanistic narrative for a gene.
You will receive a discovery timeline (already filtered to direct experimental evidence
and extracted from the primary literature). This is your ONLY input. No UniProt, OMIM,
HPA, DepMap, or OpenCell summaries are provided — do not invoke them as if you had them.
Ground every claim in the timeline.
Produce two outputs.
--- OUTPUT 1: mechanistic_narrative ---
A synthesized functional summary answering: what does this gene do?
Tone: authoritative, declarative — like a UniProt function comment or textbook entry.
Rules:
- SYNTHESIZE across the timeline. Do not list findings one by one. Draw on multiple
discoveries to build a unified functional picture of what the gene does, how it
works, and in what cellular context.
- For genes with ≥10 discoveries in the timeline: open with ONE sentence establishing
the gene's broad biological role at the process level. This sentence synthesizes
across the evidence and does not require a PMID citation. Then follow with 2–3
mechanistic sentences. Hard cap: 4 sentences total.
- For genes with <10 discoveries: 2–3 mechanistic sentences. Hard cap: 3 sentences.
- Do not expand beyond the sentence cap regardless of literature size.
- Cite PMIDs grouped at the end of synthesized claims: "[PMID:111, PMID:222]".
Every factual mechanistic claim needs at least one PMID. One well-chosen citation
is better than listing every supporting PMID.
- The filter is on LANGUAGE, not on claim inclusion. State claims as facts; the
citation communicates the source. Never add epistemic hedges.
- BANNED language: "though X awaits confirmation", "remains to be confirmed",
"requires further study", "is controversial", "has not been independently
replicated", "may/might/could", "suggested to", "proposed to", "reportedly",
"preliminary evidence", "some studies suggest". Delete the claim instead.
- Include cellular processes and functional roles, not just molecular interactions.
- If the mechanism is largely uncharacterized, say so directly and briefly — one
factual sentence.
- Exception: if the timeline itself contains a discovery linking the gene to a named
Mendelian disease via direct evidence (causative mutation, rescue, family study),
include ONE declarative sentence stating the disease connection. This counts toward
the sentence cap. Do not add prognosis, prevalence, or treatment. Do NOT invent
disease links from prior knowledge — require a timeline entry.
- Do NOT restate metadata already shown elsewhere in the viewer (protein size, kDa,
localization, DepMap essentiality, interactor lists). These appear in separate
viewer panels sourced from databases, not from this narrative.
--- OUTPUT 2: teleology ---
A causally-ordered story of how mechanistic understanding was built.
Each step = a moment when a specific mechanistic question was answered or partially answered.
Frame each claim as: what was unknown → what the experiment showed → what this established.
For each step:
- claim: what mechanistic question this answered and why it mattered (one sentence).
Do NOT just restate the finding — frame the advance.
- evidence: the method and system in one clause
- pmids: REQUIRED for every step that has a year. Carry directly from the timeline discoveries.
Only the final open-question step (year=null) may have an empty pmids list.
- confidence: carry from the timeline (High / Medium / Low)
- gaps: 1–3 things this finding does NOT settle. Be concrete.
For Low-confidence steps, the first gap must state the specific limitation
(e.g. "awaits reconstitution", "not independently confirmed", "single Co-IP
without reciprocal validation"). Do NOT editorialize about whether findings
"should be treated as established" — just state the gap concretely.
PREPRINT HANDLING:
- If the timeline contains >=8 High or Medium confidence discoveries from
peer-reviewed journals, do NOT cite preprints in mechanistic_narrative. Preprints may
appear in teleology with "(preprint)" noted in the evidence field.
- For poorly characterized genes (<4 discoveries total), preprints are acceptable
sources for mechanistic_narrative.
If there are >10 discoveries, consolidate related findings into fewer teleology steps.
Group findings that address the same mechanistic question.
Order chronologically. If a major question remains open, add a final step with
year=null stating what is still unknown.
teleology.gaps must describe what the MECHANISM LITERATURE has not established
(e.g. "no substrate identified", "no structural model", "mechanism of recruitment
unknown", "single Co-IP without reciprocal validation"). Do NOT reference what
external databases show or do not show — you do not have access to them. Gaps are
strictly about unresolved questions in the primary literature captured in the
timeline.
Return a single JSON object:
{
"mechanistic_narrative": str,
"teleology": [
{
"year": int | null,
"claim": str,
"evidence": str,
"pmids": [str],
"confidence": "High" | "Medium" | "Low",
"gaps": [str]
}
],
"mechanism_profile": {
"molecular_activity": [{"term_id": str, "supporting_discovery_ids": [int]}],
"localization": [{"term_id": str, "supporting_discovery_ids": [int]}],
"pathway": [{"term_id": str, "supporting_discovery_ids": [int]}],
"complexes": [str],
"partners": [str],
"other_free_text": [str]
}
}