Domain Research (dr): Comprehensive Report Tutorial¶
This notebook shows how to generate a comprehensive report from a user query using omicverse.llm.domain_research.
It covers the offline default synthesizer and an optional LLM-backed synthesizer.
Note: Live Web Retrieval Options
This tutorial now supports live web-backed retrieval in addition to offline demos.
- Use
vector_store="web"to auto-select a backend: Tavily ifTAVILY_API_KEYis set, otherwise DuckDuckGo. - Force a backend with
vector_store="web:tavily"or"web:duckduckgo". - Optional dependencies:
duckduckgo_searchandbeautifulsoup4improve DuckDuckGo results. - Set
RUN_WEB,RUN_WEB_FORCED, and/orRUN_WEB_LLMtoTruein the cells below to run those examples. - For LLM-backed synthesis, set
OPENAI_API_KEY. - Network access is required for web retrieval and online synthesis.
1) Quick start (offline)¶
- Implements a tiny in-memory vector store.
- Uses the default offline synthesizer to build a comprehensive report.
from omicverse.llm.domain_research import ResearchManager
class DemoStore:
def search(self, query):
class Doc:
def __init__(self, id, text, metadata=None):
self.id = id
self.text = text
self.metadata = metadata
return [Doc("1", f"Background on {query} and scRNA-seq workflows.")]
rm = ResearchManager(vector_store=DemoStore())
brief = rm.scope("Cell type annotation strategies in PBMCs")
findings = rm.research(brief)
report = rm.write(brief, findings)
print(report[:800])
The generated report includes:
- A top-level "Comprehensive Report" section (executive summary + objectives).
- Per-topic sections sourced from the vector store.
- Numbered citations with a "References" list appended.
2) Optional: ChromaDB-backed vector store¶
If you have chromadb and langchain_community installed, you can back the vector store with GPT4All embeddings.
This block is safe to run even if dependencies are missing — it will just skip.
try:
from omicverse.llm.domain_research import create_store, add_documents, query_store
docs = ["PBMC cell types include T cells, B cells, and monocytes.",
"Leiden clustering identifies cell communities in scRNA-seq."]
collection = create_store("demo_dr", persist_directory=None)
add_documents(collection, docs, ids=["d1", "d2"])
class ChromaStore:
def __init__(self, collection):
self.collection = collection
def search(self, query):
res = query_store(self.collection, query, n_results=2)
out = []
for id_, txt in zip(res.get("ids", [[""]])[0], res.get("documents", [[""]])[0]):
class Doc:
pass
d = Doc(); d.id = id_; d.text = txt; d.metadata = None
out.append(d)
return out
rm = ResearchManager(vector_store=ChromaStore(collection))
print(rm.run("PBMC annotation best practices")[:800])
except Exception as e:
print("ChromaDB example skipped:", e)
3) Optional: LLM-backed synthesizer (OpenAI-compatible)¶
Use a chat-completions API to synthesize an executive-style summary.
Set RUN_ONLINE=True and ensure OPENAI_API_KEY is available.
RUN_ONLINE = False
if RUN_ONLINE:
import os
from omicverse.llm.domain_research.write.synthesizer import PromptSynthesizer
api_key = os.getenv("OPENAI_API_KEY", "")
if not api_key:
raise RuntimeError("Missing OPENAI_API_KEY")
synth = PromptSynthesizer(model="gpt-4o-mini", base_url="https://api.openai.com/v1", api_key=api_key)
rm = ResearchManager(vector_store=DemoStore(), synthesizer=synth)
print(rm.run("Single-cell integration overview"))
else:
print("RUN_ONLINE is False — skipped external API calls.")
4) Optional: Live web search¶
Use a live web-backed retriever as the vector store.
- If
TAVILY_API_KEYis set,vector_store="web"uses Tavily; otherwise falls back to DuckDuckGo. - Alternatively, force a backend with
"web:tavily"or"web:duckduckgo".
RUN_WEB = False
if RUN_WEB:
from omicverse.llm.domain_research import ResearchManager
# Auto-selects Tavily if TAVILY_API_KEY is present, else DuckDuckGo
rm = ResearchManager(vector_store="web")
print(rm.run("Recent advances in single-cell RNA-seq batch correction")[:800])
else:
print("RUN_WEB is False — skipped live web retrieval.")
4a) Force web backend¶
- Force a specific backend via string flag:
"web:tavily"or"web:duckduckgo".
RUN_WEB_FORCED = False
if RUN_WEB_FORCED:
from omicverse.llm.domain_research import ResearchManager
# Force DuckDuckGo (works without API key)
rm = ResearchManager(vector_store="web:duckduckgo")
print(rm.run("Single-cell clustering best practices")[:800])
else:
print("RUN_WEB_FORCED is False — skipped forced backend.")
5) Live web + LLM synthesis¶
Combine live retrieval with an OpenAI-compatible synthesizer for the executive summary.
RUN_WEB_LLM = False
if RUN_WEB_LLM:
import os
from omicverse.llm.domain_research import ResearchManager
from omicverse.llm.domain_research.write.synthesizer import PromptSynthesizer
api_key = os.getenv("OPENAI_API_KEY", "")
if not api_key:
raise RuntimeError("Missing OPENAI_API_KEY")
synth = PromptSynthesizer(model="gpt-4o-mini", base_url="https://api.openai.com/v1", api_key=api_key)
rm = ResearchManager(vector_store="web", synthesizer=synth)
print(rm.run("scRNA-seq batch correction methods in 2024")[:800])
else:
print("RUN_WEB_LLM is False — skipped LLM synthesis with web retrieval.")