Spaces:
Running
A newer version of the Gradio SDK is available:
6.1.0
Critical Analysis: Search Tools - Limitations, Gaps, and Improvements
Date: November 2025 Purpose: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.
Executive Summary
DeepBoner currently has 4 search tools:
- PubMed (NCBI E-utilities)
- ClinicalTrials.gov (API v2)
- Europe PMC (includes preprints)
- OpenAlex (citation-aware)
Overall Assessment: Tools are functional but have significant gaps in:
- Deduplication (PubMed β© Europe PMC β© OpenAlex = massive overlap)
- Full-text retrieval (only abstracts currently)
- Citation graph traversal (OpenAlex has data but we don't use it)
- Query optimization (basic synonym expansion, no MeSH term mapping)
Tool 1: PubMed (NCBI E-utilities)
File: src/tools/pubmed.py
What It Does Well
| Feature | Status | Notes |
|---|---|---|
| Rate limiting | β | Shared limiter, respects 3/sec (no key) or 10/sec (with key) |
| Retry logic | β | tenacity with exponential backoff |
| Query preprocessing | β | Strips question words, expands synonyms |
| Abstract parsing | β | Handles XML edge cases (dict vs list) |
Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|---|---|---|
| 10,000 result cap per query | Medium | Yes - use date ranges to paginate |
| Abstracts only (no full text) | High | No - full text requires PMC or publisher |
| No citation counts | Medium | Yes - cross-reference with OpenAlex |
| Rate limit (10/sec max) | Low | Already handled |
Current Implementation Gaps
# GAP 1: No MeSH term expansion
# Current: expand_synonyms() uses hardcoded dict
# Better: Use NCBI's E-utilities to get MeSH terms for query
# GAP 2: No date filtering
# Current: Gets whatever PubMed returns (biased toward recent)
# Better: Add date range parameter for historical research
# GAP 3: No publication type filtering
# Current: Returns all types (reviews, case reports, RCTs)
# Better: Filter for RCTs and systematic reviews when appropriate
Priority Improvements
- HIGH: Add publication type filter (Reviews, RCTs, Meta-analyses)
- MEDIUM: Add date range parameter
- LOW: MeSH term expansion via E-utilities
Tool 2: ClinicalTrials.gov
File: src/tools/clinicaltrials.py
What It Does Well
| Feature | Status | Notes |
|---|---|---|
| API v2 usage | β | Modern API, not deprecated v1 |
| Interventional filter | β | Only gets drug/treatment studies |
| Status filter | β | COMPLETED, ACTIVE, RECRUITING |
| httpx β requests workaround | β | Bypasses WAF TLS fingerprint block |
Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|---|---|---|
| No results data | High | Yes - available via different endpoint |
| No outcome measures | High | Yes - add to FIELDS list |
| No adverse events | Medium | Yes - separate API call |
| Sparse drug mechanism data | Medium | No - not in API |
Current Implementation Gaps
# GAP 1: Missing critical fields
FIELDS: ClassVar[list[str]] = [
"NCTId",
"BriefTitle",
"Phase",
"OverallStatus",
"Condition",
"InterventionName",
"StartDate",
"BriefSummary",
# MISSING:
# "PrimaryOutcome",
# "SecondaryOutcome",
# "ResultsFirstSubmitDate",
# "StudyResults", # Whether results are posted
]
# GAP 2: No results retrieval
# Many completed trials have posted results
# We could get actual efficacy data, not just trial existence
# GAP 3: No linked publications
# Trials often link to PubMed articles with results
# We could follow these links for richer evidence
Priority Improvements
- HIGH: Add outcome measures to FIELDS
- HIGH: Check for and retrieve posted results
- MEDIUM: Follow linked publications (NCT β PMID)
Tool 3: Europe PMC
File: src/tools/europepmc.py
What It Does Well
| Feature | Status | Notes |
|---|---|---|
| Preprint coverage | β | bioRxiv, medRxiv, ChemRxiv indexed |
| Preprint labeling | β | [PREPRINT - Not peer-reviewed] marker |
| DOI/PMID fallback URLs | β | Smart URL construction |
| Relevance scoring | β | Preprints weighted lower (0.75 vs 0.9) |
Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|---|---|---|
| No full text for most articles | High | Partial - CC-licensed available after 14 days |
| Citation data limited | Medium | Only journal articles, not preprints |
| Preprint-publication linking gaps | Medium | ~50% of links missing per Crossref |
| License info sometimes missing | Low | Manual review required |
Current Implementation Gaps
# GAP 1: No full-text retrieval
# Europe PMC has full text for many CC-licensed articles
# Could retrieve full text XML via separate endpoint
# GAP 2: Massive overlap with PubMed
# Europe PMC indexes all of PubMed/MEDLINE
# We're getting duplicates with no deduplication
# GAP 3: No citation network
# Europe PMC has "citedByCount" but we don't use it
# Could prioritize highly-cited preprints
Priority Improvements
- HIGH: Add deduplication with PubMed (by PMID)
- MEDIUM: Retrieve citation counts for ranking
- LOW: Full-text retrieval for CC-licensed articles
Tool 4: OpenAlex
File: src/tools/openalex.py
What It Does Well
| Feature | Status | Notes |
|---|---|---|
| Citation counts | β | Sorted by cited_by_count:desc |
| Abstract reconstruction | β | Handles inverted index format |
| Concept extraction | β | Hierarchical classification |
| Open access detection | β | is_oa and pdf_url |
| Polite pool | β | mailto for 100k/day limit |
| Rich metadata | β | Best metadata of all tools |
Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|---|---|---|
| Author truncation at 100 | Low | Only affects mega-author papers |
| No full text | High | No - OpenAlex is metadata only |
| Stale data (1-2 day lag) | Low | Acceptable for research |
Current Implementation Gaps
# GAP 1: No citation graph traversal
# OpenAlex has `cited_by` and `references` endpoints
# We could find seminal papers by following citation chains
# GAP 2: No related works
# OpenAlex has ML-powered "related_works" field
# Could expand search to similar papers
# GAP 3: No concept filtering
# OpenAlex has hierarchical concepts
# Could filter for specific domains (e.g., "Sexual health" concept)
# GAP 4: Overlap with PubMed
# OpenAlex indexes most of PubMed
# More duplicates without deduplication
Priority Improvements
- HIGH: Add citation graph traversal (find seminal papers)
- HIGH: Add deduplication with PubMed/Europe PMC
- MEDIUM: Use
related_worksfor query expansion - LOW: Concept-based filtering
Cross-Tool Issues
Issue 1: MASSIVE DUPLICATION
PubMed: 36M+ articles
Europe PMC: Indexes ALL of PubMed + preprints
OpenAlex: 250M+ works (includes PubMed)
Current behavior: All 3 return the same papers
Result: Duplicate evidence, wasted tokens, inflated counts
Solution: Deduplication by PMID/DOI
# Proposed: Add to SearchHandler
def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
seen_ids: set[str] = set()
unique: list[Evidence] = []
for e in evidence_list:
# Extract PMID or DOI from URL
paper_id = extract_paper_id(e.citation.url)
if paper_id not in seen_ids:
seen_ids.add(paper_id)
unique.append(e)
return unique
Issue 2: NO FULL-TEXT RETRIEVAL
All tools return abstracts only. For deep research, this is limiting.
What's Actually Possible:
| Source | Full Text Access | How |
|---|---|---|
| PubMed Central (PMC) | Yes, for OA articles | Separate API: efetch with db=pmc |
| Europe PMC | Yes, CC-licensed after 14 days | /fullTextXML/{id} endpoint |
| OpenAlex | No | Metadata only |
| Unpaywall | Yes, OA link discovery | Separate API |
Recommendation: Add PMC full-text retrieval for open access articles.
Issue 3: NO CITATION GRAPH
OpenAlex has rich citation data but we only use cited_by_count for sorting.
Untapped Capabilities:
cited_by: Find papers that cite a key paperreferences: Find sources a paper citesrelated_works: ML-powered similar papers
Use Case: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
- Papers that cite it (newer evidence)
- Papers it cites (foundational research)
- Related papers (similar topics)
What's NOT Possible (API Constraints)
| Feature | Why Not Possible |
|---|---|
| bioRxiv direct search | No keyword search API, only RSS feed of latest |
| arXiv search | API exists but irrelevant for sexual health |
| PubMed full text | Requires publisher access or PMC |
| Real-time trial results | ClinicalTrials.gov results are static snapshots |
| Drug mechanism data | Not in any API - would need ChEMBL or DrugBank |
Recommended Improvements (Priority Order)
Phase 1: Fix Fundamentals (High ROI)
- Deduplication - Stop returning the same paper 3 times
- Outcome measures in ClinicalTrials - Get actual efficacy data
- Citation counts from all sources - Rank by influence, not recency
Phase 2: Depth Improvements (Medium ROI)
- PMC full-text retrieval - Get full papers for OA articles
- Citation graph traversal - Find seminal papers automatically
- Publication type filtering - Prioritize RCTs and meta-analyses
Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
- MeSH term expansion - Better PubMed queries
- Related works expansion - Use OpenAlex ML similarity
- Date range filtering - Historical vs recent research
Neo4j Integration (Future Consideration)
Question: Should we add Neo4j for citation graph storage?
Answer: Not yet. Here's why:
| Approach | Complexity | Value |
|---|---|---|
| OpenAlex API for citation traversal | Low | High |
| Neo4j for local citation graph | High | Medium (unless doing graph analytics) |
| Cron job to sync OpenAlex β Neo4j | Medium | Only if we need offline access |
Recommendation: Use OpenAlex API for citation traversal first. Only add Neo4j if:
- We need to do complex graph queries (PageRank on citations, community detection)
- We need offline access to citation data
- We're hitting OpenAlex rate limits
Summary: What's Broken vs What's Working
Working Well
- Basic search across all 4 sources
- Rate limiting and retry logic
- Query preprocessing
- Evidence model with citations
Needs Fixing (Current Scope)
- Deduplication (critical)
- Outcome measures in ClinicalTrials (critical)
- Citation-based ranking (important)
Future Enhancements (Out of Current Scope)
- Full-text retrieval
- Citation graph traversal
- Neo4j integration
- Drug mechanism data (would need new data sources)