DeepBoner / TOOL_ANALYSIS_CRITICAL.md
VibecoderMcSwaggins's picture
fix(SPEC_11): address CodeRabbit review feedback (#92)
89f1173 unverified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Critical Analysis: Search Tools - Limitations, Gaps, and Improvements

Date: November 2025 Purpose: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.


Executive Summary

DeepBoner currently has 4 search tools:

  1. PubMed (NCBI E-utilities)
  2. ClinicalTrials.gov (API v2)
  3. Europe PMC (includes preprints)
  4. OpenAlex (citation-aware)

Overall Assessment: Tools are functional but have significant gaps in:

  • Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
  • Full-text retrieval (only abstracts currently)
  • Citation graph traversal (OpenAlex has data but we don't use it)
  • Query optimization (basic synonym expansion, no MeSH term mapping)

Tool 1: PubMed (NCBI E-utilities)

File: src/tools/pubmed.py

What It Does Well

Feature Status Notes
Rate limiting βœ… Shared limiter, respects 3/sec (no key) or 10/sec (with key)
Retry logic βœ… tenacity with exponential backoff
Query preprocessing βœ… Strips question words, expands synonyms
Abstract parsing βœ… Handles XML edge cases (dict vs list)

Limitations (API-Level)

Limitation Severity Workaround Possible?
10,000 result cap per query Medium Yes - use date ranges to paginate
Abstracts only (no full text) High No - full text requires PMC or publisher
No citation counts Medium Yes - cross-reference with OpenAlex
Rate limit (10/sec max) Low Already handled

Current Implementation Gaps

# GAP 1: No MeSH term expansion
# Current: expand_synonyms() uses hardcoded dict
# Better: Use NCBI's E-utilities to get MeSH terms for query

# GAP 2: No date filtering
# Current: Gets whatever PubMed returns (biased toward recent)
# Better: Add date range parameter for historical research

# GAP 3: No publication type filtering
# Current: Returns all types (reviews, case reports, RCTs)
# Better: Filter for RCTs and systematic reviews when appropriate

Priority Improvements

  1. HIGH: Add publication type filter (Reviews, RCTs, Meta-analyses)
  2. MEDIUM: Add date range parameter
  3. LOW: MeSH term expansion via E-utilities

Tool 2: ClinicalTrials.gov

File: src/tools/clinicaltrials.py

What It Does Well

Feature Status Notes
API v2 usage βœ… Modern API, not deprecated v1
Interventional filter βœ… Only gets drug/treatment studies
Status filter βœ… COMPLETED, ACTIVE, RECRUITING
httpx β†’ requests workaround βœ… Bypasses WAF TLS fingerprint block

Limitations (API-Level)

Limitation Severity Workaround Possible?
No results data High Yes - available via different endpoint
No outcome measures High Yes - add to FIELDS list
No adverse events Medium Yes - separate API call
Sparse drug mechanism data Medium No - not in API

Current Implementation Gaps

# GAP 1: Missing critical fields
FIELDS: ClassVar[list[str]] = [
    "NCTId",
    "BriefTitle",
    "Phase",
    "OverallStatus",
    "Condition",
    "InterventionName",
    "StartDate",
    "BriefSummary",
    # MISSING:
    # "PrimaryOutcome",
    # "SecondaryOutcome",
    # "ResultsFirstSubmitDate",
    # "StudyResults",  # Whether results are posted
]

# GAP 2: No results retrieval
# Many completed trials have posted results
# We could get actual efficacy data, not just trial existence

# GAP 3: No linked publications
# Trials often link to PubMed articles with results
# We could follow these links for richer evidence

Priority Improvements

  1. HIGH: Add outcome measures to FIELDS
  2. HIGH: Check for and retrieve posted results
  3. MEDIUM: Follow linked publications (NCT β†’ PMID)

Tool 3: Europe PMC

File: src/tools/europepmc.py

What It Does Well

Feature Status Notes
Preprint coverage βœ… bioRxiv, medRxiv, ChemRxiv indexed
Preprint labeling βœ… [PREPRINT - Not peer-reviewed] marker
DOI/PMID fallback URLs βœ… Smart URL construction
Relevance scoring βœ… Preprints weighted lower (0.75 vs 0.9)

Limitations (API-Level)

Limitation Severity Workaround Possible?
No full text for most articles High Partial - CC-licensed available after 14 days
Citation data limited Medium Only journal articles, not preprints
Preprint-publication linking gaps Medium ~50% of links missing per Crossref
License info sometimes missing Low Manual review required

Current Implementation Gaps

# GAP 1: No full-text retrieval
# Europe PMC has full text for many CC-licensed articles
# Could retrieve full text XML via separate endpoint

# GAP 2: Massive overlap with PubMed
# Europe PMC indexes all of PubMed/MEDLINE
# We're getting duplicates with no deduplication

# GAP 3: No citation network
# Europe PMC has "citedByCount" but we don't use it
# Could prioritize highly-cited preprints

Priority Improvements

  1. HIGH: Add deduplication with PubMed (by PMID)
  2. MEDIUM: Retrieve citation counts for ranking
  3. LOW: Full-text retrieval for CC-licensed articles

Tool 4: OpenAlex

File: src/tools/openalex.py

What It Does Well

Feature Status Notes
Citation counts βœ… Sorted by cited_by_count:desc
Abstract reconstruction βœ… Handles inverted index format
Concept extraction βœ… Hierarchical classification
Open access detection βœ… is_oa and pdf_url
Polite pool βœ… mailto for 100k/day limit
Rich metadata βœ… Best metadata of all tools

Limitations (API-Level)

Limitation Severity Workaround Possible?
Author truncation at 100 Low Only affects mega-author papers
No full text High No - OpenAlex is metadata only
Stale data (1-2 day lag) Low Acceptable for research

Current Implementation Gaps

# GAP 1: No citation graph traversal
# OpenAlex has `cited_by` and `references` endpoints
# We could find seminal papers by following citation chains

# GAP 2: No related works
# OpenAlex has ML-powered "related_works" field
# Could expand search to similar papers

# GAP 3: No concept filtering
# OpenAlex has hierarchical concepts
# Could filter for specific domains (e.g., "Sexual health" concept)

# GAP 4: Overlap with PubMed
# OpenAlex indexes most of PubMed
# More duplicates without deduplication

Priority Improvements

  1. HIGH: Add citation graph traversal (find seminal papers)
  2. HIGH: Add deduplication with PubMed/Europe PMC
  3. MEDIUM: Use related_works for query expansion
  4. LOW: Concept-based filtering

Cross-Tool Issues

Issue 1: MASSIVE DUPLICATION

PubMed: 36M+ articles
Europe PMC: Indexes ALL of PubMed + preprints
OpenAlex: 250M+ works (includes PubMed)

Current behavior: All 3 return the same papers
Result: Duplicate evidence, wasted tokens, inflated counts

Solution: Deduplication by PMID/DOI

# Proposed: Add to SearchHandler
def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
    seen_ids: set[str] = set()
    unique: list[Evidence] = []
    for e in evidence_list:
        # Extract PMID or DOI from URL
        paper_id = extract_paper_id(e.citation.url)
        if paper_id not in seen_ids:
            seen_ids.add(paper_id)
            unique.append(e)
    return unique

Issue 2: NO FULL-TEXT RETRIEVAL

All tools return abstracts only. For deep research, this is limiting.

What's Actually Possible:

Source Full Text Access How
PubMed Central (PMC) Yes, for OA articles Separate API: efetch with db=pmc
Europe PMC Yes, CC-licensed after 14 days /fullTextXML/{id} endpoint
OpenAlex No Metadata only
Unpaywall Yes, OA link discovery Separate API

Recommendation: Add PMC full-text retrieval for open access articles.

Issue 3: NO CITATION GRAPH

OpenAlex has rich citation data but we only use cited_by_count for sorting.

Untapped Capabilities:

  • cited_by: Find papers that cite a key paper
  • references: Find sources a paper cites
  • related_works: ML-powered similar papers

Use Case: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:

  • Papers that cite it (newer evidence)
  • Papers it cites (foundational research)
  • Related papers (similar topics)

What's NOT Possible (API Constraints)

Feature Why Not Possible
bioRxiv direct search No keyword search API, only RSS feed of latest
arXiv search API exists but irrelevant for sexual health
PubMed full text Requires publisher access or PMC
Real-time trial results ClinicalTrials.gov results are static snapshots
Drug mechanism data Not in any API - would need ChEMBL or DrugBank

Recommended Improvements (Priority Order)

Phase 1: Fix Fundamentals (High ROI)

  1. Deduplication - Stop returning the same paper 3 times
  2. Outcome measures in ClinicalTrials - Get actual efficacy data
  3. Citation counts from all sources - Rank by influence, not recency

Phase 2: Depth Improvements (Medium ROI)

  1. PMC full-text retrieval - Get full papers for OA articles
  2. Citation graph traversal - Find seminal papers automatically
  3. Publication type filtering - Prioritize RCTs and meta-analyses

Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)

  1. MeSH term expansion - Better PubMed queries
  2. Related works expansion - Use OpenAlex ML similarity
  3. Date range filtering - Historical vs recent research

Neo4j Integration (Future Consideration)

Question: Should we add Neo4j for citation graph storage?

Answer: Not yet. Here's why:

Approach Complexity Value
OpenAlex API for citation traversal Low High
Neo4j for local citation graph High Medium (unless doing graph analytics)
Cron job to sync OpenAlex β†’ Neo4j Medium Only if we need offline access

Recommendation: Use OpenAlex API for citation traversal first. Only add Neo4j if:

  1. We need to do complex graph queries (PageRank on citations, community detection)
  2. We need offline access to citation data
  3. We're hitting OpenAlex rate limits

Summary: What's Broken vs What's Working

Working Well

  • Basic search across all 4 sources
  • Rate limiting and retry logic
  • Query preprocessing
  • Evidence model with citations

Needs Fixing (Current Scope)

  • Deduplication (critical)
  • Outcome measures in ClinicalTrials (critical)
  • Citation-based ranking (important)

Future Enhancements (Out of Current Scope)

  • Full-text retrieval
  • Citation graph traversal
  • Neo4j integration
  • Drug mechanism data (would need new data sources)

Sources