Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

App Files Files Community

DeepBoner / TOOL_ANALYSIS_CRITICAL.md

VibecoderMcSwaggins

fix(SPEC_11): address CodeRabbit review feedback (#92)

89f1173 unverified 14 days ago

preview code

raw

history blame contribute delete

11.9 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Critical Analysis: Search Tools - Limitations, Gaps, and Improvements

Date: November 2025 Purpose: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.

Executive Summary

DeepBoner currently has 4 search tools:

PubMed (NCBI E-utilities)
ClinicalTrials.gov (API v2)
Europe PMC (includes preprints)
OpenAlex (citation-aware)

Overall Assessment: Tools are functional but have significant gaps in:

Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
Full-text retrieval (only abstracts currently)
Citation graph traversal (OpenAlex has data but we don't use it)
Query optimization (basic synonym expansion, no MeSH term mapping)

Tool 1: PubMed (NCBI E-utilities)

File: src/tools/pubmed.py

What It Does Well

Feature	Status	Notes
Rate limiting	✅	Shared limiter, respects 3/sec (no key) or 10/sec (with key)
Retry logic	✅	tenacity with exponential backoff
Query preprocessing	✅	Strips question words, expands synonyms
Abstract parsing	✅	Handles XML edge cases (dict vs list)

Limitations (API-Level)

Limitation	Severity	Workaround Possible?
10,000 result cap per query	Medium	Yes - use date ranges to paginate
Abstracts only (no full text)	High	No - full text requires PMC or publisher
No citation counts	Medium	Yes - cross-reference with OpenAlex
Rate limit (10/sec max)	Low	Already handled

Current Implementation Gaps

# GAP 1: No MeSH term expansion
# Current: expand_synonyms() uses hardcoded dict
# Better: Use NCBI's E-utilities to get MeSH terms for query

# GAP 2: No date filtering
# Current: Gets whatever PubMed returns (biased toward recent)
# Better: Add date range parameter for historical research

# GAP 3: No publication type filtering
# Current: Returns all types (reviews, case reports, RCTs)
# Better: Filter for RCTs and systematic reviews when appropriate

Priority Improvements

HIGH: Add publication type filter (Reviews, RCTs, Meta-analyses)
MEDIUM: Add date range parameter
LOW: MeSH term expansion via E-utilities

Tool 2: ClinicalTrials.gov

File: src/tools/clinicaltrials.py

What It Does Well

Feature	Status	Notes
API v2 usage	✅	Modern API, not deprecated v1
Interventional filter	✅	Only gets drug/treatment studies
Status filter	✅	COMPLETED, ACTIVE, RECRUITING
httpx → requests workaround	✅	Bypasses WAF TLS fingerprint block

Limitations (API-Level)

Limitation	Severity	Workaround Possible?
No results data	High	Yes - available via different endpoint
No outcome measures	High	Yes - add to FIELDS list
No adverse events	Medium	Yes - separate API call
Sparse drug mechanism data	Medium	No - not in API

Current Implementation Gaps

# GAP 1: Missing critical fields
FIELDS: ClassVar[list[str]] = [
    "NCTId",
    "BriefTitle",
    "Phase",
    "OverallStatus",
    "Condition",
    "InterventionName",
    "StartDate",
    "BriefSummary",
    # MISSING:
    # "PrimaryOutcome",
    # "SecondaryOutcome",
    # "ResultsFirstSubmitDate",
    # "StudyResults",  # Whether results are posted
]

# GAP 2: No results retrieval
# Many completed trials have posted results
# We could get actual efficacy data, not just trial existence

# GAP 3: No linked publications
# Trials often link to PubMed articles with results
# We could follow these links for richer evidence

Priority Improvements

HIGH: Add outcome measures to FIELDS
HIGH: Check for and retrieve posted results
MEDIUM: Follow linked publications (NCT → PMID)

Tool 3: Europe PMC

File: src/tools/europepmc.py

What It Does Well

Feature	Status	Notes
Preprint coverage	✅	bioRxiv, medRxiv, ChemRxiv indexed
Preprint labeling	✅	`[PREPRINT - Not peer-reviewed]` marker
DOI/PMID fallback URLs	✅	Smart URL construction
Relevance scoring	✅	Preprints weighted lower (0.75 vs 0.9)

Limitations (API-Level)

Limitation	Severity	Workaround Possible?
No full text for most articles	High	Partial - CC-licensed available after 14 days
Citation data limited	Medium	Only journal articles, not preprints
Preprint-publication linking gaps	Medium	~50% of links missing per Crossref
License info sometimes missing	Low	Manual review required

Current Implementation Gaps

# GAP 1: No full-text retrieval
# Europe PMC has full text for many CC-licensed articles
# Could retrieve full text XML via separate endpoint

# GAP 2: Massive overlap with PubMed
# Europe PMC indexes all of PubMed/MEDLINE
# We're getting duplicates with no deduplication

# GAP 3: No citation network
# Europe PMC has "citedByCount" but we don't use it
# Could prioritize highly-cited preprints

Priority Improvements

HIGH: Add deduplication with PubMed (by PMID)
MEDIUM: Retrieve citation counts for ranking
LOW: Full-text retrieval for CC-licensed articles

Tool 4: OpenAlex

File: src/tools/openalex.py

What It Does Well

Feature	Status	Notes
Citation counts	✅	Sorted by `cited_by_count:desc`
Abstract reconstruction	✅	Handles inverted index format
Concept extraction	✅	Hierarchical classification
Open access detection	✅	`is_oa` and `pdf_url`
Polite pool	✅	mailto for 100k/day limit
Rich metadata	✅	Best metadata of all tools

Limitations (API-Level)

Limitation	Severity	Workaround Possible?
Author truncation at 100	Low	Only affects mega-author papers
No full text	High	No - OpenAlex is metadata only
Stale data (1-2 day lag)	Low	Acceptable for research

Current Implementation Gaps

# GAP 1: No citation graph traversal
# OpenAlex has `cited_by` and `references` endpoints
# We could find seminal papers by following citation chains

# GAP 2: No related works
# OpenAlex has ML-powered "related_works" field
# Could expand search to similar papers

# GAP 3: No concept filtering
# OpenAlex has hierarchical concepts
# Could filter for specific domains (e.g., "Sexual health" concept)

# GAP 4: Overlap with PubMed
# OpenAlex indexes most of PubMed
# More duplicates without deduplication

Priority Improvements

HIGH: Add citation graph traversal (find seminal papers)
HIGH: Add deduplication with PubMed/Europe PMC
MEDIUM: Use related_works for query expansion
LOW: Concept-based filtering

Cross-Tool Issues

Issue 1: MASSIVE DUPLICATION

PubMed: 36M+ articles
Europe PMC: Indexes ALL of PubMed + preprints
OpenAlex: 250M+ works (includes PubMed)

Current behavior: All 3 return the same papers
Result: Duplicate evidence, wasted tokens, inflated counts

Solution: Deduplication by PMID/DOI

# Proposed: Add to SearchHandler
def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
    seen_ids: set[str] = set()
    unique: list[Evidence] = []
    for e in evidence_list:
        # Extract PMID or DOI from URL
        paper_id = extract_paper_id(e.citation.url)
        if paper_id not in seen_ids:
            seen_ids.add(paper_id)
            unique.append(e)
    return unique

Issue 2: NO FULL-TEXT RETRIEVAL

All tools return abstracts only. For deep research, this is limiting.

What's Actually Possible:

Source	Full Text Access	How
PubMed Central (PMC)	Yes, for OA articles	Separate API: `efetch` with `db=pmc`
Europe PMC	Yes, CC-licensed after 14 days	`/fullTextXML/{id}` endpoint
OpenAlex	No	Metadata only
Unpaywall	Yes, OA link discovery	Separate API

Recommendation: Add PMC full-text retrieval for open access articles.

Issue 3: NO CITATION GRAPH

OpenAlex has rich citation data but we only use cited_by_count for sorting.

Untapped Capabilities:

cited_by: Find papers that cite a key paper
references: Find sources a paper cites
related_works: ML-powered similar papers

Use Case: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:

Papers that cite it (newer evidence)
Papers it cites (foundational research)
Related papers (similar topics)

What's NOT Possible (API Constraints)

Feature	Why Not Possible
bioRxiv direct search	No keyword search API, only RSS feed of latest
arXiv search	API exists but irrelevant for sexual health
PubMed full text	Requires publisher access or PMC
Real-time trial results	ClinicalTrials.gov results are static snapshots
Drug mechanism data	Not in any API - would need ChEMBL or DrugBank

Recommended Improvements (Priority Order)

Phase 1: Fix Fundamentals (High ROI)

Deduplication - Stop returning the same paper 3 times
Outcome measures in ClinicalTrials - Get actual efficacy data
Citation counts from all sources - Rank by influence, not recency

Phase 2: Depth Improvements (Medium ROI)

PMC full-text retrieval - Get full papers for OA articles
Citation graph traversal - Find seminal papers automatically
Publication type filtering - Prioritize RCTs and meta-analyses

Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)

MeSH term expansion - Better PubMed queries
Related works expansion - Use OpenAlex ML similarity
Date range filtering - Historical vs recent research

Neo4j Integration (Future Consideration)

Question: Should we add Neo4j for citation graph storage?

Answer: Not yet. Here's why:

Approach	Complexity	Value
OpenAlex API for citation traversal	Low	High
Neo4j for local citation graph	High	Medium (unless doing graph analytics)
Cron job to sync OpenAlex → Neo4j	Medium	Only if we need offline access

Recommendation: Use OpenAlex API for citation traversal first. Only add Neo4j if:

We need to do complex graph queries (PageRank on citations, community detection)
We need offline access to citation data
We're hitting OpenAlex rate limits

Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

Critical Analysis: Search Tools - Limitations, Gaps, and Improvements

Executive Summary

Tool 1: PubMed (NCBI E-utilities)

What It Does Well

Limitations (API-Level)

Current Implementation Gaps

Priority Improvements

Tool 2: ClinicalTrials.gov

What It Does Well

Limitations (API-Level)

Current Implementation Gaps

Priority Improvements

Tool 3: Europe PMC

What It Does Well

Limitations (API-Level)

Current Implementation Gaps

Priority Improvements

Tool 4: OpenAlex

What It Does Well

Limitations (API-Level)

Current Implementation Gaps

Priority Improvements

Cross-Tool Issues

Issue 1: MASSIVE DUPLICATION

Issue 2: NO FULL-TEXT RETRIEVAL

Issue 3: NO CITATION GRAPH

What's NOT Possible (API Constraints)

Recommended Improvements (Priority Order)

Phase 1: Fix Fundamentals (High ROI)

Phase 2: Depth Improvements (Medium ROI)

Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)

Neo4j Integration (Future Consideration)

Summary: What's Broken vs What's Working

Working Well

Needs Fixing (Current Scope)

Future Enhancements (Out of Current Scope)

Sources