Scraper Architecture
Coroutine-based concurrent fetchers for scholarly sources compiled to WASM.
Sources
- PubMed (ESearch + EFetch)
- DOAJ (JSON API)
- SpringerOpen (HTML parsing)
- IMEJ (HTML parsing)
- RSS (generic feeds)
Pipeline
- Build query (keywords, sanitization)
- Dispatch concurrent requests (coroutineScope + launch)
- Parse & normalize (authors, abstract, link)
- Create article object
- Return unified list
Normalization
- formatAuthors(): unify name list
- stripHTML(): remove tags from abstracts
- normalizeLink(): canonical URL
Example Output
json
{
"id": "000000000000002345678901",
"title": "Circadian Effects on Immunity",
"description": "Research examining the relationship between circadian rhythms and immune function",
"tags": ["immunology", "circadian", "health"],
"ocean": {
"title": "Circadian Regulation of Immune Function",
"abstract": "Full abstract text from source...",
"author": "A. Smith, B. Johnson",
"source": "PubMed",
"url": "https://pubmed.ncbi.nlm.nih.gov/12345678/"
},
"created_at": "2025-12-07T10:00:00Z"
}Note: Scrapers return raw data which is then transformed and posted to Mantle2 with proper 24-digit ID assignment.
Error Handling
- Individual source failure tolerated; others still return
- Timeout -> exclude source (soft fail)
- Malformed HTML skipped with log marker
Performance
- Parallel fetch; merge after awaitAll
- Chunk large result sets before rerank
- Reuse HTTP client instances