Skip to content

Scraper Architecture

Coroutine-based concurrent fetchers for scholarly sources compiled to WASM.

Sources

  • PubMed (ESearch + EFetch)
  • DOAJ (JSON API)
  • SpringerOpen (HTML parsing)
  • IMEJ (HTML parsing)
  • RSS (generic feeds)

Pipeline

  1. Build query (keywords, sanitization)
  2. Dispatch concurrent requests (coroutineScope + launch)
  3. Parse & normalize (authors, abstract, link)
  4. Create article object
  5. Return unified list

Normalization

  • formatAuthors(): unify name list
  • stripHTML(): remove tags from abstracts
  • normalizeLink(): canonical URL

Example Output

json
{
	"id": "000000000000002345678901",
	"title": "Circadian Effects on Immunity",
	"description": "Research examining the relationship between circadian rhythms and immune function",
	"tags": ["immunology", "circadian", "health"],
	"ocean": {
		"title": "Circadian Regulation of Immune Function",
		"abstract": "Full abstract text from source...",
		"author": "A. Smith, B. Johnson",
		"source": "PubMed",
		"url": "https://pubmed.ncbi.nlm.nih.gov/12345678/"
	},
	"created_at": "2025-12-07T10:00:00Z"
}

Note: Scrapers return raw data which is then transformed and posted to Mantle2 with proper 24-digit ID assignment.

Error Handling

  • Individual source failure tolerated; others still return
  • Timeout -> exclude source (soft fail)
  • Malformed HTML skipped with log marker

Performance

  • Parallel fetch; merge after awaitAll
  • Chunk large result sets before rerank
  • Reuse HTTP client instances