Ocean Scrapers | The Earth App

Scraper Architecture

Coroutine-based concurrent fetchers for scholarly sources compiled to WASM.

Sources

PubMed (ESearch + EFetch)
DOAJ (JSON API)
SpringerOpen (HTML parsing)
IMEJ (HTML parsing)
RSS (generic feeds)

Pipeline

Build query (keywords, sanitization)
Dispatch concurrent requests (coroutineScope + launch)
Parse & normalize (authors, abstract, link)
Create article object
Return unified list

Normalization

formatAuthors(): unify name list
stripHTML(): remove tags from abstracts
normalizeLink(): canonical URL

Example Output

json

{
	"id": "000000000000002345678901",
	"title": "Circadian Effects on Immunity",
	"description": "Research examining the relationship between circadian rhythms and immune function",
	"tags": ["immunology", "circadian", "health"],
	"ocean": {
		"title": "Circadian Regulation of Immune Function",
		"abstract": "Full abstract text from source...",
		"author": "A. Smith, B. Johnson",
		"source": "PubMed",
		"url": "https://pubmed.ncbi.nlm.nih.gov/12345678/"
	},
	"created_at": "2025-12-07T10:00:00Z"
}

Note: Scrapers return raw data which is then transformed and posted to Mantle2 with proper 24-digit ID assignment.

Error Handling

Individual source failure tolerated; others still return
Timeout -> exclude source (soft fail)
Malformed HTML skipped with log marker

Performance

Parallel fetch; merge after awaitAll
Chunk large result sets before rerank
Reuse HTTP client instances

Scraper Architecture ​

Sources ​

Pipeline ​

Normalization ​

Example Output ​

Error Handling ​

Performance ​

Scraper Architecture

Sources

Pipeline

Normalization

Example Output

Error Handling

Performance