Originally from Jinhua, Enpu is a software engineer currently based in Seattle. He has spent the last few years building the backend components behind enterprise software, and more recently, sed.i, a tool for his own personal reading and curation. A graduate from Allegheny College with double majors in Computer Science and Classical Piano, he is currently completing the master's program in information systems at the University of Washington.

Previously at ScienceLogic, he worked on webhook service, metric ingestion and postprocessing pipelines, and components for multi-tenant analytics for IT observability software.

Outside of work, he managed dv, an independent record store and venue focused on vinyl, natural wine, and experimental music in Suzhou; co-owned a community café in Pittsburgh that led to the establishment of a city-wide crisis support program for front end retail and food service workers.

Enpu also plays records for dance parties. Having performed at many venues in China and Japan, you can catch him behind the deck at Otherworld in Seattle every now and then, and his latest mix on Apple Music.

GitHub LinkedIn Email CV

sed.i — System Architecture Wiki

sed.i Landing Page

1. What the app is

sed.i is, at its core, a read-it-later queue built around save, search, and retrieval of reading content across internet. A user pastes a URL and the system saves it, extracts the full article text and metadata in the background, generates an embedding for semantic search, auto-tags it, and presents it in a clean reader with features to manage reading process, as well as highlights and cross-article idea connections. It's also tightly integrated with MCP so that a reader can interact with their reading context directly from a LLM client such as Claude.

The core user loop:

Save a URL from the browser or extension
System extracts content asynchronously
User reads in-app with highlights and notes
Search across the knowledge base semantically, by keyword, or by filter operators
Recommendations surface what to read next
An LLM agent (via MCP) can query and write to the library

Secondary features: vinyl record collection (Crates), public profiles, list-based markdown editor, and a hosted MCP server with OAuth so AI agents can access the library.

Browser / Chrome Extension
  └─> Next.js 14 (Vercel)
        └─> FastAPI (Railway)
              ├─> PostgreSQL + pgvector   ← main data store + vector search
              ├─> Redis                   ← Celery broker + embedding cache + OAuth codes
              └─> Celery Workers (Railway)
                    ├─> URL fetch + metadata (requests + BeautifulSoup)
                    ├─> Article text extraction (trafilatura)
                    ├─> PDF layout extraction (PyMuPDF + YOLO/ONNX)
                    ├─> Embedding generation (OpenAI text-embedding-3-small)
                    ├─> Auto-tagging (pgvector similarity + gpt-4o-mini fallback)
                    ├─> AI summarization (OpenAI)
                    └─> Discogs metadata (vinyl)

LLM Agents (Claude Desktop, etc.)
  └─> MCP stdio (local) or MCP HTTP (hosted, OAuth 2.1 + PKCE)
        └─> FastAPI /mcp-transport

2. Data model

Content Items

Column	Purpose	Design note
`user_id`	Ownership	Every query filters by this. Cross-user leakage is structurally impossible.
`processing_status`	Pipeline state	`pending → processing → completed / failed`. Frontend shows reduced-opacity cards while extraction runs.
`processing_error`	Failure detail	Normalized: 401/403/paywall errors are classified as "source access issues", not parser bugs. Items can be `processing_status='completed'` but have a `processing_error` (partial extraction — something was extracted but less than expected).
`full_text`	Raw HTML	Stored as HTML, not plain text. The reader renders it; search uses a plain-text derivative.
`search_vector`	tsvector	Maintained by a PostgreSQL trigger. Dual-dictionary (english + simple) so both stemmed words and acronyms index correctly.
`embedding`	Vector(1536)	Populated by Celery after extraction. Null until then — all search paths degrade gracefully.
`read_position`	0.0–1.0	Scroll progress. Auto-marks `is_read=True` when ≥ 0.9.
`deleted_at`	Soft delete	Rows are never hard-deleted. Enables undo, audit trails, and safe Celery task execution.
`tags` / `auto_tags`	Two-layer tagging	`auto_tags` are AI suggestions pending review; `tags` are user-confirmed.
`submitted_via`	Source tracking	`'web'`, `'extension'`, `'api'`, `'email'` — useful for analytics and pipeline routing decisions.

Highlights

A highlight stores character offsets (start_offset, end_offset) into full_text. It also gets an OpenAI embedding — this is what powers cross-article idea connections.

Writing - WIP

One draft per list per user. Stores markdown content, optional title, and word count. Auto-created on first PATCH so the frontend can always use a single endpoint without checking for existence first.

3. Authentication and sessions

JWT design

Login returns a signed JWT (sub=email, exp=now+N_minutes). Every request passes it as Authorization: Bearer <token>. FastAPI's get_current_active_user dependency decodes and verifies the token on every request.

Why JWT over server sessions? JWTs are stateless — any FastAPI instance can verify a token without shared session storage. With sessions you'd need a shared Redis session store; with JWTs the signing key alone is sufficient. This matters because Railway can run multiple web instances.

Trade-off: JWTs can't be revoked before expiry. If a user changes their password, an old token is still valid until it expires. Acceptable for a personal reading app.

Token storage: localStorage['token'] in the browser. XSS-accessible — a known gap, documented in §20.

Email verification and password reset

Both use a VerificationToken table:

Field	Notes
`token`	`secrets.token_urlsafe(32)` — 43 chars of URL-safe base64
`token_type`	`"email_verification"` (24hr TTL) or `"password_reset"` (1hr TTL)
`is_used`	Prevents replay — marked true on first use
`expires_at`	Enforced at validation time

Password reset invalidates any existing unused reset tokens for the user before creating a new one — prevents confusion from multiple concurrent reset links.

Emails are sent via Resend (HTTP API) as Celery background tasks.

4. Content ingestion pipeline

A URL save triggers a multi-stage Celery chain. The API returns 201 before any of this runs.

Stage 1: `extract_metadata`

fetch URL (requests, follow redirects)
  ├─> detect Content-Type → PDF path or HTML path
  ├─> parse OG / Twitter Card / JSON-LD metadata (BeautifulSoup)
  ├─> detect content_type (article / pdf / video / tweet / unknown)
  ├─> thumbnail fallback chain: OG → Twitter → JSON-LD → link[rel=image_src] → first in-content image
  ├─> run trafilatura (output_format="xml", include_tables/images/links=True)
  │     └─> xml_to_html() converts structured XML → clean HTML
  │         fallback: if <100 chars, convert plain text → <p> tags
  ├─> compute word_count, reading_time_minutes = max(1, round(word_count / 200))
  ├─> run paywall/access heuristics → set processing_error if restricted
  ├─> set processing_status = 'completed'
  └─> chain: generate_embedding.delay() → generate_tags.delay()

trafilatura's XML output preserves structural information (heading levels, table cells, list items) that plain HTML or text loses. The xml_to_html() conversion maps these back to semantic HTML elements the reader can render.

Paywall detection — 7 heuristics

Items can reach processing_status='completed' with a processing_error set, meaning: we extracted something, but it's probably not the full article. The _detect_limited_extraction_reason() function applies these checks in order:

Heuristic	Signal
Schema markers	JSON-LD `isAccessibleForFree: false`, `content_tier` meta tags (`paid`, `premium`, `subscriber`, `metered`)
Restriction text	Keywords like "paywall", "subscriber-only", "access options" in page text or DOM class names
Bot blocking	Status codes 403/429, "cloudflare", "captcha", "blocks bots" in error text
Coverage ratio	Extracted text / source text < 0.4 AND < 5 paragraphs AND < 1800 chars
Teaser overlap	OG description found verbatim in extracted text AND extracted < 1200 chars AND < 5 paragraphs
Media-only	Figure/img present + 0 paragraphs + < 900 chars (image galleries, paywalled articles showing only art)
Structural truncation	Source has 6+ paragraphs, extracted ≤ 4 AND ratio < 0.4 AND < 1800 chars

These map to user-facing categories via ingestErrors.ts on the frontend: blocked, unauthorized, source_access, network, partial, unknown. Each category has its own badge text, severity, and reader fallback message.

Extension fast path

The Chrome extension runs Mozilla Readability client-side (in the browser, on the page being viewed) and sends pre_extracted_html in the POST body. The backend:

Calls _clean_extension_html() to strip duplicate title/description/thumbnail elements that Readability might preserve
Sets processing_status='completed' immediately
Still fires extract_metadata.delay() for missing metadata fields and to trigger the embedding chain

Result: extension-saved articles are fully readable in seconds. The extension also sets pre_extracted_access_restricted=True if detectAccessRestriction() fires — this immediately sets processing_error before Celery even runs.

Retry semantics

max_retries=3 on extraction tasks. Each failure saves a processing_error and eventually sets processing_status='failed'. The frontend shows failed items with a specific failure badge based on the error category.

5. PDF extraction

PDF extraction is a separate 3-stage pipeline handled by PyMuPDF (fitz) and an ONNX/YOLO layout model.

Stage 1: Pre-scan

Runs on every page before any model inference. Pure text analysis — fast and free.

Per page:

Column detection: Measures X-distribution of text blocks. A split at 0.42/0.58 page width indicates two-column layout.
Header/footer bands: Detected via horizontal rules from page.get_drawings() — narrow lines (< 3pt height, > 30% page width) in the top 10% or bottom 20% of the page. Falls back to repeating-text detection if no rules are found.
Page number rects: Margin blocks matching \d[\d\s]* (max 6 chars) — excluded from reading bounds.
Ambiguity scoring: Two-column confidence scored by how evenly text is distributed left/right.

Document-level:

Repeating header/footer texts: Blocks appearing on 3+ pages (at least 40% of all pages), normalized with re.sub(r"\b\d+\b", "#", ...) to match page numbers with different values. These are filtered from output.
Column mode: Majority vote across all pages.

Stage 2: Layout detection

Three implementations available (configured by environment):

Implementation	Cost	Speed	Accuracy
`pymupdf_layout` (ONNX model)	Free	~200ms/page	Good
`yolo` (torch, local)	Free after download	Variable	Good
`gpt4o_mini_vision`	~$0.03/page	Network-bound	Best

The YOLO/ONNX approach detects reading regions, figures, tables, and headers as bounding boxes. These are then used to extract text in reading order (column-aware, skipping headers/footers/page numbers).

Stage 3: Post-processing

After layout extraction produces raw HTML:

Step	What
Title	First `<h1>` or `<h2>` → stored as `item.title`
Author line removal	Short `<p>` blocks (< 300 chars) between title and abstract heading → removed (arXiv author/affiliation lines)
Abstract (arXiv style)	Standalone `<h1>ABSTRACT</h1>` + following `<p>` → stored as `item.description`, removed from body
Abstract (journal style)	Inline `<p>ABSTRACT: text...</p>` → same treatment
Figure thumbnail	First `<div class="figure-block"><img src="data:..."/>` → stored as `item.thumbnail_url`, removed from body
Confidence badge	Injected as `<meta>` tag; parsed by the Reader

Why pre-scan first instead of running the model on every page? The pre-scan identifies structural features (columns, headers, footers) that inform how model outputs are interpreted. It also identifies pages that don't need model inference (single-column, clean layout). Running the model on every page without context leads to misrouted reading order in two-column academic papers.

6. The reader

The reader is a Reader shell component wrapping ReaderArticle.

Rendering pipeline

The stored HTML goes through several transforms before display:

sanitizeContentHtml — strips ephemeral UI elements that might have been saved accidentally
stripDocumentWrappers — removes <html>/<body> wrappers from PDF-extracted content
addHeadingAnchors — generates deduplicated IDs on headings for table-of-contents links
toBionic (optional) — bolds the first ~50% of each word. Optional toggle in reading settings.

Scroll position persistence

read_position (0.0–1.0) is updated via a debounced PATCH /content/{id} as the user scrolls. On open, the reader scrolls to the saved position. Auto-marks as read at ≥ 0.9.

Reading settings and SSR hydration

All settings (theme, typography, bionic reading, feature visibility toggles) are stored in ReadingSettingsContext, persisted in localStorage['sedi-reading-settings']. The context initializes from defaults on the server (for SSR) and loads saved values in useLayoutEffect (client-only, after hydration). Components that depend on these settings return null until the hydrated flag is true — this prevents a flash of wrong font/theme on initial load.

Highlights in the reader

Highlights are stored with character offsets (start_offset, end_offset) into full_text. When the reader renders, it injects highlight spans by finding character ranges.

7. Hybrid search

A single GET /search/semantic?query=... endpoint handles four search strategies, routed by a classifier that runs in <1ms.

The classifier (`app/core/search_router.py`)

Priority-ordered rules — first match wins:

Query shape	Route	Example
Contains `author:`, `tag:`, `site:`, `after:`, `before:`, `is:`	SQL filter	`author:Paul Graham after:2025-01-01`
Quoted phrase	tsvector keyword	`"attention is all you need"`
Looks like a domain	SQL site filter	`substack.com`
Matches user's known author name	SQL filter	(user has articles by "Simon Willison")
Matches user's known tag	SQL filter	`music`
Ends with `?` or starts with question word	pgvector semantic	`how does attention work?`
≤ 4 words, no question words	tsvector keyword	`llm`, `react hooks`
Everything else	keyword + semantic fused with RRF	`building products with AI`

get_user_search_context() preloads the user's known authors and tags before classification — one SQL query, called once per search request.

The three engines

SQL filter path: Pure SQLAlchemy query, no vector math, no full-text index. Fastest. Tag filter uses EXISTS (SELECT 1 FROM unnest(tags) t WHERE t ILIKE :pat) — case-insensitive partial match across the Postgres array.

Keyword path: PostgreSQL tsvector full-text search via ts_rank_cd. The search_vector column uses a dual-dictionary trigger:

english dictionary: stems words (running → run, articles → article)
simple dictionary: stores tokens as-is (LLMs → llms, RAG → rag, API → api)
Combined with || so both dictionaries contribute to ranking
Single-token queries use prefix matching (to_tsquery('simple', 'llm:*')) so llm matches llms
ts_rank_cd with normalization flag 32 caps scores at 0–1
If keyword returns 0 results (misspelling, partial word), falls back to semantic silently

Semantic path: Embeds the query via OpenAI, runs cosine similarity against content_items.embedding in pgvector. Falls back to keyword if embedding fails or no embeddings exist yet.

Embedding cache

Query embeddings are cached in Redis (qemb:{sha256(query)[:16]}, 1hr TTL). Searching "llm" five times calls OpenAI once.

Reciprocal Rank Fusion (RRF)

When two or more engines run, ranked lists merge with RRF:

score(item) = Σ  1 / (60 + rank_in_list)
              all lists containing item

The constant 60 (from the original RRF paper) dampens the benefit of high rank — a rank-1 result doesn't dominate completely. RRF scores are not percentages — they're only meaningful for sorting. The frontend does not display them.

Three-way fusion (mode=full): Used by the SearchModal. Filter + keyword + semantic all run simultaneously. Keyword + semantic are fused first, then fused with filter results. This maximizes recall — every result from every engine is considered.

SearchModal vs SearchBar

SearchBar (navbar): mode=auto — classifier picks the cheapest path. Returns 5 results inline. No OpenAI call for keyword/filter queries.

SearchModal (Cmd+K): mode=full — always runs all three engines. Returns 10 results per page with prev/next pagination. Date preset chips: last 7 days / 30 days / 3 months / last year / custom range. Custom range uses cross-validated min/max on the date inputs to prevent invalid ranges.

8. Auto-tagging

Goal: Suggest relevant tags without spending money on every article.

Two-pass strategy

Pass 1 — free (pgvector similarity):

Find the user's already-tagged content with cosine distance < 0.25 (very similar articles)
If ≥ 2 tags appear across ≥ 2 similar articles → auto-accept those tags immediately

Pass 2 — cheap LLM (gpt-4o-mini):

Only runs if pass 1 finds nothing (new library, first article in a topic area)
Sends title + description + first 800 words of plain text to gpt-4o-mini
Parses JSON array from response

Why this design over always calling GPT? At 1000 articles, calling gpt-4o-mini for every article costs ~$0.20/month. But at 10,000 articles with a well-tagged library, pass 1 handles ~80% of new articles for free. The cost curve flattens as the library grows — the system gets cheaper per article over time.

9. Highlights and idea connections

Embeddings on highlights

After a highlight is created, generate_embedding embeds the highlight text (same OpenAI model, same Celery task used for articles). The 1536-dim vector is stored in highlights.embedding.

Cross-article connections

GET /search/connections/{highlight_id}:

SELECT h.id, h.text, h.color, ci.title,
       (1 - (h.embedding <=> CAST(:source AS vector))) as similarity
FROM highlights h
JOIN content_items ci ON h.content_item_id = ci.id
WHERE h.user_id = :uid
  AND h.content_item_id != :source_article_id
  AND h.embedding IS NOT NULL
  AND LENGTH(h.text) >= 20
  AND (1 - (h.embedding <=> :source)) >= :threshold
ORDER BY h.embedding <=> :source
LIMIT :limit

The LENGTH(h.text) >= 20 filter excludes trivially short highlights that produce noisy embeddings.

GET /search/connections/article/{content_id} — runs the above for every highlight in an article, groups by connected article, and sorts by total similarity score. This powers the "Connections" tab in the reader.

10. Lists and drafts

Lists

Many-to-many join table (content_list_membership) between users and content items. Key design decisions:

added_by on the join table — preserves who added an item (future sharing feature)
List soft deletion doesn't cascade-delete items — items just lose membership
GET /lists/{id}/content joins content_items and filters deleted_at IS NULL — items independently soft-deleted don't appear

Drafts

Each list can have one markdown draft (one per list per user). The draft is a writing space for notes, summaries, or essays about the list's content.

API:

GET /lists/{id}/draft — 404 if no draft exists yet
POST /lists/{id}/draft — explicit creation (409 if already exists)
PATCH /lists/{id}/draft — auto-creates if not exists; this is the autosave target
DELETE /lists/{id}/draft — removes the draft

Split-pane reader in list view

/lists/[id] renders a split-pane layout: list on the left, ReaderArticle on the right. ReaderArticle accepts an embedded prop that switches from window.scroll to a container scroll, so the scroll position and reading progress bar work correctly inside the panel.

11. MCP server — LLM agent interface

The MCP (Model Context Protocol) server exposes the library to LLM agents. It has two deployment modes.

Phase 1: Local stdio (Claude Desktop)

Launched as a subprocess by Claude Desktop:

{
  "mcpServers": {
    "sedi": {
      "command": "poetry",
      "args": ["run", "python", "-m", "app.mcp.server"],
      "cwd": "/absolute/path/to/backend",
      "env": { "TOKEN": "<your-sedi-jwt>" }
    }
  }
}

Phase 2: Hosted HTTP (OAuth 2.1 + PKCE)

Mounted as an ASGI app at /mcp-transport on the main FastAPI server. Fully compliant OAuth 2.1 with PKCE (S256 only — no plain challenge method).

OAuth endpoints:

Endpoint	Standard	Purpose
`/.well-known/oauth-authorization-server`	RFC 8414	Discovery metadata
`/.well-known/oauth-protected-resource`	RFC 9728	Resource server metadata
`register`	RFC 7591	Dynamic client registration
`authorize` GET	—	Login form (HTML)
`authorize` POST	—	Credential check + auth code issuance
`token`	RFC 6749	Code exchange + refresh token grant

Client validation: Static allowlist from MCP_OAUTH_CLIENTS_JSON env var (JSON dict of client_id → [redirect_uri...]). If the env var is empty, any dynamically registered client is accepted. Non-matching client_id or redirect_uri → 400.

Security: OAuth login form HTML-escapes all reflected parameters (client_id, redirect_uri, state, code_challenge) to prevent reflected XSS. CSRF embedded in OAuth state parameter.

PKCE? PKCE prevents authorization code interception attacks. Even if an attacker captures the auth code in transit, they can't exchange it without the code verifier that was never sent over the network.

MCP tools (15 total)

Read tools:

Tool	What it does
`list_lists()`	All reading lists with item counts
`get_list_content(list_id, include_full_text, limit)`	Articles in a list. `full_text` capped at 32,000 chars to prevent LLM context overflow. Max 200 items.
`get_content_item(item_id, include_full_text)`	Single article metadata or full content
`search_content(query, limit)`	Hybrid search (same engine as the app). Max 50 results.
`find_similar(item_id, limit, threshold)`	Cosine similarity search. Default threshold 0.5.
`get_highlights(item_id?, list_id?)`	Three modes: all highlights for an article, all highlights in a list, or all highlights across the library (max 100)
`get_draft(list_id)`	Writing draft for a list, or null if none
`get_reading_stats()`	`{total_items, read_count, unread_count, archived_count}`
`summarize_list(list_id, style, max_items)`	AI summary. Styles: `overview`, `themes`, `gaps`, `timeline`. `gaps` style includes draft content to identify what's missing. Cached in-process by (user_id, list_id, content_hash, style).

Write tools:

Tool	What it does
`add_content(url)`	Save URL, triggers background extraction
`update_draft(list_id, content, title?)`	Create or update list draft
`create_list(name, description?)`	New reading list
`add_to_list(list_id, item_id)`	Add content item to a list

An LLM agent calling get_list_content with a large list could easily overflow its context window with full article text. The 32k cap prevents this while still giving the agent substantial content to work with. Agents needing full text should call get_content_item per article.

12. Public profiles

Users can expose their queue and/or crates publicly via /[username].

Visibility model

Three independent toggles:

user.is_public — profile visible at all
user.is_queue_public — queue items visible
user.is_crates_public — vinyl records visible

Individual items additionally have is_public — a user can have a public queue but hide specific articles.

No-auth API routes

/public/u/{username}/... routes have no get_current_active_user dependency. They check is_public flags and return 403 if the profile is private. These are the only unauthenticated API routes in the system (besides /auth).

Guest reading limit

The public reader tracks reads in localStorage per profile owner. After 3 reads from the same profile, it shows a signup prompt. Entirely client-side — a conversion nudge, not server-enforced access control.

13. Crates — vinyl record collection

A second vertical for managing a vinyl record collection. Users paste a Discogs URL; Celery fetches metadata from the Discogs API asynchronously.

`vinyl_records` table (separate from `content_items`)

The metadata structure (tracklist JSON, label, catalog number, year, artist) doesn't fit the content_items schema. A separate table is cleaner than adding 10 nullable columns to a content items table that doesn't need them.

Column	Notes
`discogs_release_id`	Integer parsed from URL, used for the Discogs API call
`tracklist`	JSONB: `[{position, title, duration}, ...]`
`videos`	JSONB: `[{title, uri, duration}, ...]` — Discogs links + user-added YouTube links
`status`	`'collection'`, `'wantlist'`, `'library'` — not a boolean `wantlist` column
`processing_status`	Same `pending → completed/failed` pattern as content items

Music playback

PlayerContext manages a queue of QueueTrack[] objects derived from vinyl_records.videos. YouTubePlayer is an invisible div hosting the YouTube IFrame API — plays videos sequentially. Queue persists to localStorage['sedi-player'] across navigation.

Why YouTube IFrame API instead of an audio element? Discogs stores YouTube links as listening sources. The IFrame API gives programmatic play/pause/next without needing a raw audio URL. The trade-off: requires the video to be embeddable (some YouTube videos disable embedding).

14. Chrome extension

MV3 extension. Three components in a chain:

`content.js` (injected into every page)

Runs Mozilla Readability on the page to extract clean article HTML. Also runs detectAccessRestriction() — checks JSON-LD isAccessibleForFree, content_tier meta tags, and known paywall DOM selectors (subscription modal class names, etc.).

`popup.js`

Shows a preview before saving: word count, read time, author, publish date, access-restriction signal. User clicks "Save" → sends payload to service worker.

`service_worker.js`

Calls POST /content with pre_extracted_html, pre_extracted_access_restricted, and all metadata fields. Maps accessRestricted: true → pre_extracted_access_restricted: true, which triggers immediate processing_error flagging on the backend before Celery runs.

15. Rate limiting

RateLimitMiddleware applies to POST /content only. Sliding window algorithm:

Each user (JWT user ID) has a deque of request timestamps
On each request: pop timestamps older than the window; if len(deque) < max_requests, allow and append
Limits: 10 requests / 60 seconds AND 50 requests / 3600 seconds
429 response includes Retry-After header and CORS headers

Known limitation: State is in-memory per process. Multiple Railway instances would each have independent counters — a user could exceed the limit across instances. Production fix: Redis-backed INCR + EXPIRE. Documented, not yet implemented.

16. Security posture

Current protections

Area	Implementation
Passwords	bcrypt via passlib
API auth	JWT signed with `SECRET_KEY`, expiry enforced on every request
Cross-user isolation	Every DB query filters by `user_id = current_user.id`
CORS	Origin allowlist from `ALLOWED_ORIGINS` env var
Rate limiting	Sliding window on `POST /content`
XSS in OAuth	HTML-escaped reflected parameters in MCP OAuth login page
MCP OAuth	Strict client_id + redirect_uri allowlist; PKCE S256 only; CSRF in state
Error sanitization	No internal query details or stack traces in API responses

Known gaps (documented, not hidden)

Risk	Current state	Production fix
XSS via article HTML	`dangerouslySetInnerHTML` in reader without sanitization	DOMPurify or sandboxed iframe
SSRF	Backend fetches user-provided URLs without IP validation	Block internal ranges (169.254.x.x, 10.x.x.x, 127.x.x.x)
Token storage	`localStorage` is XSS-accessible	httpOnly cookies + CSRF tokens
Rate limit in-memory	No cross-instance enforcement	Redis-backed `INCR` + `EXPIRE`

The gaps are accepted trade-offs for a personal-use app at current scale, not oversights. Any engineer picking up this codebase can see exactly what needs hardening before a public launch.

17. Key design decisions

Q: Why Celery + Redis instead of FastAPI background tasks?

FastAPI's BackgroundTasks run in the same process. A slow extraction blocks other requests. Celery runs in a separate worker process — failures, retries, and slow jobs don't affect API latency. Redis is already provisioned for caching and OAuth state, so Celery adds no infrastructure cost.

Q: Why store article HTML instead of plain text?

The reader needs structure — headings, code blocks, images, lists. Plain text loses all of that. The trade-off is storage size and XSS risk. XSS is on the known-gaps list (DOMPurify is the fix).

Q: Why not just use Elasticsearch or Typesense for search?

pgvector + tsvector together cover keyword, semantic, and filter search in one database. No sync lag, no second datastore, no additional cost. The RRF fusion technique is what production systems (Elasticsearch 8.x ships it natively) use. The only thing missing is ANN indexing (HNSW) for billion-scale vector search — not relevant here.

Q: How does the hybrid search classifier work?

It's a priority-ordered set of regex heuristics and user data lookups that runs in <1ms with no LLM call. Explicit operators first, then domain detection, then the user's known authors/tags (loaded from DB before classification), then question detection, then word count threshold (≤4 words = keyword, more = hybrid). mode=full bypasses the classifier entirely and runs all three engines simultaneously.

Q: Why dual-dictionary tsvector instead of just the English dictionary?

The English stemmer treats "LLMs" as a token it doesn't know how to stem — it comes out as llms. A query for llm (stemmed via English) doesn't match llms. The simple dictionary stores tokens as-is, so llms is indexed and prefix matching llm:* catches it. Running both dictionaries means you get stemming benefits (run/running match) AND acronym accuracy.

Q: How does the MCP server handle auth in two different modes?

Phase 1 (stdio): JWT passed as environment variable, decoded by app/mcp/auth.py on every tool call. Simple but requires the user to copy a token from their browser.

Phase 2 (HTTP): Full OAuth 2.1 + PKCE flow. The MCP client redirects to the sed.i login form, user authenticates, gets an auth code, exchanges it for a sed.i JWT. The end result is the same JWT the regular API uses — no separate token type, no extra verification logic.

Q: Why character offsets for highlights instead of DOM ranges?

DOM ranges (XPath, CSS selectors) break when HTML structure changes — e.g. bionic reading transforms, re-extraction, or adding heading anchors. Character offsets into the source full_text string are stable as long as that string doesn't change. This is also why re-extraction after a successful extraction isn't triggered automatically.

Q: Why JSONB for reading_patterns instead of a separate analytics table?

The recommendation engine reads one row per request. A separate table would require joining content_items across potentially thousands of rows to compute rolling stats at request time. JSONB on the user row is a deliberate denormalization — stats are maintained incrementally (each article completion updates the rolling window) rather than computed on read.

Q: Why soft deletes everywhere?

Hard deletes are irreversible and can crash background Celery tasks holding a reference to a deleted ID. Soft deletes let users undo, preserve audit trails, and keep background tasks safe. The trade-off: every query must include AND deleted_at IS NULL. Missing this is a data leak — it's enforced at the ORM layer in shared query helpers.

Q: How does the extension fast path work, and why?

The extension runs Mozilla Readability client-side (in the browser, on the page being viewed) and sends pre-extracted HTML. The backend skips trafilatura, sets processing_status='completed' immediately, and still fires Celery for embeddings and any missing metadata. Result: extension-saved articles are fully readable in seconds. The extension also detects paywalls client-side (JSON-LD signals, DOM selectors) and signals this with pre_extracted_access_restricted=True — the backend sets processing_error before Celery even runs.

Q: How does the PDF extraction pipeline differ from article extraction?

PDFs go through a 3-stage pipeline: (1) a free pre-scan per page that detects column layout, header/footer bands, and page numbers using PyMuPDF text block positions; (2) layout model inference (YOLO/ONNX by default, GPT-4o-mini vision as a higher-accuracy option) to identify reading regions, figures, and tables; (3) post-processing that extracts the title from the first heading, removes author lines from arXiv-style front matter, detects and removes abstracts (stored as description), and extracts the first figure as a thumbnail. The pre-scan informs how model outputs are interpreted, especially for two-column academic papers where reading order matters.

Q: What can an LLM agent do with the MCP server?

Read: browse lists, read full article content (capped at 32k chars to prevent context overflow), search the library with the same hybrid engine the app uses, find articles similar to a given one, read all highlights, get reading stats, and generate AI summaries of a list in different styles (overview, themes, gaps, timeline). Write: save new URLs, create lists, add articles to lists, and create/update writing drafts. The gaps summary style cross-references the list's draft to identify what's missing — useful for research writing workflows.

Prediction Whale - Audit on Prediction Market Concentration

The concept of wisdom of crowds came up a lot in my machine learning courses. The idea is straightforward: decisions made by aggregating information across a group tend to be better than those made by any single member. Prediction markets like Kalshi and Polymarket are built on this premise — they let a broad audience trade on the probability of specific outcomes, and treat the resulting prices as a more accurate read on public sentiment than polls or expert opinion. CNN now cites Kalshi odds during election coverage. Financial desks treat them as a legitimate forecasting signal. The platforms are growing fast, with billions in volume and serious institutional attention, all without charging a transaction fee, which raises an obvious question about what they're actually selling, and to whom.

The two platforms are less similar than they appear. Kalshi is CFTC-regulated, operates in USD, and requires verified US accounts, where each account tied to a confirmed identity. Polymarket runs on Polygon, settles in USDC, and at the time this project was conducted had no KYC requirement and was pseudonymous, operating outside the US following a 2022 CFTC enforcement action. The groups behind each platform reflect this split: Kalshi is the regulated, institutional-adjacent one; Polymarket is crypto-native. Counterintuitively, that made Polymarket the more auditable of the two, where every trade, wallet, and timestamp is on-chain and publicly accessible. Kalshi is accountable in the legal sense but opaque in the data sense. It seems that if you want to know who's actually moving its prices, you have to ask Kalshi.

That asymmetry opened up a few directions worth investigating:

If both markets claim to represent public wisdom, odds on the same event should converge. A persistent gap suggests the crowds are different, or one market is less honest about what it's doing, or both.
The difference in user base (crypto-native versus regulated US accounts) might mean each market reflects a genuinely different population, equally representative but measuring different things.
More centrally: are these odds actually set by a distributed crowd, or by a small number of large wallets? If a Kalshi probability ticks up sharply and CNN reports it as public sentiment, that's a real information-flow problem regardless of whether anyone intended it.
Do large traders cause retail participants to change their behavior? A retail trader might have a clear view, see a large wallet moving the other direction, and follow — even if that position has nothing to do with the actual probability of the outcome.

For a ML course of my grad program, that last question is what this project is trying to answer.

Getting the data

My initial thought as the starting point was collecting bets that exist on both platforms. If both markets are pricing the same event, you can ask whether the odds converge, diverge, and why.

The first obstacle was that Polymarket and Kalshi don't organize their markets the same way. Each platform nests binary sub-markets ("Will Trump nominate Judy Shelton?") inside categorical event groupings ("Who will Trump nominate as Fed Chair?"). Matching at the sub-market level is difficult where titles are too platform-specific to align. Matching at the event level worked much better: using the sentence-transformer model from huggingface to convert embeddings and calculate cosine similarity across 9.8 million candidate pairs found 479 matches above the 0.75 threshold and 8 perfect 1.0 matches. Six pairs made the final cut after manual review.

Market	Type	Notes
Champions League Winner	Sports	$1B volume; insider-risk structure (lineup leaks, injury timing)
US Presidential Election	Political	High volume, high public attention
Congressional Control	Political	—
Fed Rate Decision	Policy	Single binary outcome
Trump Cabinet Nominations (×2)	Political	Multiple sub-markets per event

Running the analysis on a single event type would have told a single-event story, so the diversity across categories mattered. Due to time and compute constraints, trades were capped at 2,000 per sub-market, sampled in reverse chronological order. Each matched event contains many sub-markets. For example, the Champions League event alone has dozens, one per team. The 237,791 total Polymarket trades come from roughly 125 sub-markets at up to 2,000 each, alongside 160,000 Kalshi trades. One clear drawback is that the reverse-chronological sampling skews the dataset toward end-of-market activity, when prices have largely settled and perhaps large wallets tend to be most active.

This sampling choice shaped what the analysis could and couldn't answer. End-of-market behavior is when volume concentration is most visible, which worked in our favor for the behavioral clustering. But it also meant the cross-platform price comparison — the original goal — came back empty. Two markets pricing the same event should converge over time, and testing that properly requires a more extended trade history, not the most-recent-N approach the API forces on you. That limitation is worth stating upfront, because the analysis that follows is primarily about Polymarket's internal structure rather than a head-to-head platform comparison.

There's also an asymmetry in what each platform's API could give us. Polymarket's blockchain architecture exposes wallet addresses for every trade, making wallet-level behavioral analysis possible. Kalshi's public API returns price, size, and timestamp, but no user identifier. Every trade is anonymous. The entire behavioral analysis depends on being able to group trades by the account that made them, so while the project started as a two-platform comparison, the behavioral work is necessarily Polymarket-only. This is its own kind of irony: the unregulated, pseudonymous platform is the one that's actually auditable from the outside.

Behavioral fingerprints

Instead of simply using the raw data metric types returned by the APIs, the methodological bet of this project is that trading behavior — how someone trades, not just how much — is informative enough to characterize who's in the market and what they're doing.

Eleven features per wallet, organized around four dimensions:

Dimension	Features	What it's asking
Scale	`total_volume`, `avg_trade_size`	How much does this wallet trade?
Activity	`num_trades`, `trade_freq_per_hour`	How often does it trade?
Timing	`timing_entropy`, `activity_span`	When and how predictably does it trade?
Direction & breadth	`buy_ratio`, `market_diversification`, `volume_concentration`, `win_rate`, `position_size_variance`	Does it show systematic bias, and across how many markets?

After standardization, all features contribute roughly equal variance, with no single dimension drowns out the others. The more interesting finding from the correlation matrix is that timing entropy is nearly orthogonal to volume features, which means the dataset is capturing two genuinely independent behavioral dimensions. How much someone trades and how they distribute that activity across time don't necessarily go together: a wallet that concentrates all its activity in a single overnight session looks completely different from one that trades across every hour of the day, even at identical total volume. That distinction turned out to be one of the cleaner separators between the normal clusters and the wallets that didn't fit anywhere.

Who is actually in this market

We applied clustering algorithms to answer this question, and deliberately used DBSCAN over something like K-means, where it lets the density of the feature space determine both the number and shape of clusters, which is more appropriate for an exploratory question like this one. It also doesn't assume spherical clusters, which matters when behavioral groups can take irregular shapes in high-dimensional space. With a parameter sweep optimizing for silhouette score, the final parameters (eps = 0.3, min_samples = 10) produced 104 clusters and 4,446 noise points, with a silhouette score of 0.7017.

5 clusters at 1,000+ wallets — the dominant retail archetypes; clusters 4 and 5 alone cover ~32,000 wallets
23 clusters at 100–999 — solid mid-size behavioral groups
75 clusters at 10–99 — niche behavioral styles, small but not arbitrary; min_samples = 10 means each required at least 10 wallets to form

The long tail of small clusters reflects genuine behavioral heterogeneity in a large dataset — traders who share enough features to be grouped together but don't fit the dominant archetypes. What the cluster taxonomy captures well is the dominant archetypes; what matters more for the analysis is the noise-point group, whose behavioral profile holds regardless of how the normal clusters are divided.

A note on what noise means here: DBSCAN labels a point as noise not because it's anomalous in a statistical sense, but most likley because it lacks enough nearby neighbors in feature space to form or join a dense region. These wallets aren't outliers by definition but just didn't cluster. What makes the group interesting is what their features actually show once you look at them directly.

Who Is Actually Moving This Market?

The cluster size distribution shows the largest normal cluster at 16,309 wallets — one-time retail speculators who bet once and never return. The trading frequency distribution on a log scale makes the separation visible: the mass of wallets sit near 0–5 trades per hour, while a thin tail extends past 175. That tail is not retail.

The noise-point (suspicious) wallets average $49,790 per wallet in volume — roughly 30× the typical retail cluster — with timing entropy of 0.89 versus near-zero for retail. Retail wallets make one or two trades, usually at the same hour, usually all on one side. The high-frequency group makes 28 trades on average, spread across multiple markets, at hours that don't cluster.

Trader Behavior Archetypes

By count, Buy & Hold dominates; the high-frequency wallets are a small bar. Flip to average volume on a log scale, and that bar towers over everything else by orders of magnitude. The buy/sell bias chart shows that Whale and Buy & Hold wallets are 100% buyers, while the high-frequency group leans 64% buy — some directional conviction, but without the pure-accumulation signature of retail.

Detecting Structural Manipulators vs. Retail Participants

The timing entropy histogram is the clearest separator. The normal population spikes sharply near zero — nearly all retail wallets make one-off bets at a single hour and stop. The noise-point wallets spread across a wide entropy range centered around 0.89. The buy ratio distributions show something subtler: normal wallets pile up at 0 and 1 (pure sellers and pure buyers), while the high-frequency group shows a flatter distribution centered around 0.6–0.7, which is a different behavioral mode from anything in the 104 normal clusters.

Those 4,446 wallets — 7% of participants — account for 69.8% of all trading volume.

Concentration and network structure

How Concentrated Is the Trading Power?

The Lorenz curve hugs the bottom-left corner before shooting nearly vertically at the far right, which is the classic extreme-inequality shape. The Gini coefficient of 0.958 puts these markets toward the far end of concentration: the top 1% of wallets control 66.5% of volume, while the bottom 90% control 5%. No single wallet individually dominates — the top four each hold about 1.7% of total volume — so the concentration is in a group, not a monopolist.

While the clustering characterizes wallets in isolation, each wallet's behavioral profile is computed from its own trades, with no reference to what other wallets were doing at the same time. That's useful for identifying who's in the market, but it can't answer whether the high-volume group is acting in concert. I also wanted to understand whether they're moving together, which is what the network analysis is for.

The co-trading graph was built from shared-side activity: two wallets get an edge if they traded on the same side within the same one-hour window on the same market. Same-side co-trading in a tight window is a necessary condition for coordination, though not sufficient. It could also capture traders independently reaching the same conclusion at the same time.

Network metric	Value
Nodes	63,793
Edges	4,127,204
Graph density	0.0020
Communities (Louvain)	1,261
Modularity	0.833
Freeman centralization	~0

A density of 0.002 means the graph is sparse, that most wallet pairs never co-trade. The modularity of 0.833 is high relative to typical real-world networks (which score 0.3–0.7), meaning the communities are genuinely coherent: groups of wallets that trade together far more than chance would predict, with sparse connections between groups. Freeman centralization near zero means no single wallet acts as a hub connecting the rest of the network.

That's the decentralization finding: diffuse, no coordination center, no dominant node. But network decentralization is not the same as capital equality. The trading network looks dispersed. The money isn't. Those two things can coexist, and here they do.

As a sanity check, DBSCAN cluster membership was compared against Louvain community membership. The two methods measure different things — behavioral feature similarity versus temporal co-trading structure — so perfect alignment would be suspicious, but no alignment would suggest they're picking up completely unrelated signals. The largest network community had 35.8% of its members from DBSCAN cluster 5 (the buy-and-hold retail cluster), which is meaningfully above that cluster's base rate in the population (~25%). Two different lenses, picking up related but distinct structure.

Are these markets being manipulated?

Surowiecki's framework for crowd wisdom requires four conditions: diversity, independence, decentralization, and aggregation. Each has a measurable proxy, though translating a qualitative framework into numbers involves judgment calls — the sub-scores below are meant to be indicative rather than precise.

Condition	Finding	Sub-score
Diversity	69.8% of volume controlled by 7% of wallets	30.2 / 100
Independence	High-frequency wallets show no directional coordination with each other	93.0 / 100
Decentralization	Network Freeman centralization ≈ 0; 1,261 communities	100.0 / 100
Aggregation	1,261 communities actively forming prices across markets	83.3 / 100

On independence: the high-frequency wallets' buy ratio standard deviation is 0.32, essentially the same as retail's 0.34, which means they aren't agreeing with each other on direction any more than retail traders do. If 4,446 wallets were coordinating, you'd expect their bets to cluster toward one side across markets. They don't — the directional spread looks like independent actors, not a coordinated group.

On decentralization: as covered in the network section above, the trading graph is structurally dispersed. The caveat worth repeating is that this doesn't resolve the capital concentration finding. A decentralized network where one subset holds 70% of the capital is still a concentration problem — the decentralization just means that subset isn't acting as a single coordinated entity.

Diversity is the condition that fails, and it's the most consequential one. The volume-weighted price in a capital-heavy market will reflect the convictions of the large wallets more than the small ones, regardless of whether those large wallets are coordinating. That's just how capital weighting works.

The most direct test of active manipulation was temporal causality: do the high-frequency wallets trade before retail, moving prices and inducing the crowd to follow?

Does Suspicious Wallet Activity Lead Retail Activity?

Lead-lag cross-correlation across 124 markets in 15-minute bins. If the high-frequency wallets were systematically front-running retail, the red bars (high-frequency leads) would be clearly taller than the blue. The profile is nearly symmetric — directional asymmetry of 0.008, p = 0.531, not significant. Worth noting that the reverse-chronological sampling means this test is weighted toward end-of-market behavior, when large wallets are most active and retail has often already tapered off. A cleaner version of this test would require full trade history across the market lifecycle. The result is suggestive, not definitive.

Wisdom of Crowds Score

Three of the four conditions come out strong; diversity is the outlier. The composite score of 62.1/100 places these markets in the mixed / expert-weighted range — not a failed market, but not a democratic crowd either. The three-mode summary: herding LOW, minority price-setting HIGH, causal manipulation LOW.

What this means

We can say that prediction markets are capital-weighted systems, not democratic aggregators. Whether that makes them less accurate is a separate question — a highly informed professional (or insider) with $5 million in a market may be a better signal than 10,000 retail traders each putting in $20. But it does mean the "public believes X" framing is wrong. The more accurate citation is "prediction markets currently price X at Y%", not "the public gives X a Y% chance." That distinction matters when the number ends up in a CNN headline.

Timing adds another layer: even a high-scoring market warrants more caution in the final week before resolution, when the temporal data suggests large wallets are most active and late-stage concentration is highest. The January 2026 volume spike in the dataset is a clear example — trade count had been rising for months, but volume exploded late. That's not broader retail participation; it's large wallets moving aggressively on approach to resolution.

The cross-platform comparison that was the original goal of this project remains the real gap. Structural inequality in Polymarket is now measurable. Whether Kalshi shows the same patterns is unknown, because Kalshi doesn't expose wallet-level data. The platform with less external oversight is paradoxically the more auditable one: regulators can presumably see Kalshi's account-level activity, outside researchers can't, while on Polymarket the reverse is true. That's the asymmetry that made this analysis possible and also the reason it's incomplete.

What's still open is the harder question underneath the concentration finding: are the high-frequency wallets signal or noise? If they're informed professionals with genuinely better models, their outsized influence might make prices more accurate, not less. If they're actors with reasons to move prices unrelated to predicting outcomes (to profit from derivatives positions, to move public perception ahead of a decision) then the concentration is a real problem that the independent, decentralized structure of the surrounding network can't fix. Distinguishing between those two interpretations isn't possible from behavioral data alone. Answering the identity question (who these wallets actually belong to) would require regulatory access to Kalshi-style account records, which outside researchers don't have on Polymarket's pseudonymous chain. Answering the accuracy question (whether markets dominated by high-frequency wallets predict outcomes better or worse than distributed ones) is technically doable from public data, but would need full price histories across hundreds of resolved markets, not the end-of-market sample this project collected. That's the natural next step.