Originally from Jinhua, Enpu is a software engineer currently based in Seattle. He has spent the last few years building the backend components behind enterprise software, and more recently, sed.i, a tool for his own personal reading and curation. A graduate from Allegheny College with double majors in Computer Science and Classical Piano, he is currently completing the master's program in information systems at the University of Washington.

Previously at ScienceLogic, he worked on webhook service, metric ingestion and postprocessing pipelines, and components for multi-tenant analytics for IT observability software.

Outside of work, he managed dv, an independent record store and venue focused on vinyl, natural wine, and experimental music in Suzhou; co-owned a community café in Pittsburgh that led to the establishment of a city-wide crisis support program for front end retail and food service workers.

Enpu also plays records for dance parties. Having performed at many venues in China and Japan, you can catch him behind the deck at Otherworld in Seattle every now and then, and his latest mix on Apple Music.

GitHubLinkedInEmailCV
  • sed.i — System Architecture Wiki

    sed.i Landing Pagesed.i Landing Page

    1. What the app is

    sed.i is, at its core, a read-it-later queue built around save, search, and retrieval of reading content across internet. A user pastes a URL and the system saves it, extracts the full article text and metadata in the background, generates an embedding for semantic search, auto-tags it, and presents it in a clean reader with features to manage reading process, as well as highlights and cross-article idea connections. It's also tightly integrated with MCP so that a reader can interact with their reading context directly from a LLM client such as Claude.

    The core user loop:

    1. Save a URL from the browser or extension
    2. System extracts content asynchronously
    3. User reads in-app with highlights and notes
    4. Search across the knowledge base semantically, by keyword, or by filter operators
    5. Recommendations surface what to read next
    6. An LLM agent (via MCP) can query and write to the library

    Secondary features: vinyl record collection (Crates), public profiles, list-based markdown editor, and a hosted MCP server with OAuth so AI agents can access the library.

    Browser / Chrome Extension
      └─> Next.js 14 (Vercel)
            └─> FastAPI (Railway)
                  ├─> PostgreSQL + pgvector   ← main data store + vector search
                  ├─> Redis                   ← Celery broker + embedding cache + OAuth codes
                  └─> Celery Workers (Railway)
                        ├─> URL fetch + metadata (requests + BeautifulSoup)
                        ├─> Article text extraction (trafilatura)
                        ├─> PDF layout extraction (PyMuPDF + YOLO/ONNX)
                        ├─> Embedding generation (OpenAI text-embedding-3-small)
                        ├─> Auto-tagging (pgvector similarity + gpt-4o-mini fallback)
                        ├─> AI summarization (OpenAI)
                        └─> Discogs metadata (vinyl)
    
    LLM Agents (Claude Desktop, etc.)
      └─> MCP stdio (local) or MCP HTTP (hosted, OAuth 2.1 + PKCE)
            └─> FastAPI /mcp-transport
    

    2. Data model

    Content Items

    ColumnPurposeDesign note
    user_idOwnershipEvery query filters by this. Cross-user leakage is structurally impossible.
    processing_statusPipeline statepending → processing → completed / failed. Frontend shows reduced-opacity cards while extraction runs.
    processing_errorFailure detailNormalized: 401/403/paywall errors are classified as "source access issues", not parser bugs. Items can be processing_status='completed' but have a processing_error (partial extraction — something was extracted but less than expected).
    full_textRaw HTMLStored as HTML, not plain text. The reader renders it; search uses a plain-text derivative.
    search_vectortsvectorMaintained by a PostgreSQL trigger. Dual-dictionary (english + simple) so both stemmed words and acronyms index correctly.
    embeddingVector(1536)Populated by Celery after extraction. Null until then — all search paths degrade gracefully.
    read_position0.0–1.0Scroll progress. Auto-marks is_read=True when ≥ 0.9.
    deleted_atSoft deleteRows are never hard-deleted. Enables undo, audit trails, and safe Celery task execution.
    tags / auto_tagsTwo-layer taggingauto_tags are AI suggestions pending review; tags are user-confirmed.
    submitted_viaSource tracking'web', 'extension', 'api', 'email' — useful for analytics and pipeline routing decisions.

    Highlights

    A highlight stores character offsets (start_offset, end_offset) into full_text. It also gets an OpenAI embedding — this is what powers cross-article idea connections.

    Writing - WIP

    One draft per list per user. Stores markdown content, optional title, and word count. Auto-created on first PATCH so the frontend can always use a single endpoint without checking for existence first.


    3. Authentication and sessions

    JWT design

    Login returns a signed JWT (sub=email, exp=now+N_minutes). Every request passes it as Authorization: Bearer <token>. FastAPI's get_current_active_user dependency decodes and verifies the token on every request.

    Why JWT over server sessions? JWTs are stateless — any FastAPI instance can verify a token without shared session storage. With sessions you'd need a shared Redis session store; with JWTs the signing key alone is sufficient. This matters because Railway can run multiple web instances.

    Trade-off: JWTs can't be revoked before expiry. If a user changes their password, an old token is still valid until it expires. Acceptable for a personal reading app.

    Token storage: localStorage['token'] in the browser. XSS-accessible — a known gap, documented in §20.

    Email verification and password reset

    Both use a VerificationToken table:

    FieldNotes
    tokensecrets.token_urlsafe(32) — 43 chars of URL-safe base64
    token_type"email_verification" (24hr TTL) or "password_reset" (1hr TTL)
    is_usedPrevents replay — marked true on first use
    expires_atEnforced at validation time

    Password reset invalidates any existing unused reset tokens for the user before creating a new one — prevents confusion from multiple concurrent reset links.

    Emails are sent via Resend (HTTP API) as Celery background tasks.


    4. Content ingestion pipeline

    A URL save triggers a multi-stage Celery chain. The API returns 201 before any of this runs.

    Stage 1: extract_metadata

    fetch URL (requests, follow redirects)
      ├─> detect Content-Type → PDF path or HTML path
      ├─> parse OG / Twitter Card / JSON-LD metadata (BeautifulSoup)
      ├─> detect content_type (article / pdf / video / tweet / unknown)
      ├─> thumbnail fallback chain: OG → Twitter → JSON-LD → link[rel=image_src] → first in-content image
      ├─> run trafilatura (output_format="xml", include_tables/images/links=True)
      │     └─> xml_to_html() converts structured XML → clean HTML
      │         fallback: if <100 chars, convert plain text → <p> tags
      ├─> compute word_count, reading_time_minutes = max(1, round(word_count / 200))
      ├─> run paywall/access heuristics → set processing_error if restricted
      ├─> set processing_status = 'completed'
      └─> chain: generate_embedding.delay() → generate_tags.delay()
    

    trafilatura's XML output preserves structural information (heading levels, table cells, list items) that plain HTML or text loses. The xml_to_html() conversion maps these back to semantic HTML elements the reader can render.

    Paywall detection — 7 heuristics

    Items can reach processing_status='completed' with a processing_error set, meaning: we extracted something, but it's probably not the full article. The _detect_limited_extraction_reason() function applies these checks in order:

    HeuristicSignal
    Schema markersJSON-LD isAccessibleForFree: false, content_tier meta tags (paid, premium, subscriber, metered)
    Restriction textKeywords like "paywall", "subscriber-only", "access options" in page text or DOM class names
    Bot blockingStatus codes 403/429, "cloudflare", "captcha", "blocks bots" in error text
    Coverage ratioExtracted text / source text < 0.4 AND < 5 paragraphs AND < 1800 chars
    Teaser overlapOG description found verbatim in extracted text AND extracted < 1200 chars AND < 5 paragraphs
    Media-onlyFigure/img present + 0 paragraphs + < 900 chars (image galleries, paywalled articles showing only art)
    Structural truncationSource has 6+ paragraphs, extracted ≤ 4 AND ratio < 0.4 AND < 1800 chars

    These map to user-facing categories via ingestErrors.ts on the frontend: blocked, unauthorized, source_access, network, partial, unknown. Each category has its own badge text, severity, and reader fallback message.

    Extension fast path

    The Chrome extension runs Mozilla Readability client-side (in the browser, on the page being viewed) and sends pre_extracted_html in the POST body. The backend:

    1. Calls _clean_extension_html() to strip duplicate title/description/thumbnail elements that Readability might preserve
    2. Sets processing_status='completed' immediately
    3. Still fires extract_metadata.delay() for missing metadata fields and to trigger the embedding chain

    Result: extension-saved articles are fully readable in seconds. The extension also sets pre_extracted_access_restricted=True if detectAccessRestriction() fires — this immediately sets processing_error before Celery even runs.

    Retry semantics

    max_retries=3 on extraction tasks. Each failure saves a processing_error and eventually sets processing_status='failed'. The frontend shows failed items with a specific failure badge based on the error category.


    5. PDF extraction

    PDF extraction is a separate 3-stage pipeline handled by PyMuPDF (fitz) and an ONNX/YOLO layout model.

    Stage 1: Pre-scan

    Runs on every page before any model inference. Pure text analysis — fast and free.

    Per page:

    • Column detection: Measures X-distribution of text blocks. A split at 0.42/0.58 page width indicates two-column layout.
    • Header/footer bands: Detected via horizontal rules from page.get_drawings() — narrow lines (< 3pt height, > 30% page width) in the top 10% or bottom 20% of the page. Falls back to repeating-text detection if no rules are found.
    • Page number rects: Margin blocks matching \d[\d\s]* (max 6 chars) — excluded from reading bounds.
    • Ambiguity scoring: Two-column confidence scored by how evenly text is distributed left/right.

    Document-level:

    • Repeating header/footer texts: Blocks appearing on 3+ pages (at least 40% of all pages), normalized with re.sub(r"\b\d+\b", "#", ...) to match page numbers with different values. These are filtered from output.
    • Column mode: Majority vote across all pages.

    Stage 2: Layout detection

    Three implementations available (configured by environment):

    ImplementationCostSpeedAccuracy
    pymupdf_layout (ONNX model)Free~200ms/pageGood
    yolo (torch, local)Free after downloadVariableGood
    gpt4o_mini_vision~$0.03/pageNetwork-boundBest

    The YOLO/ONNX approach detects reading regions, figures, tables, and headers as bounding boxes. These are then used to extract text in reading order (column-aware, skipping headers/footers/page numbers).

    Stage 3: Post-processing

    After layout extraction produces raw HTML:

    StepWhat
    TitleFirst <h1> or <h2> → stored as item.title
    Author line removalShort <p> blocks (< 300 chars) between title and abstract heading → removed (arXiv author/affiliation lines)
    Abstract (arXiv style)Standalone <h1>ABSTRACT</h1> + following <p> → stored as item.description, removed from body
    Abstract (journal style)Inline <p>ABSTRACT: text...</p> → same treatment
    Figure thumbnailFirst <div class="figure-block"><img src="data:..."/> → stored as item.thumbnail_url, removed from body
    Confidence badgeInjected as <meta> tag; parsed by the Reader

    Why pre-scan first instead of running the model on every page? The pre-scan identifies structural features (columns, headers, footers) that inform how model outputs are interpreted. It also identifies pages that don't need model inference (single-column, clean layout). Running the model on every page without context leads to misrouted reading order in two-column academic papers.


    6. The reader

    The reader is a Reader shell component wrapping ReaderArticle.

    Rendering pipeline

    The stored HTML goes through several transforms before display:

    1. sanitizeContentHtml — strips ephemeral UI elements that might have been saved accidentally
    2. stripDocumentWrappers — removes <html>/<body> wrappers from PDF-extracted content
    3. addHeadingAnchors — generates deduplicated IDs on headings for table-of-contents links
    4. toBionic (optional) — bolds the first ~50% of each word. Optional toggle in reading settings.

    Scroll position persistence

    read_position (0.0–1.0) is updated via a debounced PATCH /content/{id} as the user scrolls. On open, the reader scrolls to the saved position. Auto-marks as read at ≥ 0.9.

    Reading settings and SSR hydration

    All settings (theme, typography, bionic reading, feature visibility toggles) are stored in ReadingSettingsContext, persisted in localStorage['sedi-reading-settings']. The context initializes from defaults on the server (for SSR) and loads saved values in useLayoutEffect (client-only, after hydration). Components that depend on these settings return null until the hydrated flag is true — this prevents a flash of wrong font/theme on initial load.

    Highlights in the reader

    Highlights are stored with character offsets (start_offset, end_offset) into full_text. When the reader renders, it injects highlight spans by finding character ranges.


    A single GET /search/semantic?query=... endpoint handles four search strategies, routed by a classifier that runs in <1ms.

    The classifier (app/core/search_router.py)

    Priority-ordered rules — first match wins:

    Query shapeRouteExample
    Contains author:, tag:, site:, after:, before:, is:SQL filterauthor:Paul Graham after:2025-01-01
    Quoted phrasetsvector keyword"attention is all you need"
    Looks like a domainSQL site filtersubstack.com
    Matches user's known author nameSQL filter(user has articles by "Simon Willison")
    Matches user's known tagSQL filtermusic
    Ends with ? or starts with question wordpgvector semantichow does attention work?
    ≤ 4 words, no question wordstsvector keywordllm, react hooks
    Everything elsekeyword + semantic fused with RRFbuilding products with AI

    get_user_search_context() preloads the user's known authors and tags before classification — one SQL query, called once per search request.

    The three engines

    SQL filter path: Pure SQLAlchemy query, no vector math, no full-text index. Fastest. Tag filter uses EXISTS (SELECT 1 FROM unnest(tags) t WHERE t ILIKE :pat) — case-insensitive partial match across the Postgres array.

    Keyword path: PostgreSQL tsvector full-text search via ts_rank_cd. The search_vector column uses a dual-dictionary trigger:

    • english dictionary: stems words (running → run, articles → article)
    • simple dictionary: stores tokens as-is (LLMs → llms, RAG → rag, API → api)
    • Combined with || so both dictionaries contribute to ranking
    • Single-token queries use prefix matching (to_tsquery('simple', 'llm:*')) so llm matches llms
    • ts_rank_cd with normalization flag 32 caps scores at 0–1
    • If keyword returns 0 results (misspelling, partial word), falls back to semantic silently

    Semantic path: Embeds the query via OpenAI, runs cosine similarity against content_items.embedding in pgvector. Falls back to keyword if embedding fails or no embeddings exist yet.

    Embedding cache

    Query embeddings are cached in Redis (qemb:{sha256(query)[:16]}, 1hr TTL). Searching "llm" five times calls OpenAI once.

    Reciprocal Rank Fusion (RRF)

    When two or more engines run, ranked lists merge with RRF:

    score(item) = Σ  1 / (60 + rank_in_list)
                  all lists containing item
    

    The constant 60 (from the original RRF paper) dampens the benefit of high rank — a rank-1 result doesn't dominate completely. RRF scores are not percentages — they're only meaningful for sorting. The frontend does not display them.

    Three-way fusion (mode=full): Used by the SearchModal. Filter + keyword + semantic all run simultaneously. Keyword + semantic are fused first, then fused with filter results. This maximizes recall — every result from every engine is considered.

    SearchBar (navbar): mode=auto — classifier picks the cheapest path. Returns 5 results inline. No OpenAI call for keyword/filter queries.

    SearchModal (Cmd+K): mode=full — always runs all three engines. Returns 10 results per page with prev/next pagination. Date preset chips: last 7 days / 30 days / 3 months / last year / custom range. Custom range uses cross-validated min/max on the date inputs to prevent invalid ranges.


    8. Auto-tagging

    Goal: Suggest relevant tags without spending money on every article.

    Two-pass strategy

    Pass 1 — free (pgvector similarity):

    • Find the user's already-tagged content with cosine distance < 0.25 (very similar articles)
    • If ≥ 2 tags appear across ≥ 2 similar articles → auto-accept those tags immediately

    Pass 2 — cheap LLM (gpt-4o-mini):

    • Only runs if pass 1 finds nothing (new library, first article in a topic area)
    • Sends title + description + first 800 words of plain text to gpt-4o-mini
    • Parses JSON array from response

    Why this design over always calling GPT? At 1000 articles, calling gpt-4o-mini for every article costs ~$0.20/month. But at 10,000 articles with a well-tagged library, pass 1 handles ~80% of new articles for free. The cost curve flattens as the library grows — the system gets cheaper per article over time.


    9. Highlights and idea connections

    Embeddings on highlights

    After a highlight is created, generate_embedding embeds the highlight text (same OpenAI model, same Celery task used for articles). The 1536-dim vector is stored in highlights.embedding.

    Cross-article connections

    GET /search/connections/{highlight_id}:

    SELECT h.id, h.text, h.color, ci.title,
           (1 - (h.embedding <=> CAST(:source AS vector))) as similarity
    FROM highlights h
    JOIN content_items ci ON h.content_item_id = ci.id
    WHERE h.user_id = :uid
      AND h.content_item_id != :source_article_id
      AND h.embedding IS NOT NULL
      AND LENGTH(h.text) >= 20
      AND (1 - (h.embedding <=> :source)) >= :threshold
    ORDER BY h.embedding <=> :source
    LIMIT :limit
    

    The LENGTH(h.text) >= 20 filter excludes trivially short highlights that produce noisy embeddings.

    GET /search/connections/article/{content_id} — runs the above for every highlight in an article, groups by connected article, and sorts by total similarity score. This powers the "Connections" tab in the reader.


    10. Lists and drafts

    Lists

    Many-to-many join table (content_list_membership) between users and content items. Key design decisions:

    • added_by on the join table — preserves who added an item (future sharing feature)
    • List soft deletion doesn't cascade-delete items — items just lose membership
    • GET /lists/{id}/content joins content_items and filters deleted_at IS NULL — items independently soft-deleted don't appear

    Drafts

    Each list can have one markdown draft (one per list per user). The draft is a writing space for notes, summaries, or essays about the list's content.

    API:

    • GET /lists/{id}/draft — 404 if no draft exists yet
    • POST /lists/{id}/draft — explicit creation (409 if already exists)
    • PATCH /lists/{id}/draftauto-creates if not exists; this is the autosave target
    • DELETE /lists/{id}/draft — removes the draft

    Split-pane reader in list view

    /lists/[id] renders a split-pane layout: list on the left, ReaderArticle on the right. ReaderArticle accepts an embedded prop that switches from window.scroll to a container scroll, so the scroll position and reading progress bar work correctly inside the panel.


    11. MCP server — LLM agent interface

    The MCP (Model Context Protocol) server exposes the library to LLM agents. It has two deployment modes.

    Phase 1: Local stdio (Claude Desktop)

    Launched as a subprocess by Claude Desktop:

    {
      "mcpServers": {
        "sedi": {
          "command": "poetry",
          "args": ["run", "python", "-m", "app.mcp.server"],
          "cwd": "/absolute/path/to/backend",
          "env": { "TOKEN": "<your-sedi-jwt>" }
        }
      }
    }
    

    Phase 2: Hosted HTTP (OAuth 2.1 + PKCE)

    Mounted as an ASGI app at /mcp-transport on the main FastAPI server. Fully compliant OAuth 2.1 with PKCE (S256 only — no plain challenge method).

    OAuth endpoints:

    EndpointStandardPurpose
    /.well-known/oauth-authorization-serverRFC 8414Discovery metadata
    /.well-known/oauth-protected-resourceRFC 9728Resource server metadata
    registerRFC 7591Dynamic client registration
    authorize GETLogin form (HTML)
    authorize POSTCredential check + auth code issuance
    tokenRFC 6749Code exchange + refresh token grant

    Client validation: Static allowlist from MCP_OAUTH_CLIENTS_JSON env var (JSON dict of client_id → [redirect_uri...]). If the env var is empty, any dynamically registered client is accepted. Non-matching client_id or redirect_uri → 400.

    Security: OAuth login form HTML-escapes all reflected parameters (client_id, redirect_uri, state, code_challenge) to prevent reflected XSS. CSRF embedded in OAuth state parameter.

    PKCE? PKCE prevents authorization code interception attacks. Even if an attacker captures the auth code in transit, they can't exchange it without the code verifier that was never sent over the network.

    MCP tools (15 total)

    Read tools:

    ToolWhat it does
    list_lists()All reading lists with item counts
    get_list_content(list_id, include_full_text, limit)Articles in a list. full_text capped at 32,000 chars to prevent LLM context overflow. Max 200 items.
    get_content_item(item_id, include_full_text)Single article metadata or full content
    search_content(query, limit)Hybrid search (same engine as the app). Max 50 results.
    find_similar(item_id, limit, threshold)Cosine similarity search. Default threshold 0.5.
    get_highlights(item_id?, list_id?)Three modes: all highlights for an article, all highlights in a list, or all highlights across the library (max 100)
    get_draft(list_id)Writing draft for a list, or null if none
    get_reading_stats(){total_items, read_count, unread_count, archived_count}
    summarize_list(list_id, style, max_items)AI summary. Styles: overview, themes, gaps, timeline. gaps style includes draft content to identify what's missing. Cached in-process by (user_id, list_id, content_hash, style).

    Write tools:

    ToolWhat it does
    add_content(url)Save URL, triggers background extraction
    update_draft(list_id, content, title?)Create or update list draft
    create_list(name, description?)New reading list
    add_to_list(list_id, item_id)Add content item to a list

    An LLM agent calling get_list_content with a large list could easily overflow its context window with full article text. The 32k cap prevents this while still giving the agent substantial content to work with. Agents needing full text should call get_content_item per article.


    12. Public profiles

    Users can expose their queue and/or crates publicly via /[username].

    Visibility model

    Three independent toggles:

    • user.is_public — profile visible at all
    • user.is_queue_public — queue items visible
    • user.is_crates_public — vinyl records visible

    Individual items additionally have is_public — a user can have a public queue but hide specific articles.

    No-auth API routes

    /public/u/{username}/... routes have no get_current_active_user dependency. They check is_public flags and return 403 if the profile is private. These are the only unauthenticated API routes in the system (besides /auth).

    Guest reading limit

    The public reader tracks reads in localStorage per profile owner. After 3 reads from the same profile, it shows a signup prompt. Entirely client-side — a conversion nudge, not server-enforced access control.


    13. Crates — vinyl record collection

    A second vertical for managing a vinyl record collection. Users paste a Discogs URL; Celery fetches metadata from the Discogs API asynchronously.

    vinyl_records table (separate from content_items)

    The metadata structure (tracklist JSON, label, catalog number, year, artist) doesn't fit the content_items schema. A separate table is cleaner than adding 10 nullable columns to a content items table that doesn't need them.

    ColumnNotes
    discogs_release_idInteger parsed from URL, used for the Discogs API call
    tracklistJSONB: [{position, title, duration}, ...]
    videosJSONB: [{title, uri, duration}, ...] — Discogs links + user-added YouTube links
    status'collection', 'wantlist', 'library' — not a boolean wantlist column
    processing_statusSame pending → completed/failed pattern as content items

    Music playback

    PlayerContext manages a queue of QueueTrack[] objects derived from vinyl_records.videos. YouTubePlayer is an invisible div hosting the YouTube IFrame API — plays videos sequentially. Queue persists to localStorage['sedi-player'] across navigation.

    Why YouTube IFrame API instead of an audio element? Discogs stores YouTube links as listening sources. The IFrame API gives programmatic play/pause/next without needing a raw audio URL. The trade-off: requires the video to be embeddable (some YouTube videos disable embedding).


    14. Chrome extension

    MV3 extension. Three components in a chain:

    content.js (injected into every page)

    Runs Mozilla Readability on the page to extract clean article HTML. Also runs detectAccessRestriction() — checks JSON-LD isAccessibleForFree, content_tier meta tags, and known paywall DOM selectors (subscription modal class names, etc.).

    Shows a preview before saving: word count, read time, author, publish date, access-restriction signal. User clicks "Save" → sends payload to service worker.

    service_worker.js

    Calls POST /content with pre_extracted_html, pre_extracted_access_restricted, and all metadata fields. Maps accessRestricted: truepre_extracted_access_restricted: true, which triggers immediate processing_error flagging on the backend before Celery runs.


    15. Rate limiting

    RateLimitMiddleware applies to POST /content only. Sliding window algorithm:

    • Each user (JWT user ID) has a deque of request timestamps
    • On each request: pop timestamps older than the window; if len(deque) < max_requests, allow and append
    • Limits: 10 requests / 60 seconds AND 50 requests / 3600 seconds
    • 429 response includes Retry-After header and CORS headers

    Known limitation: State is in-memory per process. Multiple Railway instances would each have independent counters — a user could exceed the limit across instances. Production fix: Redis-backed INCR + EXPIRE. Documented, not yet implemented.


    16. Security posture

    Current protections

    AreaImplementation
    Passwordsbcrypt via passlib
    API authJWT signed with SECRET_KEY, expiry enforced on every request
    Cross-user isolationEvery DB query filters by user_id = current_user.id
    CORSOrigin allowlist from ALLOWED_ORIGINS env var
    Rate limitingSliding window on POST /content
    XSS in OAuthHTML-escaped reflected parameters in MCP OAuth login page
    MCP OAuthStrict client_id + redirect_uri allowlist; PKCE S256 only; CSRF in state
    Error sanitizationNo internal query details or stack traces in API responses

    Known gaps (documented, not hidden)

    RiskCurrent stateProduction fix
    XSS via article HTMLdangerouslySetInnerHTML in reader without sanitizationDOMPurify or sandboxed iframe
    SSRFBackend fetches user-provided URLs without IP validationBlock internal ranges (169.254.x.x, 10.x.x.x, 127.x.x.x)
    Token storagelocalStorage is XSS-accessiblehttpOnly cookies + CSRF tokens
    Rate limit in-memoryNo cross-instance enforcementRedis-backed INCR + EXPIRE

    The gaps are accepted trade-offs for a personal-use app at current scale, not oversights. Any engineer picking up this codebase can see exactly what needs hardening before a public launch.


    17. Key design decisions


    Q: Why Celery + Redis instead of FastAPI background tasks?

    FastAPI's BackgroundTasks run in the same process. A slow extraction blocks other requests. Celery runs in a separate worker process — failures, retries, and slow jobs don't affect API latency. Redis is already provisioned for caching and OAuth state, so Celery adds no infrastructure cost.


    Q: Why store article HTML instead of plain text?

    The reader needs structure — headings, code blocks, images, lists. Plain text loses all of that. The trade-off is storage size and XSS risk. XSS is on the known-gaps list (DOMPurify is the fix).


    Q: Why not just use Elasticsearch or Typesense for search?

    pgvector + tsvector together cover keyword, semantic, and filter search in one database. No sync lag, no second datastore, no additional cost. The RRF fusion technique is what production systems (Elasticsearch 8.x ships it natively) use. The only thing missing is ANN indexing (HNSW) for billion-scale vector search — not relevant here.


    Q: How does the hybrid search classifier work?

    It's a priority-ordered set of regex heuristics and user data lookups that runs in <1ms with no LLM call. Explicit operators first, then domain detection, then the user's known authors/tags (loaded from DB before classification), then question detection, then word count threshold (≤4 words = keyword, more = hybrid). mode=full bypasses the classifier entirely and runs all three engines simultaneously.


    Q: Why dual-dictionary tsvector instead of just the English dictionary?

    The English stemmer treats "LLMs" as a token it doesn't know how to stem — it comes out as llms. A query for llm (stemmed via English) doesn't match llms. The simple dictionary stores tokens as-is, so llms is indexed and prefix matching llm:* catches it. Running both dictionaries means you get stemming benefits (run/running match) AND acronym accuracy.


    Q: How does the MCP server handle auth in two different modes?

    Phase 1 (stdio): JWT passed as environment variable, decoded by app/mcp/auth.py on every tool call. Simple but requires the user to copy a token from their browser.

    Phase 2 (HTTP): Full OAuth 2.1 + PKCE flow. The MCP client redirects to the sed.i login form, user authenticates, gets an auth code, exchanges it for a sed.i JWT. The end result is the same JWT the regular API uses — no separate token type, no extra verification logic.


    Q: Why character offsets for highlights instead of DOM ranges?

    DOM ranges (XPath, CSS selectors) break when HTML structure changes — e.g. bionic reading transforms, re-extraction, or adding heading anchors. Character offsets into the source full_text string are stable as long as that string doesn't change. This is also why re-extraction after a successful extraction isn't triggered automatically.


    Q: Why JSONB for reading_patterns instead of a separate analytics table?

    The recommendation engine reads one row per request. A separate table would require joining content_items across potentially thousands of rows to compute rolling stats at request time. JSONB on the user row is a deliberate denormalization — stats are maintained incrementally (each article completion updates the rolling window) rather than computed on read.


    Q: Why soft deletes everywhere?

    Hard deletes are irreversible and can crash background Celery tasks holding a reference to a deleted ID. Soft deletes let users undo, preserve audit trails, and keep background tasks safe. The trade-off: every query must include AND deleted_at IS NULL. Missing this is a data leak — it's enforced at the ORM layer in shared query helpers.


    Q: How does the extension fast path work, and why?

    The extension runs Mozilla Readability client-side (in the browser, on the page being viewed) and sends pre-extracted HTML. The backend skips trafilatura, sets processing_status='completed' immediately, and still fires Celery for embeddings and any missing metadata. Result: extension-saved articles are fully readable in seconds. The extension also detects paywalls client-side (JSON-LD signals, DOM selectors) and signals this with pre_extracted_access_restricted=True — the backend sets processing_error before Celery even runs.


    Q: How does the PDF extraction pipeline differ from article extraction?

    PDFs go through a 3-stage pipeline: (1) a free pre-scan per page that detects column layout, header/footer bands, and page numbers using PyMuPDF text block positions; (2) layout model inference (YOLO/ONNX by default, GPT-4o-mini vision as a higher-accuracy option) to identify reading regions, figures, and tables; (3) post-processing that extracts the title from the first heading, removes author lines from arXiv-style front matter, detects and removes abstracts (stored as description), and extracts the first figure as a thumbnail. The pre-scan informs how model outputs are interpreted, especially for two-column academic papers where reading order matters.


    Q: What can an LLM agent do with the MCP server?

    Read: browse lists, read full article content (capped at 32k chars to prevent context overflow), search the library with the same hybrid engine the app uses, find articles similar to a given one, read all highlights, get reading stats, and generate AI summaries of a list in different styles (overview, themes, gaps, timeline). Write: save new URLs, create lists, add articles to lists, and create/update writing drafts. The gaps summary style cross-references the list's draft to identify what's missing — useful for research writing workflows.


  • Prediction Whale - Audit on Prediction Market Concentration

    The concept of wisdom of crowds came up a lot in my machine learning courses. The idea is straightforward: decisions made by aggregating information across a group tend to be better than those made by any single member. Prediction markets like Kalshi and Polymarket are built on this premise — they let a broad audience trade on the probability of specific outcomes, and treat the resulting prices as a more accurate read on public sentiment than polls or expert opinion. CNN now cites Kalshi odds during election coverage. Financial desks treat them as a legitimate forecasting signal. The platforms are growing fast, with billions in volume and serious institutional attention, all without charging a transaction fee, which raises an obvious question about what they're actually selling, and to whom.

    The two platforms are less similar than they appear. Kalshi is CFTC-regulated, operates in USD, and requires verified US accounts, where each account tied to a confirmed identity. Polymarket runs on Polygon, settles in USDC, and at the time this project was conducted had no KYC requirement and was pseudonymous, operating outside the US following a 2022 CFTC enforcement action. The groups behind each platform reflect this split: Kalshi is the regulated, institutional-adjacent one; Polymarket is crypto-native. Counterintuitively, that made Polymarket the more auditable of the two, where every trade, wallet, and timestamp is on-chain and publicly accessible. Kalshi is accountable in the legal sense but opaque in the data sense. It seems that if you want to know who's actually moving its prices, you have to ask Kalshi.

    That asymmetry opened up a few directions worth investigating:

    • If both markets claim to represent public wisdom, odds on the same event should converge. A persistent gap suggests the crowds are different, or one market is less honest about what it's doing, or both.
    • The difference in user base (crypto-native versus regulated US accounts) might mean each market reflects a genuinely different population, equally representative but measuring different things.
    • More centrally: are these odds actually set by a distributed crowd, or by a small number of large wallets? If a Kalshi probability ticks up sharply and CNN reports it as public sentiment, that's a real information-flow problem regardless of whether anyone intended it.
    • Do large traders cause retail participants to change their behavior? A retail trader might have a clear view, see a large wallet moving the other direction, and follow — even if that position has nothing to do with the actual probability of the outcome.

    For a ML course of my grad program, that last question is what this project is trying to answer.

    Getting the data

    My initial thought as the starting point was collecting bets that exist on both platforms. If both markets are pricing the same event, you can ask whether the odds converge, diverge, and why.

    The first obstacle was that Polymarket and Kalshi don't organize their markets the same way. Each platform nests binary sub-markets ("Will Trump nominate Judy Shelton?") inside categorical event groupings ("Who will Trump nominate as Fed Chair?"). Matching at the sub-market level is difficult where titles are too platform-specific to align. Matching at the event level worked much better: using the sentence-transformer model from huggingface to convert embeddings and calculate cosine similarity across 9.8 million candidate pairs found 479 matches above the 0.75 threshold and 8 perfect 1.0 matches. Six pairs made the final cut after manual review.

    MarketTypeNotes
    Champions League WinnerSports$1B volume; insider-risk structure (lineup leaks, injury timing)
    US Presidential ElectionPoliticalHigh volume, high public attention
    Congressional ControlPolitical
    Fed Rate DecisionPolicySingle binary outcome
    Trump Cabinet Nominations (×2)PoliticalMultiple sub-markets per event

    Running the analysis on a single event type would have told a single-event story, so the diversity across categories mattered. Due to time and compute constraints, trades were capped at 2,000 per sub-market, sampled in reverse chronological order. Each matched event contains many sub-markets. For example, the Champions League event alone has dozens, one per team. The 237,791 total Polymarket trades come from roughly 125 sub-markets at up to 2,000 each, alongside 160,000 Kalshi trades. One clear drawback is that the reverse-chronological sampling skews the dataset toward end-of-market activity, when prices have largely settled and perhaps large wallets tend to be most active.

    This sampling choice shaped what the analysis could and couldn't answer. End-of-market behavior is when volume concentration is most visible, which worked in our favor for the behavioral clustering. But it also meant the cross-platform price comparison — the original goal — came back empty. Two markets pricing the same event should converge over time, and testing that properly requires a more extended trade history, not the most-recent-N approach the API forces on you. That limitation is worth stating upfront, because the analysis that follows is primarily about Polymarket's internal structure rather than a head-to-head platform comparison.

    There's also an asymmetry in what each platform's API could give us. Polymarket's blockchain architecture exposes wallet addresses for every trade, making wallet-level behavioral analysis possible. Kalshi's public API returns price, size, and timestamp, but no user identifier. Every trade is anonymous. The entire behavioral analysis depends on being able to group trades by the account that made them, so while the project started as a two-platform comparison, the behavioral work is necessarily Polymarket-only. This is its own kind of irony: the unregulated, pseudonymous platform is the one that's actually auditable from the outside.

    Behavioral fingerprints

    Instead of simply using the raw data metric types returned by the APIs, the methodological bet of this project is that trading behavior — how someone trades, not just how much — is informative enough to characterize who's in the market and what they're doing.

    Eleven features per wallet, organized around four dimensions:

    DimensionFeaturesWhat it's asking
    Scaletotal_volume, avg_trade_sizeHow much does this wallet trade?
    Activitynum_trades, trade_freq_per_hourHow often does it trade?
    Timingtiming_entropy, activity_spanWhen and how predictably does it trade?
    Direction & breadthbuy_ratio, market_diversification, volume_concentration, win_rate, position_size_varianceDoes it show systematic bias, and across how many markets?

    After standardization, all features contribute roughly equal variance, with no single dimension drowns out the others. The more interesting finding from the correlation matrix is that timing entropy is nearly orthogonal to volume features, which means the dataset is capturing two genuinely independent behavioral dimensions. How much someone trades and how they distribute that activity across time don't necessarily go together: a wallet that concentrates all its activity in a single overnight session looks completely different from one that trades across every hour of the day, even at identical total volume. That distinction turned out to be one of the cleaner separators between the normal clusters and the wallets that didn't fit anywhere.

    Who is actually in this market

    We applied clustering algorithms to answer this question, and deliberately used DBSCAN over something like K-means, where it lets the density of the feature space determine both the number and shape of clusters, which is more appropriate for an exploratory question like this one. It also doesn't assume spherical clusters, which matters when behavioral groups can take irregular shapes in high-dimensional space. With a parameter sweep optimizing for silhouette score, the final parameters (eps = 0.3, min_samples = 10) produced 104 clusters and 4,446 noise points, with a silhouette score of 0.7017.

    • 5 clusters at 1,000+ wallets — the dominant retail archetypes; clusters 4 and 5 alone cover ~32,000 wallets
    • 23 clusters at 100–999 — solid mid-size behavioral groups
    • 75 clusters at 10–99 — niche behavioral styles, small but not arbitrary; min_samples = 10 means each required at least 10 wallets to form

    The long tail of small clusters reflects genuine behavioral heterogeneity in a large dataset — traders who share enough features to be grouped together but don't fit the dominant archetypes. What the cluster taxonomy captures well is the dominant archetypes; what matters more for the analysis is the noise-point group, whose behavioral profile holds regardless of how the normal clusters are divided.

    A note on what noise means here: DBSCAN labels a point as noise not because it's anomalous in a statistical sense, but most likley because it lacks enough nearby neighbors in feature space to form or join a dense region. These wallets aren't outliers by definition but just didn't cluster. What makes the group interesting is what their features actually show once you look at them directly.

    Who Is Actually Moving This Market?Who Is Actually Moving This Market?

    The cluster size distribution shows the largest normal cluster at 16,309 wallets — one-time retail speculators who bet once and never return. The trading frequency distribution on a log scale makes the separation visible: the mass of wallets sit near 0–5 trades per hour, while a thin tail extends past 175. That tail is not retail.

    The noise-point (suspicious) wallets average $49,790 per wallet in volume — roughly 30× the typical retail cluster — with timing entropy of 0.89 versus near-zero for retail. Retail wallets make one or two trades, usually at the same hour, usually all on one side. The high-frequency group makes 28 trades on average, spread across multiple markets, at hours that don't cluster.

    Trader Behavior ArchetypesTrader Behavior Archetypes

    By count, Buy & Hold dominates; the high-frequency wallets are a small bar. Flip to average volume on a log scale, and that bar towers over everything else by orders of magnitude. The buy/sell bias chart shows that Whale and Buy & Hold wallets are 100% buyers, while the high-frequency group leans 64% buy — some directional conviction, but without the pure-accumulation signature of retail.

    Detecting Structural Manipulators vs. Retail ParticipantsDetecting Structural Manipulators vs. Retail Participants

    The timing entropy histogram is the clearest separator. The normal population spikes sharply near zero — nearly all retail wallets make one-off bets at a single hour and stop. The noise-point wallets spread across a wide entropy range centered around 0.89. The buy ratio distributions show something subtler: normal wallets pile up at 0 and 1 (pure sellers and pure buyers), while the high-frequency group shows a flatter distribution centered around 0.6–0.7, which is a different behavioral mode from anything in the 104 normal clusters.

    Those 4,446 wallets — 7% of participants — account for 69.8% of all trading volume.

    Concentration and network structure

    How Concentrated Is the Trading Power?How Concentrated Is the Trading Power?

    The Lorenz curve hugs the bottom-left corner before shooting nearly vertically at the far right, which is the classic extreme-inequality shape. The Gini coefficient of 0.958 puts these markets toward the far end of concentration: the top 1% of wallets control 66.5% of volume, while the bottom 90% control 5%. No single wallet individually dominates — the top four each hold about 1.7% of total volume — so the concentration is in a group, not a monopolist.

    While the clustering characterizes wallets in isolation, each wallet's behavioral profile is computed from its own trades, with no reference to what other wallets were doing at the same time. That's useful for identifying who's in the market, but it can't answer whether the high-volume group is acting in concert. I also wanted to understand whether they're moving together, which is what the network analysis is for.

    The co-trading graph was built from shared-side activity: two wallets get an edge if they traded on the same side within the same one-hour window on the same market. Same-side co-trading in a tight window is a necessary condition for coordination, though not sufficient. It could also capture traders independently reaching the same conclusion at the same time.

    Network metricValue
    Nodes63,793
    Edges4,127,204
    Graph density0.0020
    Communities (Louvain)1,261
    Modularity0.833
    Freeman centralization~0

    A density of 0.002 means the graph is sparse, that most wallet pairs never co-trade. The modularity of 0.833 is high relative to typical real-world networks (which score 0.3–0.7), meaning the communities are genuinely coherent: groups of wallets that trade together far more than chance would predict, with sparse connections between groups. Freeman centralization near zero means no single wallet acts as a hub connecting the rest of the network.

    That's the decentralization finding: diffuse, no coordination center, no dominant node. But network decentralization is not the same as capital equality. The trading network looks dispersed. The money isn't. Those two things can coexist, and here they do.

    As a sanity check, DBSCAN cluster membership was compared against Louvain community membership. The two methods measure different things — behavioral feature similarity versus temporal co-trading structure — so perfect alignment would be suspicious, but no alignment would suggest they're picking up completely unrelated signals. The largest network community had 35.8% of its members from DBSCAN cluster 5 (the buy-and-hold retail cluster), which is meaningfully above that cluster's base rate in the population (~25%). Two different lenses, picking up related but distinct structure.

    Are these markets being manipulated?

    Surowiecki's framework for crowd wisdom requires four conditions: diversity, independence, decentralization, and aggregation. Each has a measurable proxy, though translating a qualitative framework into numbers involves judgment calls — the sub-scores below are meant to be indicative rather than precise.

    ConditionFindingSub-score
    Diversity69.8% of volume controlled by 7% of wallets30.2 / 100
    IndependenceHigh-frequency wallets show no directional coordination with each other93.0 / 100
    DecentralizationNetwork Freeman centralization ≈ 0; 1,261 communities100.0 / 100
    Aggregation1,261 communities actively forming prices across markets83.3 / 100

    On independence: the high-frequency wallets' buy ratio standard deviation is 0.32, essentially the same as retail's 0.34, which means they aren't agreeing with each other on direction any more than retail traders do. If 4,446 wallets were coordinating, you'd expect their bets to cluster toward one side across markets. They don't — the directional spread looks like independent actors, not a coordinated group.

    On decentralization: as covered in the network section above, the trading graph is structurally dispersed. The caveat worth repeating is that this doesn't resolve the capital concentration finding. A decentralized network where one subset holds 70% of the capital is still a concentration problem — the decentralization just means that subset isn't acting as a single coordinated entity.

    Diversity is the condition that fails, and it's the most consequential one. The volume-weighted price in a capital-heavy market will reflect the convictions of the large wallets more than the small ones, regardless of whether those large wallets are coordinating. That's just how capital weighting works.

    The most direct test of active manipulation was temporal causality: do the high-frequency wallets trade before retail, moving prices and inducing the crowd to follow?

    Does Suspicious Wallet Activity Lead Retail Activity?Does Suspicious Wallet Activity Lead Retail Activity?

    Lead-lag cross-correlation across 124 markets in 15-minute bins. If the high-frequency wallets were systematically front-running retail, the red bars (high-frequency leads) would be clearly taller than the blue. The profile is nearly symmetric — directional asymmetry of 0.008, p = 0.531, not significant. Worth noting that the reverse-chronological sampling means this test is weighted toward end-of-market behavior, when large wallets are most active and retail has often already tapered off. A cleaner version of this test would require full trade history across the market lifecycle. The result is suggestive, not definitive.

    Wisdom of Crowds ScoreWisdom of Crowds Score

    Three of the four conditions come out strong; diversity is the outlier. The composite score of 62.1/100 places these markets in the mixed / expert-weighted range — not a failed market, but not a democratic crowd either. The three-mode summary: herding LOW, minority price-setting HIGH, causal manipulation LOW.

    What this means

    We can say that prediction markets are capital-weighted systems, not democratic aggregators. Whether that makes them less accurate is a separate question — a highly informed professional (or insider) with $5 million in a market may be a better signal than 10,000 retail traders each putting in $20. But it does mean the "public believes X" framing is wrong. The more accurate citation is "prediction markets currently price X at Y%", not "the public gives X a Y% chance." That distinction matters when the number ends up in a CNN headline.

    Timing adds another layer: even a high-scoring market warrants more caution in the final week before resolution, when the temporal data suggests large wallets are most active and late-stage concentration is highest. The January 2026 volume spike in the dataset is a clear example — trade count had been rising for months, but volume exploded late. That's not broader retail participation; it's large wallets moving aggressively on approach to resolution.

    The cross-platform comparison that was the original goal of this project remains the real gap. Structural inequality in Polymarket is now measurable. Whether Kalshi shows the same patterns is unknown, because Kalshi doesn't expose wallet-level data. The platform with less external oversight is paradoxically the more auditable one: regulators can presumably see Kalshi's account-level activity, outside researchers can't, while on Polymarket the reverse is true. That's the asymmetry that made this analysis possible and also the reason it's incomplete.

    What's still open is the harder question underneath the concentration finding: are the high-frequency wallets signal or noise? If they're informed professionals with genuinely better models, their outsized influence might make prices more accurate, not less. If they're actors with reasons to move prices unrelated to predicting outcomes (to profit from derivatives positions, to move public perception ahead of a decision) then the concentration is a real problem that the independent, decentralized structure of the surrounding network can't fix. Distinguishing between those two interpretations isn't possible from behavioral data alone. Answering the identity question (who these wallets actually belong to) would require regulatory access to Kalshi-style account records, which outside researchers don't have on Polymarket's pseudonymous chain. Answering the accuracy question (whether markets dominated by high-frequency wallets predict outcomes better or worse than distributed ones) is technically doable from public data, but would need full price histories across hundreds of resolved markets, not the end-of-market sample this project collected. That's the natural next step.

GitHubLinkedInEmailCV