Originally from Jinhua, Enpu is a software engineer currently based in Seattle. He has spent the last few years building the backend components behind enterprise software, and more recently, sed.i, a tool for his own personal reading and curation. A graduate from Allegheny College with double majors in Computer Science and Classical Piano, he is currently completing the master's program in information systems at the University of Washington.
Previously at ScienceLogic, he worked on webhook service, metric ingestion and postprocessing pipelines, and components for multi-tenant analytics for IT observability software.
Outside of work, he managed dv, an independent record store and venue focused on vinyl, natural wine, and experimental music in Suzhou; co-owned a community café in Pittsburgh that led to the establishment of a city-wide crisis support program for front end retail and food service workers.
Enpu also plays records for dance parties. Having performed at many venues in China and Japan, you can catch him behind the deck at Otherworld in Seattle every now and then, and his latest mix on Apple Music.
sed.i — System Architecture Wiki
sed.i Landing Page1. What the app is
sed.i is, at its core, a read-it-later queue built around save, search, and retrieval of reading content across internet. A user pastes a URL and the system saves it, extracts the full article text and metadata in the background, generates an embedding for semantic search, auto-tags it, and presents it in a clean reader with features to manage reading process, as well as highlights and cross-article idea connections. It's also tightly integrated with MCP so that a reader can interact with their reading context directly from a LLM client such as Claude.
The core user loop:
- Save a URL from the browser or extension
- System extracts content asynchronously
- User reads in-app with highlights and notes
- Search across the knowledge base semantically, by keyword, or by filter operators
- Recommendations surface what to read next
- An LLM agent (via MCP) can query and write to the library
Secondary features: vinyl record collection (Crates), public profiles, list-based markdown editor, and a hosted MCP server with OAuth so AI agents can access the library.
Browser / Chrome Extension └─> Next.js 14 (Vercel) └─> FastAPI (Railway) ├─> PostgreSQL + pgvector ← main data store + vector search ├─> Redis ← Celery broker + embedding cache + OAuth codes └─> Celery Workers (Railway) ├─> URL fetch + metadata (requests + BeautifulSoup) ├─> Article text extraction (trafilatura) ├─> PDF layout extraction (PyMuPDF + YOLO/ONNX) ├─> Embedding generation (OpenAI text-embedding-3-small) ├─> Auto-tagging (pgvector similarity + gpt-4o-mini fallback) ├─> AI summarization (OpenAI) └─> Discogs metadata (vinyl) LLM Agents (Claude Desktop, etc.) └─> MCP stdio (local) or MCP HTTP (hosted, OAuth 2.1 + PKCE) └─> FastAPI /mcp-transport
2. Data model
Content Items
Column Purpose Design note user_idOwnership Every query filters by this. Cross-user leakage is structurally impossible. processing_statusPipeline state pending → processing → completed / failed. Frontend shows reduced-opacity cards while extraction runs.processing_errorFailure detail Normalized: 401/403/paywall errors are classified as "source access issues", not parser bugs. Items can be processing_status='completed'but have aprocessing_error(partial extraction — something was extracted but less than expected).full_textRaw HTML Stored as HTML, not plain text. The reader renders it; search uses a plain-text derivative. search_vectortsvector Maintained by a PostgreSQL trigger. Dual-dictionary (english + simple) so both stemmed words and acronyms index correctly. embeddingVector(1536) Populated by Celery after extraction. Null until then — all search paths degrade gracefully. read_position0.0–1.0 Scroll progress. Auto-marks is_read=Truewhen ≥ 0.9.deleted_atSoft delete Rows are never hard-deleted. Enables undo, audit trails, and safe Celery task execution. tags/auto_tagsTwo-layer tagging auto_tagsare AI suggestions pending review;tagsare user-confirmed.submitted_viaSource tracking 'web','extension','api','email'— useful for analytics and pipeline routing decisions.Highlights
A highlight stores character offsets (
start_offset,end_offset) intofull_text. It also gets an OpenAI embedding — this is what powers cross-article idea connections.Writing - WIP
One draft per list per user. Stores markdown content, optional title, and word count. Auto-created on first
PATCHso the frontend can always use a single endpoint without checking for existence first.
3. Authentication and sessions
JWT design
Login returns a signed JWT (
sub=email,exp=now+N_minutes). Every request passes it asAuthorization: Bearer <token>. FastAPI'sget_current_active_userdependency decodes and verifies the token on every request.Why JWT over server sessions? JWTs are stateless — any FastAPI instance can verify a token without shared session storage. With sessions you'd need a shared Redis session store; with JWTs the signing key alone is sufficient. This matters because Railway can run multiple web instances.
Trade-off: JWTs can't be revoked before expiry. If a user changes their password, an old token is still valid until it expires. Acceptable for a personal reading app.
Token storage:
localStorage['token']in the browser. XSS-accessible — a known gap, documented in §20.Email verification and password reset
Both use a
VerificationTokentable:Field Notes tokensecrets.token_urlsafe(32)— 43 chars of URL-safe base64token_type"email_verification"(24hr TTL) or"password_reset"(1hr TTL)is_usedPrevents replay — marked true on first use expires_atEnforced at validation time Password reset invalidates any existing unused reset tokens for the user before creating a new one — prevents confusion from multiple concurrent reset links.
Emails are sent via Resend (HTTP API) as Celery background tasks.
4. Content ingestion pipeline
A URL save triggers a multi-stage Celery chain. The API returns 201 before any of this runs.
Stage 1:
extract_metadatafetch URL (requests, follow redirects) ├─> detect Content-Type → PDF path or HTML path ├─> parse OG / Twitter Card / JSON-LD metadata (BeautifulSoup) ├─> detect content_type (article / pdf / video / tweet / unknown) ├─> thumbnail fallback chain: OG → Twitter → JSON-LD → link[rel=image_src] → first in-content image ├─> run trafilatura (output_format="xml", include_tables/images/links=True) │ └─> xml_to_html() converts structured XML → clean HTML │ fallback: if <100 chars, convert plain text → <p> tags ├─> compute word_count, reading_time_minutes = max(1, round(word_count / 200)) ├─> run paywall/access heuristics → set processing_error if restricted ├─> set processing_status = 'completed' └─> chain: generate_embedding.delay() → generate_tags.delay()trafilatura's XML output preserves structural information (heading levels, table cells, list items) that plain HTML or text loses. The
xml_to_html()conversion maps these back to semantic HTML elements the reader can render.Paywall detection — 7 heuristics
Items can reach
processing_status='completed'with aprocessing_errorset, meaning: we extracted something, but it's probably not the full article. The_detect_limited_extraction_reason()function applies these checks in order:Heuristic Signal Schema markers JSON-LD isAccessibleForFree: false,content_tiermeta tags (paid,premium,subscriber,metered)Restriction text Keywords like "paywall", "subscriber-only", "access options" in page text or DOM class names Bot blocking Status codes 403/429, "cloudflare", "captcha", "blocks bots" in error text Coverage ratio Extracted text / source text < 0.4 AND < 5 paragraphs AND < 1800 chars Teaser overlap OG description found verbatim in extracted text AND extracted < 1200 chars AND < 5 paragraphs Media-only Figure/img present + 0 paragraphs + < 900 chars (image galleries, paywalled articles showing only art) Structural truncation Source has 6+ paragraphs, extracted ≤ 4 AND ratio < 0.4 AND < 1800 chars These map to user-facing categories via
ingestErrors.tson the frontend:blocked,unauthorized,source_access,network,partial,unknown. Each category has its own badge text, severity, and reader fallback message.Extension fast path
The Chrome extension runs Mozilla Readability client-side (in the browser, on the page being viewed) and sends
pre_extracted_htmlin the POST body. The backend:- Calls
_clean_extension_html()to strip duplicate title/description/thumbnail elements that Readability might preserve - Sets
processing_status='completed'immediately - Still fires
extract_metadata.delay()for missing metadata fields and to trigger the embedding chain
Result: extension-saved articles are fully readable in seconds. The extension also sets
pre_extracted_access_restricted=TrueifdetectAccessRestriction()fires — this immediately setsprocessing_errorbefore Celery even runs.Retry semantics
max_retries=3on extraction tasks. Each failure saves aprocessing_errorand eventually setsprocessing_status='failed'. The frontend shows failed items with a specific failure badge based on the error category.
5. PDF extraction
PDF extraction is a separate 3-stage pipeline handled by PyMuPDF (fitz) and an ONNX/YOLO layout model.
Stage 1: Pre-scan
Runs on every page before any model inference. Pure text analysis — fast and free.
Per page:
- Column detection: Measures X-distribution of text blocks. A split at 0.42/0.58 page width indicates two-column layout.
- Header/footer bands: Detected via horizontal rules from
page.get_drawings()— narrow lines (< 3pt height, > 30% page width) in the top 10% or bottom 20% of the page. Falls back to repeating-text detection if no rules are found. - Page number rects: Margin blocks matching
\d[\d\s]*(max 6 chars) — excluded from reading bounds. - Ambiguity scoring: Two-column confidence scored by how evenly text is distributed left/right.
Document-level:
- Repeating header/footer texts: Blocks appearing on 3+ pages (at least 40% of all pages), normalized with
re.sub(r"\b\d+\b", "#", ...)to match page numbers with different values. These are filtered from output. - Column mode: Majority vote across all pages.
Stage 2: Layout detection
Three implementations available (configured by environment):
Implementation Cost Speed Accuracy pymupdf_layout(ONNX model)Free ~200ms/page Good yolo(torch, local)Free after download Variable Good gpt4o_mini_vision~$0.03/page Network-bound Best The YOLO/ONNX approach detects reading regions, figures, tables, and headers as bounding boxes. These are then used to extract text in reading order (column-aware, skipping headers/footers/page numbers).
Stage 3: Post-processing
After layout extraction produces raw HTML:
Step What Title First <h1>or<h2>→ stored asitem.titleAuthor line removal Short <p>blocks (< 300 chars) between title and abstract heading → removed (arXiv author/affiliation lines)Abstract (arXiv style) Standalone <h1>ABSTRACT</h1>+ following<p>→ stored asitem.description, removed from bodyAbstract (journal style) Inline <p>ABSTRACT: text...</p>→ same treatmentFigure thumbnail First <div class="figure-block"><img src="data:..."/>→ stored asitem.thumbnail_url, removed from bodyConfidence badge Injected as <meta>tag; parsed by the ReaderWhy pre-scan first instead of running the model on every page? The pre-scan identifies structural features (columns, headers, footers) that inform how model outputs are interpreted. It also identifies pages that don't need model inference (single-column, clean layout). Running the model on every page without context leads to misrouted reading order in two-column academic papers.
6. The reader
The reader is a
Readershell component wrappingReaderArticle.Rendering pipeline
The stored HTML goes through several transforms before display:
sanitizeContentHtml— strips ephemeral UI elements that might have been saved accidentallystripDocumentWrappers— removes<html>/<body>wrappers from PDF-extracted contentaddHeadingAnchors— generates deduplicated IDs on headings for table-of-contents linkstoBionic(optional) — bolds the first ~50% of each word. Optional toggle in reading settings.
Scroll position persistence
read_position(0.0–1.0) is updated via a debouncedPATCH /content/{id}as the user scrolls. On open, the reader scrolls to the saved position. Auto-marks as read at ≥ 0.9.Reading settings and SSR hydration
All settings (
theme, typography, bionic reading, feature visibility toggles) are stored inReadingSettingsContext, persisted inlocalStorage['sedi-reading-settings']. The context initializes from defaults on the server (for SSR) and loads saved values inuseLayoutEffect(client-only, after hydration). Components that depend on these settings returnnulluntil thehydratedflag is true — this prevents a flash of wrong font/theme on initial load.Highlights in the reader
Highlights are stored with character offsets (
start_offset,end_offset) intofull_text. When the reader renders, it injects highlight spans by finding character ranges.
7. Hybrid search
A single
GET /search/semantic?query=...endpoint handles four search strategies, routed by a classifier that runs in <1ms.The classifier (
app/core/search_router.py)Priority-ordered rules — first match wins:
Query shape Route Example Contains author:,tag:,site:,after:,before:,is:SQL filter author:Paul Graham after:2025-01-01Quoted phrase tsvector keyword "attention is all you need"Looks like a domain SQL site filter substack.comMatches user's known author name SQL filter (user has articles by "Simon Willison") Matches user's known tag SQL filter musicEnds with ?or starts with question wordpgvector semantic how does attention work?≤ 4 words, no question words tsvector keyword llm,react hooksEverything else keyword + semantic fused with RRF building products with AIget_user_search_context()preloads the user's known authors and tags before classification — one SQL query, called once per search request.The three engines
SQL filter path: Pure SQLAlchemy query, no vector math, no full-text index. Fastest. Tag filter uses
EXISTS (SELECT 1 FROM unnest(tags) t WHERE t ILIKE :pat)— case-insensitive partial match across the Postgres array.Keyword path: PostgreSQL
tsvectorfull-text search viats_rank_cd. Thesearch_vectorcolumn uses a dual-dictionary trigger:englishdictionary: stems words (running → run,articles → article)simpledictionary: stores tokens as-is (LLMs → llms,RAG → rag,API → api)- Combined with
||so both dictionaries contribute to ranking - Single-token queries use prefix matching (
to_tsquery('simple', 'llm:*')) sollmmatchesllms ts_rank_cdwith normalization flag32caps scores at 0–1- If keyword returns 0 results (misspelling, partial word), falls back to semantic silently
Semantic path: Embeds the query via OpenAI, runs cosine similarity against
content_items.embeddingin pgvector. Falls back to keyword if embedding fails or no embeddings exist yet.Embedding cache
Query embeddings are cached in Redis (
qemb:{sha256(query)[:16]}, 1hr TTL). Searching "llm" five times calls OpenAI once.Reciprocal Rank Fusion (RRF)
When two or more engines run, ranked lists merge with RRF:
score(item) = Σ 1 / (60 + rank_in_list) all lists containing itemThe constant 60 (from the original RRF paper) dampens the benefit of high rank — a rank-1 result doesn't dominate completely. RRF scores are not percentages — they're only meaningful for sorting. The frontend does not display them.
Three-way fusion (
mode=full): Used by the SearchModal. Filter + keyword + semantic all run simultaneously. Keyword + semantic are fused first, then fused with filter results. This maximizes recall — every result from every engine is considered.SearchModal vs SearchBar
SearchBar (navbar):
mode=auto— classifier picks the cheapest path. Returns 5 results inline. No OpenAI call for keyword/filter queries.SearchModal (Cmd+K):
mode=full— always runs all three engines. Returns 10 results per page with prev/next pagination. Date preset chips: last 7 days / 30 days / 3 months / last year / custom range. Custom range uses cross-validatedmin/maxon the date inputs to prevent invalid ranges.
8. Auto-tagging
Goal: Suggest relevant tags without spending money on every article.
Two-pass strategy
Pass 1 — free (pgvector similarity):
- Find the user's already-tagged content with cosine distance < 0.25 (very similar articles)
- If ≥ 2 tags appear across ≥ 2 similar articles → auto-accept those tags immediately
Pass 2 — cheap LLM (gpt-4o-mini):
- Only runs if pass 1 finds nothing (new library, first article in a topic area)
- Sends title + description + first 800 words of plain text to gpt-4o-mini
- Parses JSON array from response
Why this design over always calling GPT? At 1000 articles, calling gpt-4o-mini for every article costs ~$0.20/month. But at 10,000 articles with a well-tagged library, pass 1 handles ~80% of new articles for free. The cost curve flattens as the library grows — the system gets cheaper per article over time.
9. Highlights and idea connections
Embeddings on highlights
After a highlight is created,
generate_embeddingembeds the highlight text (same OpenAI model, same Celery task used for articles). The 1536-dim vector is stored inhighlights.embedding.Cross-article connections
GET /search/connections/{highlight_id}:SELECT h.id, h.text, h.color, ci.title, (1 - (h.embedding <=> CAST(:source AS vector))) as similarity FROM highlights h JOIN content_items ci ON h.content_item_id = ci.id WHERE h.user_id = :uid AND h.content_item_id != :source_article_id AND h.embedding IS NOT NULL AND LENGTH(h.text) >= 20 AND (1 - (h.embedding <=> :source)) >= :threshold ORDER BY h.embedding <=> :source LIMIT :limitThe
LENGTH(h.text) >= 20filter excludes trivially short highlights that produce noisy embeddings.GET /search/connections/article/{content_id}— runs the above for every highlight in an article, groups by connected article, and sorts by total similarity score. This powers the "Connections" tab in the reader.
10. Lists and drafts
Lists
Many-to-many join table (
content_list_membership) between users and content items. Key design decisions:added_byon the join table — preserves who added an item (future sharing feature)- List soft deletion doesn't cascade-delete items — items just lose membership
GET /lists/{id}/contentjoinscontent_itemsand filtersdeleted_at IS NULL— items independently soft-deleted don't appear
Drafts
Each list can have one markdown draft (one per list per user). The draft is a writing space for notes, summaries, or essays about the list's content.
API:
GET /lists/{id}/draft— 404 if no draft exists yetPOST /lists/{id}/draft— explicit creation (409 if already exists)PATCH /lists/{id}/draft— auto-creates if not exists; this is the autosave targetDELETE /lists/{id}/draft— removes the draft
Split-pane reader in list view
/lists/[id]renders a split-pane layout: list on the left,ReaderArticleon the right.ReaderArticleaccepts anembeddedprop that switches fromwindow.scrollto a container scroll, so the scroll position and reading progress bar work correctly inside the panel.
11. MCP server — LLM agent interface
The MCP (Model Context Protocol) server exposes the library to LLM agents. It has two deployment modes.
Phase 1: Local stdio (Claude Desktop)
Launched as a subprocess by Claude Desktop:
{ "mcpServers": { "sedi": { "command": "poetry", "args": ["run", "python", "-m", "app.mcp.server"], "cwd": "/absolute/path/to/backend", "env": { "TOKEN": "<your-sedi-jwt>" } } } }Phase 2: Hosted HTTP (OAuth 2.1 + PKCE)
Mounted as an ASGI app at
/mcp-transporton the main FastAPI server. Fully compliant OAuth 2.1 with PKCE (S256 only — no plain challenge method).OAuth endpoints:
Endpoint Standard Purpose /.well-known/oauth-authorization-serverRFC 8414 Discovery metadata /.well-known/oauth-protected-resourceRFC 9728 Resource server metadata registerRFC 7591 Dynamic client registration authorizeGET— Login form (HTML) authorizePOST— Credential check + auth code issuance tokenRFC 6749 Code exchange + refresh token grant Client validation: Static allowlist from
MCP_OAUTH_CLIENTS_JSONenv var (JSON dict ofclient_id → [redirect_uri...]). If the env var is empty, any dynamically registered client is accepted. Non-matchingclient_idorredirect_uri→ 400.Security: OAuth login form HTML-escapes all reflected parameters (
client_id,redirect_uri,state,code_challenge) to prevent reflected XSS. CSRF embedded in OAuth state parameter.PKCE? PKCE prevents authorization code interception attacks. Even if an attacker captures the auth code in transit, they can't exchange it without the code verifier that was never sent over the network.
MCP tools (15 total)
Read tools:
Tool What it does list_lists()All reading lists with item counts get_list_content(list_id, include_full_text, limit)Articles in a list. full_textcapped at 32,000 chars to prevent LLM context overflow. Max 200 items.get_content_item(item_id, include_full_text)Single article metadata or full content search_content(query, limit)Hybrid search (same engine as the app). Max 50 results. find_similar(item_id, limit, threshold)Cosine similarity search. Default threshold 0.5. get_highlights(item_id?, list_id?)Three modes: all highlights for an article, all highlights in a list, or all highlights across the library (max 100) get_draft(list_id)Writing draft for a list, or null if none get_reading_stats(){total_items, read_count, unread_count, archived_count}summarize_list(list_id, style, max_items)AI summary. Styles: overview,themes,gaps,timeline.gapsstyle includes draft content to identify what's missing. Cached in-process by (user_id, list_id, content_hash, style).Write tools:
Tool What it does add_content(url)Save URL, triggers background extraction update_draft(list_id, content, title?)Create or update list draft create_list(name, description?)New reading list add_to_list(list_id, item_id)Add content item to a list An LLM agent calling
get_list_contentwith a large list could easily overflow its context window with full article text. The 32k cap prevents this while still giving the agent substantial content to work with. Agents needing full text should callget_content_itemper article.
12. Public profiles
Users can expose their queue and/or crates publicly via
/[username].Visibility model
Three independent toggles:
user.is_public— profile visible at alluser.is_queue_public— queue items visibleuser.is_crates_public— vinyl records visible
Individual items additionally have
is_public— a user can have a public queue but hide specific articles.No-auth API routes
/public/u/{username}/...routes have noget_current_active_userdependency. They checkis_publicflags and return 403 if the profile is private. These are the only unauthenticated API routes in the system (besides/auth).Guest reading limit
The public reader tracks reads in
localStorageper profile owner. After 3 reads from the same profile, it shows a signup prompt. Entirely client-side — a conversion nudge, not server-enforced access control.
13. Crates — vinyl record collection
A second vertical for managing a vinyl record collection. Users paste a Discogs URL; Celery fetches metadata from the Discogs API asynchronously.
vinyl_recordstable (separate fromcontent_items)The metadata structure (tracklist JSON, label, catalog number, year, artist) doesn't fit the
content_itemsschema. A separate table is cleaner than adding 10 nullable columns to a content items table that doesn't need them.Column Notes discogs_release_idInteger parsed from URL, used for the Discogs API call tracklistJSONB: [{position, title, duration}, ...]videosJSONB: [{title, uri, duration}, ...]— Discogs links + user-added YouTube linksstatus'collection','wantlist','library'— not a booleanwantlistcolumnprocessing_statusSame pending → completed/failedpattern as content itemsMusic playback
PlayerContextmanages a queue ofQueueTrack[]objects derived fromvinyl_records.videos.YouTubePlayeris an invisible div hosting the YouTube IFrame API — plays videos sequentially. Queue persists tolocalStorage['sedi-player']across navigation.Why YouTube IFrame API instead of an audio element? Discogs stores YouTube links as listening sources. The IFrame API gives programmatic play/pause/next without needing a raw audio URL. The trade-off: requires the video to be embeddable (some YouTube videos disable embedding).
14. Chrome extension
MV3 extension. Three components in a chain:
content.js(injected into every page)Runs Mozilla Readability on the page to extract clean article HTML. Also runs
detectAccessRestriction()— checks JSON-LDisAccessibleForFree,content_tiermeta tags, and known paywall DOM selectors (subscription modal class names, etc.).popup.jsShows a preview before saving: word count, read time, author, publish date, access-restriction signal. User clicks "Save" → sends payload to service worker.
service_worker.jsCalls
POST /contentwithpre_extracted_html,pre_extracted_access_restricted, and all metadata fields. MapsaccessRestricted: true→pre_extracted_access_restricted: true, which triggers immediateprocessing_errorflagging on the backend before Celery runs.
15. Rate limiting
RateLimitMiddlewareapplies toPOST /contentonly. Sliding window algorithm:- Each user (JWT user ID) has a
dequeof request timestamps - On each request: pop timestamps older than the window; if
len(deque) < max_requests, allow and append - Limits: 10 requests / 60 seconds AND 50 requests / 3600 seconds
- 429 response includes
Retry-Afterheader and CORS headers
Known limitation: State is in-memory per process. Multiple Railway instances would each have independent counters — a user could exceed the limit across instances. Production fix: Redis-backed
INCR+EXPIRE. Documented, not yet implemented.
16. Security posture
Current protections
Area Implementation Passwords bcrypt via passlib API auth JWT signed with SECRET_KEY, expiry enforced on every requestCross-user isolation Every DB query filters by user_id = current_user.idCORS Origin allowlist from ALLOWED_ORIGINSenv varRate limiting Sliding window on POST /contentXSS in OAuth HTML-escaped reflected parameters in MCP OAuth login page MCP OAuth Strict client_id + redirect_uri allowlist; PKCE S256 only; CSRF in state Error sanitization No internal query details or stack traces in API responses Known gaps (documented, not hidden)
Risk Current state Production fix XSS via article HTML dangerouslySetInnerHTMLin reader without sanitizationDOMPurify or sandboxed iframe SSRF Backend fetches user-provided URLs without IP validation Block internal ranges (169.254.x.x, 10.x.x.x, 127.x.x.x) Token storage localStorageis XSS-accessiblehttpOnly cookies + CSRF tokens Rate limit in-memory No cross-instance enforcement Redis-backed INCR+EXPIREThe gaps are accepted trade-offs for a personal-use app at current scale, not oversights. Any engineer picking up this codebase can see exactly what needs hardening before a public launch.
17. Key design decisions
Q: Why Celery + Redis instead of FastAPI background tasks?
FastAPI's
BackgroundTasksrun in the same process. A slow extraction blocks other requests. Celery runs in a separate worker process — failures, retries, and slow jobs don't affect API latency. Redis is already provisioned for caching and OAuth state, so Celery adds no infrastructure cost.
Q: Why store article HTML instead of plain text?
The reader needs structure — headings, code blocks, images, lists. Plain text loses all of that. The trade-off is storage size and XSS risk. XSS is on the known-gaps list (DOMPurify is the fix).
Q: Why not just use Elasticsearch or Typesense for search?
pgvector + tsvector together cover keyword, semantic, and filter search in one database. No sync lag, no second datastore, no additional cost. The RRF fusion technique is what production systems (Elasticsearch 8.x ships it natively) use. The only thing missing is ANN indexing (HNSW) for billion-scale vector search — not relevant here.
Q: How does the hybrid search classifier work?
It's a priority-ordered set of regex heuristics and user data lookups that runs in <1ms with no LLM call. Explicit operators first, then domain detection, then the user's known authors/tags (loaded from DB before classification), then question detection, then word count threshold (≤4 words = keyword, more = hybrid).
mode=fullbypasses the classifier entirely and runs all three engines simultaneously.
Q: Why dual-dictionary tsvector instead of just the English dictionary?
The English stemmer treats "LLMs" as a token it doesn't know how to stem — it comes out as
llms. A query forllm(stemmed via English) doesn't matchllms. The simple dictionary stores tokens as-is, sollmsis indexed and prefix matchingllm:*catches it. Running both dictionaries means you get stemming benefits (run/running match) AND acronym accuracy.
Q: How does the MCP server handle auth in two different modes?
Phase 1 (stdio): JWT passed as environment variable, decoded by
app/mcp/auth.pyon every tool call. Simple but requires the user to copy a token from their browser.Phase 2 (HTTP): Full OAuth 2.1 + PKCE flow. The MCP client redirects to the sed.i login form, user authenticates, gets an auth code, exchanges it for a sed.i JWT. The end result is the same JWT the regular API uses — no separate token type, no extra verification logic.
Q: Why character offsets for highlights instead of DOM ranges?
DOM ranges (XPath, CSS selectors) break when HTML structure changes — e.g. bionic reading transforms, re-extraction, or adding heading anchors. Character offsets into the source
full_textstring are stable as long as that string doesn't change. This is also why re-extraction after a successful extraction isn't triggered automatically.
Q: Why JSONB for
reading_patternsinstead of a separate analytics table?The recommendation engine reads one row per request. A separate table would require joining
content_itemsacross potentially thousands of rows to compute rolling stats at request time. JSONB on the user row is a deliberate denormalization — stats are maintained incrementally (each article completion updates the rolling window) rather than computed on read.
Q: Why soft deletes everywhere?
Hard deletes are irreversible and can crash background Celery tasks holding a reference to a deleted ID. Soft deletes let users undo, preserve audit trails, and keep background tasks safe. The trade-off: every query must include
AND deleted_at IS NULL. Missing this is a data leak — it's enforced at the ORM layer in shared query helpers.
Q: How does the extension fast path work, and why?
The extension runs Mozilla Readability client-side (in the browser, on the page being viewed) and sends pre-extracted HTML. The backend skips trafilatura, sets
processing_status='completed'immediately, and still fires Celery for embeddings and any missing metadata. Result: extension-saved articles are fully readable in seconds. The extension also detects paywalls client-side (JSON-LD signals, DOM selectors) and signals this withpre_extracted_access_restricted=True— the backend setsprocessing_errorbefore Celery even runs.
Q: How does the PDF extraction pipeline differ from article extraction?
PDFs go through a 3-stage pipeline: (1) a free pre-scan per page that detects column layout, header/footer bands, and page numbers using PyMuPDF text block positions; (2) layout model inference (YOLO/ONNX by default, GPT-4o-mini vision as a higher-accuracy option) to identify reading regions, figures, and tables; (3) post-processing that extracts the title from the first heading, removes author lines from arXiv-style front matter, detects and removes abstracts (stored as
description), and extracts the first figure as a thumbnail. The pre-scan informs how model outputs are interpreted, especially for two-column academic papers where reading order matters.
Q: What can an LLM agent do with the MCP server?
Read: browse lists, read full article content (capped at 32k chars to prevent context overflow), search the library with the same hybrid engine the app uses, find articles similar to a given one, read all highlights, get reading stats, and generate AI summaries of a list in different styles (overview, themes, gaps, timeline). Write: save new URLs, create lists, add articles to lists, and create/update writing drafts. The
gapssummary style cross-references the list's draft to identify what's missing — useful for research writing workflows.Prediction Whale - Audit on Prediction Market Concentration
The concept of wisdom of crowds came up a lot in my machine learning courses. The idea is straightforward: decisions made by aggregating information across a group tend to be better than those made by any single member. Prediction markets like Kalshi and Polymarket are built on this premise — they let a broad audience trade on the probability of specific outcomes, and treat the resulting prices as a more accurate read on public sentiment than polls or expert opinion. CNN now cites Kalshi odds during election coverage. Financial desks treat them as a legitimate forecasting signal. The platforms are growing fast, with billions in volume and serious institutional attention, all without charging a transaction fee, which raises an obvious question about what they're actually selling, and to whom.
The two platforms are less similar than they appear. Kalshi is CFTC-regulated, operates in USD, and requires verified US accounts, where each account tied to a confirmed identity. Polymarket runs on Polygon, settles in USDC, and at the time this project was conducted had no KYC requirement and was pseudonymous, operating outside the US following a 2022 CFTC enforcement action. The groups behind each platform reflect this split: Kalshi is the regulated, institutional-adjacent one; Polymarket is crypto-native. Counterintuitively, that made Polymarket the more auditable of the two, where every trade, wallet, and timestamp is on-chain and publicly accessible. Kalshi is accountable in the legal sense but opaque in the data sense. It seems that if you want to know who's actually moving its prices, you have to ask Kalshi.
That asymmetry opened up a few directions worth investigating:
- If both markets claim to represent public wisdom, odds on the same event should converge. A persistent gap suggests the crowds are different, or one market is less honest about what it's doing, or both.
- The difference in user base (crypto-native versus regulated US accounts) might mean each market reflects a genuinely different population, equally representative but measuring different things.
- More centrally: are these odds actually set by a distributed crowd, or by a small number of large wallets? If a Kalshi probability ticks up sharply and CNN reports it as public sentiment, that's a real information-flow problem regardless of whether anyone intended it.
- Do large traders cause retail participants to change their behavior? A retail trader might have a clear view, see a large wallet moving the other direction, and follow — even if that position has nothing to do with the actual probability of the outcome.
For a ML course of my grad program, that last question is what this project is trying to answer.
Getting the data
My initial thought as the starting point was collecting bets that exist on both platforms. If both markets are pricing the same event, you can ask whether the odds converge, diverge, and why.
The first obstacle was that Polymarket and Kalshi don't organize their markets the same way. Each platform nests binary sub-markets ("Will Trump nominate Judy Shelton?") inside categorical event groupings ("Who will Trump nominate as Fed Chair?"). Matching at the sub-market level is difficult where titles are too platform-specific to align. Matching at the event level worked much better: using the sentence-transformer model from huggingface to convert embeddings and calculate cosine similarity across 9.8 million candidate pairs found 479 matches above the 0.75 threshold and 8 perfect 1.0 matches. Six pairs made the final cut after manual review.
Market Type Notes Champions League Winner Sports $1B volume; insider-risk structure (lineup leaks, injury timing) US Presidential Election Political High volume, high public attention Congressional Control Political — Fed Rate Decision Policy Single binary outcome Trump Cabinet Nominations (×2) Political Multiple sub-markets per event Running the analysis on a single event type would have told a single-event story, so the diversity across categories mattered. Due to time and compute constraints, trades were capped at 2,000 per sub-market, sampled in reverse chronological order. Each matched event contains many sub-markets. For example, the Champions League event alone has dozens, one per team. The 237,791 total Polymarket trades come from roughly 125 sub-markets at up to 2,000 each, alongside 160,000 Kalshi trades. One clear drawback is that the reverse-chronological sampling skews the dataset toward end-of-market activity, when prices have largely settled and perhaps large wallets tend to be most active.
This sampling choice shaped what the analysis could and couldn't answer. End-of-market behavior is when volume concentration is most visible, which worked in our favor for the behavioral clustering. But it also meant the cross-platform price comparison — the original goal — came back empty. Two markets pricing the same event should converge over time, and testing that properly requires a more extended trade history, not the most-recent-N approach the API forces on you. That limitation is worth stating upfront, because the analysis that follows is primarily about Polymarket's internal structure rather than a head-to-head platform comparison.
There's also an asymmetry in what each platform's API could give us. Polymarket's blockchain architecture exposes wallet addresses for every trade, making wallet-level behavioral analysis possible. Kalshi's public API returns price, size, and timestamp, but no user identifier. Every trade is anonymous. The entire behavioral analysis depends on being able to group trades by the account that made them, so while the project started as a two-platform comparison, the behavioral work is necessarily Polymarket-only. This is its own kind of irony: the unregulated, pseudonymous platform is the one that's actually auditable from the outside.
Behavioral fingerprints
Instead of simply using the raw data metric types returned by the APIs, the methodological bet of this project is that trading behavior — how someone trades, not just how much — is informative enough to characterize who's in the market and what they're doing.
Eleven features per wallet, organized around four dimensions:
Dimension Features What it's asking Scale total_volume,avg_trade_sizeHow much does this wallet trade? Activity num_trades,trade_freq_per_hourHow often does it trade? Timing timing_entropy,activity_spanWhen and how predictably does it trade? Direction & breadth buy_ratio,market_diversification,volume_concentration,win_rate,position_size_varianceDoes it show systematic bias, and across how many markets? After standardization, all features contribute roughly equal variance, with no single dimension drowns out the others. The more interesting finding from the correlation matrix is that timing entropy is nearly orthogonal to volume features, which means the dataset is capturing two genuinely independent behavioral dimensions. How much someone trades and how they distribute that activity across time don't necessarily go together: a wallet that concentrates all its activity in a single overnight session looks completely different from one that trades across every hour of the day, even at identical total volume. That distinction turned out to be one of the cleaner separators between the normal clusters and the wallets that didn't fit anywhere.
Who is actually in this market
We applied clustering algorithms to answer this question, and deliberately used DBSCAN over something like K-means, where it lets the density of the feature space determine both the number and shape of clusters, which is more appropriate for an exploratory question like this one. It also doesn't assume spherical clusters, which matters when behavioral groups can take irregular shapes in high-dimensional space. With a parameter sweep optimizing for silhouette score, the final parameters (eps = 0.3, min_samples = 10) produced 104 clusters and 4,446 noise points, with a silhouette score of 0.7017.
- 5 clusters at 1,000+ wallets — the dominant retail archetypes; clusters 4 and 5 alone cover ~32,000 wallets
- 23 clusters at 100–999 — solid mid-size behavioral groups
- 75 clusters at 10–99 — niche behavioral styles, small but not arbitrary; min_samples = 10 means each required at least 10 wallets to form
The long tail of small clusters reflects genuine behavioral heterogeneity in a large dataset — traders who share enough features to be grouped together but don't fit the dominant archetypes. What the cluster taxonomy captures well is the dominant archetypes; what matters more for the analysis is the noise-point group, whose behavioral profile holds regardless of how the normal clusters are divided.
A note on what noise means here: DBSCAN labels a point as noise not because it's anomalous in a statistical sense, but most likley because it lacks enough nearby neighbors in feature space to form or join a dense region. These wallets aren't outliers by definition but just didn't cluster. What makes the group interesting is what their features actually show once you look at them directly.
Who Is Actually Moving This Market?The cluster size distribution shows the largest normal cluster at 16,309 wallets — one-time retail speculators who bet once and never return. The trading frequency distribution on a log scale makes the separation visible: the mass of wallets sit near 0–5 trades per hour, while a thin tail extends past 175. That tail is not retail.
The noise-point (suspicious) wallets average $49,790 per wallet in volume — roughly 30× the typical retail cluster — with timing entropy of 0.89 versus near-zero for retail. Retail wallets make one or two trades, usually at the same hour, usually all on one side. The high-frequency group makes 28 trades on average, spread across multiple markets, at hours that don't cluster.
Trader Behavior ArchetypesBy count, Buy & Hold dominates; the high-frequency wallets are a small bar. Flip to average volume on a log scale, and that bar towers over everything else by orders of magnitude. The buy/sell bias chart shows that Whale and Buy & Hold wallets are 100% buyers, while the high-frequency group leans 64% buy — some directional conviction, but without the pure-accumulation signature of retail.
Detecting Structural Manipulators vs. Retail ParticipantsThe timing entropy histogram is the clearest separator. The normal population spikes sharply near zero — nearly all retail wallets make one-off bets at a single hour and stop. The noise-point wallets spread across a wide entropy range centered around 0.89. The buy ratio distributions show something subtler: normal wallets pile up at 0 and 1 (pure sellers and pure buyers), while the high-frequency group shows a flatter distribution centered around 0.6–0.7, which is a different behavioral mode from anything in the 104 normal clusters.
Those 4,446 wallets — 7% of participants — account for 69.8% of all trading volume.
Concentration and network structure
How Concentrated Is the Trading Power?The Lorenz curve hugs the bottom-left corner before shooting nearly vertically at the far right, which is the classic extreme-inequality shape. The Gini coefficient of 0.958 puts these markets toward the far end of concentration: the top 1% of wallets control 66.5% of volume, while the bottom 90% control 5%. No single wallet individually dominates — the top four each hold about 1.7% of total volume — so the concentration is in a group, not a monopolist.
While the clustering characterizes wallets in isolation, each wallet's behavioral profile is computed from its own trades, with no reference to what other wallets were doing at the same time. That's useful for identifying who's in the market, but it can't answer whether the high-volume group is acting in concert. I also wanted to understand whether they're moving together, which is what the network analysis is for.
The co-trading graph was built from shared-side activity: two wallets get an edge if they traded on the same side within the same one-hour window on the same market. Same-side co-trading in a tight window is a necessary condition for coordination, though not sufficient. It could also capture traders independently reaching the same conclusion at the same time.
Network metric Value Nodes 63,793 Edges 4,127,204 Graph density 0.0020 Communities (Louvain) 1,261 Modularity 0.833 Freeman centralization ~0 A density of 0.002 means the graph is sparse, that most wallet pairs never co-trade. The modularity of 0.833 is high relative to typical real-world networks (which score 0.3–0.7), meaning the communities are genuinely coherent: groups of wallets that trade together far more than chance would predict, with sparse connections between groups. Freeman centralization near zero means no single wallet acts as a hub connecting the rest of the network.
That's the decentralization finding: diffuse, no coordination center, no dominant node. But network decentralization is not the same as capital equality. The trading network looks dispersed. The money isn't. Those two things can coexist, and here they do.
As a sanity check, DBSCAN cluster membership was compared against Louvain community membership. The two methods measure different things — behavioral feature similarity versus temporal co-trading structure — so perfect alignment would be suspicious, but no alignment would suggest they're picking up completely unrelated signals. The largest network community had 35.8% of its members from DBSCAN cluster 5 (the buy-and-hold retail cluster), which is meaningfully above that cluster's base rate in the population (~25%). Two different lenses, picking up related but distinct structure.
Are these markets being manipulated?
Surowiecki's framework for crowd wisdom requires four conditions: diversity, independence, decentralization, and aggregation. Each has a measurable proxy, though translating a qualitative framework into numbers involves judgment calls — the sub-scores below are meant to be indicative rather than precise.
Condition Finding Sub-score Diversity 69.8% of volume controlled by 7% of wallets 30.2 / 100 Independence High-frequency wallets show no directional coordination with each other 93.0 / 100 Decentralization Network Freeman centralization ≈ 0; 1,261 communities 100.0 / 100 Aggregation 1,261 communities actively forming prices across markets 83.3 / 100 On independence: the high-frequency wallets' buy ratio standard deviation is 0.32, essentially the same as retail's 0.34, which means they aren't agreeing with each other on direction any more than retail traders do. If 4,446 wallets were coordinating, you'd expect their bets to cluster toward one side across markets. They don't — the directional spread looks like independent actors, not a coordinated group.
On decentralization: as covered in the network section above, the trading graph is structurally dispersed. The caveat worth repeating is that this doesn't resolve the capital concentration finding. A decentralized network where one subset holds 70% of the capital is still a concentration problem — the decentralization just means that subset isn't acting as a single coordinated entity.
Diversity is the condition that fails, and it's the most consequential one. The volume-weighted price in a capital-heavy market will reflect the convictions of the large wallets more than the small ones, regardless of whether those large wallets are coordinating. That's just how capital weighting works.
The most direct test of active manipulation was temporal causality: do the high-frequency wallets trade before retail, moving prices and inducing the crowd to follow?
Does Suspicious Wallet Activity Lead Retail Activity?Lead-lag cross-correlation across 124 markets in 15-minute bins. If the high-frequency wallets were systematically front-running retail, the red bars (high-frequency leads) would be clearly taller than the blue. The profile is nearly symmetric — directional asymmetry of 0.008, p = 0.531, not significant. Worth noting that the reverse-chronological sampling means this test is weighted toward end-of-market behavior, when large wallets are most active and retail has often already tapered off. A cleaner version of this test would require full trade history across the market lifecycle. The result is suggestive, not definitive.
Wisdom of Crowds ScoreThree of the four conditions come out strong; diversity is the outlier. The composite score of 62.1/100 places these markets in the mixed / expert-weighted range — not a failed market, but not a democratic crowd either. The three-mode summary: herding LOW, minority price-setting HIGH, causal manipulation LOW.
What this means
We can say that prediction markets are capital-weighted systems, not democratic aggregators. Whether that makes them less accurate is a separate question — a highly informed professional (or insider) with $5 million in a market may be a better signal than 10,000 retail traders each putting in $20. But it does mean the "public believes X" framing is wrong. The more accurate citation is "prediction markets currently price X at Y%", not "the public gives X a Y% chance." That distinction matters when the number ends up in a CNN headline.
Timing adds another layer: even a high-scoring market warrants more caution in the final week before resolution, when the temporal data suggests large wallets are most active and late-stage concentration is highest. The January 2026 volume spike in the dataset is a clear example — trade count had been rising for months, but volume exploded late. That's not broader retail participation; it's large wallets moving aggressively on approach to resolution.
The cross-platform comparison that was the original goal of this project remains the real gap. Structural inequality in Polymarket is now measurable. Whether Kalshi shows the same patterns is unknown, because Kalshi doesn't expose wallet-level data. The platform with less external oversight is paradoxically the more auditable one: regulators can presumably see Kalshi's account-level activity, outside researchers can't, while on Polymarket the reverse is true. That's the asymmetry that made this analysis possible and also the reason it's incomplete.
What's still open is the harder question underneath the concentration finding: are the high-frequency wallets signal or noise? If they're informed professionals with genuinely better models, their outsized influence might make prices more accurate, not less. If they're actors with reasons to move prices unrelated to predicting outcomes (to profit from derivatives positions, to move public perception ahead of a decision) then the concentration is a real problem that the independent, decentralized structure of the surrounding network can't fix. Distinguishing between those two interpretations isn't possible from behavioral data alone. Answering the identity question (who these wallets actually belong to) would require regulatory access to Kalshi-style account records, which outside researchers don't have on Polymarket's pseudonymous chain. Answering the accuracy question (whether markets dominated by high-frequency wallets predict outcomes better or worse than distributed ones) is technically doable from public data, but would need full price histories across hundreds of resolved markets, not the end-of-market sample this project collected. That's the natural next step.