Document Classifier & Semantic Chunker

Data Engineering & AI

A high-performance data preparation platform designed to optimize unstructured text for Retrieval Augmented Generation (RAG) pipelines. Unlike standard text splitters that blindly cut content, this application employs a Semantic Chunking Engine to preserve context and meaning.

Document Classifier Playground

Key Capabilities

🧠 Intelligent Semantic Engine

The core of the application is a custom-built chunking algorithm that goes beyond simple character limits.

Context-Aware Splitting: Automatically detects semantic boundaries (sentences, paragraphs) to ensure chunks feel natural and coherent.
Gap-Proof Overlap: Utilizes a proprietary "Output-First" measurement strategy to guarantee strict token overlap, ensuring zero data loss between chunks.
Variable Sizing: Dynamically adjusts chunk sizes to fit the content flow rather than forcing rigid constraints.

🛡️ Enterprise-Grade Text Processing

Automated Sanitation: Instantly strips markdown formatting, screenshot references, and system artifacts (e.g., Notion exports).
PII Redaction: Built-in privacy filters to detect and remove email addresses and sensitive URLs before storage.
Multi-Dimensional Classification: Tags content by Persona, Industry, Funnel Stage, and Angle for precise retrieval filtering.

🏢 Multi-Tenant Architecture

Built for scale, the system supports Runtime Database Provisioning, allowing for distinct, isolated databases for every client or project. This ensures strict data segregation and compliance with enterprise security standards.

👁️ Visual Verification Workflow

A dedicated "Human-in-the-Loop" interface allows users to:
  • Visually inspect generated chunks in real-time.
  • Manually adjust split points with a drag-and-drop style editor.
  • Verify metadata classification before committing to long-term storage.
Document Classifier Chunk Editor

The Technical Challenge

Standard RAG pipelines often suffer from "context fragmentation"—where a vital piece of information is split across two chunks, making it invisible to the LLM.

This application solves that by treating text as a semantic stream rather than a string of characters, ensuring that every chunk stands on its own as a retrieval unit.

Tech Stack

  • Frontend: Next.js 14+ (App Router)
  • Database: PostgreSQL with pgvector (Dockerized)
  • ORM: Prisma with Dynamic Schema Management
  • Styling: Tailwind CSS + Radix UI