Project Catalogue
Three systems in detail — what was built, why, and what I learned from being wrong.
Each project below is presented as a self-contained mini-paper: Motivation, Approach, Implementation, and Outcome & Lessons. The systems are ordered by maturity rather than recency. Code links are in the margin where licensable. Stack badges sit beneath each title.
A.1Kneen — an open-source cited-RAG document chat
Motivation
Most document-chat tools fail in two related ways: they cite confidently when the answer is not in the document, and they cite vaguely ("see chapter 4") when it is. Kneen was built to test a simple claim — that if every generated claim is required to point to a specific span in a specific page, the failure modes become diagnosable rather than mysterious.
Approach
A single Postgres instance serves as both the relational store (users, documents, sessions) and the vector store (via the pgvector extension). Avoiding a separate vector DB removes an entire failure surface — the embedding and the row containing the chunk text live on the same primary key, so they cannot drift.
Retrieval is hybrid: dense via cosine over halfvec(768), lexical via Postgres's built-in ts_rank, both 20-deep, fused with RRF (Eq. 1). For under-specified queries we route through HyDE first [4] — the LLM hallucinates an answer, we embed that, and use the embedding for search. The hallucination is discarded; only its vector is kept.
EventSource to /ask; the orchestrator interleaves retrieval and generation; tokens are streamed back paired with the source-page anchor they came from. The Coverage map is the most-asked-about feature in user interviews — it shows which chunks have been queried, and surfaces "Questions You Haven't Asked Yet" by inverting that signal.Implementation
The orchestrator is small — under 600 lines of Python — because the heavy lifting is delegated to Postgres. The Coverage map deserves its own paragraph: every time a chunk is retrieved, a counter is incremented; the map renders this counter as opacity in a chunk-grid (Fig. 2, right). The "unasked questions" feature inverts the signal — the LLM is asked to summarise the bottom-decile chunks, then to propose questions whose answers live there. It turns out users want this far more than they want the chat itself.
Outcome & Lessons
- The citation is the product. Stripping out citations during a usability test caused trust scores to drop by half overnight, even though the underlying answers were identical.
- Single-store discipline is worth it. A separate vector DB would have shaved two days off the prototype and added six months of drift bugs.
- The "questions you haven't asked" feature is a sleeper hit. Worth more user delight than any of the retrieval improvements that took ten times the work.
A.2Kidnex — renal-CT triage with patient-facing reasoning
Motivation
Kidney pathology — cysts, tumours, stones — presents very differently on CT, and the diagnostic gap between radiologist and patient holding the report is enormous. The classifier was the easy half; the hard half was producing patient-facing text that was useful without being authoritative.
Approach
A pre-trained ResNet-50 backbone [3] is fine-tuned end-to-end on the 4-class problem with the standard cross-entropy:
After classification, the top-1 class and the calibrated probability ptop are passed to an LLM with a tightly-templated prompt — one of four conditional branches keyed on ptop's decile. The LLM never sees the scan; it sees a structured object, which keeps it factual.
Outcome & Lessons
- 96.4% top-1 acc. on the held-out test set; the residual 3.6% sits almost entirely on the cyst↔tumour boundary, exactly where a confidence threshold is most useful.
- The LLM's role is narrower than I expected. It writes the explanation; it does not influence the label. Treating it as a renderer rather than a reasoner was the right call.
- Calibration matters more than accuracy when the downstream consumer is a non-expert. A 96% model that lies about its own confidence is worse than an 88% model that does not.
A.3Pothole measurement — CSIR · CRRI internship
Motivation
Road authorities triage repair work by pothole area, but the current measurement loop is a person, a measuring tape, and a clipboard. The brief from CRRI was direct: given a single phone photo of a pothole with a reference object, return its area in cm².
Approach
Classical pipeline: CLAHE→ bilateral filter → adaptive threshold → contour extraction. The reference object's known size sets the pixel-to-mm scale; pothole contour area is then trivially convertible.
def measure(img, ref_obj_mm): g = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) g = clahe.apply(g) g = cv2.bilateralFilter(g, 9, 75, 75) th = cv2.adaptiveThreshold(g, 255, ADAPTIVE_GAUSSIAN, ...) cnts, _ = cv2.findContours(th, RETR_EXTERNAL, CHAIN_APPROX_SIMPLE) ref = largest_rectangular(cnts) # the marker pothole = largest_irregular(cnts, exclude=ref) px_per_mm = perimeter(ref) / (4 * ref_obj_mm) return cv2.contourArea(pothole) / (px_per_mm ** 2)
Outcome & Lessons
- 8.7% mean absolute percentage erroron a hand-measured set of 120 potholes — the brief's bar was 15%.
- Classical CV is underrated for narrow problems. No GPU, no training set, no model drift; just a deterministic pipeline that ran on a five-year-old laptop in the field.
- Lighting is the entire problem. Half of the engineering time went into the CLAHE + bilateral combination; the contour step was an afternoon.
A.4Selected work in progress
Two systems currently under active development at eInfochips · AI-Studio. NDAs limit detail; high-level descriptions only.
| System | Goal | Notable choice |
|---|---|---|
| Embedded-device test-case RAG | Generate firmware-level test cases from datasheets & errata. | Scenario-shaped chunking; co-indexes errata and historical bug reports. |
| Website test-case RAG (improvements) | Lift engineer-accept rate of generated cases on web targets. | HyDE for under-specified user stories; reranker over RRF top-50. |