From c80a52a241d0cb236349f105b2851fa3738d51cc Mon Sep 17 00:00:00 2001 From: Shinwoo PARK Date: Mon, 11 May 2026 12:30:36 +0900 Subject: [PATCH] docs: add IdentityDB overview, usage, and extractor guides --- Extractors.md | 145 ++++++++++++++++++++++++++++++++++++ Getting-Started.md | 178 +++++++++++++++++++++++++++++++++++++++++++++ Home.md | 76 +++++++++++++++++++ _Sidebar.md | 6 ++ 4 files changed, 405 insertions(+) create mode 100644 Extractors.md create mode 100644 Getting-Started.md create mode 100644 _Sidebar.md diff --git a/Extractors.md b/Extractors.md new file mode 100644 index 0000000..c902739 --- /dev/null +++ b/Extractors.md @@ -0,0 +1,145 @@ +# Extractors + +IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be. + +Today the package ships with two main extractor patterns: + +- `NaiveExtractor` — deterministic, rule-based, local +- `LlmFactExtractor` — provider-agnostic adapter for external or local LLMs + +## `NaiveExtractor` + +`NaiveExtractor` is the simplest built-in extractor. It does not use an LLM. +Instead, it scans the input with a small set of rules and emits obvious topics. + +### What it looks for + +Current behavior is intentionally narrow and predictable: + +- the standalone token `I` +- **4-digit years** such as `2025` +- **capitalized tokens** such as `TypeScript`, `Bun`, or `JavaScript` + +### How it labels topics + +- a 4-digit year becomes: + - `category: 'temporal'` + - `granularity: 'concrete'` + - `role: 'time'` +- `I` becomes: + - `category: 'entity'` + - `granularity: 'concrete'` + - `role: 'subject'` +- other capitalized tokens become: + - `category: 'entity'` + - `granularity: 'concrete'` + - `role: 'object'` + +### Example + +Input: + +```text +I have worked with TypeScript since 2025. +``` + +Typical extracted result: + +```ts +{ + statement: 'I have worked with TypeScript since 2025.', + topics: [ + { name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' }, + { name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' }, + { name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' }, + ], +} +``` + +### Why `NaiveExtractor` exists + +It is useful because it is: + +- **deterministic** — the same input gives the same output +- **fast** — no API calls or model latency +- **cheap** — no model cost +- **good for tests** — easy to reason about in fixtures and regression tests +- **good for demos** — great when you want a minimal local example + +### Limitations + +`NaiveExtractor` is intentionally not smart. It does not truly understand language. +That means it can: + +- miss lowercase concepts +- miss multi-word concepts that do not begin with capitals +- over-trust capitalization +- fail to infer nuanced categories or roles +- produce weaker results on messy conversational text + +Use it when you want predictability, not deep understanding. + +## `LlmFactExtractor` + +`LlmFactExtractor` keeps the same `FactExtractor` contract but delegates extraction to a text-generating model. + +```ts +import { LlmFactExtractor } from 'identitydb'; + +const extractor = new LlmFactExtractor({ + model: { + async generateText(prompt) { + return callYourFavoriteLlm(prompt); + }, + }, + instructions: 'Prefer technology, product, and time topics over generic nouns.', +}); +``` + +### Why this adapter exists + +The goal is to make LLM extraction possible **without coupling IdentityDB to a specific SDK**. +The adapter only expects a model object with: + +```ts +generateText(prompt: string): Promise +``` + +That means you can bridge: + +- hosted APIs +- local inference servers +- wrappers around OpenAI-compatible APIs +- wrappers around Anthropic-style APIs +- your own orchestrator layer + +### Output requirements + +The model is expected to return **JSON only**. +The adapter validates the response before IdentityDB writes anything to the database. +It also tolerates some common formatting noise, such as a fenced ` ```json ` block around the payload. + +### When to choose which extractor + +Use `NaiveExtractor` when: + +- you want deterministic tests +- you want a zero-dependency local example +- your input format is controlled and simple +- you do not want model cost or latency + +Use `LlmFactExtractor` when: + +- the text is messy, ambiguous, or conversational +- you need better topic selection +- you want richer descriptions, categories, or metadata +- you are building a real ingestion pipeline instead of a toy example + +## Recommended strategy + +A practical development strategy is: + +1. Start with `NaiveExtractor` +2. Build the surrounding ingestion flow and tests +3. Swap in `LlmFactExtractor` when you need better recall and better structure +4. Keep `NaiveExtractor` around for fixtures, regression tests, and offline demos diff --git a/Getting-Started.md b/Getting-Started.md new file mode 100644 index 0000000..b012057 --- /dev/null +++ b/Getting-Started.md @@ -0,0 +1,178 @@ +# Getting Started + +This page shows the concrete workflow for using IdentityDB as a structured memory layer. + +## 1. Connect to a database + +IdentityDB supports SQLite, PostgreSQL, MySQL, and MariaDB through Kysely-backed adapters. + +### In-memory SQLite example + +```ts +import { IdentityDB } from 'identitydb'; + +const db = await IdentityDB.connect({ + client: 'sqlite', + filename: ':memory:', +}); +``` + +## 2. Initialize the schema + +```ts +await db.initialize(); +``` + +This creates the tables IdentityDB needs: + +- `topics` +- `facts` +- `fact_topics` +- `topic_relations` +- `topic_aliases` +- `fact_embeddings` + +## 3. Add structured facts directly + +Use `addFact()` when your application already knows the topics it wants to attach. + +```ts +await db.addFact({ + statement: 'TypeScript is a programming language.', + topics: [ + { + name: 'TypeScript', + category: 'entity', + granularity: 'concrete', + }, + { + name: 'programming language', + category: 'concept', + granularity: 'abstract', + }, + ], +}); +``` + +## 4. Model topic hierarchy explicitly + +Use `linkTopics()` when you want hierarchy to be explicit rather than inferred. + +```ts +await db.linkTopics({ + parentName: 'programming language', + childName: 'TypeScript', +}); + +const children = await db.getTopicChildren('programming language'); +const lineage = await db.getTopicLineage('TypeScript'); +``` + +This is useful for reasoning such as: + +- `TypeScript` is a kind of `programming language` +- `Bun` is a kind of `runtime` +- `PostgreSQL` is a kind of `database` + +## 5. Add aliases for canonical topic resolution + +```ts +await db.addTopicAlias('TypeScript', 'TS'); + +const canonicalTopic = await db.getTopicByName('TS', { includeFacts: true }); +``` + +This keeps one canonical topic row while still allowing alternate spellings or shorthand forms. + +## 6. Ingest free-form text through an extractor + +When your application starts from raw text, use `ingestStatement()`. + +### Deterministic local example + +```ts +import { NaiveExtractor } from 'identitydb'; + +await db.ingestStatement('I have worked with TypeScript since 2025.', { + extractor: new NaiveExtractor(), +}); +``` + +### LLM-backed example + +```ts +import { LlmFactExtractor } from 'identitydb'; + +const extractor = new LlmFactExtractor({ + model: { + async generateText(prompt) { + return callYourFavoriteLlm(prompt); + }, + }, + instructions: 'Prefer technology, product, and time topics over generic nouns.', +}); + +await db.ingestStatement('I have worked with Bun and TypeScript since 2025.', { + extractor, +}); +``` + +See [Extractors](Extractors) for a deeper explanation of the trade-offs. + +## 7. Add semantic search + +IdentityDB keeps semantic search provider-agnostic through an `EmbeddingProvider` interface. + +```ts +import type { EmbeddingProvider } from 'identitydb'; + +const provider: EmbeddingProvider = { + model: 'example-embedding-v1', + dimensions: 3, + async embed(input) { + if (input.toLowerCase().includes('typescript')) { + return [1, 0, 0]; + } + + return [0, 1, 0]; + }, +}; + +await db.indexFactEmbeddings({ provider }); + +const matches = await db.searchFacts({ + query: 'TypeScript experience', + provider, + limit: 5, +}); +``` + +## 8. Enable duplicate-aware ingestion + +If you also provide an embedding provider during ingestion, IdentityDB can check whether a semantically similar fact already exists. + +```ts +await db.ingestStatement('Bun makes TypeScript tooling fast.', { + extractor: new NaiveExtractor(), + embeddingProvider: provider, + duplicateThreshold: 0.95, +}); +``` + +If a close enough match already exists, IdentityDB can return the existing fact instead of writing a duplicate. + +## 9. Close the connection + +```ts +await db.close(); +``` + +## Practical workflow recommendation + +A good default integration pattern is: + +1. Start with SQLite in development +2. Use `NaiveExtractor` for tests and deterministic local examples +3. Introduce `LlmFactExtractor` when you need better topic extraction from messy natural language +4. Add embeddings only when you actually need semantic retrieval or duplicate detection +5. Move to PostgreSQL or MySQL/MariaDB later without changing the high-level API diff --git a/Home.md b/Home.md index e69de29..e41f77a 100644 --- a/Home.md +++ b/Home.md @@ -0,0 +1,76 @@ +# IdentityDB + +IdentityDB exists to make **AI memory explicit, queryable, portable, and evolvable**. + +Most AI applications start by stuffing raw text into prompts, vector stores, or ad-hoc JSON blobs. That works for demos, but it becomes fragile when you need to answer questions like: + +- What facts do we know about a person, product, or project? +- Which topics are connected by the same statement? +- Can we distinguish canonical concepts from aliases such as `TypeScript` and `TS`? +- Can we preserve memory across SQLite locally and PostgreSQL or MySQL in production? +- Can we mix deterministic extraction, LLM-backed extraction, and semantic search without locking into one vendor? + +IdentityDB is designed as the answer to those problems. + +## Why IdentityDB exists + +IdentityDB turns memory into a relational graph with a stable application API: + +- **Topics** are named nodes such as `TypeScript`, `Bun`, `2025`, or `programming language` +- **Facts** are statements such as `I have worked with TypeScript since 2025.` +- **Fact-topic links** connect one fact to many topics, which lets a single statement become a graph edge between concepts +- **Topic relations** model explicit hierarchy such as `programming language -> TypeScript` +- **Topic aliases** model canonicalization such as `TS -> TypeScript` +- **Fact embeddings** enable provider-agnostic semantic search and duplicate detection + +This gives you a memory system that is easier to inspect than a black-box vector index and easier to evolve than hard-coded prompt state. + +## What the package can do today + +- Connect to **SQLite, PostgreSQL, MySQL, and MariaDB** +- Initialize the required schema automatically +- Add facts and topics directly through a typed API +- Ingest free-form text through pluggable extractors +- Resolve aliases to canonical topics +- Traverse parent/child topic relationships +- Index facts with embeddings for semantic retrieval +- Reuse an existing fact when semantic duplicate detection says a new statement is effectively the same memory + +## Core idea in one example + +The fact: + +```text +I have worked with TypeScript since 2025. +``` + +can connect all of these topics at once: + +- `I` +- `TypeScript` +- `2025` + +That means IdentityDB can answer more than plain keyword lookup. It can tell you: + +- which facts connect `TypeScript` and `2025` +- which topics are related to `TypeScript` +- which alias should resolve to the same canonical topic +- which facts are semantically similar even if the wording changes + +## Recommended reading order + +- [Getting Started](Getting-Started) — installation, initialization, and concrete examples +- [Extractors](Extractors) — when to use `NaiveExtractor` vs `LlmFactExtractor` + +## Repository + +- Source repository: [p-sw/IdentityDB](https://git.psw.kr/p-sw/IdentityDB) + +## Current direction + +IdentityDB is still in active MVP expansion, but the current shape is already useful for: + +- structured long-term memory for agents +- knowledge capture from conversations +- portable memory graphs across databases +- inspectable semantic memory systems diff --git a/_Sidebar.md b/_Sidebar.md new file mode 100644 index 0000000..409222e --- /dev/null +++ b/_Sidebar.md @@ -0,0 +1,6 @@ +## Navigation + +- [Home](Home) +- [Getting Started](Getting-Started) +- [Extractors](Extractors) +- [Repository](https://git.psw.kr/p-sw/IdentityDB)