docs: add IdentityDB overview, usage, and extractor guides

2026-05-11 12:30:36 +09:00
parent d75fd6fecc
commit c80a52a241
4 changed files with 405 additions and 0 deletions
--- a/Extractors.md
+++ b/Extractors.md
@@ -0,0 +1,145 @@
+# Extractors
+
+IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.
+
+Today the package ships with two main extractor patterns:
+
+- `NaiveExtractor` — deterministic, rule-based, local
+- `LlmFactExtractor` — provider-agnostic adapter for external or local LLMs
+
+## `NaiveExtractor`
+
+`NaiveExtractor` is the simplest built-in extractor. It does not use an LLM.
+Instead, it scans the input with a small set of rules and emits obvious topics.
+
+### What it looks for
+
+Current behavior is intentionally narrow and predictable:
+
+- the standalone token `I`
+- **4-digit years** such as `2025`
+- **capitalized tokens** such as `TypeScript`, `Bun`, or `JavaScript`
+
+### How it labels topics
+
+- a 4-digit year becomes:
+  - `category: 'temporal'`
+  - `granularity: 'concrete'`
+  - `role: 'time'`
+- `I` becomes:
+  - `category: 'entity'`
+  - `granularity: 'concrete'`
+  - `role: 'subject'`
+- other capitalized tokens become:
+  - `category: 'entity'`
+  - `granularity: 'concrete'`
+  - `role: 'object'`
+
+### Example
+
+Input:
+
+```text
+I have worked with TypeScript since 2025.
+```
+
+Typical extracted result:
+
+```ts
+{
+  statement: 'I have worked with TypeScript since 2025.',
+  topics: [
+    { name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
+    { name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
+    { name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
+  ],
+}
+```
+
+### Why `NaiveExtractor` exists
+
+It is useful because it is:
+
+- **deterministic** — the same input gives the same output
+- **fast** — no API calls or model latency
+- **cheap** — no model cost
+- **good for tests** — easy to reason about in fixtures and regression tests
+- **good for demos** — great when you want a minimal local example
+
+### Limitations
+
+`NaiveExtractor` is intentionally not smart. It does not truly understand language.
+That means it can:
+
+- miss lowercase concepts
+- miss multi-word concepts that do not begin with capitals
+- over-trust capitalization
+- fail to infer nuanced categories or roles
+- produce weaker results on messy conversational text
+
+Use it when you want predictability, not deep understanding.
+
+## `LlmFactExtractor`
+
+`LlmFactExtractor` keeps the same `FactExtractor` contract but delegates extraction to a text-generating model.
+
+```ts
+import { LlmFactExtractor } from 'identitydb';
+
+const extractor = new LlmFactExtractor({
+  model: {
+    async generateText(prompt) {
+      return callYourFavoriteLlm(prompt);
+    },
+  },
+  instructions: 'Prefer technology, product, and time topics over generic nouns.',
+});
+```
+
+### Why this adapter exists
+
+The goal is to make LLM extraction possible **without coupling IdentityDB to a specific SDK**.
+The adapter only expects a model object with:
+
+```ts
+generateText(prompt: string): Promise<string>
+```
+
+That means you can bridge:
+
+- hosted APIs
+- local inference servers
+- wrappers around OpenAI-compatible APIs
+- wrappers around Anthropic-style APIs
+- your own orchestrator layer
+
+### Output requirements
+
+The model is expected to return **JSON only**.
+The adapter validates the response before IdentityDB writes anything to the database.
+It also tolerates some common formatting noise, such as a fenced ` ```json ` block around the payload.
+
+### When to choose which extractor
+
+Use `NaiveExtractor` when:
+
+- you want deterministic tests
+- you want a zero-dependency local example
+- your input format is controlled and simple
+- you do not want model cost or latency
+
+Use `LlmFactExtractor` when:
+
+- the text is messy, ambiguous, or conversational
+- you need better topic selection
+- you want richer descriptions, categories, or metadata
+- you are building a real ingestion pipeline instead of a toy example
+
+## Recommended strategy
+
+A practical development strategy is:
+
+1. Start with `NaiveExtractor`
+2. Build the surrounding ingestion flow and tests
+3. Swap in `LlmFactExtractor` when you need better recall and better structure
+4. Keep `NaiveExtractor` around for fixtures, regression tests, and offline demos
--- a/Getting-Started.md
+++ b/Getting-Started.md
@@ -0,0 +1,178 @@
+# Getting Started
+
+This page shows the concrete workflow for using IdentityDB as a structured memory layer.
+
+## 1. Connect to a database
+
+IdentityDB supports SQLite, PostgreSQL, MySQL, and MariaDB through Kysely-backed adapters.
+
+### In-memory SQLite example
+
+```ts
+import { IdentityDB } from 'identitydb';
+
+const db = await IdentityDB.connect({
+  client: 'sqlite',
+  filename: ':memory:',
+});
+```
+
+## 2. Initialize the schema
+
+```ts
+await db.initialize();
+```
+
+This creates the tables IdentityDB needs:
+
+- `topics`
+- `facts`
+- `fact_topics`
+- `topic_relations`
+- `topic_aliases`
+- `fact_embeddings`
+
+## 3. Add structured facts directly
+
+Use `addFact()` when your application already knows the topics it wants to attach.
+
+```ts
+await db.addFact({
+  statement: 'TypeScript is a programming language.',
+  topics: [
+    {
+      name: 'TypeScript',
+      category: 'entity',
+      granularity: 'concrete',
+    },
+    {
+      name: 'programming language',
+      category: 'concept',
+      granularity: 'abstract',
+    },
+  ],
+});
+```
+
+## 4. Model topic hierarchy explicitly
+
+Use `linkTopics()` when you want hierarchy to be explicit rather than inferred.
+
+```ts
+await db.linkTopics({
+  parentName: 'programming language',
+  childName: 'TypeScript',
+});
+
+const children = await db.getTopicChildren('programming language');
+const lineage = await db.getTopicLineage('TypeScript');
+```
+
+This is useful for reasoning such as:
+
+- `TypeScript` is a kind of `programming language`
+- `Bun` is a kind of `runtime`
+- `PostgreSQL` is a kind of `database`
+
+## 5. Add aliases for canonical topic resolution
+
+```ts
+await db.addTopicAlias('TypeScript', 'TS');
+
+const canonicalTopic = await db.getTopicByName('TS', { includeFacts: true });
+```
+
+This keeps one canonical topic row while still allowing alternate spellings or shorthand forms.
+
+## 6. Ingest free-form text through an extractor
+
+When your application starts from raw text, use `ingestStatement()`.
+
+### Deterministic local example
+
+```ts
+import { NaiveExtractor } from 'identitydb';
+
+await db.ingestStatement('I have worked with TypeScript since 2025.', {
+  extractor: new NaiveExtractor(),
+});
+```
+
+### LLM-backed example
+
+```ts
+import { LlmFactExtractor } from 'identitydb';
+
+const extractor = new LlmFactExtractor({
+  model: {
+    async generateText(prompt) {
+      return callYourFavoriteLlm(prompt);
+    },
+  },
+  instructions: 'Prefer technology, product, and time topics over generic nouns.',
+});
+
+await db.ingestStatement('I have worked with Bun and TypeScript since 2025.', {
+  extractor,
+});
+```
+
+See [Extractors](Extractors) for a deeper explanation of the trade-offs.
+
+## 7. Add semantic search
+
+IdentityDB keeps semantic search provider-agnostic through an `EmbeddingProvider` interface.
+
+```ts
+import type { EmbeddingProvider } from 'identitydb';
+
+const provider: EmbeddingProvider = {
+  model: 'example-embedding-v1',
+  dimensions: 3,
+  async embed(input) {
+    if (input.toLowerCase().includes('typescript')) {
+      return [1, 0, 0];
+    }
+
+    return [0, 1, 0];
+  },
+};
+
+await db.indexFactEmbeddings({ provider });
+
+const matches = await db.searchFacts({
+  query: 'TypeScript experience',
+  provider,
+  limit: 5,
+});
+```
+
+## 8. Enable duplicate-aware ingestion
+
+If you also provide an embedding provider during ingestion, IdentityDB can check whether a semantically similar fact already exists.
+
+```ts
+await db.ingestStatement('Bun makes TypeScript tooling fast.', {
+  extractor: new NaiveExtractor(),
+  embeddingProvider: provider,
+  duplicateThreshold: 0.95,
+});
+```
+
+If a close enough match already exists, IdentityDB can return the existing fact instead of writing a duplicate.
+
+## 9. Close the connection
+
+```ts
+await db.close();
+```
+
+## Practical workflow recommendation
+
+A good default integration pattern is:
+
+1. Start with SQLite in development
+2. Use `NaiveExtractor` for tests and deterministic local examples
+3. Introduce `LlmFactExtractor` when you need better topic extraction from messy natural language
+4. Add embeddings only when you actually need semantic retrieval or duplicate detection
+5. Move to PostgreSQL or MySQL/MariaDB later without changing the high-level API
--- a/Home.md
+++ b/Home.md
@@ -0,0 +1,76 @@
+# IdentityDB
+
+IdentityDB exists to make **AI memory explicit, queryable, portable, and evolvable**.
+
+Most AI applications start by stuffing raw text into prompts, vector stores, or ad-hoc JSON blobs. That works for demos, but it becomes fragile when you need to answer questions like:
+
+- What facts do we know about a person, product, or project?
+- Which topics are connected by the same statement?
+- Can we distinguish canonical concepts from aliases such as `TypeScript` and `TS`?
+- Can we preserve memory across SQLite locally and PostgreSQL or MySQL in production?
+- Can we mix deterministic extraction, LLM-backed extraction, and semantic search without locking into one vendor?
+
+IdentityDB is designed as the answer to those problems.
+
+## Why IdentityDB exists
+
+IdentityDB turns memory into a relational graph with a stable application API:
+
+- **Topics** are named nodes such as `TypeScript`, `Bun`, `2025`, or `programming language`
+- **Facts** are statements such as `I have worked with TypeScript since 2025.`
+- **Fact-topic links** connect one fact to many topics, which lets a single statement become a graph edge between concepts
+- **Topic relations** model explicit hierarchy such as `programming language -> TypeScript`
+- **Topic aliases** model canonicalization such as `TS -> TypeScript`
+- **Fact embeddings** enable provider-agnostic semantic search and duplicate detection
+
+This gives you a memory system that is easier to inspect than a black-box vector index and easier to evolve than hard-coded prompt state.
+
+## What the package can do today
+
+- Connect to **SQLite, PostgreSQL, MySQL, and MariaDB**
+- Initialize the required schema automatically
+- Add facts and topics directly through a typed API
+- Ingest free-form text through pluggable extractors
+- Resolve aliases to canonical topics
+- Traverse parent/child topic relationships
+- Index facts with embeddings for semantic retrieval
+- Reuse an existing fact when semantic duplicate detection says a new statement is effectively the same memory
+
+## Core idea in one example
+
+The fact:
+
+```text
+I have worked with TypeScript since 2025.
+```
+
+can connect all of these topics at once:
+
+- `I`
+- `TypeScript`
+- `2025`
+
+That means IdentityDB can answer more than plain keyword lookup. It can tell you:
+
+- which facts connect `TypeScript` and `2025`
+- which topics are related to `TypeScript`
+- which alias should resolve to the same canonical topic
+- which facts are semantically similar even if the wording changes
+
+## Recommended reading order
+
+- [Getting Started](Getting-Started) — installation, initialization, and concrete examples
+- [Extractors](Extractors) — when to use `NaiveExtractor` vs `LlmFactExtractor`
+
+## Repository
+
+- Source repository: [p-sw/IdentityDB](https://git.psw.kr/p-sw/IdentityDB)
+
+## Current direction
+
+IdentityDB is still in active MVP expansion, but the current shape is already useful for:
+
+- structured long-term memory for agents
+- knowledge capture from conversations
+- portable memory graphs across databases
+- inspectable semantic memory systems
--- a/_Sidebar.md
+++ b/_Sidebar.md
@@ -0,0 +1,6 @@
+## Navigation
+
+- [Home](Home)
+- [Getting Started](Getting-Started)
+- [Extractors](Extractors)
+- [Repository](https://git.psw.kr/p-sw/IdentityDB)