docs: add IdentityDB overview, usage, and extractor guides

2026-05-11 12:30:36 +09:00
parent d75fd6fecc
commit c80a52a241
4 changed files with 405 additions and 0 deletions
--- a/Extractors.md
+++ b/Extractors.md
@@ -0,0 +1,145 @@
 # Extractors
 IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.
 Today the package ships with two main extractor patterns:
 - `NaiveExtractor` — deterministic, rule-based, local
 - `LlmFactExtractor` — provider-agnostic adapter for external or local LLMs
 ## `NaiveExtractor`
 `NaiveExtractor` is the simplest built-in extractor. It does not use an LLM.
 Instead, it scans the input with a small set of rules and emits obvious topics.
 ### What it looks for
 Current behavior is intentionally narrow and predictable:
 - the standalone token `I`
 - **4-digit years** such as `2025`
 - **capitalized tokens** such as `TypeScript`, `Bun`, or `JavaScript`
 ### How it labels topics
 - a 4-digit year becomes:
  - `category: 'temporal'`
  - `granularity: 'concrete'`
  - `role: 'time'`
 - `I` becomes:
  - `category: 'entity'`
  - `granularity: 'concrete'`
  - `role: 'subject'`
 - other capitalized tokens become:
  - `category: 'entity'`
  - `granularity: 'concrete'`
  - `role: 'object'`
 ### Example
 Input:
 ```text
 I have worked with TypeScript since 2025.
 ```
 Typical extracted result:
 ```ts
 {
  statement: 'I have worked with TypeScript since 2025.',
  topics: [
    { name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
    { name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
    { name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
  ],
 }
 ```
 ### Why `NaiveExtractor` exists
 It is useful because it is:
 - **deterministic** — the same input gives the same output
 - **fast** — no API calls or model latency
 - **cheap** — no model cost
 - **good for tests** — easy to reason about in fixtures and regression tests
 - **good for demos** — great when you want a minimal local example
 ### Limitations
 `NaiveExtractor` is intentionally not smart. It does not truly understand language.
 That means it can:
 - miss lowercase concepts
 - miss multi-word concepts that do not begin with capitals
 - over-trust capitalization
 - fail to infer nuanced categories or roles
 - produce weaker results on messy conversational text
 Use it when you want predictability, not deep understanding.
 ## `LlmFactExtractor`
 `LlmFactExtractor` keeps the same `FactExtractor` contract but delegates extraction to a text-generating model.
 ```ts
 import { LlmFactExtractor } from 'identitydb';
 const extractor = new LlmFactExtractor({
  model: {
    async generateText(prompt) {
      return callYourFavoriteLlm(prompt);
    },
  },
  instructions: 'Prefer technology, product, and time topics over generic nouns.',
 });
 ```
 ### Why this adapter exists
 The goal is to make LLM extraction possible **without coupling IdentityDB to a specific SDK**.
 The adapter only expects a model object with:
 ```ts
 generateText(prompt: string): Promise<string>
 ```
 That means you can bridge:
 - hosted APIs
 - local inference servers
 - wrappers around OpenAI-compatible APIs
 - wrappers around Anthropic-style APIs
 - your own orchestrator layer
 ### Output requirements
 The model is expected to return **JSON only**.
 The adapter validates the response before IdentityDB writes anything to the database.
 It also tolerates some common formatting noise, such as a fenced ` ```json ` block around the payload.
 ### When to choose which extractor
 Use `NaiveExtractor` when:
 - you want deterministic tests
 - you want a zero-dependency local example
 - your input format is controlled and simple
 - you do not want model cost or latency
 Use `LlmFactExtractor` when:
 - the text is messy, ambiguous, or conversational
 - you need better topic selection
 - you want richer descriptions, categories, or metadata
 - you are building a real ingestion pipeline instead of a toy example
 ## Recommended strategy
 A practical development strategy is:
 1. Start with `NaiveExtractor`
 2. Build the surrounding ingestion flow and tests
 3. Swap in `LlmFactExtractor` when you need better recall and better structure
 4. Keep `NaiveExtractor` around for fixtures, regression tests, and offline demos
--- a/Getting-Started.md
+++ b/Getting-Started.md
@@ -0,0 +1,178 @@
 # Getting Started
 This page shows the concrete workflow for using IdentityDB as a structured memory layer.
 ## 1. Connect to a database
 IdentityDB supports SQLite, PostgreSQL, MySQL, and MariaDB through Kysely-backed adapters.
 ### In-memory SQLite example
 ```ts
 import { IdentityDB } from 'identitydb';
 const db = await IdentityDB.connect({
  client: 'sqlite',
  filename: ':memory:',
 });
 ```
 ## 2. Initialize the schema
 ```ts
 await db.initialize();
 ```
 This creates the tables IdentityDB needs:
 - `topics`
 - `facts`
 - `fact_topics`
 - `topic_relations`
 - `topic_aliases`
 - `fact_embeddings`
 ## 3. Add structured facts directly
 Use `addFact()` when your application already knows the topics it wants to attach.
 ```ts
 await db.addFact({
  statement: 'TypeScript is a programming language.',
  topics: [
    {
      name: 'TypeScript',
      category: 'entity',
      granularity: 'concrete',
    },
    {
      name: 'programming language',
      category: 'concept',
      granularity: 'abstract',
    },
  ],
 });
 ```
 ## 4. Model topic hierarchy explicitly
 Use `linkTopics()` when you want hierarchy to be explicit rather than inferred.
 ```ts
 await db.linkTopics({
  parentName: 'programming language',
  childName: 'TypeScript',
 });
 const children = await db.getTopicChildren('programming language');
 const lineage = await db.getTopicLineage('TypeScript');
 ```
 This is useful for reasoning such as:
 - `TypeScript` is a kind of `programming language`
 - `Bun` is a kind of `runtime`
 - `PostgreSQL` is a kind of `database`
 ## 5. Add aliases for canonical topic resolution
 ```ts
 await db.addTopicAlias('TypeScript', 'TS');
 const canonicalTopic = await db.getTopicByName('TS', { includeFacts: true });
 ```
 This keeps one canonical topic row while still allowing alternate spellings or shorthand forms.
 ## 6. Ingest free-form text through an extractor
 When your application starts from raw text, use `ingestStatement()`.
 ### Deterministic local example
 ```ts
 import { NaiveExtractor } from 'identitydb';
 await db.ingestStatement('I have worked with TypeScript since 2025.', {
  extractor: new NaiveExtractor(),
 });
 ```
 ### LLM-backed example
 ```ts
 import { LlmFactExtractor } from 'identitydb';
 const extractor = new LlmFactExtractor({
  model: {
    async generateText(prompt) {
      return callYourFavoriteLlm(prompt);
    },
  },
  instructions: 'Prefer technology, product, and time topics over generic nouns.',
 });
 await db.ingestStatement('I have worked with Bun and TypeScript since 2025.', {
  extractor,
 });
 ```
 See [Extractors](Extractors) for a deeper explanation of the trade-offs.
 ## 7. Add semantic search
 IdentityDB keeps semantic search provider-agnostic through an `EmbeddingProvider` interface.
 ```ts
 import type { EmbeddingProvider } from 'identitydb';
 const provider: EmbeddingProvider = {
  model: 'example-embedding-v1',
  dimensions: 3,
  async embed(input) {
    if (input.toLowerCase().includes('typescript')) {
      return [1, 0, 0];
    }
    return [0, 1, 0];
  },
 };
 await db.indexFactEmbeddings({ provider });
 const matches = await db.searchFacts({
  query: 'TypeScript experience',
  provider,
  limit: 5,
 });
 ```
 ## 8. Enable duplicate-aware ingestion
 If you also provide an embedding provider during ingestion, IdentityDB can check whether a semantically similar fact already exists.
 ```ts
 await db.ingestStatement('Bun makes TypeScript tooling fast.', {
  extractor: new NaiveExtractor(),
  embeddingProvider: provider,
  duplicateThreshold: 0.95,
 });
 ```
 If a close enough match already exists, IdentityDB can return the existing fact instead of writing a duplicate.
 ## 9. Close the connection
 ```ts
 await db.close();
 ```
 ## Practical workflow recommendation
 A good default integration pattern is:
 1. Start with SQLite in development
 2. Use `NaiveExtractor` for tests and deterministic local examples
 3. Introduce `LlmFactExtractor` when you need better topic extraction from messy natural language
 4. Add embeddings only when you actually need semantic retrieval or duplicate detection
 5. Move to PostgreSQL or MySQL/MariaDB later without changing the high-level API
--- a/Home.md
+++ b/Home.md
@@ -0,0 +1,76 @@
 # IdentityDB
 IdentityDB exists to make **AI memory explicit, queryable, portable, and evolvable**.
 Most AI applications start by stuffing raw text into prompts, vector stores, or ad-hoc JSON blobs. That works for demos, but it becomes fragile when you need to answer questions like:
 - What facts do we know about a person, product, or project?
 - Which topics are connected by the same statement?
 - Can we distinguish canonical concepts from aliases such as `TypeScript` and `TS`?
 - Can we preserve memory across SQLite locally and PostgreSQL or MySQL in production?
 - Can we mix deterministic extraction, LLM-backed extraction, and semantic search without locking into one vendor?
 IdentityDB is designed as the answer to those problems.
 ## Why IdentityDB exists
 IdentityDB turns memory into a relational graph with a stable application API:
 - **Topics** are named nodes such as `TypeScript`, `Bun`, `2025`, or `programming language`
 - **Facts** are statements such as `I have worked with TypeScript since 2025.`
 - **Fact-topic links** connect one fact to many topics, which lets a single statement become a graph edge between concepts
 - **Topic relations** model explicit hierarchy such as `programming language -> TypeScript`
 - **Topic aliases** model canonicalization such as `TS -> TypeScript`
 - **Fact embeddings** enable provider-agnostic semantic search and duplicate detection
 This gives you a memory system that is easier to inspect than a black-box vector index and easier to evolve than hard-coded prompt state.
 ## What the package can do today
 - Connect to **SQLite, PostgreSQL, MySQL, and MariaDB**
 - Initialize the required schema automatically
 - Add facts and topics directly through a typed API
 - Ingest free-form text through pluggable extractors
 - Resolve aliases to canonical topics
 - Traverse parent/child topic relationships
 - Index facts with embeddings for semantic retrieval
 - Reuse an existing fact when semantic duplicate detection says a new statement is effectively the same memory
 ## Core idea in one example
 The fact:
 ```text
 I have worked with TypeScript since 2025.
 ```
 can connect all of these topics at once:
 - `I`
 - `TypeScript`
 - `2025`
 That means IdentityDB can answer more than plain keyword lookup. It can tell you:
 - which facts connect `TypeScript` and `2025`
 - which topics are related to `TypeScript`
 - which alias should resolve to the same canonical topic
 - which facts are semantically similar even if the wording changes
 ## Recommended reading order
 - [Getting Started](Getting-Started) — installation, initialization, and concrete examples
 - [Extractors](Extractors) — when to use `NaiveExtractor` vs `LlmFactExtractor`
 ## Repository
 - Source repository: [p-sw/IdentityDB](https://git.psw.kr/p-sw/IdentityDB)
 ## Current direction
 IdentityDB is still in active MVP expansion, but the current shape is already useful for:
 - structured long-term memory for agents
 - knowledge capture from conversations
 - portable memory graphs across databases
 - inspectable semantic memory systems
--- a/_Sidebar.md
+++ b/_Sidebar.md
@@ -0,0 +1,6 @@
 ## Navigation
 - [Home](Home)
 - [Getting Started](Getting-Started)
 - [Extractors](Extractors)
 - [Repository](https://git.psw.kr/p-sw/IdentityDB)