docs: add IdentityDB overview, usage, and extractor guides

2026-05-11 12:30:36 +09:00
parent d75fd6fecc
commit c80a52a241
4 changed files with 405 additions and 0 deletions

145
Extractors.md Normal file

@@ -0,0 +1,145 @@
# Extractors
IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.
Today the package ships with two main extractor patterns:
- `NaiveExtractor` — deterministic, rule-based, local
- `LlmFactExtractor` — provider-agnostic adapter for external or local LLMs
## `NaiveExtractor`
`NaiveExtractor` is the simplest built-in extractor. It does not use an LLM.
Instead, it scans the input with a small set of rules and emits obvious topics.
### What it looks for
Current behavior is intentionally narrow and predictable:
- the standalone token `I`
- **4-digit years** such as `2025`
- **capitalized tokens** such as `TypeScript`, `Bun`, or `JavaScript`
### How it labels topics
- a 4-digit year becomes:
- `category: 'temporal'`
- `granularity: 'concrete'`
- `role: 'time'`
- `I` becomes:
- `category: 'entity'`
- `granularity: 'concrete'`
- `role: 'subject'`
- other capitalized tokens become:
- `category: 'entity'`
- `granularity: 'concrete'`
- `role: 'object'`
### Example
Input:
```text
I have worked with TypeScript since 2025.
```
Typical extracted result:
```ts
{
statement: 'I have worked with TypeScript since 2025.',
topics: [
{ name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
{ name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
{ name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
],
}
```
### Why `NaiveExtractor` exists
It is useful because it is:
- **deterministic** — the same input gives the same output
- **fast** — no API calls or model latency
- **cheap** — no model cost
- **good for tests** — easy to reason about in fixtures and regression tests
- **good for demos** — great when you want a minimal local example
### Limitations
`NaiveExtractor` is intentionally not smart. It does not truly understand language.
That means it can:
- miss lowercase concepts
- miss multi-word concepts that do not begin with capitals
- over-trust capitalization
- fail to infer nuanced categories or roles
- produce weaker results on messy conversational text
Use it when you want predictability, not deep understanding.
## `LlmFactExtractor`
`LlmFactExtractor` keeps the same `FactExtractor` contract but delegates extraction to a text-generating model.
```ts
import { LlmFactExtractor } from 'identitydb';
const extractor = new LlmFactExtractor({
model: {
async generateText(prompt) {
return callYourFavoriteLlm(prompt);
},
},
instructions: 'Prefer technology, product, and time topics over generic nouns.',
});
```
### Why this adapter exists
The goal is to make LLM extraction possible **without coupling IdentityDB to a specific SDK**.
The adapter only expects a model object with:
```ts
generateText(prompt: string): Promise<string>
```
That means you can bridge:
- hosted APIs
- local inference servers
- wrappers around OpenAI-compatible APIs
- wrappers around Anthropic-style APIs
- your own orchestrator layer
### Output requirements
The model is expected to return **JSON only**.
The adapter validates the response before IdentityDB writes anything to the database.
It also tolerates some common formatting noise, such as a fenced ` ```json ` block around the payload.
### When to choose which extractor
Use `NaiveExtractor` when:
- you want deterministic tests
- you want a zero-dependency local example
- your input format is controlled and simple
- you do not want model cost or latency
Use `LlmFactExtractor` when:
- the text is messy, ambiguous, or conversational
- you need better topic selection
- you want richer descriptions, categories, or metadata
- you are building a real ingestion pipeline instead of a toy example
## Recommended strategy
A practical development strategy is:
1. Start with `NaiveExtractor`
2. Build the surrounding ingestion flow and tests
3. Swap in `LlmFactExtractor` when you need better recall and better structure
4. Keep `NaiveExtractor` around for fixtures, regression tests, and offline demos

178
Getting-Started.md Normal file

@@ -0,0 +1,178 @@
# Getting Started
This page shows the concrete workflow for using IdentityDB as a structured memory layer.
## 1. Connect to a database
IdentityDB supports SQLite, PostgreSQL, MySQL, and MariaDB through Kysely-backed adapters.
### In-memory SQLite example
```ts
import { IdentityDB } from 'identitydb';
const db = await IdentityDB.connect({
client: 'sqlite',
filename: ':memory:',
});
```
## 2. Initialize the schema
```ts
await db.initialize();
```
This creates the tables IdentityDB needs:
- `topics`
- `facts`
- `fact_topics`
- `topic_relations`
- `topic_aliases`
- `fact_embeddings`
## 3. Add structured facts directly
Use `addFact()` when your application already knows the topics it wants to attach.
```ts
await db.addFact({
statement: 'TypeScript is a programming language.',
topics: [
{
name: 'TypeScript',
category: 'entity',
granularity: 'concrete',
},
{
name: 'programming language',
category: 'concept',
granularity: 'abstract',
},
],
});
```
## 4. Model topic hierarchy explicitly
Use `linkTopics()` when you want hierarchy to be explicit rather than inferred.
```ts
await db.linkTopics({
parentName: 'programming language',
childName: 'TypeScript',
});
const children = await db.getTopicChildren('programming language');
const lineage = await db.getTopicLineage('TypeScript');
```
This is useful for reasoning such as:
- `TypeScript` is a kind of `programming language`
- `Bun` is a kind of `runtime`
- `PostgreSQL` is a kind of `database`
## 5. Add aliases for canonical topic resolution
```ts
await db.addTopicAlias('TypeScript', 'TS');
const canonicalTopic = await db.getTopicByName('TS', { includeFacts: true });
```
This keeps one canonical topic row while still allowing alternate spellings or shorthand forms.
## 6. Ingest free-form text through an extractor
When your application starts from raw text, use `ingestStatement()`.
### Deterministic local example
```ts
import { NaiveExtractor } from 'identitydb';
await db.ingestStatement('I have worked with TypeScript since 2025.', {
extractor: new NaiveExtractor(),
});
```
### LLM-backed example
```ts
import { LlmFactExtractor } from 'identitydb';
const extractor = new LlmFactExtractor({
model: {
async generateText(prompt) {
return callYourFavoriteLlm(prompt);
},
},
instructions: 'Prefer technology, product, and time topics over generic nouns.',
});
await db.ingestStatement('I have worked with Bun and TypeScript since 2025.', {
extractor,
});
```
See [Extractors](Extractors) for a deeper explanation of the trade-offs.
## 7. Add semantic search
IdentityDB keeps semantic search provider-agnostic through an `EmbeddingProvider` interface.
```ts
import type { EmbeddingProvider } from 'identitydb';
const provider: EmbeddingProvider = {
model: 'example-embedding-v1',
dimensions: 3,
async embed(input) {
if (input.toLowerCase().includes('typescript')) {
return [1, 0, 0];
}
return [0, 1, 0];
},
};
await db.indexFactEmbeddings({ provider });
const matches = await db.searchFacts({
query: 'TypeScript experience',
provider,
limit: 5,
});
```
## 8. Enable duplicate-aware ingestion
If you also provide an embedding provider during ingestion, IdentityDB can check whether a semantically similar fact already exists.
```ts
await db.ingestStatement('Bun makes TypeScript tooling fast.', {
extractor: new NaiveExtractor(),
embeddingProvider: provider,
duplicateThreshold: 0.95,
});
```
If a close enough match already exists, IdentityDB can return the existing fact instead of writing a duplicate.
## 9. Close the connection
```ts
await db.close();
```
## Practical workflow recommendation
A good default integration pattern is:
1. Start with SQLite in development
2. Use `NaiveExtractor` for tests and deterministic local examples
3. Introduce `LlmFactExtractor` when you need better topic extraction from messy natural language
4. Add embeddings only when you actually need semantic retrieval or duplicate detection
5. Move to PostgreSQL or MySQL/MariaDB later without changing the high-level API

76
Home.md

@@ -0,0 +1,76 @@
# IdentityDB
IdentityDB exists to make **AI memory explicit, queryable, portable, and evolvable**.
Most AI applications start by stuffing raw text into prompts, vector stores, or ad-hoc JSON blobs. That works for demos, but it becomes fragile when you need to answer questions like:
- What facts do we know about a person, product, or project?
- Which topics are connected by the same statement?
- Can we distinguish canonical concepts from aliases such as `TypeScript` and `TS`?
- Can we preserve memory across SQLite locally and PostgreSQL or MySQL in production?
- Can we mix deterministic extraction, LLM-backed extraction, and semantic search without locking into one vendor?
IdentityDB is designed as the answer to those problems.
## Why IdentityDB exists
IdentityDB turns memory into a relational graph with a stable application API:
- **Topics** are named nodes such as `TypeScript`, `Bun`, `2025`, or `programming language`
- **Facts** are statements such as `I have worked with TypeScript since 2025.`
- **Fact-topic links** connect one fact to many topics, which lets a single statement become a graph edge between concepts
- **Topic relations** model explicit hierarchy such as `programming language -> TypeScript`
- **Topic aliases** model canonicalization such as `TS -> TypeScript`
- **Fact embeddings** enable provider-agnostic semantic search and duplicate detection
This gives you a memory system that is easier to inspect than a black-box vector index and easier to evolve than hard-coded prompt state.
## What the package can do today
- Connect to **SQLite, PostgreSQL, MySQL, and MariaDB**
- Initialize the required schema automatically
- Add facts and topics directly through a typed API
- Ingest free-form text through pluggable extractors
- Resolve aliases to canonical topics
- Traverse parent/child topic relationships
- Index facts with embeddings for semantic retrieval
- Reuse an existing fact when semantic duplicate detection says a new statement is effectively the same memory
## Core idea in one example
The fact:
```text
I have worked with TypeScript since 2025.
```
can connect all of these topics at once:
- `I`
- `TypeScript`
- `2025`
That means IdentityDB can answer more than plain keyword lookup. It can tell you:
- which facts connect `TypeScript` and `2025`
- which topics are related to `TypeScript`
- which alias should resolve to the same canonical topic
- which facts are semantically similar even if the wording changes
## Recommended reading order
- [Getting Started](Getting-Started) — installation, initialization, and concrete examples
- [Extractors](Extractors) — when to use `NaiveExtractor` vs `LlmFactExtractor`
## Repository
- Source repository: [p-sw/IdentityDB](https://git.psw.kr/p-sw/IdentityDB)
## Current direction
IdentityDB is still in active MVP expansion, but the current shape is already useful for:
- structured long-term memory for agents
- knowledge capture from conversations
- portable memory graphs across databases
- inspectable semantic memory systems

6
_Sidebar.md Normal file

@@ -0,0 +1,6 @@
## Navigation
- [Home](Home)
- [Getting Started](Getting-Started)
- [Extractors](Extractors)
- [Repository](https://git.psw.kr/p-sw/IdentityDB)