docs: add IdentityDB overview, usage, and extractor guides
145
Extractors.md
Normal file
145
Extractors.md
Normal file
@@ -0,0 +1,145 @@
|
|||||||
|
# Extractors
|
||||||
|
|
||||||
|
IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.
|
||||||
|
|
||||||
|
Today the package ships with two main extractor patterns:
|
||||||
|
|
||||||
|
- `NaiveExtractor` — deterministic, rule-based, local
|
||||||
|
- `LlmFactExtractor` — provider-agnostic adapter for external or local LLMs
|
||||||
|
|
||||||
|
## `NaiveExtractor`
|
||||||
|
|
||||||
|
`NaiveExtractor` is the simplest built-in extractor. It does not use an LLM.
|
||||||
|
Instead, it scans the input with a small set of rules and emits obvious topics.
|
||||||
|
|
||||||
|
### What it looks for
|
||||||
|
|
||||||
|
Current behavior is intentionally narrow and predictable:
|
||||||
|
|
||||||
|
- the standalone token `I`
|
||||||
|
- **4-digit years** such as `2025`
|
||||||
|
- **capitalized tokens** such as `TypeScript`, `Bun`, or `JavaScript`
|
||||||
|
|
||||||
|
### How it labels topics
|
||||||
|
|
||||||
|
- a 4-digit year becomes:
|
||||||
|
- `category: 'temporal'`
|
||||||
|
- `granularity: 'concrete'`
|
||||||
|
- `role: 'time'`
|
||||||
|
- `I` becomes:
|
||||||
|
- `category: 'entity'`
|
||||||
|
- `granularity: 'concrete'`
|
||||||
|
- `role: 'subject'`
|
||||||
|
- other capitalized tokens become:
|
||||||
|
- `category: 'entity'`
|
||||||
|
- `granularity: 'concrete'`
|
||||||
|
- `role: 'object'`
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
Input:
|
||||||
|
|
||||||
|
```text
|
||||||
|
I have worked with TypeScript since 2025.
|
||||||
|
```
|
||||||
|
|
||||||
|
Typical extracted result:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
{
|
||||||
|
statement: 'I have worked with TypeScript since 2025.',
|
||||||
|
topics: [
|
||||||
|
{ name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
|
||||||
|
{ name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
|
||||||
|
{ name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
|
||||||
|
],
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why `NaiveExtractor` exists
|
||||||
|
|
||||||
|
It is useful because it is:
|
||||||
|
|
||||||
|
- **deterministic** — the same input gives the same output
|
||||||
|
- **fast** — no API calls or model latency
|
||||||
|
- **cheap** — no model cost
|
||||||
|
- **good for tests** — easy to reason about in fixtures and regression tests
|
||||||
|
- **good for demos** — great when you want a minimal local example
|
||||||
|
|
||||||
|
### Limitations
|
||||||
|
|
||||||
|
`NaiveExtractor` is intentionally not smart. It does not truly understand language.
|
||||||
|
That means it can:
|
||||||
|
|
||||||
|
- miss lowercase concepts
|
||||||
|
- miss multi-word concepts that do not begin with capitals
|
||||||
|
- over-trust capitalization
|
||||||
|
- fail to infer nuanced categories or roles
|
||||||
|
- produce weaker results on messy conversational text
|
||||||
|
|
||||||
|
Use it when you want predictability, not deep understanding.
|
||||||
|
|
||||||
|
## `LlmFactExtractor`
|
||||||
|
|
||||||
|
`LlmFactExtractor` keeps the same `FactExtractor` contract but delegates extraction to a text-generating model.
|
||||||
|
|
||||||
|
```ts
|
||||||
|
import { LlmFactExtractor } from 'identitydb';
|
||||||
|
|
||||||
|
const extractor = new LlmFactExtractor({
|
||||||
|
model: {
|
||||||
|
async generateText(prompt) {
|
||||||
|
return callYourFavoriteLlm(prompt);
|
||||||
|
},
|
||||||
|
},
|
||||||
|
instructions: 'Prefer technology, product, and time topics over generic nouns.',
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why this adapter exists
|
||||||
|
|
||||||
|
The goal is to make LLM extraction possible **without coupling IdentityDB to a specific SDK**.
|
||||||
|
The adapter only expects a model object with:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
generateText(prompt: string): Promise<string>
|
||||||
|
```
|
||||||
|
|
||||||
|
That means you can bridge:
|
||||||
|
|
||||||
|
- hosted APIs
|
||||||
|
- local inference servers
|
||||||
|
- wrappers around OpenAI-compatible APIs
|
||||||
|
- wrappers around Anthropic-style APIs
|
||||||
|
- your own orchestrator layer
|
||||||
|
|
||||||
|
### Output requirements
|
||||||
|
|
||||||
|
The model is expected to return **JSON only**.
|
||||||
|
The adapter validates the response before IdentityDB writes anything to the database.
|
||||||
|
It also tolerates some common formatting noise, such as a fenced ` ```json ` block around the payload.
|
||||||
|
|
||||||
|
### When to choose which extractor
|
||||||
|
|
||||||
|
Use `NaiveExtractor` when:
|
||||||
|
|
||||||
|
- you want deterministic tests
|
||||||
|
- you want a zero-dependency local example
|
||||||
|
- your input format is controlled and simple
|
||||||
|
- you do not want model cost or latency
|
||||||
|
|
||||||
|
Use `LlmFactExtractor` when:
|
||||||
|
|
||||||
|
- the text is messy, ambiguous, or conversational
|
||||||
|
- you need better topic selection
|
||||||
|
- you want richer descriptions, categories, or metadata
|
||||||
|
- you are building a real ingestion pipeline instead of a toy example
|
||||||
|
|
||||||
|
## Recommended strategy
|
||||||
|
|
||||||
|
A practical development strategy is:
|
||||||
|
|
||||||
|
1. Start with `NaiveExtractor`
|
||||||
|
2. Build the surrounding ingestion flow and tests
|
||||||
|
3. Swap in `LlmFactExtractor` when you need better recall and better structure
|
||||||
|
4. Keep `NaiveExtractor` around for fixtures, regression tests, and offline demos
|
||||||
178
Getting-Started.md
Normal file
178
Getting-Started.md
Normal file
@@ -0,0 +1,178 @@
|
|||||||
|
# Getting Started
|
||||||
|
|
||||||
|
This page shows the concrete workflow for using IdentityDB as a structured memory layer.
|
||||||
|
|
||||||
|
## 1. Connect to a database
|
||||||
|
|
||||||
|
IdentityDB supports SQLite, PostgreSQL, MySQL, and MariaDB through Kysely-backed adapters.
|
||||||
|
|
||||||
|
### In-memory SQLite example
|
||||||
|
|
||||||
|
```ts
|
||||||
|
import { IdentityDB } from 'identitydb';
|
||||||
|
|
||||||
|
const db = await IdentityDB.connect({
|
||||||
|
client: 'sqlite',
|
||||||
|
filename: ':memory:',
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. Initialize the schema
|
||||||
|
|
||||||
|
```ts
|
||||||
|
await db.initialize();
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates the tables IdentityDB needs:
|
||||||
|
|
||||||
|
- `topics`
|
||||||
|
- `facts`
|
||||||
|
- `fact_topics`
|
||||||
|
- `topic_relations`
|
||||||
|
- `topic_aliases`
|
||||||
|
- `fact_embeddings`
|
||||||
|
|
||||||
|
## 3. Add structured facts directly
|
||||||
|
|
||||||
|
Use `addFact()` when your application already knows the topics it wants to attach.
|
||||||
|
|
||||||
|
```ts
|
||||||
|
await db.addFact({
|
||||||
|
statement: 'TypeScript is a programming language.',
|
||||||
|
topics: [
|
||||||
|
{
|
||||||
|
name: 'TypeScript',
|
||||||
|
category: 'entity',
|
||||||
|
granularity: 'concrete',
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: 'programming language',
|
||||||
|
category: 'concept',
|
||||||
|
granularity: 'abstract',
|
||||||
|
},
|
||||||
|
],
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Model topic hierarchy explicitly
|
||||||
|
|
||||||
|
Use `linkTopics()` when you want hierarchy to be explicit rather than inferred.
|
||||||
|
|
||||||
|
```ts
|
||||||
|
await db.linkTopics({
|
||||||
|
parentName: 'programming language',
|
||||||
|
childName: 'TypeScript',
|
||||||
|
});
|
||||||
|
|
||||||
|
const children = await db.getTopicChildren('programming language');
|
||||||
|
const lineage = await db.getTopicLineage('TypeScript');
|
||||||
|
```
|
||||||
|
|
||||||
|
This is useful for reasoning such as:
|
||||||
|
|
||||||
|
- `TypeScript` is a kind of `programming language`
|
||||||
|
- `Bun` is a kind of `runtime`
|
||||||
|
- `PostgreSQL` is a kind of `database`
|
||||||
|
|
||||||
|
## 5. Add aliases for canonical topic resolution
|
||||||
|
|
||||||
|
```ts
|
||||||
|
await db.addTopicAlias('TypeScript', 'TS');
|
||||||
|
|
||||||
|
const canonicalTopic = await db.getTopicByName('TS', { includeFacts: true });
|
||||||
|
```
|
||||||
|
|
||||||
|
This keeps one canonical topic row while still allowing alternate spellings or shorthand forms.
|
||||||
|
|
||||||
|
## 6. Ingest free-form text through an extractor
|
||||||
|
|
||||||
|
When your application starts from raw text, use `ingestStatement()`.
|
||||||
|
|
||||||
|
### Deterministic local example
|
||||||
|
|
||||||
|
```ts
|
||||||
|
import { NaiveExtractor } from 'identitydb';
|
||||||
|
|
||||||
|
await db.ingestStatement('I have worked with TypeScript since 2025.', {
|
||||||
|
extractor: new NaiveExtractor(),
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
### LLM-backed example
|
||||||
|
|
||||||
|
```ts
|
||||||
|
import { LlmFactExtractor } from 'identitydb';
|
||||||
|
|
||||||
|
const extractor = new LlmFactExtractor({
|
||||||
|
model: {
|
||||||
|
async generateText(prompt) {
|
||||||
|
return callYourFavoriteLlm(prompt);
|
||||||
|
},
|
||||||
|
},
|
||||||
|
instructions: 'Prefer technology, product, and time topics over generic nouns.',
|
||||||
|
});
|
||||||
|
|
||||||
|
await db.ingestStatement('I have worked with Bun and TypeScript since 2025.', {
|
||||||
|
extractor,
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
See [Extractors](Extractors) for a deeper explanation of the trade-offs.
|
||||||
|
|
||||||
|
## 7. Add semantic search
|
||||||
|
|
||||||
|
IdentityDB keeps semantic search provider-agnostic through an `EmbeddingProvider` interface.
|
||||||
|
|
||||||
|
```ts
|
||||||
|
import type { EmbeddingProvider } from 'identitydb';
|
||||||
|
|
||||||
|
const provider: EmbeddingProvider = {
|
||||||
|
model: 'example-embedding-v1',
|
||||||
|
dimensions: 3,
|
||||||
|
async embed(input) {
|
||||||
|
if (input.toLowerCase().includes('typescript')) {
|
||||||
|
return [1, 0, 0];
|
||||||
|
}
|
||||||
|
|
||||||
|
return [0, 1, 0];
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
await db.indexFactEmbeddings({ provider });
|
||||||
|
|
||||||
|
const matches = await db.searchFacts({
|
||||||
|
query: 'TypeScript experience',
|
||||||
|
provider,
|
||||||
|
limit: 5,
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## 8. Enable duplicate-aware ingestion
|
||||||
|
|
||||||
|
If you also provide an embedding provider during ingestion, IdentityDB can check whether a semantically similar fact already exists.
|
||||||
|
|
||||||
|
```ts
|
||||||
|
await db.ingestStatement('Bun makes TypeScript tooling fast.', {
|
||||||
|
extractor: new NaiveExtractor(),
|
||||||
|
embeddingProvider: provider,
|
||||||
|
duplicateThreshold: 0.95,
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
If a close enough match already exists, IdentityDB can return the existing fact instead of writing a duplicate.
|
||||||
|
|
||||||
|
## 9. Close the connection
|
||||||
|
|
||||||
|
```ts
|
||||||
|
await db.close();
|
||||||
|
```
|
||||||
|
|
||||||
|
## Practical workflow recommendation
|
||||||
|
|
||||||
|
A good default integration pattern is:
|
||||||
|
|
||||||
|
1. Start with SQLite in development
|
||||||
|
2. Use `NaiveExtractor` for tests and deterministic local examples
|
||||||
|
3. Introduce `LlmFactExtractor` when you need better topic extraction from messy natural language
|
||||||
|
4. Add embeddings only when you actually need semantic retrieval or duplicate detection
|
||||||
|
5. Move to PostgreSQL or MySQL/MariaDB later without changing the high-level API
|
||||||
76
Home.md
76
Home.md
@@ -0,0 +1,76 @@
|
|||||||
|
# IdentityDB
|
||||||
|
|
||||||
|
IdentityDB exists to make **AI memory explicit, queryable, portable, and evolvable**.
|
||||||
|
|
||||||
|
Most AI applications start by stuffing raw text into prompts, vector stores, or ad-hoc JSON blobs. That works for demos, but it becomes fragile when you need to answer questions like:
|
||||||
|
|
||||||
|
- What facts do we know about a person, product, or project?
|
||||||
|
- Which topics are connected by the same statement?
|
||||||
|
- Can we distinguish canonical concepts from aliases such as `TypeScript` and `TS`?
|
||||||
|
- Can we preserve memory across SQLite locally and PostgreSQL or MySQL in production?
|
||||||
|
- Can we mix deterministic extraction, LLM-backed extraction, and semantic search without locking into one vendor?
|
||||||
|
|
||||||
|
IdentityDB is designed as the answer to those problems.
|
||||||
|
|
||||||
|
## Why IdentityDB exists
|
||||||
|
|
||||||
|
IdentityDB turns memory into a relational graph with a stable application API:
|
||||||
|
|
||||||
|
- **Topics** are named nodes such as `TypeScript`, `Bun`, `2025`, or `programming language`
|
||||||
|
- **Facts** are statements such as `I have worked with TypeScript since 2025.`
|
||||||
|
- **Fact-topic links** connect one fact to many topics, which lets a single statement become a graph edge between concepts
|
||||||
|
- **Topic relations** model explicit hierarchy such as `programming language -> TypeScript`
|
||||||
|
- **Topic aliases** model canonicalization such as `TS -> TypeScript`
|
||||||
|
- **Fact embeddings** enable provider-agnostic semantic search and duplicate detection
|
||||||
|
|
||||||
|
This gives you a memory system that is easier to inspect than a black-box vector index and easier to evolve than hard-coded prompt state.
|
||||||
|
|
||||||
|
## What the package can do today
|
||||||
|
|
||||||
|
- Connect to **SQLite, PostgreSQL, MySQL, and MariaDB**
|
||||||
|
- Initialize the required schema automatically
|
||||||
|
- Add facts and topics directly through a typed API
|
||||||
|
- Ingest free-form text through pluggable extractors
|
||||||
|
- Resolve aliases to canonical topics
|
||||||
|
- Traverse parent/child topic relationships
|
||||||
|
- Index facts with embeddings for semantic retrieval
|
||||||
|
- Reuse an existing fact when semantic duplicate detection says a new statement is effectively the same memory
|
||||||
|
|
||||||
|
## Core idea in one example
|
||||||
|
|
||||||
|
The fact:
|
||||||
|
|
||||||
|
```text
|
||||||
|
I have worked with TypeScript since 2025.
|
||||||
|
```
|
||||||
|
|
||||||
|
can connect all of these topics at once:
|
||||||
|
|
||||||
|
- `I`
|
||||||
|
- `TypeScript`
|
||||||
|
- `2025`
|
||||||
|
|
||||||
|
That means IdentityDB can answer more than plain keyword lookup. It can tell you:
|
||||||
|
|
||||||
|
- which facts connect `TypeScript` and `2025`
|
||||||
|
- which topics are related to `TypeScript`
|
||||||
|
- which alias should resolve to the same canonical topic
|
||||||
|
- which facts are semantically similar even if the wording changes
|
||||||
|
|
||||||
|
## Recommended reading order
|
||||||
|
|
||||||
|
- [Getting Started](Getting-Started) — installation, initialization, and concrete examples
|
||||||
|
- [Extractors](Extractors) — when to use `NaiveExtractor` vs `LlmFactExtractor`
|
||||||
|
|
||||||
|
## Repository
|
||||||
|
|
||||||
|
- Source repository: [p-sw/IdentityDB](https://git.psw.kr/p-sw/IdentityDB)
|
||||||
|
|
||||||
|
## Current direction
|
||||||
|
|
||||||
|
IdentityDB is still in active MVP expansion, but the current shape is already useful for:
|
||||||
|
|
||||||
|
- structured long-term memory for agents
|
||||||
|
- knowledge capture from conversations
|
||||||
|
- portable memory graphs across databases
|
||||||
|
- inspectable semantic memory systems
|
||||||
|
|||||||
6
_Sidebar.md
Normal file
6
_Sidebar.md
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
## Navigation
|
||||||
|
|
||||||
|
- [Home](Home)
|
||||||
|
- [Getting Started](Getting-Started)
|
||||||
|
- [Extractors](Extractors)
|
||||||
|
- [Repository](https://git.psw.kr/p-sw/IdentityDB)
|
||||||
Reference in New Issue
Block a user