docs: add IdentityDB overview, usage, and extractor guides
145
Extractors.md
Normal file
145
Extractors.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Extractors
|
||||
|
||||
IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.
|
||||
|
||||
Today the package ships with two main extractor patterns:
|
||||
|
||||
- `NaiveExtractor` — deterministic, rule-based, local
|
||||
- `LlmFactExtractor` — provider-agnostic adapter for external or local LLMs
|
||||
|
||||
## `NaiveExtractor`
|
||||
|
||||
`NaiveExtractor` is the simplest built-in extractor. It does not use an LLM.
|
||||
Instead, it scans the input with a small set of rules and emits obvious topics.
|
||||
|
||||
### What it looks for
|
||||
|
||||
Current behavior is intentionally narrow and predictable:
|
||||
|
||||
- the standalone token `I`
|
||||
- **4-digit years** such as `2025`
|
||||
- **capitalized tokens** such as `TypeScript`, `Bun`, or `JavaScript`
|
||||
|
||||
### How it labels topics
|
||||
|
||||
- a 4-digit year becomes:
|
||||
- `category: 'temporal'`
|
||||
- `granularity: 'concrete'`
|
||||
- `role: 'time'`
|
||||
- `I` becomes:
|
||||
- `category: 'entity'`
|
||||
- `granularity: 'concrete'`
|
||||
- `role: 'subject'`
|
||||
- other capitalized tokens become:
|
||||
- `category: 'entity'`
|
||||
- `granularity: 'concrete'`
|
||||
- `role: 'object'`
|
||||
|
||||
### Example
|
||||
|
||||
Input:
|
||||
|
||||
```text
|
||||
I have worked with TypeScript since 2025.
|
||||
```
|
||||
|
||||
Typical extracted result:
|
||||
|
||||
```ts
|
||||
{
|
||||
statement: 'I have worked with TypeScript since 2025.',
|
||||
topics: [
|
||||
{ name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
|
||||
{ name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
|
||||
{ name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
### Why `NaiveExtractor` exists
|
||||
|
||||
It is useful because it is:
|
||||
|
||||
- **deterministic** — the same input gives the same output
|
||||
- **fast** — no API calls or model latency
|
||||
- **cheap** — no model cost
|
||||
- **good for tests** — easy to reason about in fixtures and regression tests
|
||||
- **good for demos** — great when you want a minimal local example
|
||||
|
||||
### Limitations
|
||||
|
||||
`NaiveExtractor` is intentionally not smart. It does not truly understand language.
|
||||
That means it can:
|
||||
|
||||
- miss lowercase concepts
|
||||
- miss multi-word concepts that do not begin with capitals
|
||||
- over-trust capitalization
|
||||
- fail to infer nuanced categories or roles
|
||||
- produce weaker results on messy conversational text
|
||||
|
||||
Use it when you want predictability, not deep understanding.
|
||||
|
||||
## `LlmFactExtractor`
|
||||
|
||||
`LlmFactExtractor` keeps the same `FactExtractor` contract but delegates extraction to a text-generating model.
|
||||
|
||||
```ts
|
||||
import { LlmFactExtractor } from 'identitydb';
|
||||
|
||||
const extractor = new LlmFactExtractor({
|
||||
model: {
|
||||
async generateText(prompt) {
|
||||
return callYourFavoriteLlm(prompt);
|
||||
},
|
||||
},
|
||||
instructions: 'Prefer technology, product, and time topics over generic nouns.',
|
||||
});
|
||||
```
|
||||
|
||||
### Why this adapter exists
|
||||
|
||||
The goal is to make LLM extraction possible **without coupling IdentityDB to a specific SDK**.
|
||||
The adapter only expects a model object with:
|
||||
|
||||
```ts
|
||||
generateText(prompt: string): Promise<string>
|
||||
```
|
||||
|
||||
That means you can bridge:
|
||||
|
||||
- hosted APIs
|
||||
- local inference servers
|
||||
- wrappers around OpenAI-compatible APIs
|
||||
- wrappers around Anthropic-style APIs
|
||||
- your own orchestrator layer
|
||||
|
||||
### Output requirements
|
||||
|
||||
The model is expected to return **JSON only**.
|
||||
The adapter validates the response before IdentityDB writes anything to the database.
|
||||
It also tolerates some common formatting noise, such as a fenced ` ```json ` block around the payload.
|
||||
|
||||
### When to choose which extractor
|
||||
|
||||
Use `NaiveExtractor` when:
|
||||
|
||||
- you want deterministic tests
|
||||
- you want a zero-dependency local example
|
||||
- your input format is controlled and simple
|
||||
- you do not want model cost or latency
|
||||
|
||||
Use `LlmFactExtractor` when:
|
||||
|
||||
- the text is messy, ambiguous, or conversational
|
||||
- you need better topic selection
|
||||
- you want richer descriptions, categories, or metadata
|
||||
- you are building a real ingestion pipeline instead of a toy example
|
||||
|
||||
## Recommended strategy
|
||||
|
||||
A practical development strategy is:
|
||||
|
||||
1. Start with `NaiveExtractor`
|
||||
2. Build the surrounding ingestion flow and tests
|
||||
3. Swap in `LlmFactExtractor` when you need better recall and better structure
|
||||
4. Keep `NaiveExtractor` around for fixtures, regression tests, and offline demos
|
||||
178
Getting-Started.md
Normal file
178
Getting-Started.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Getting Started
|
||||
|
||||
This page shows the concrete workflow for using IdentityDB as a structured memory layer.
|
||||
|
||||
## 1. Connect to a database
|
||||
|
||||
IdentityDB supports SQLite, PostgreSQL, MySQL, and MariaDB through Kysely-backed adapters.
|
||||
|
||||
### In-memory SQLite example
|
||||
|
||||
```ts
|
||||
import { IdentityDB } from 'identitydb';
|
||||
|
||||
const db = await IdentityDB.connect({
|
||||
client: 'sqlite',
|
||||
filename: ':memory:',
|
||||
});
|
||||
```
|
||||
|
||||
## 2. Initialize the schema
|
||||
|
||||
```ts
|
||||
await db.initialize();
|
||||
```
|
||||
|
||||
This creates the tables IdentityDB needs:
|
||||
|
||||
- `topics`
|
||||
- `facts`
|
||||
- `fact_topics`
|
||||
- `topic_relations`
|
||||
- `topic_aliases`
|
||||
- `fact_embeddings`
|
||||
|
||||
## 3. Add structured facts directly
|
||||
|
||||
Use `addFact()` when your application already knows the topics it wants to attach.
|
||||
|
||||
```ts
|
||||
await db.addFact({
|
||||
statement: 'TypeScript is a programming language.',
|
||||
topics: [
|
||||
{
|
||||
name: 'TypeScript',
|
||||
category: 'entity',
|
||||
granularity: 'concrete',
|
||||
},
|
||||
{
|
||||
name: 'programming language',
|
||||
category: 'concept',
|
||||
granularity: 'abstract',
|
||||
},
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
## 4. Model topic hierarchy explicitly
|
||||
|
||||
Use `linkTopics()` when you want hierarchy to be explicit rather than inferred.
|
||||
|
||||
```ts
|
||||
await db.linkTopics({
|
||||
parentName: 'programming language',
|
||||
childName: 'TypeScript',
|
||||
});
|
||||
|
||||
const children = await db.getTopicChildren('programming language');
|
||||
const lineage = await db.getTopicLineage('TypeScript');
|
||||
```
|
||||
|
||||
This is useful for reasoning such as:
|
||||
|
||||
- `TypeScript` is a kind of `programming language`
|
||||
- `Bun` is a kind of `runtime`
|
||||
- `PostgreSQL` is a kind of `database`
|
||||
|
||||
## 5. Add aliases for canonical topic resolution
|
||||
|
||||
```ts
|
||||
await db.addTopicAlias('TypeScript', 'TS');
|
||||
|
||||
const canonicalTopic = await db.getTopicByName('TS', { includeFacts: true });
|
||||
```
|
||||
|
||||
This keeps one canonical topic row while still allowing alternate spellings or shorthand forms.
|
||||
|
||||
## 6. Ingest free-form text through an extractor
|
||||
|
||||
When your application starts from raw text, use `ingestStatement()`.
|
||||
|
||||
### Deterministic local example
|
||||
|
||||
```ts
|
||||
import { NaiveExtractor } from 'identitydb';
|
||||
|
||||
await db.ingestStatement('I have worked with TypeScript since 2025.', {
|
||||
extractor: new NaiveExtractor(),
|
||||
});
|
||||
```
|
||||
|
||||
### LLM-backed example
|
||||
|
||||
```ts
|
||||
import { LlmFactExtractor } from 'identitydb';
|
||||
|
||||
const extractor = new LlmFactExtractor({
|
||||
model: {
|
||||
async generateText(prompt) {
|
||||
return callYourFavoriteLlm(prompt);
|
||||
},
|
||||
},
|
||||
instructions: 'Prefer technology, product, and time topics over generic nouns.',
|
||||
});
|
||||
|
||||
await db.ingestStatement('I have worked with Bun and TypeScript since 2025.', {
|
||||
extractor,
|
||||
});
|
||||
```
|
||||
|
||||
See [Extractors](Extractors) for a deeper explanation of the trade-offs.
|
||||
|
||||
## 7. Add semantic search
|
||||
|
||||
IdentityDB keeps semantic search provider-agnostic through an `EmbeddingProvider` interface.
|
||||
|
||||
```ts
|
||||
import type { EmbeddingProvider } from 'identitydb';
|
||||
|
||||
const provider: EmbeddingProvider = {
|
||||
model: 'example-embedding-v1',
|
||||
dimensions: 3,
|
||||
async embed(input) {
|
||||
if (input.toLowerCase().includes('typescript')) {
|
||||
return [1, 0, 0];
|
||||
}
|
||||
|
||||
return [0, 1, 0];
|
||||
},
|
||||
};
|
||||
|
||||
await db.indexFactEmbeddings({ provider });
|
||||
|
||||
const matches = await db.searchFacts({
|
||||
query: 'TypeScript experience',
|
||||
provider,
|
||||
limit: 5,
|
||||
});
|
||||
```
|
||||
|
||||
## 8. Enable duplicate-aware ingestion
|
||||
|
||||
If you also provide an embedding provider during ingestion, IdentityDB can check whether a semantically similar fact already exists.
|
||||
|
||||
```ts
|
||||
await db.ingestStatement('Bun makes TypeScript tooling fast.', {
|
||||
extractor: new NaiveExtractor(),
|
||||
embeddingProvider: provider,
|
||||
duplicateThreshold: 0.95,
|
||||
});
|
||||
```
|
||||
|
||||
If a close enough match already exists, IdentityDB can return the existing fact instead of writing a duplicate.
|
||||
|
||||
## 9. Close the connection
|
||||
|
||||
```ts
|
||||
await db.close();
|
||||
```
|
||||
|
||||
## Practical workflow recommendation
|
||||
|
||||
A good default integration pattern is:
|
||||
|
||||
1. Start with SQLite in development
|
||||
2. Use `NaiveExtractor` for tests and deterministic local examples
|
||||
3. Introduce `LlmFactExtractor` when you need better topic extraction from messy natural language
|
||||
4. Add embeddings only when you actually need semantic retrieval or duplicate detection
|
||||
5. Move to PostgreSQL or MySQL/MariaDB later without changing the high-level API
|
||||
76
Home.md
76
Home.md
@@ -0,0 +1,76 @@
|
||||
# IdentityDB
|
||||
|
||||
IdentityDB exists to make **AI memory explicit, queryable, portable, and evolvable**.
|
||||
|
||||
Most AI applications start by stuffing raw text into prompts, vector stores, or ad-hoc JSON blobs. That works for demos, but it becomes fragile when you need to answer questions like:
|
||||
|
||||
- What facts do we know about a person, product, or project?
|
||||
- Which topics are connected by the same statement?
|
||||
- Can we distinguish canonical concepts from aliases such as `TypeScript` and `TS`?
|
||||
- Can we preserve memory across SQLite locally and PostgreSQL or MySQL in production?
|
||||
- Can we mix deterministic extraction, LLM-backed extraction, and semantic search without locking into one vendor?
|
||||
|
||||
IdentityDB is designed as the answer to those problems.
|
||||
|
||||
## Why IdentityDB exists
|
||||
|
||||
IdentityDB turns memory into a relational graph with a stable application API:
|
||||
|
||||
- **Topics** are named nodes such as `TypeScript`, `Bun`, `2025`, or `programming language`
|
||||
- **Facts** are statements such as `I have worked with TypeScript since 2025.`
|
||||
- **Fact-topic links** connect one fact to many topics, which lets a single statement become a graph edge between concepts
|
||||
- **Topic relations** model explicit hierarchy such as `programming language -> TypeScript`
|
||||
- **Topic aliases** model canonicalization such as `TS -> TypeScript`
|
||||
- **Fact embeddings** enable provider-agnostic semantic search and duplicate detection
|
||||
|
||||
This gives you a memory system that is easier to inspect than a black-box vector index and easier to evolve than hard-coded prompt state.
|
||||
|
||||
## What the package can do today
|
||||
|
||||
- Connect to **SQLite, PostgreSQL, MySQL, and MariaDB**
|
||||
- Initialize the required schema automatically
|
||||
- Add facts and topics directly through a typed API
|
||||
- Ingest free-form text through pluggable extractors
|
||||
- Resolve aliases to canonical topics
|
||||
- Traverse parent/child topic relationships
|
||||
- Index facts with embeddings for semantic retrieval
|
||||
- Reuse an existing fact when semantic duplicate detection says a new statement is effectively the same memory
|
||||
|
||||
## Core idea in one example
|
||||
|
||||
The fact:
|
||||
|
||||
```text
|
||||
I have worked with TypeScript since 2025.
|
||||
```
|
||||
|
||||
can connect all of these topics at once:
|
||||
|
||||
- `I`
|
||||
- `TypeScript`
|
||||
- `2025`
|
||||
|
||||
That means IdentityDB can answer more than plain keyword lookup. It can tell you:
|
||||
|
||||
- which facts connect `TypeScript` and `2025`
|
||||
- which topics are related to `TypeScript`
|
||||
- which alias should resolve to the same canonical topic
|
||||
- which facts are semantically similar even if the wording changes
|
||||
|
||||
## Recommended reading order
|
||||
|
||||
- [Getting Started](Getting-Started) — installation, initialization, and concrete examples
|
||||
- [Extractors](Extractors) — when to use `NaiveExtractor` vs `LlmFactExtractor`
|
||||
|
||||
## Repository
|
||||
|
||||
- Source repository: [p-sw/IdentityDB](https://git.psw.kr/p-sw/IdentityDB)
|
||||
|
||||
## Current direction
|
||||
|
||||
IdentityDB is still in active MVP expansion, but the current shape is already useful for:
|
||||
|
||||
- structured long-term memory for agents
|
||||
- knowledge capture from conversations
|
||||
- portable memory graphs across databases
|
||||
- inspectable semantic memory systems
|
||||
|
||||
6
_Sidebar.md
Normal file
6
_Sidebar.md
Normal file
@@ -0,0 +1,6 @@
|
||||
## Navigation
|
||||
|
||||
- [Home](Home)
|
||||
- [Getting Started](Getting-Started)
|
||||
- [Extractors](Extractors)
|
||||
- [Repository](https://git.psw.kr/p-sw/IdentityDB)
|
||||
Reference in New Issue
Block a user