Extractors
IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.
Today the package ships with two main extractor patterns:
NaiveExtractor— deterministic, rule-based, localLlmFactExtractor— provider-agnostic adapter for external or local LLMs
NaiveExtractor
NaiveExtractor is the simplest built-in extractor. It does not use an LLM.
Instead, it scans the input with a small set of rules and emits obvious topics.
What it looks for
Current behavior is intentionally narrow and predictable:
- the standalone token
I - 4-digit years such as
2025 - capitalized tokens such as
TypeScript,Bun, orJavaScript
How it labels topics
- a 4-digit year becomes:
category: 'temporal'granularity: 'concrete'role: 'time'
Ibecomes:category: 'entity'granularity: 'concrete'role: 'subject'
- other capitalized tokens become:
category: 'entity'granularity: 'concrete'role: 'object'
Example
Input:
I have worked with TypeScript since 2025.
Typical extracted result:
{
statement: 'I have worked with TypeScript since 2025.',
topics: [
{ name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
{ name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
{ name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
],
}
Why NaiveExtractor exists
It is useful because it is:
- deterministic — the same input gives the same output
- fast — no API calls or model latency
- cheap — no model cost
- good for tests — easy to reason about in fixtures and regression tests
- good for demos — great when you want a minimal local example
Limitations
NaiveExtractor is intentionally not smart. It does not truly understand language.
That means it can:
- miss lowercase concepts
- miss multi-word concepts that do not begin with capitals
- over-trust capitalization
- fail to infer nuanced categories or roles
- produce weaker results on messy conversational text
Use it when you want predictability, not deep understanding.
LlmFactExtractor
LlmFactExtractor keeps the same FactExtractor contract but delegates extraction to a text-generating model.
import { LlmFactExtractor } from 'identitydb';
const extractor = new LlmFactExtractor({
model: {
async generateText(prompt) {
return callYourFavoriteLlm(prompt);
},
},
instructions: 'Prefer technology, product, and time topics over generic nouns.',
});
Why this adapter exists
The goal is to make LLM extraction possible without coupling IdentityDB to a specific SDK. The adapter only expects a model object with:
generateText(prompt: string): Promise<string>
That means you can bridge:
- hosted APIs
- local inference servers
- wrappers around OpenAI-compatible APIs
- wrappers around Anthropic-style APIs
- your own orchestrator layer
Output requirements
The model is expected to return JSON only.
The adapter validates the response before IdentityDB writes anything to the database.
It also tolerates some common formatting noise, such as a fenced ```json block around the payload.
When to choose which extractor
Use NaiveExtractor when:
- you want deterministic tests
- you want a zero-dependency local example
- your input format is controlled and simple
- you do not want model cost or latency
Use LlmFactExtractor when:
- the text is messy, ambiguous, or conversational
- you need better topic selection
- you want richer descriptions, categories, or metadata
- you are building a real ingestion pipeline instead of a toy example
Recommended strategy
A practical development strategy is:
- Start with
NaiveExtractor - Build the surrounding ingestion flow and tests
- Swap in
LlmFactExtractorwhen you need better recall and better structure - Keep
NaiveExtractoraround for fixtures, regression tests, and offline demos