p-sw/IdentityDB

Fork 0

Table of Contents

Extractors

NaiveExtractor

What it looks for
How it labels topics
Example
Why NaiveExtractor exists
Limitations

LlmFactExtractor

Why this adapter exists
Output requirements
When to choose which extractor

Recommended strategy

Extractors

IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.

Today the package ships with two main extractor patterns:

NaiveExtractor — deterministic, rule-based, local
LlmFactExtractor — provider-agnostic adapter for external or local LLMs

`NaiveExtractor`

NaiveExtractor is the simplest built-in extractor. It does not use an LLM. Instead, it scans the input with a small set of rules and emits obvious topics.

What it looks for

Current behavior is intentionally narrow and predictable:

the standalone token I
4-digit years such as 2025
capitalized tokens such as TypeScript, Bun, or JavaScript

How it labels topics

a 4-digit year becomes:
- category: 'temporal'
- granularity: 'concrete'
- role: 'time'
I becomes:
- category: 'entity'
- granularity: 'concrete'
- role: 'subject'
other capitalized tokens become:
- category: 'entity'
- granularity: 'concrete'
- role: 'object'

Example

Input:

I have worked with TypeScript since 2025.

Typical extracted result:

{
  statement: 'I have worked with TypeScript since 2025.',
  topics: [
    { name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
    { name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
    { name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
  ],
}

Why `NaiveExtractor` exists

It is useful because it is:

deterministic — the same input gives the same output
fast — no API calls or model latency
cheap — no model cost
good for tests — easy to reason about in fixtures and regression tests
good for demos — great when you want a minimal local example

Limitations

NaiveExtractor is intentionally not smart. It does not truly understand language. That means it can:

miss lowercase concepts
miss multi-word concepts that do not begin with capitals
over-trust capitalization
fail to infer nuanced categories or roles
produce weaker results on messy conversational text

Use it when you want predictability, not deep understanding.

`LlmFactExtractor`

LlmFactExtractor keeps the same FactExtractor contract but delegates extraction to a text-generating model.

import { LlmFactExtractor } from 'identitydb';

const extractor = new LlmFactExtractor({
  model: {
    async generateText(prompt) {
      return callYourFavoriteLlm(prompt);
    },
  },
  instructions: 'Prefer technology, product, and time topics over generic nouns.',
});

Why this adapter exists

The goal is to make LLM extraction possible without coupling IdentityDB to a specific SDK. The adapter only expects a model object with:

generateText(prompt: string): Promise<string>

That means you can bridge:

hosted APIs
local inference servers
wrappers around OpenAI-compatible APIs
wrappers around Anthropic-style APIs
your own orchestrator layer

Output requirements

The model is expected to return JSON only. The adapter validates the response before IdentityDB writes anything to the database. It also tolerates some common formatting noise, such as a fenced ```json block around the payload.

When to choose which extractor

Use NaiveExtractor when:

you want deterministic tests
you want a zero-dependency local example
your input format is controlled and simple
you do not want model cost or latency

Use LlmFactExtractor when:

the text is messy, ambiguous, or conversational
you need better topic selection
you want richer descriptions, categories, or metadata
you are building a real ingestion pipeline instead of a toy example

Recommended strategy

A practical development strategy is:

Start with NaiveExtractor
Build the surrounding ingestion flow and tests
Swap in LlmFactExtractor when you need better recall and better structure
Keep NaiveExtractor around for fixtures, regression tests, and offline demos