Clone
1
Extractors
Shinwoo PARK edited this page 2026-05-11 12:30:36 +09:00

Extractors

IdentityDB keeps extraction pluggable so you can decide how smart or how deterministic ingestion should be.

Today the package ships with two main extractor patterns:

  • NaiveExtractor — deterministic, rule-based, local
  • LlmFactExtractor — provider-agnostic adapter for external or local LLMs

NaiveExtractor

NaiveExtractor is the simplest built-in extractor. It does not use an LLM. Instead, it scans the input with a small set of rules and emits obvious topics.

What it looks for

Current behavior is intentionally narrow and predictable:

  • the standalone token I
  • 4-digit years such as 2025
  • capitalized tokens such as TypeScript, Bun, or JavaScript

How it labels topics

  • a 4-digit year becomes:
    • category: 'temporal'
    • granularity: 'concrete'
    • role: 'time'
  • I becomes:
    • category: 'entity'
    • granularity: 'concrete'
    • role: 'subject'
  • other capitalized tokens become:
    • category: 'entity'
    • granularity: 'concrete'
    • role: 'object'

Example

Input:

I have worked with TypeScript since 2025.

Typical extracted result:

{
  statement: 'I have worked with TypeScript since 2025.',
  topics: [
    { name: 'I', category: 'entity', granularity: 'concrete', role: 'subject' },
    { name: 'TypeScript', category: 'entity', granularity: 'concrete', role: 'object' },
    { name: '2025', category: 'temporal', granularity: 'concrete', role: 'time' },
  ],
}

Why NaiveExtractor exists

It is useful because it is:

  • deterministic — the same input gives the same output
  • fast — no API calls or model latency
  • cheap — no model cost
  • good for tests — easy to reason about in fixtures and regression tests
  • good for demos — great when you want a minimal local example

Limitations

NaiveExtractor is intentionally not smart. It does not truly understand language. That means it can:

  • miss lowercase concepts
  • miss multi-word concepts that do not begin with capitals
  • over-trust capitalization
  • fail to infer nuanced categories or roles
  • produce weaker results on messy conversational text

Use it when you want predictability, not deep understanding.

LlmFactExtractor

LlmFactExtractor keeps the same FactExtractor contract but delegates extraction to a text-generating model.

import { LlmFactExtractor } from 'identitydb';

const extractor = new LlmFactExtractor({
  model: {
    async generateText(prompt) {
      return callYourFavoriteLlm(prompt);
    },
  },
  instructions: 'Prefer technology, product, and time topics over generic nouns.',
});

Why this adapter exists

The goal is to make LLM extraction possible without coupling IdentityDB to a specific SDK. The adapter only expects a model object with:

generateText(prompt: string): Promise<string>

That means you can bridge:

  • hosted APIs
  • local inference servers
  • wrappers around OpenAI-compatible APIs
  • wrappers around Anthropic-style APIs
  • your own orchestrator layer

Output requirements

The model is expected to return JSON only. The adapter validates the response before IdentityDB writes anything to the database. It also tolerates some common formatting noise, such as a fenced ```json block around the payload.

When to choose which extractor

Use NaiveExtractor when:

  • you want deterministic tests
  • you want a zero-dependency local example
  • your input format is controlled and simple
  • you do not want model cost or latency

Use LlmFactExtractor when:

  • the text is messy, ambiguous, or conversational
  • you need better topic selection
  • you want richer descriptions, categories, or metadata
  • you are building a real ingestion pipeline instead of a toy example

A practical development strategy is:

  1. Start with NaiveExtractor
  2. Build the surrounding ingestion flow and tests
  3. Swap in LlmFactExtractor when you need better recall and better structure
  4. Keep NaiveExtractor around for fixtures, regression tests, and offline demos