Overview

The clinical data processing pipeline in HoneyBee handles clinical text end-to-end: ingest documents (PDF, scanned images, EHR exports), extract named entities, detect context (negation, uncertainty), link to medical ontologies (SNOMED CT, UMLS, BioPortal), build patient timelines, generate biomedical embeddings, and export FHIR R4 bundles. The ClinicalProcessor class exposes every step as a standalone method so you can compose custom pipelines or run the full chain with a single call via HoneyBee.process_clinical().

Clinical Processing Pipeline

Key Features

  • Multi-backend PDF/image/EHR ingestion (Unstructured, PyMuPDF, PyPDF2, Tesseract OCR fallback)
  • Named entity recognition with 4 backends (transformer, scispacy, medspacy, medcat)
  • Context detection — negation, uncertainty, historical, and family flags
  • Proximity-based relationship extraction between entities
  • Temporal timeline construction with date parsing and entity linkage
  • SNOMED CT / UMLS / BioPortal ontology linking via Snowstorm
  • 7 biomedical embedding model presets (BioClinicalBERT, PubMedBERT, BioBERT, SciBERT, GatorTron, Clinical-T5, sentence-transformers)
  • FHIR R4 transaction bundle export
  • High-level HoneyBee.process_clinical() one-call API

Quick Start

Install HoneyBee and run the full clinical pipeline on a sample note:

quickstart.py
from honeybee import HoneyBee

hb = HoneyBee()

clinical_note = """Patient: Jane Doe, 58-year-old postmenopausal female.
Date of Visit: 03/15/2024

History of Present Illness:
Patient was diagnosed with invasive ductal carcinoma of the left breast on 01/10/2024
following an abnormal screening mammogram on 12/05/2023. Core needle biopsy confirmed
Grade 2 moderately differentiated adenocarcinoma. Tumor measures 2.3 cm on imaging.

Biomarker Profile:
ER: positive (95%), PR: positive (80%), HER2: negative (IHC 1+), Ki-67: 22%.

Medications:
Tamoxifen 20 mg daily, Metformin 1000 mg bid, Lisinopril 10 mg daily.

Plan: Begin dose-dense AC-T (doxorubicin/cyclophosphamide x4 followed by
paclitaxel x4) starting 04/01/2024.
"""

result = hb.process_clinical(text=clinical_note)

print(f"Entities:      {len(result.entities)}")
print(f"Relationships: {len(result.relationships)}")
print(f"Timeline:      {len(result.timeline)} events")
Output
Entities:      63
Relationships: 36
Timeline:      8 events

Document Ingestion

HoneyBee can ingest clinical PDFs directly — it tries multiple extraction backends (Unstructured, PyMuPDF, PyPDF2) and falls back to Tesseract OCR for scanned documents. Here we download a sample pathology report from the honeybee-samples dataset and process it end-to-end.

document_ingestion.py
from huggingface_hub import hf_hub_download

# Download sample clinical PDF from HuggingFace
pdf_path = hf_hub_download(
    repo_id="Lab-Rasool/honeybee-samples",
    filename="sample.PDF",
    repo_type="dataset",
)
print(f"Downloaded: {pdf_path}")

# Process the PDF through the full clinical pipeline
pdf_result = hb.process_clinical(document_path=pdf_path)

# Show extraction metadata
meta = pdf_result.document.metadata
print(f"Extraction method: {meta.get('extraction_method', 'unknown')}")
print(f"Pages: {meta.get('num_pages', 'N/A')}")
print(f"Text length: {len(pdf_result.document.text):,} chars")
print(f"Entities: {len(pdf_result.entities)}")
print(f"Relationships: {len(pdf_result.relationships)}")
Output
Downloaded: .../datasets--Lab-Rasool--honeybee-samples/.../sample.PDF

Extraction method: pypdf2
Pages: 1
Text length: 1,470 chars
Entities: 21
Relationships: 4

Extracted Entities

Each entity is a ClinicalEntity with typed spans (condition, medication, procedure, measurement, temporal, anatomy, dosage), a confidence score, context flags (is_negated, is_uncertain, is_historical, is_family), and optional ontology_codes.

extracted_entities.py
from collections import Counter

entities = result.entities
type_counts = Counter(e.type for e in entities)

# Show entity type distribution
for etype, count in type_counts.most_common():
    print(f"  {etype:<14s}: {count}")

# Top-15 entities
print(f"\n{'Text':>35s}  {'Type':<12s}  {'Conf':>5s}")
print("-" * 58)
for e in entities[:15]:
    print(f"{e.text[:35]:>35s}  {e.type:<12s}  {e.confidence:>5.2f}")
Output
  procedure      : 24
  medication     : 14
  measurement    : 10
  condition      : 7
  temporal       : 4
  dosage         : 4

                               Text  Type          Conf
----------------------------------------------------------
                        58-year-old  temporal       0.88
                    ductal carcinoma  condition      1.00
                            Grade 2  measurement    0.98
                              Tumor  condition      1.00
                                HER  procedure      0.99
                         Lisinopril  medication     0.97
                       chemotherapy  medication     1.00
                 radiation oncology  procedure      0.87
                       chemotherapy  medication     0.92
                       chemotherapy  procedure      0.83

Annotated Clinical Note

The same entities shown inline in the original text — each highlight corresponds to an entity type detected by the NER pipeline.

Patient: Jane Doe, 58-year-old postmenopausal female. Date of Visit: 03/15/2024 Chief Complaint: Follow-up evaluation for recently diagnosed breast cancer. History of Present Illness: Patient was diagnosed with invasive ductal carcinoma of the left breast on 01/10/2024 following an abnormal screening mammogram on 12/05/2023. Core needle biopsy confirmed Grade 2 moderately differentiated adenocarcinoma. Tumor measures 2.3 cm on imaging. Staging workup including CT chest/abdomen/pelvis and bone scan completed on 02/01/2024 showed no evidence of distant metastasis. Final pathologic staging: T2N1M0, Stage IIB. Biomarker Profile: ER: positive (95%), PR: positive (80%), HER2: negative (IHC 1+), Ki-67: 22%. Genomic testing (Oncotype DX) returned a recurrence score of 18 on 02/20/2024. Past Medical History: Hypertension, Type 2 diabetes mellitus, hyperlipidemia. Medications: Tamoxifen 20 mg daily, Metformin 1000 mg bid, Lisinopril 10 mg daily, Atorvastatin 40 mg daily, Ondansetron 8 mg prn. Allergies: No known drug allergies. Laboratory Results (03/15/2024): Hemoglobin: 11.2 g/dL, White Blood Cell Count: 5.8 K/uL, Platelet Count: 210 K/uL, Creatinine: 0.9 mg/dL, Albumin: 3.8 g/dL, Calcium: 9.4 mg/dL. Assessment and Plan: 1. Invasive ductal carcinoma, left breast, T2N1M0, Stage IIB, ER+/PR+/HER2-. - Oncotype DX recurrence score 18 supports adjuvant chemotherapy followed by endocrine therapy. - Plan: Begin dose-dense AC-T (doxorubicin/cyclophosphamide x4 cycles followed by paclitaxel x4 cycles) starting 04/01/2024. - Refer to radiation oncology for post-chemotherapy radiation planning. - Continue tamoxifen; transition to aromatase inhibitor after chemotherapy. 2. Pre-chemotherapy cardiac evaluation: Echocardiogram scheduled for 03/25/2024. 3. Follow-up in 2 weeks for lab work and treatment initiation.
condition dosage measurement medication procedure temporal
Generate this visualization in Python

Reproduce the annotated note above from a process_clinical() result in a Jupyter notebook:

annotated_note.py
from html import escape
from IPython.display import HTML, display

entities = result.entities

# Colour per entity type (alphabetical order)
palette = {
    "condition": "#a8d8ea", "dosage": "#ffcfdf",
    "measurement": "#fefdca", "medication": "#c3bef7",
    "procedure": "#cadefc", "temporal": "#f7d794",
}

# Resolve overlapping spans — keep earlier / longer
sorted_ents = sorted(entities, key=lambda e: (e.start, -(e.end - e.start)))
keep, last_end = [], -1
for e in sorted_ents:
    if e.start >= last_end:
        keep.append(e)
        last_end = e.end

# Build annotated HTML
parts, prev = [], 0
for e in keep:
    if e.start > prev:
        parts.append(escape(clinical_note[prev:e.start]))
    bg = palette.get(e.type, "#d5dbdb")
    tip = f"{e.type} (conf {e.confidence:.2f})"
    parts.append(
        f'<span style="background:{bg};padding:1px 3px;border-radius:3px"'
        f' title="{tip}">{escape(clinical_note[e.start:e.end])}</span>'
    )
    prev = e.end
if prev < len(clinical_note):
    parts.append(escape(clinical_note[prev:]))

# Legend
legend = " &nbsp; ".join(
    f'<span style="background:{c};padding:2px 8px;border-radius:3px">{t}</span>'
    for t, c in palette.items()
)

display(HTML(
    f'<div style="font-family:monospace;font-size:13px;line-height:1.8;'
    f'white-space:pre-wrap;max-width:900px">{"".join(parts)}</div>'
    f'<div style="margin-top:12px">{legend}</div>'
))

Cancer-Specific Entities

The clinical note describes breast cancer with biomarkers (ER+, PR+, HER2−), staging (T2N1M0, Stage IIB), and tumor type (invasive ductal carcinoma). The default transformer NER backend maps these to generic types. We can filter for oncology-relevant entities by matching known cancer terms.

cancer_entities.py
# Define oncology keyword categories
CANCER_KEYWORDS = {
    "tumor": ["carcinoma", "adenocarcinoma", "cancer", "tumor",
              "ductal", "invasive", "glioblastoma"],
    "biomarker": ["ER", "PR", "HER2", "Ki-67", "EGFR", "ALK",
                  "PD-L1", "Oncotype", "GFAP", "MIB-1"],
    "staging": ["T2N1M0", "Stage", "Grade", "metastasis"],
    "treatment": ["doxorubicin", "cyclophosphamide", "paclitaxel",
                  "tamoxifen", "pembrolizumab", "chemotherapy", "radiation"],
}

# Filter entities
print(f"{'Entity Text':>35s}  {'NER Type':<12s}  {'Onc. Category':<14s}")
print("-" * 65)
for e in entities:
    for cat, keywords in CANCER_KEYWORDS.items():
        if any(kw.lower() in e.text.lower() for kw in keywords):
            print(f"{e.text[:35]:>35s}  {e.type:<12s}  {cat:<14s}")
            break
Output
                        Entity Text  NER Type      Onc. Category
-----------------------------------------------------------------
                   ductal carcinoma  condition     tumor
                            Grade 2  measurement   staging
                              Tumor  condition     tumor
                       chemotherapy  medication    biomarker
                 radiation oncology  procedure     treatment
                       chemotherapy  medication    biomarker
                       chemotherapy  procedure     biomarker

Tip: Configure the scispacy or medspacy backend for dedicated tumor, biomarker, and staging entity types.

Context Attributes

Each entity carries four boolean flags that capture how it appears in clinical narrative:

Flag Meaning Example
is_negated Explicitly denied "no evidence of distant metastasis"
is_uncertain Hedging language "possible pneumonia"
is_historical Past event "history of hypertension"
is_family Family member "mother had breast cancer"

The default transformer backend does not perform context detection. Enable the medspacy backend (pip install medspacy) for automatic detection via ConText:

context_attributes.py
# Default transformer backend — no context flags set
sample = entities[0]
print(f"Entity: {sample.text!r}")
print(f"  is_negated:    {sample.is_negated}")
print(f"  is_uncertain:  {sample.is_uncertain}")
print(f"  is_historical: {sample.is_historical}")
print(f"  is_family:     {sample.is_family}")

# Enable context detection with medspacy
hb = HoneyBee(config={"clinical": {"ner": {"backends": ["medspacy"]}}})
Output
Entity: '58-year-old'
  is_negated:    False
  is_uncertain:  False
  is_historical: False
  is_family:     False

Entity Relationships

The processor detects proximity-based relationships between entities (medication–dosage, condition–treatment, tumor–staging, etc.).

relationships.py
print(f"{'Source':>30s}  {'Relation':<18s}  Target")
print("-" * 80)
for rel in result.relationships[:12]:
    src = rel.get("source_text", entities[rel["source_idx"]].text)[:30]
    tgt = rel.get("target_text", entities[rel["target_idx"]].text)[:30]
    print(f"{src:>30s}  {rel['type']:<18s}  {tgt}")
Output
                        Source  Relation            Target
--------------------------------------------------------------------------------
              ductal carcinoma  temporal_relation   /2024
                         /2024  temporal_relation   screening
                          mamm  has_result          Grade 2
                 needle biopsy  has_result          Grade 2
                 needle biopsy  investigated_by     adenocarcino
                          3 cm  has_result          CT
                            CT  temporal_relation   02/01/2024
                          scan  temporal_relation   02/01/2024

Patient Timeline

Dates are extracted from the text and linked to nearby clinical entities to build a chronological patient timeline.

timeline.py
timeline = result.timeline

print(f"{'Date Text':>15s}  {'Parsed':>12s}  Related Entities")
print("-" * 75)
for ev in timeline:
    related = [entities[i].text[:25] for i in ev.related_entities
               if i < len(entities)]
    label = ", ".join(related[:3])
    if len(related) > 3:
        label += f" (+{len(related) - 3})"
    print(f"{ev.date_text:>15s}  {str(ev.date)[:10]:>12s}  {label}")
Output
      Date Text        Parsed  Related Entities
---------------------------------------------------------------------------
     12/05/2023    2023-12-05  ductal carcinoma, screening, mamm (+8)
     01/10/2024    2024-01-10  ductal carcinoma, screening, mamm (+5)
     02/01/2024    2024-02-01  adenocarcino, Tumor, 2 (+8)
     02/20/2024    2024-02-20  positive, positive, HER (+13)
     03/15/2024    2024-03-15  ductal carcinoma
     03/15/2024    2024-03-15  ductal carcinoma
     03/25/2024    2024-03-25  tam, tase inhibitor, chemotherapy (+1)
     04/01/2024    2024-04-01  AC, do, or (+5)

The timeline can be visualized as a scatter plot using matplotlib, with events positioned chronologically and labeled with their most relevant clinical entities.

FHIR Export

Convert the processing result into a FHIR R4 transaction Bundle. Entity types map to FHIR resources: conditions to Condition, medications to MedicationStatement, measurements to Observation.

fhir_export.py
import json

bundle = hb.clinical_processor.to_fhir(result, patient_id="patient-001")

entries = bundle.get("entry", [])
resource_types = Counter(e["resource"]["resourceType"] for e in entries)
print(f"Bundle: {bundle['resourceType']}  ({len(entries)} entries)")
for rtype, count in resource_types.most_common():
    print(f"  {rtype:<25s}: {count}")

print(f"\n--- Sample Condition resource ---")
print(json.dumps(entries[0]["resource"], indent=2))
Output
Bundle: Bundle  (32 entries)
  MedicationStatement      : 14
  Observation              : 10
  Condition                : 7
  DiagnosticReport         : 1

--- Sample Condition resource ---
{
  "resourceType": "Condition",
  "code": {
    "text": "ductal carcinoma"
  },
  "clinicalStatus": {
    "coding": [
      {
        "system": "http://terminology.hl7.org/CodeSystem/condition-clinical",
        "code": "active"
      }
    ]
  },
  "subject": {
    "reference": "Patient/patient-001"
  }
}

Ontology Linking

HoneyBee can link extracted entities to standard medical ontologies (SNOMED CT, RxNorm, LOINC, ICD-10). The Snowstorm backend queries the free SNOMED International terminology server — no API key needed. Each entity gains an ontology_codes list of OntologyCode(system, code, display, source_api) objects.

ontology_linking.py
# Re-process with Snowstorm ontology linking enabled
hb_onto = HoneyBee(config={"clinical": {"ontology": {"backends": ["snowstorm"]}}})
result_onto = hb_onto.process_clinical(text=clinical_note)

coded = [e for e in result_onto.entities if e.ontology_codes]

if coded:
    print(f"{'Entity':>30s}  {'System':<12s}  {'Code':<14s}  Display")
    print("-" * 90)
    for e in coded[:15]:
        oc = e.ontology_codes[0]
        print(f"{e.text[:30]:>30s}  {oc.system:<12s}  {oc.code:<14s}  {oc.display[:35]}")
    print(f"\n{len(coded)} / {len(result_onto.entities)} entities linked")
else:
    print("Snowstorm returned 0 codes (public server may be rate-limiting).")

Note: The public Snowstorm server enforces rate limits. If many entities time out, try again later or point to a local Snowstorm instance:

local_snowstorm.py
hb = HoneyBee(config={"clinical": {"ontology": {
    "backends": ["snowstorm"],
    "snowstorm_base_url": "http://localhost:8080/snowstorm/snomed-ct",
}}})

Biomedical Embeddings

Generate dense vector representations with pre-configured biomedical language models:

Available Models

Preset Description
bioclinicalbert Clinical BERT trained on MIMIC-III notes (default)
pubmedbert BERT trained on PubMed abstracts
biobert BioBERT for biomedical text mining
scibert SciBERT trained on scientific publications
gatortron Clinical foundation model (gated)
clinicalt5 T5 fine-tuned on PubMed Central
sentence-transformers Lightweight general-purpose sentence embeddings

Generate Embeddings

embeddings.py
import numpy as np

# Single model
emb = hb.generate_embeddings(clinical_note, modality="clinical",
                              model_name="bioclinicalbert")
print(f"bioclinicalbert: shape={emb.shape}, norm={np.linalg.norm(emb):.2f}")

# Multi-model comparison
for name in ["bioclinicalbert", "pubmedbert", "biobert", "scibert"]:
    e = hb.generate_embeddings(clinical_note, modality="clinical", model_name=name)
    print(f"{name:>20s}: shape={e.shape}, norm={np.linalg.norm(e):.2f}")
Output
bioclinicalbert: shape=(1, 768), norm=10.75
     bioclinicalbert: shape=(1, 768), norm=10.75
          pubmedbert: shape=(1, 768), norm=13.51
             biobert: shape=(1, 768), norm=9.34
             scibert: shape=(1, 768), norm=11.22

Embedding Similarity

Compare how different clinical text segments relate to each other using cosine similarity:

similarity.py
segments = [
    "invasive ductal carcinoma Grade 2",
    "breast cancer T2N1M0 Stage IIB",
    "ER positive PR positive HER2 negative",
    "tamoxifen 20 mg daily",
    "doxorubicin cyclophosphamide paclitaxel",
    "hemoglobin 11.2 albumin 3.8",
    "non-small cell lung cancer EGFR positive",
    "carboplatin paclitaxel pembrolizumab",
]

seg_emb = hb.generate_embeddings(segments, modality="clinical",
                                  model_name="bioclinicalbert")
norms = np.linalg.norm(seg_emb, axis=1, keepdims=True)
norms = np.where(norms == 0, 1, norms)
similarity = (seg_emb / norms) @ (seg_emb / norms).T

# Print top-5 most similar pairs (excluding self-similarity)
pairs = []
for i in range(len(segments)):
    for j in range(i + 1, len(segments)):
        pairs.append((similarity[i, j], segments[i][:30], segments[j][:30]))
pairs.sort(reverse=True)

print("Top-5 most similar segment pairs:")
for sim, a, b in pairs[:5]:
    print(f"  {sim:.3f}  {a}  <->  {b}")

A cosine similarity heatmap can be generated with matplotlib to visualize the full pairwise similarity matrix across all segments.

Batch Processing

Process multiple notes and compare aggregate statistics. For file-based workflows, use hb.process_clinical_batch(input_dir=..., file_pattern="*.pdf").

batch_processing.py
comparison_note = """Patient: John Smith, 65-year-old male.
Diagnosed with stage IIIA non-small cell lung cancer (adenocarcinoma) on 06/15/2024.
EGFR: positive, ALK: negative, PD-L1: positive (75%).
Started on carboplatin 450 mg iv plus paclitaxel 200 mg iv q3w on 07/01/2024.
Pembrolizumab 200 mg iv q3w added as immunotherapy.
CT scan on 09/15/2024 showed partial response.
"""

result2 = hb.process_clinical(text=comparison_note)

# Side-by-side summary statistics
for label, r in [("Jane Doe (breast)", result), ("John Smith (lung)", result2)]:
    stats = hb.clinical_processor.get_summary_statistics(r)
    print(f"--- {label} ---")
    print(f"  Text length:    {stats['text_length']:,} chars")
    print(f"  Entities:       {stats['num_entities']}")
    print(f"  Entity types:   {dict(stats['entity_types'])}")
    print(f"  Relationships:  {stats['num_relationships']}")
    print(f"  Timeline:       {stats['num_timeline_events']} events")
    print()
Output
--- Jane Doe (breast) ---
  Text length:    1,853 chars
  Entities:       63
  Entity types:   {'temporal': 4, 'condition': 7, 'procedure': 24, 'measurement': 10, 'medication': 14, 'dosage': 4}
  Relationships:  36
  Timeline:       8 events

--- John Smith (lung) ---
  Text length:    355 chars
  Entities:       27
  Entity types:   {'temporal': 5, 'condition': 2, 'procedure': 5, 'measurement': 4, 'medication': 7, 'dosage': 4}
  Relationships:  44
  Timeline:       3 events

Output Structure

result.to_dict() serializes the full pipeline output to a JSON-compatible dictionary — the canonical format for saving results, feeding downstream models, or exporting to external systems.

output_structure.py
import json

output = result.to_dict()

print("Top-level keys:", list(output.keys()))
print(f"  text:          {len(output['text']):,} chars")
print(f"  sections:      {len(output['sections'])} sections")
print(f"  metadata:      {output['metadata']}")
print(f"  entities:      {len(output['entities'])} items")
print(f"  relationships: {len(output['relationships'])} items")
print(f"  timeline:      {len(output['timeline'])} items")
Output
Top-level keys: ['text', 'sections', 'metadata', 'entities', 'relationships', 'timeline']
  text:          1,853 chars
  sections:      0 sections
  metadata:      {'source_type': 'text', 'document_type': 'unknown'}
  entities:      63 items
  relationships: 36 items
  timeline:      8 items

Entity Schema

entity_schema.json
{
  "text": "58-year-old",
  "type": "temporal",
  "start": 19,
  "end": 30,
  "confidence": 0.88,
  "properties": {
    "backend": "transformer",
    "label": "Age"
  },
  "is_negated": false,
  "is_uncertain": false,
  "is_historical": false,
  "is_family": false,
  "ontology_codes": []
}

Timeline Event Schema

timeline_event_schema.json
{
  "date": "2023-12-05T00:00:00",
  "date_text": "12/05/2023",
  "sentence": "Patient was diagnosed with invasive ductal carcinoma...",
  "related_entities": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
}

Multimodal Integration

integrate_embeddings() concatenates embeddings from different modalities into a single fused vector:

multimodal_integration.py
clinical_emb = hb.generate_embeddings(clinical_note, modality="clinical",
                                       model_name="bioclinicalbert")
pathology_emb = np.random.randn(1, 1536)   # e.g., UNI2 output
radiology_emb = np.random.randn(1, 768)    # e.g., RAD-DINO output

integrated = hb.integrate_embeddings([clinical_emb, pathology_emb, radiology_emb])

print(f"Clinical:   {clinical_emb.shape}")
print(f"Pathology:  {pathology_emb.shape}")
print(f"Radiology:  {radiology_emb.shape}")
print(f"Integrated: {integrated.shape}")
Output
Clinical:   (1, 768)
Pathology:  (1, 1536)
Radiology:  (1, 768)
Integrated: (1, 3072)
Configuration Reference

NER Backends

Backend Install Notes
transformer included Default; HuggingFace token-classification model
scispacy pip install scispacy + model Biomedical NER via spaCy
medspacy pip install medspacy Adds negation/uncertainty context detection
ner_config.py
hb = HoneyBee(config={"clinical": {"ner": {"backends": ["transformer", "medspacy"]}}})

Ontology Resolution

Backend Requires
snowstorm Free SNOMED CT server (no key needed)
umls UMLS_API_KEY env var or config
bioportal BIOPORTAL_API_KEY env var or config
ontology_config.py
hb = HoneyBee(config={"clinical": {"ontology": {"backends": ["snowstorm"]}}})

API Embeddings

Use any provider supported by litellm instead of local models:

api_embeddings.py
hb = HoneyBee(config={"clinical": {"embeddings": {
    "mode": "api",
    "model": "ollama/nomic-embed-text",
    "api_base": "http://localhost:11434",
}}})

Performance Considerations

When processing clinical datasets, consider the following:

  • Batch processing: Use process_clinical_batch() for large document collections
  • GPU acceleration: Embedding models benefit from device="cuda" when available
  • Ontology caching: Snowstorm results are cached with LRU (1,000 entries by default) to reduce repeated lookups
  • Rate limiting: The public Snowstorm server enforces rate limits; use a local instance for high-volume processing
  • Multiple NER backends: Combining backends (e.g., transformer + medspacy) increases entity coverage but adds latency