Overview
The clinical data processing pipeline in HoneyBee is designed to handle various types of clinical data, including electronic health records (EHRs), clinical notes, pathology reports, and other textual medical documents. The pipeline extracts, processes, and generates embeddings from clinical text data for downstream machine learning applications.

Key Features
- Support for multiple input formats (PDF, scanned images, EHR exports)
- OCR capabilities for digitizing scanned documents with medical terminology verification
- Integration with specialized biomedical language models (Bio-ClinicalBERT, PubMedBERT, GatorTron, Clinical-T5)
- Comprehensive clinical entity recognition and normalization
- Integration with medical ontologies (SNOMED-CT, RxNorm, LOINC, ICD-O-3)
- Temporal information extraction for patient timelines
- Cancer-specific entity extractors for oncology use cases
- Configurable processing pipelines and output options
Basic Usage
The ClinicalProcessor provides a unified interface for processing clinical documents:
from honeybee.processors import ClinicalProcessor
# Initialize the clinical processor with default configuration
processor = ClinicalProcessor()
# Process a clinical document (PDF, image, or EHR export)
result = processor.process("path/to/clinical_document.pdf")
# Access extracted text
text = result["text"]
# Access extracted entities
entities = result["entities"]
# Access temporal timeline
timeline = result["temporal_timeline"]
Custom Configuration
Configure the processor for specific use cases and document types:
from honeybee.processors import ClinicalProcessor
# Custom configuration
config = {
"document_processor": {
"use_ocr": True,
"use_ehr": True
},
"tokenization": {
"model": "gatortron", # Options: bioclinicalbert, pubmedbert, gatortron, clinicalt5
"max_length": 512,
"segment_strategy": "sentence", # sentence, paragraph, fixed
"long_document_strategy": "sliding_window" # sliding_window, hierarchical, important_segments, summarize
},
"entity_recognition": {
"use_rules": True,
"use_spacy": True,
"use_deep_learning": False,
"cancer_specific_extraction": True,
"temporal_extraction": True,
"ontologies": ["snomed_ct", "rxnorm", "loinc"]
},
"processing_pipeline": ["document", "tokenization", "entity_recognition"],
"output": {
"include_raw_text": True,
"include_tokens": True,
"include_entities": True,
"include_document_structure": True,
"include_temporal_timeline": True
}
}
# Initialize with custom configuration
processor = ClinicalProcessor(config=config)
# Process document
result = processor.process("document.pdf", save_output=True)
Processing Raw Text
Process clinical text directly without file input:
from honeybee.processors import ClinicalProcessor
processor = ClinicalProcessor()
# Process raw clinical text
clinical_text = """
Patient presents with stage III non-small cell lung cancer.
EGFR mutation positive. Started on erlotinib 150mg daily.
Partial response observed after 3 months of treatment.
"""
result = processor.process_text(
text=clinical_text,
document_type="progress_note"
)
# Extract entities
for entity in result["entities"]:
print(f"Entity: {entity['text']}")
print(f"Type: {entity['type']}")
print(f"Properties: {entity['properties']}")
print("---")
Batch Processing
Process multiple clinical documents efficiently:
from honeybee.processors import ClinicalProcessor
processor = ClinicalProcessor()
# Process all PDF files in a directory
results = processor.process_batch(
input_dir="path/to/clinical_documents",
file_pattern="*.pdf",
save_output=True,
output_dir="path/to/output"
)
# Analyze batch results
total_docs = len(results)
total_entities = sum(len(r.get("entities", [])) for r in results)
print(f"Processed {total_docs} documents")
print(f"Extracted {total_entities} total entities")
Advanced Tokenization
Handle long documents with various tokenization strategies:
from honeybee.processors import ClinicalProcessor
# Configure for long document processing
config = {
"tokenization": {
"model": "gatortron",
"max_length": 512,
"segment_strategy": "paragraph",
"long_document_strategy": "hierarchical", # Preserves document structure
"stride": 128
}
}
processor = ClinicalProcessor(config=config)
# Process long operative report
result = processor.process("long_operative_report.pdf")
# Access tokenization details
tokenization = result["tokenization"]
print(f"Number of tokens: {len(tokenization['tokens'])}")
print(f"Number of segments: {tokenization['num_segments']}")
if tokenization.get("hierarchical"):
print("Document sections:")
for section in tokenization["sections"]:
print(f"- {section['name']}: {len(section['segment_indices'])} segments")
Cancer-Specific Entity Extraction
Extract oncology-specific entities with specialized extractors:
from honeybee.processors import ClinicalProcessor
# Enable cancer-specific extraction
config = {
"entity_recognition": {
"cancer_specific_extraction": True,
"temporal_extraction": True,
"ontologies": ["snomed_ct", "rxnorm"]
}
}
processor = ClinicalProcessor(config=config)
pathology_text = """
Invasive ductal carcinoma, Grade 2, measuring 2.1 cm.
ER positive (90%), PR positive (75%), HER2 negative.
T2N0M0 stage IIA. Margins clear.
"""
result = processor.process_text(pathology_text, "pathology_report")
# Filter entities by type
tumors = [e for e in result["entities"] if e["type"] == "tumor"]
biomarkers = [e for e in result["entities"] if e["type"] == "biomarker"]
staging = [e for e in result["entities"] if e["type"] == "staging"]
print(f"Found {len(tumors)} tumor entities")
print(f"Found {len(biomarkers)} biomarker entities")
print(f"Found {len(staging)} staging entities")
# Examine biomarker details
for biomarker in biomarkers:
props = biomarker["properties"]
print(f"Biomarker: {props['name']}")
print(f"Status: {props['status']}")
if "percentage" in props:
print(f"Percentage: {props['percentage']}%")
Temporal Timeline Construction
Extract and organize temporal information for patient timelines:
from honeybee.processors import ClinicalProcessor
processor = ClinicalProcessor()
clinical_note = """
Patient diagnosed with breast cancer on 03/15/2023.
Started neoadjuvant chemotherapy in April 2023.
Surgery performed on 08/20/2023.
Currently on adjuvant hormone therapy as of September 2023.
"""
result = processor.process_text(clinical_note, "consultation_note")
# Access temporal timeline
timeline = result["temporal_timeline"]
print("Patient Timeline:")
for event in timeline:
print(f"Date: {event['temporal_text']}")
if event["normalized_date"]:
print(f"Normalized: {event['normalized_date']}")
# Show related entities
related_entities = [result["entities"][i] for i in event["related_entities"]]
for entity in related_entities:
print(f" - {entity['type']}: {entity['text']}")
print("---")
Entity Normalization and Ontology Linking
Normalize entities and link to standard medical ontologies:
from honeybee.processors import ClinicalProcessor
# Configure with specific ontologies
config = {
"entity_recognition": {
"ontologies": ["snomed_ct", "rxnorm", "loinc"],
"abbreviation_expansion": True,
"term_disambiguation": True
}
}
processor = ClinicalProcessor(config=config)
text = "Pt started on tamoxifen 20mg daily for breast ca."
result = processor.process_text(text)
# Examine normalized entities
for entity in result["entities"]:
props = entity["properties"]
print(f"Original text: {entity['text']}")
print(f"Type: {entity['type']}")
# Check for abbreviation expansion
if "expanded" in props:
print(f"Expanded: {props['expanded']}")
# Check for ontology links
if "ontology_links" in props:
for link in props["ontology_links"]:
print(f"Ontology: {link['ontology']}")
print(f"Concept: {link['concept_name']} ({link['concept_id']})")
print("---")
Supported File Formats
The clinical processor supports various input formats:
- Image formats: PDF, PNG, JPG, JPEG, TIFF, BMP (processed via OCR)
- EHR formats: XML, JSON, CSV, XLSX (structured data processing)
- Document types: Operative reports, pathology reports, consultation notes, progress notes, discharge summaries
Supported Biomedical Models
Choose from specialized biomedical language models:
- bioclinicalbert: Bio-ClinicalBERT for clinical text
- pubmedbert: PubMedBERT for biomedical literature
- gatortron: GatorTron for clinical notes (default)
- clinicalt5: Clinical-T5 for text generation tasks
Performance Considerations
When processing large clinical datasets, consider the following:
- Use batch processing for large document collections
- Configure appropriate tokenization strategies for long documents
- Enable GPU acceleration when available for deep learning models
- Use sliding window or hierarchical tokenization for very long documents
- Consider memory-efficient processing for large datasets
- Adjust entity recognition components based on performance requirements
Output Structure
The processor returns a comprehensive results dictionary:
{
"file_path": "path/to/document.pdf",
"file_name": "document.pdf",
"processing_timestamp": "2023-05-30T10:30:00",
"text": "Patient presents with...",
"document_structure": {
"sections": [...],
"headers": [...]
},
"tokenization": {
"tokens": [...],
"token_ids": [...],
"segment_mapping": [...],
"num_segments": 5
},
"entities": [
{
"text": "breast cancer",
"type": "condition",
"start": 25,
"end": 38,
"properties": {
"ontology_links": [...],
"source": "rule-based"
}
}
],
"entity_relationships": [
{
"source": 0,
"target": 1,
"type": "treats"
}
],
"temporal_timeline": [...]
}
References
- GatorTron: https://arxiv.org/abs/2301.04619
- Bio-ClinicalBERT: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
- PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
- Clinical-T5: https://huggingface.co/healx/gpt-t5-clinical