Clinical Data Processing

Overview

The clinical data processing pipeline in HoneyBee is designed to handle various types of clinical data, including electronic health records (EHRs), clinical notes, pathology reports, and other textual medical documents. The pipeline extracts, processes, and generates embeddings from clinical text data for downstream machine learning applications.

Key Features

Support for multiple input formats (PDF, scanned images, EHR exports)
OCR capabilities for digitizing scanned documents with medical terminology verification
Integration with specialized biomedical language models (Bio-ClinicalBERT, PubMedBERT, GatorTron, Clinical-T5)
Comprehensive clinical entity recognition and normalization
Integration with medical ontologies (SNOMED-CT, RxNorm, LOINC, ICD-O-3)
Temporal information extraction for patient timelines
Cancer-specific entity extractors for oncology use cases
Configurable processing pipelines and output options

Basic Usage

The ClinicalProcessor provides a unified interface for processing clinical documents:

basic_usage.py


from honeybee.processors import ClinicalProcessor

# Initialize the clinical processor with default configuration
processor = ClinicalProcessor()

# Process a clinical document (PDF, image, or EHR export)
result = processor.process("path/to/clinical_document.pdf")

# Access extracted text
text = result["text"]

# Access extracted entities
entities = result["entities"]

# Access temporal timeline
timeline = result["temporal_timeline"]

Custom Configuration

Configure the processor for specific use cases and document types:

custom_configuration.py


from honeybee.processors import ClinicalProcessor

# Custom configuration
config = {
    "document_processor": {
        "use_ocr": True,
        "use_ehr": True
    },
    "tokenization": {
        "model": "gatortron",  # Options: bioclinicalbert, pubmedbert, gatortron, clinicalt5
        "max_length": 512,
        "segment_strategy": "sentence",  # sentence, paragraph, fixed
        "long_document_strategy": "sliding_window"  # sliding_window, hierarchical, important_segments, summarize
    },
    "entity_recognition": {
        "use_rules": True,
        "use_spacy": True,
        "use_deep_learning": False,
        "cancer_specific_extraction": True,
        "temporal_extraction": True,
        "ontologies": ["snomed_ct", "rxnorm", "loinc"]
    },
    "processing_pipeline": ["document", "tokenization", "entity_recognition"],
    "output": {
        "include_raw_text": True,
        "include_tokens": True,
        "include_entities": True,
        "include_document_structure": True,
        "include_temporal_timeline": True
    }
}

# Initialize with custom configuration
processor = ClinicalProcessor(config=config)

# Process document
result = processor.process("document.pdf", save_output=True)

Processing Raw Text

Process clinical text directly without file input:

process_text.py


from honeybee.processors import ClinicalProcessor

processor = ClinicalProcessor()

# Process raw clinical text
clinical_text = """
Patient presents with stage III non-small cell lung cancer.
EGFR mutation positive. Started on erlotinib 150mg daily.
Partial response observed after 3 months of treatment.
"""

result = processor.process_text(
    text=clinical_text,
    document_type="progress_note"
)

# Extract entities
for entity in result["entities"]:
    print(f"Entity: {entity['text']}")
    print(f"Type: {entity['type']}")
    print(f"Properties: {entity['properties']}")
    print("---")

Batch Processing

Process multiple clinical documents efficiently:

batch_processing.py


from honeybee.processors import ClinicalProcessor

processor = ClinicalProcessor()

# Process all PDF files in a directory
results = processor.process_batch(
    input_dir="path/to/clinical_documents",
    file_pattern="*.pdf",
    save_output=True,
    output_dir="path/to/output"
)

# Analyze batch results
total_docs = len(results)
total_entities = sum(len(r.get("entities", [])) for r in results)

print(f"Processed {total_docs} documents")
print(f"Extracted {total_entities} total entities")

Advanced Tokenization

Handle long documents with various tokenization strategies:

advanced_tokenization.py


from honeybee.processors import ClinicalProcessor

# Configure for long document processing
config = {
    "tokenization": {
        "model": "gatortron",
        "max_length": 512,
        "segment_strategy": "paragraph",
        "long_document_strategy": "hierarchical",  # Preserves document structure
        "stride": 128
    }
}

processor = ClinicalProcessor(config=config)

# Process long operative report
result = processor.process("long_operative_report.pdf")

# Access tokenization details
tokenization = result["tokenization"]
print(f"Number of tokens: {len(tokenization['tokens'])}")
print(f"Number of segments: {tokenization['num_segments']}")

if tokenization.get("hierarchical"):
    print("Document sections:")
    for section in tokenization["sections"]:
        print(f"- {section['name']}: {len(section['segment_indices'])} segments")

Cancer-Specific Entity Extraction

Extract oncology-specific entities with specialized extractors:

cancer_specific_extraction.py


from honeybee.processors import ClinicalProcessor

# Enable cancer-specific extraction
config = {
    "entity_recognition": {
        "cancer_specific_extraction": True,
        "temporal_extraction": True,
        "ontologies": ["snomed_ct", "rxnorm"]
    }
}

processor = ClinicalProcessor(config=config)

pathology_text = """
Invasive ductal carcinoma, Grade 2, measuring 2.1 cm.
ER positive (90%), PR positive (75%), HER2 negative.
T2N0M0 stage IIA. Margins clear.
"""

result = processor.process_text(pathology_text, "pathology_report")

# Filter entities by type
tumors = [e for e in result["entities"] if e["type"] == "tumor"]
biomarkers = [e for e in result["entities"] if e["type"] == "biomarker"]
staging = [e for e in result["entities"] if e["type"] == "staging"]

print(f"Found {len(tumors)} tumor entities")
print(f"Found {len(biomarkers)} biomarker entities")
print(f"Found {len(staging)} staging entities")

# Examine biomarker details
for biomarker in biomarkers:
    props = biomarker["properties"]
    print(f"Biomarker: {props['name']}")
    print(f"Status: {props['status']}")
    if "percentage" in props:
        print(f"Percentage: {props['percentage']}%")

Temporal Timeline Construction

Extract and organize temporal information for patient timelines:

temporal_timeline.py


from honeybee.processors import ClinicalProcessor

processor = ClinicalProcessor()

clinical_note = """
Patient diagnosed with breast cancer on 03/15/2023.
Started neoadjuvant chemotherapy in April 2023.
Surgery performed on 08/20/2023.
Currently on adjuvant hormone therapy as of September 2023.
"""

result = processor.process_text(clinical_note, "consultation_note")

# Access temporal timeline
timeline = result["temporal_timeline"]

print("Patient Timeline:")
for event in timeline:
    print(f"Date: {event['temporal_text']}")
    if event["normalized_date"]:
        print(f"Normalized: {event['normalized_date']}")
    
    # Show related entities
    related_entities = [result["entities"][i] for i in event["related_entities"]]
    for entity in related_entities:
        print(f"  - {entity['type']}: {entity['text']}")
    print("---")

Entity Normalization and Ontology Linking

Normalize entities and link to standard medical ontologies:

ontology_linking.py


from honeybee.processors import ClinicalProcessor

# Configure with specific ontologies
config = {
    "entity_recognition": {
        "ontologies": ["snomed_ct", "rxnorm", "loinc"],
        "abbreviation_expansion": True,
        "term_disambiguation": True
    }
}

processor = ClinicalProcessor(config=config)

text = "Pt started on tamoxifen 20mg daily for breast ca."

result = processor.process_text(text)

# Examine normalized entities
for entity in result["entities"]:
    props = entity["properties"]
    
    print(f"Original text: {entity['text']}")
    print(f"Type: {entity['type']}")
    
    # Check for abbreviation expansion
    if "expanded" in props:
        print(f"Expanded: {props['expanded']}")
    
    # Check for ontology links
    if "ontology_links" in props:
        for link in props["ontology_links"]:
            print(f"Ontology: {link['ontology']}")
            print(f"Concept: {link['concept_name']} ({link['concept_id']})")
    
    print("---")

Supported File Formats

The clinical processor supports various input formats:

Image formats: PDF, PNG, JPG, JPEG, TIFF, BMP (processed via OCR)
EHR formats: XML, JSON, CSV, XLSX (structured data processing)
Document types: Operative reports, pathology reports, consultation notes, progress notes, discharge summaries

Supported Biomedical Models

Choose from specialized biomedical language models:

bioclinicalbert: Bio-ClinicalBERT for clinical text
pubmedbert: PubMedBERT for biomedical literature
gatortron: GatorTron for clinical notes (default)
clinicalt5: Clinical-T5 for text generation tasks

Performance Considerations

When processing large clinical datasets, consider the following:

Use batch processing for large document collections
Configure appropriate tokenization strategies for long documents
Enable GPU acceleration when available for deep learning models
Use sliding window or hierarchical tokenization for very long documents
Consider memory-efficient processing for large datasets
Adjust entity recognition components based on performance requirements

Output Structure

The processor returns a comprehensive results dictionary:

output_structure.json


{
    "file_path": "path/to/document.pdf",
    "file_name": "document.pdf",
    "processing_timestamp": "2023-05-30T10:30:00",
    "text": "Patient presents with...",
    "document_structure": {
        "sections": [...],
        "headers": [...]
    },
    "tokenization": {
        "tokens": [...],
        "token_ids": [...],
        "segment_mapping": [...],
        "num_segments": 5
    },
    "entities": [
        {
            "text": "breast cancer",
            "type": "condition",
            "start": 25,
            "end": 38,
            "properties": {
                "ontology_links": [...],
                "source": "rule-based"
            }
        }
    ],
    "entity_relationships": [
        {
            "source": 0,
            "target": 1,
            "type": "treats"
        }
    ],
    "temporal_timeline": [...]
}

References

GatorTron: https://arxiv.org/abs/2301.04619
Bio-ClinicalBERT: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
PubMedBERT: https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
Clinical-T5: https://huggingface.co/healx/gpt-t5-clinical