Overview
The molecular processing pipeline in HoneyBee handles various types of molecular data, including DNA methylation, gene expression, protein expression, DNA mutation, and miRNA expression. The pipeline preprocesses, integrates, and generates embeddings from these diverse molecular profiles for downstream machine learning applications.

Key Features
- Support for multiple molecular data types
- Comprehensive preprocessing pipelines
- Feature selection and dimensionality reduction
- Multi-modal integration of molecular data
- Integration with clinical features
- Embedding generation using specialized models
Data Acquisition and Preprocessing
HoneyBee utilizes publicly available datasets from repositories such as TCGA and UCSC Xena:
from honeybee.processors import MolecularProcessor
# Initialize the molecular processor
processor = MolecularProcessor()
# Load data from file
gene_expression = processor.load_data("path/to/gene_expression.csv")
methylation = processor.load_data("path/to/methylation.csv")
mutations = processor.load_data("path/to/mutations.csv")
# Basic preprocessing
processed_expression = processor.preprocess(
gene_expression,
modality="gene_expression",
normalize=True,
log_transform=True
)
processed_methylation = processor.preprocess(
methylation,
modality="methylation",
normalize=True
)
processed_mutations = processor.preprocess(
mutations,
modality="mutation",
binarize=True
)
Feature Processing and Selection
Clean and optimize feature sets for analysis:
from honeybee.processors import MolecularProcessor
processor = MolecularProcessor()
gene_expression = processor.load_data("path/to/gene_expression.csv")
# Remove constant and duplicate features
cleaned_data = processor.remove_constant_features(gene_expression)
cleaned_data = processor.remove_duplicate_features(cleaned_data)
# Remove low-expression genes
filtered_data = processor.filter_low_expression(
cleaned_data,
threshold=7 # Equivalent to 127 FPKM
)
# Handle missing data
imputed_data = processor.impute_missing(filtered_data, method="mean")
# Remove collinear features
final_data = processor.remove_collinear_features(
imputed_data,
threshold=0.25 # Variance threshold
)
Feature Selection and Integration
Unify and integrate features across modalities:
from honeybee.processors import MolecularProcessor
processor = MolecularProcessor()
# Load preprocessed data for each modality
gene_expr = processor.load_data("path/to/processed_gene_expression.csv")
methylation = processor.load_data("path/to/processed_methylation.csv")
protein_expr = processor.load_data("path/to/processed_protein_expression.csv")
mutations = processor.load_data("path/to/processed_mutations.csv")
mirna_expr = processor.load_data("path/to/processed_mirna_expression.csv")
# Unify features within each modality (for pan-cancer analysis)
unified_gene_expr = processor.unify_features(gene_expr, modality="gene_expression")
unified_methylation = processor.unify_features(methylation, modality="methylation")
unified_protein_expr = processor.unify_features(protein_expr, modality="protein_expression")
unified_mutations = processor.unify_features(mutations, modality="mutation")
unified_mirna_expr = processor.unify_features(mirna_expr, modality="mirna_expression")
# Integrate modalities
integrated_data = processor.integrate_modalities([
unified_gene_expr,
unified_methylation,
unified_protein_expr,
unified_mutations,
unified_mirna_expr
])
# Add clinical features
clinical_data = processor.load_data("path/to/clinical_data.csv")
final_dataset = processor.add_clinical_features(
integrated_data,
clinical_data,
features=["age", "gender", "race", "cancer_stage"]
)
Embedding Generation
Generate embeddings from molecular data using specialized models:
from honeybee.processors import MolecularProcessor
# Initialize processor with specific model
processor = MolecularProcessor(model="senmo") # Using the SeNMo model
# Load and preprocess integrated data
integrated_data = processor.load_data("path/to/integrated_molecular_data.csv")
preprocessed_data = processor.preprocess(integrated_data)
# Generate embeddings
embeddings = processor.generate_embeddings(preprocessed_data)
# Shape: (num_samples, embedding_dim) # embedding_dim depends on the model
Complete Example
Full pipeline from data loading to embedding generation:
from honeybee.processors import MolecularProcessor
from honeybee import HoneyBee
# Initialize processor with specific model
processor = MolecularProcessor(model="senmo")
# Load data for multiple modalities
gene_expr = processor.load_data("path/to/gene_expression.csv")
methylation = processor.load_data("path/to/methylation.csv")
protein_expr = processor.load_data("path/to/protein_expression.csv")
mutations = processor.load_data("path/to/mutations.csv")
mirna_expr = processor.load_data("path/to/mirna_expression.csv")
# Preprocess each modality
processed_gene_expr = processor.preprocess(gene_expr, modality="gene_expression")
processed_methylation = processor.preprocess(methylation, modality="methylation")
processed_protein_expr = processor.preprocess(protein_expr, modality="protein_expression")
processed_mutations = processor.preprocess(mutations, modality="mutation")
processed_mirna_expr = processor.preprocess(mirna_expr, modality="mirna_expression")
# Integrate modalities
integrated_data = processor.integrate_modalities([
processed_gene_expr,
processed_methylation,
processed_protein_expr,
processed_mutations,
processed_mirna_expr
])
# Add clinical features
clinical_data = processor.load_data("path/to/clinical_data.csv")
final_dataset = processor.add_clinical_features(
integrated_data,
clinical_data,
features=["age", "gender", "race", "cancer_stage"]
)
# Generate molecular embeddings
molecular_embeddings = processor.generate_embeddings(final_dataset)
# For multimodal integration with other data types
hb = HoneyBee()
multimodal_embeddings = hb.integrate_embeddings([
molecular_embeddings,
clinical_embeddings, # Generated separately
pathology_embeddings # Generated separately
])
# Use for downstream tasks
survival_prediction = hb.predict_survival(multimodal_embeddings)
Advanced Usage: Integration with Pathway Knowledge
Incorporate biological pathway information into molecular analysis:
from honeybee.processors import MolecularProcessor
processor = MolecularProcessor()
gene_expr = processor.load_data("path/to/gene_expression.csv")
# Load pathway information
pathways = processor.load_pathways("path/to/pathway_database.gmt")
# Group genes by pathway
pathway_scores = processor.calculate_pathway_scores(
gene_expr,
pathways,
method="ssgsea" # Single-sample Gene Set Enrichment Analysis
)
# Generate pathway-based embeddings
pathway_embeddings = processor.generate_pathway_embeddings(pathway_scores)
Performance Considerations
When processing large molecular datasets, consider the following:
- Use sparse matrix representations for mutation and methylation data
- Implement batch processing for large datasets
- Consider dimensionality reduction techniques before integration
- Optimize memory usage when working with multiple large datasets
- Parallelize preprocessing operations where possible