🎯 Project Overview
In Emergency Departments (ED), accurate prediction of resource utilization (like IV fluids) is critical for operational efficiency. Traditional models often ignore the rich information hidden in unstructured patient narratives (Chief Complaints).
This project aimed to bridge this gap by developing a Multimodal Machine Learning pipeline that integrates structured clinical variables with NLP-derived text features.
🛠 Methodology
Data Source
- Analyzed 13,115 patient records from the National Hospital Ambulatory Medical Care Survey (NHAMCS-ED).
- Input: Mixed data types including demographics (structured) and triage notes (unstructured).
The “Early Fusion” Strategy
I implemented an Early Fusion approach to combine distinct data modalities:
- Structured Pipeline: Processed clinical variables (vitals, age, history).
- NLP Pipeline: Experimented with three techniques to vectorize patient text:
- Baseline: CountVectorizer (Bag-of-Words).
- Static Embeddings: Word2Vec.
- Transformer: Pre-trained GPT-2 embeddings.
- Modeling: Concatenated features were fed into Logistic Regression and Gradient Boosting Classifiers (GBC).
💡 Key Technical Insight
“Simpler can be better.” Contrary to the popular trend of using Large Language Models (LLMs) for everything, my comparative analysis revealed a crucial insight:
CountVectorizer outperformed GPT-2 for this specific use case (AUC 0.786 vs 0.772).
Why? Emergency department narratives are typically short, telegraphic, and keyword-driven (e.g., “chest pain”, “nausea”). They lack the complex semantic structures that Transformers like GPT-2 excel at capturing. Frequency-based methods (CountVectorizer) proved more effective at extracting these direct predictive signals without the noise of deep semantic layers.
🚀 Results & Impact
- Performance: The integrated GBC model (Structured + NLP) achieved the highest AUC of 0.786, significantly outperforming models using structured data alone.
- Clinical Value: Demonstrated that integrating free-text narratives provides a robust framework for improving clinical decision-making and resource allocation in the ED.