Motivation
Preliminary self-diagnosis is something most people attempt informally — searching symptoms online and trying to narrow down possible conditions. This project formalizes that process: given a natural language description of symptoms, can a CNN return the most probable medical condition?
Approach
Two preprocessing pipelines were compared on identical CNN architectures. The One-Hot pipeline tokenized and stemmed symptom descriptions into a 56x4210 binary word-stem matrix, with Witten-Bell distribution used to generate synthetic samples for underrepresented conditions. The FastText pipeline trained unsupervised word embeddings on the corpus with max/min pooling for fixed-length vectors. Training data was scraped from Mayo Clinic, UpToDate, Healthline, and NHS via custom DOM selectors.
Results
The One-Hot pipeline significantly outperformed FastText:
- One-Hot accuracy: 93% with 90% recall across conditions
- Perfect precision on migraines and tetanus
- FastText accuracy: 73% with 70% recall — the unsupervised embeddings lost semantic distinctions critical for differentiating similar symptom profiles
The result is counterintuitive — dense learned representations are usually preferred over sparse one-hot encodings. The small corpus meant FastText couldn’t learn meaningful embeddings, while one-hot preserved every lexical distinction the CNN needed.
Key Takeaway
Preprocessing choices can dominate model architecture choices when data is limited. The “modern” approach (learned embeddings) failed precisely because it needed more data than was available to learn useful representations. The simpler, higher-dimensional one-hot encoding preserved the information the model needed to discriminate between conditions.