Skip to main content
PaperDalhousie University — Course Project2022

Classification of Ailments Given Description of Symptoms

CNN-based medical condition prediction from natural language symptom descriptions

Overview

Develops a CNN for medical self-diagnosis that returns the most probable condition from natural language symptom descriptions. Compares One-Hot Encoding (56x4210 word-stem matrix with Witten-Bell synthetic augmentation) against unsupervised FastText embeddings on data scraped from Mayo Clinic, UpToDate, Healthline, and NHS. The One-Hot CNN achieved 93% accuracy and 90% recall with perfect precision on migraines and tetanus; FastText reached only 73% accuracy, demonstrating that sparse representations outperform learned embeddings on small medical corpora.

Motivation

Preliminary self-diagnosis is something most people attempt informally — searching symptoms online and trying to narrow down possible conditions. This project formalizes that process: given a natural language description of symptoms, can a CNN return the most probable medical condition?

Approach

Two preprocessing pipelines were compared on identical CNN architectures. The One-Hot pipeline tokenized and stemmed symptom descriptions into a 56x4210 binary word-stem matrix, with Witten-Bell distribution used to generate synthetic samples for underrepresented conditions. The FastText pipeline trained unsupervised word embeddings on the corpus with max/min pooling for fixed-length vectors. Training data was scraped from Mayo Clinic, UpToDate, Healthline, and NHS via custom DOM selectors.

Results

The One-Hot pipeline significantly outperformed FastText:

  • One-Hot accuracy: 93% with 90% recall across conditions
  • Perfect precision on migraines and tetanus
  • FastText accuracy: 73% with 70% recall — the unsupervised embeddings lost semantic distinctions critical for differentiating similar symptom profiles

The result is counterintuitive — dense learned representations are usually preferred over sparse one-hot encodings. The small corpus meant FastText couldn’t learn meaningful embeddings, while one-hot preserved every lexical distinction the CNN needed.

Key Takeaway

Preprocessing choices can dominate model architecture choices when data is limited. The “modern” approach (learned embeddings) failed precisely because it needed more data than was available to learn useful representations. The simpler, higher-dimensional one-hot encoding preserved the information the model needed to discriminate between conditions.

Features

Symptom-to-Diagnosis CNN

Predicts medical conditions from natural language symptom descriptions with 93% accuracy.

One-Hot Encoding

Binary 56x4210 word-stem matrix with Witten-Bell distribution for synthetic sample augmentation.

FastText Embeddings

Unsupervised word embeddings trained on scraped medical corpus for pipeline comparison.

Multi-Source Scraping

Training data scraped from Mayo Clinic, UpToDate, Healthline, and NHS.

Multi-Class Classification

Classifies migraines, depression, and tetanus conditions with per-class precision and recall.

Pipeline Comparison

Side-by-side evaluation proving sparse one-hot representations outperform learned embeddings on small corpora.

Tech Stack

Python
Python
Keras
Keras
N
NLTK