Skip to main content

md-classifier

Archived

A deep learning system combining transformers and CNNs to classify diseases from patient-described symptoms, achieving 90% recall through semantic embeddings and CNN feature extraction.

Jupyter Notebook 2Updated Mar 16, 2026
cnnconvolutional-neural-networksdeep-learningdisease-classificationencodingsgithub-actionshealthcarekerasmachine-learningmedical-diagnosisnatural-language-processingnlpnltkpandaspippythonresearchtransformers

Archived. This was a group project / final assessment for CSCI 4152/6509 — Natural Language Processing at Dalhousie University, part of the Artificial Intelligence & Intelligent Systems certificate (undergraduate/graduate mixed course). It is no longer actively maintained.

MD Classifier

A convolutional neural network (CNN) that predicts medical conditions from natural language symptom descriptions. Given a text input like “My head is hurting”, the model returns the most probable diagnosis (e.g., Migraine).

Built as a research project at Dalhousie University, achieving 90% recall across three conditions: Migraine, Depression, and Tetanus.

How It Works

Two CNN implementations are compared, each with a different text preprocessing strategy:

ApproachPreprocessingAccuracyRecall
One-Hot EncodingTokenize, stem, generate synthetic samples via Witten-Bell distribution, encode as binary vectors93%90%
FastText EmbeddingsTokenize sentences, train unsupervised word embeddings, pad via max/min pooling73%70%

Training data is scraped from medical encyclopedias (Mayo Clinic, UpToDate, Healthline, NHS, etc.) and processed through custom pipelines in the Jupyter notebooks.

Documents

Project Structure

src/
  one-hot-encoding.ipynb   # One-Hot + CNN implementation
  fast-text.ipynb          # FastText + CNN implementation
resources/data/
  sources/                 # CSV files with scraping targets (URLs + DOM selectors)
  targets/                 # Processed training text per condition
docs/                      # LaTeX source for paper and proposal
scripts/
  compile-latex.sh         # Build PDFs from LaTeX

Getting Started

Requirements: Python 3.7+, pip

# Create and activate a virtual environment
python -m virtualenv venv
source venv/bin/activate

# Install dependencies
pip install --no-deps -r requirements.txt

# Set up Jupyter kernel
pip install ipykernel
python -m ipykernel install --user --name=mdnlp

Then open either notebook in src/ (locally or via Google Colab) and run the cells sequentially. Each notebook handles data collection, preprocessing, model training, and evaluation end-to-end.

License

Apache 2.0