Text Mining

Course Number: DIDA 310
Semester: Summer 2025
Level: Undergraduate
An undergraduate-level text mining course that emphasizes using Python and standard libraries to perform basic data analysis, word counts, classification, and identify major trends in text documents.

Overview

In this course, students will learn to use Python and standard libraries to perform basic data analysis, word counting, classification, and uncover major trends in documents. Special focus will be paid to how natural language processing can be used in the service of social good and the prevention of genocide and mass atrocities. This will be done through the examination of datasets including internet hate speech, truth and reconciliation commission proceedings, and historic records of targeted mass killings.

Office Hours

  • Wednesdays: 2:30 PM - 4:30 PM on Zoom.
  • Or by appointment

Accessibility

If you have a documented disability, anticipate needing any type of accommodation in this course, or have questions/concerns about access, please tell me as soon as possible. Reasonable accommodations will be made for all students with disabilities, but it is your responsibility to inform me early in the term. I strongly encourage you to register any disability with the Services for Students with Disabilities (SSD) Office.

AI Usage Policy

The use of Artificial Intelligence (AI) tools (such as ChatGPT, Copilot, or similar generative AI platforms) is strictly prohibited in this course. Assignments, projects, and all submitted work must reflect your own understanding and effort.

To enhance your learning and problem-solving skills, you are strongly encouraged to use the sample code and guidance provided in the course materials available on Brightspace.

If you encounter difficulties in coding tasks:

  • First, consult the course materials and lecture notes.

  • Second, you may use reputable online resources or the recommended course textbook. However, you must properly cite any external sources used.

  • If you are still unable to resolve the issue, please reach out to me via email or visit during my office hours for additional support.

This policy is designed to support your learning and ensure academic integrity. Violations may result in disciplinary action in accordance with the university’s academic conduct policies.

Course Content

Python for Data Science

  • Basic programming concepts: math, functions, variables, iteration, conditionals
  • Basic data structures: strings, lists, and data frames
  • Data cleaning (pandas)
  • Data visualization (matplotlib and seaborn)

Statistical and Rule-Based Text Analysis

  • Regular expressions
  • Words and sentence tokenization
  • Part of speech tagging, lemmatization (spacy)
  • Word vectors and embeddings
  • Word counts and stop words

Machine Learning

  • Loss functions and minimization algorithm for perceptron.
  • Training, testing, and model evaluation
  • Classification algorithms (perceptron, SVM, LSTM RNN)
  • Word vectors and embeddings (GloVe)
  • Topic Modelling (LDA, Top2Vec, UMAP)

Course Materials

This class does not have a required textbook, but Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning is highly recommended.

Assignments

Students will complete:

  1. Weekly Assignments (80%): Four problem sets focusing on algorithm design and analysis.
  2. Final Project (20%): Two programming projects implementing and analyzing advanced algorithms.

We will be using Google Colab in this class. Detailed information about how to use Google Colab will be available on Brightspace.

Assignments and Final Project

Assignment 1

Questions will include these topics:

  1. Python fundamentals: math, functions, conditionals, loops
  2. Data structures: strings, lists, dictionaries, DataFrames
  3. Pandas for text data analysis
  4. Basic data visualization with matplotlib/seaborn

Deadline: June 2

Dataset: Link

Assignment 1: Brightspace

Assignment 2

Questions will include these topics:

  1. Regular expressions, word/sentence tokenization
  2. Stop words, word counts, n-grams
  3. POS tagging, lemmatization (spaCy)
  4. TF-IDF transformation

Deadline: June 9

Dataset: Link

Assignment 2: Brightspace

Assignment 3

  1. Train-test split, evaluation metrics
  2. Perceptron: loss function, gradient descent
  3. Classification models: Perceptron, SVM
  4. Word embeddings (GloVe)

Deadline: June 16

Dataset: Link

Assignment 3: Brightspace

Assignment 4

  1. Topic Modeling: LDA, Top2Vec, UMAP

Deadline: June 23

Dataset: Link

Assignment 4: Brightspace

Final Project

It will cover all topics that we covered in the class.

Deadline: June 29

Dataset: Link

Research Project: Brightspace

Exercises

I will provide weekly exercises to help you prepare for each assignment. You can find all the exercise files on the class Brightspace page.