Text Mining
Overview
In this course, students will learn to use Python and standard libraries to perform basic data analysis, word counting, classification, and uncover major trends in documents. Special focus will be paid to how natural language processing can be used in the service of social good and the prevention of genocide and mass atrocities. This will be done through the examination of datasets including internet hate speech, truth and reconciliation commission proceedings, and historic records of targeted mass killings.
Office Hours
- Wednesdays: 2:30 PM - 4:30 PM on Zoom.
- Or by appointment
Accessibility
If you have a documented disability, anticipate needing any type of accommodation in this course, or have questions/concerns about access, please tell me as soon as possible. Reasonable accommodations will be made for all students with disabilities, but it is your responsibility to inform me early in the term. I strongly encourage you to register any disability with the Services for Students with Disabilities (SSD) Office.
AI Usage Policy
The use of Artificial Intelligence (AI) tools (such as ChatGPT, Copilot, or similar generative AI platforms) is strictly prohibited in this course. Assignments, projects, and all submitted work must reflect your own understanding and effort.
To enhance your learning and problem-solving skills, you are strongly encouraged to use the sample code and guidance provided in the course materials available on Brightspace.
If you encounter difficulties in coding tasks:
First, consult the course materials and lecture notes.
Second, you may use reputable online resources or the recommended course textbook. However, you must properly cite any external sources used.
If you are still unable to resolve the issue, please reach out to me via email or visit during my office hours for additional support.
This policy is designed to support your learning and ensure academic integrity. Violations may result in disciplinary action in accordance with the university’s academic conduct policies.
Course Content
Python for Data Science
- Basic programming concepts: math, functions, variables, iteration, conditionals
- Basic data structures: strings, lists, and data frames
- Data cleaning (pandas)
- Data visualization (matplotlib and seaborn)
Statistical and Rule-Based Text Analysis
- Regular expressions
- Words and sentence tokenization
- Part of speech tagging, lemmatization (spacy)
- Word vectors and embeddings
- Word counts and stop words
Machine Learning
- Loss functions and minimization algorithm for perceptron.
- Training, testing, and model evaluation
- Classification algorithms (perceptron, SVM, LSTM RNN)
- Word vectors and embeddings (GloVe)
- Topic Modelling (LDA, Top2Vec, UMAP)
Course Materials
This class does not have a required textbook, but Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning is highly recommended.
Assignments
Students will complete:
- Weekly Assignments (80%): Four problem sets focusing on algorithm design and analysis.
- Final Project (20%): Two programming projects implementing and analyzing advanced algorithms.
We will be using Google Colab in this class. Detailed information about how to use Google Colab will be available on Brightspace.
Assignments and Final Project
Assignment 1
Questions will include these topics:
- Python fundamentals: math, functions, conditionals, loops
- Data structures: strings, lists, dictionaries, DataFrames
- Pandas for text data analysis
- Basic data visualization with matplotlib/seaborn
Deadline: June 2
Dataset: Link
Assignment 1: Brightspace
Assignment 2
Questions will include these topics:
- Regular expressions, word/sentence tokenization
- Stop words, word counts, n-grams
- POS tagging, lemmatization (spaCy)
- TF-IDF transformation
Deadline: June 9
Dataset: Link
Assignment 2: Brightspace
Assignment 3
- Train-test split, evaluation metrics
- Perceptron: loss function, gradient descent
- Classification models: Perceptron, SVM
- Word embeddings (GloVe)
Deadline: June 16
Dataset: Link
Assignment 3: Brightspace
Assignment 4
- Topic Modeling: LDA, Top2Vec, UMAP
Deadline: June 23
Dataset: Link
Assignment 4: Brightspace
Final Project
It will cover all topics that we covered in the class.
Deadline: June 29
Dataset: Link
Research Project: Brightspace
Exercises
I will provide weekly exercises to help you prepare for each assignment. You can find all the exercise files on the class Brightspace page.