Language as Data

What are we trying to accomplish?

In this module, students will learn how human language can be transformed into structured data that machines can process, analyze, and learn from. Rather than relying on handcrafted rules, we begin treating text as a dataset—applying systematic preprocessing steps to convert raw language into consistent, machine-readable formats.

This module introduces the foundational Natural Language Processing (NLP) pipeline used in nearly all modern chatbot systems. Students will explore how text is normalized, tokenized, and transformed into feature representations such as Bag of Words and N-grams. These representations enable statistical reasoning over language and serve as the backbone for intent classification, retrieval-based chatbots, and machine learning models.

By the end of this module, students will understand how language transitions from unstructured text into numerical features—and why this transformation is essential before introducing machine learning or deep learning techniques.

Lectures & Assignments

Lectures

Assignments

TLO's (Terminal Learning Objectives)

Transform raw text into structured, machine-readable data
Implement a complete NLP preprocessing pipeline in Python
Convert processed text into numerical feature vectors
Apply Bag of Words, N-grams, and TF-IDF vectorization techniques
Explain how feature representations enable intent classification and retrieval
Prepare text data for downstream machine learning models

ELO's (Enabling Learning Objectives)

What it means to treat language as data rather than rules
How tokenization, normalization, and lemmatization affect downstream models
Why preprocessing consistency is critical for NLP systems
How vocabularies are constructed from corpora
The difference between unigrams, bigrams, and higher-order N-grams
How sparse feature vectors represent text numerically
Why raw word counts can be misleading in language tasks
How TF-IDF improves relevance in retrieval-based chatbots
The strengths and limitations of classical NLP representations
How vectorization bridges rule-based systems and machine learning