BUS 41201 is a course on Big Data and Applied AI. Students will learn concepts underlying powerful prediction and text generating systems (e.g. Google search or ChatGPT). The purpose is to learn how to explore and analyze large datasets, become adept at building models, and gain the understanding necessary for interpreting such models.
We will put emphasis on the analysis of text data in the context of both small and Large Language Models (LLM) that form a basis of popular text-generating systems (ChatGPT, BERT, Llama).
This course includes the key concepts and tools that data scientists find valuable in business environments, and it is also designed to act as a primer for continued study. It is not specifically an introduction to computer science or machine learning, nor a class on high-dimensional econometrics and statistics; rather, like a good data scientist, the class borrows from multiple disciplines.
Techniques covered include an advanced overview of linear and logistic regression, model choice and false discovery rates, information criteria and cross validation, regularized regression and the lasso, bagging and the bootstrap, experiments and causal estimation, multinomial and binary regression, classification, latent variable models, principal component analysis, topic models, deci- sion trees and random forests, text analysis, language models and natural language processing.
We learn both basic underlying concepts and practical computational skills, including techniques for scalable analysis of distributed data. Heavy emphasis is placed on the analysis of actual datasets, and on development of application specific methodology. Among other examples, we will consider consumer database mining, internet and social media tracking, asset pricing, network analysis, sports analytics, and text mining.
Each class is accompanied by at least two real large datasets (e.g. from marketing, healthcare, or finance). We will analyze these datasets together and develop intuition for data analysis without too much emphasis on coding.