Module Content

Part I: Classical Techniques (5 ECTS)
  • Least Squares Analysis
  • Principal Component Analysis
  • Cluster Analysis
  • Factor Analysis
Part II: Topological Data Analysis (5 ECTS)
  • Topological Preliminaries
  • Mapper Clustering
  • Persistent Homology
  • Fundamental Group

Module Coordinates

  • Lecturer: Graham Ellis & Emil Sköldberg
  • Lectures:
    Mon 10.00am, GE, ADB1020
    Tue 12.00m, GE, ADB1020
    Wed 10.00am, ES, ADB1019 (or ADB1020)
    Fri 14.00pm, ES, ADB1019 (or IT206)
  • Tutorials: Wednesday and Friday lectures will often take the format of a tutorial and so no formal tutorials are scheduled.
  • Recomended text: Part I is based on four chapters of the textbook Multivariate Analysis by Sir Maurice Kendall. (The numerical examples on regression are taken from Applied Linear Statistical Models by John Neter and William Wasserman.) Part II is based on the research paper An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists by Frédéric Chazal and Bertrand Michel .
  • Problem sheet: available here.
  • Module Website: Information and module documents will be posted to this site, which is linked from the Blackboard MA500 Geometric Foundations of Data Analysis pages. Blackboard will also be used for announcements and for posting grades.

Module Assessment

Part I will be assessed by a 2-hour written exam (52%) and three continuous assessment assignments (16% each).

Part II will be assessed by a 2-hour written exam (50%) and two continuous assessment assignments (25% each).

Each exam will consist of four questions, with full marks for four correct answers.

Each assignment will consist of a data analysis problem that needs to be tackled using the Python programming language, and submitted (by email to both lecturers) as a PDF document.

Supplementary Material and News

Clicker opinion polling may be used in some lectures.

Lecture Notes

Lecture Notes
Lecture Summaries
1
Began by explaining the terms "geometry", "data analysis", "statistics", "probability". Then explained how to find the line y=b0 + b1 x that "best fits" a collection of data points (xi,yi) for i=1,2,...,n. We took "best fit" to mean the line that minimizes the sum of the squares of the residuals ei = yi - b0 - b1xi .
2
Explained how to determine the best (in the least squares sense) plane that fits data points (yi,xi,1 + ... + xi,p-1) in Rp for i=1,2, ..., n.

Also explained how to determine the best (in the least squares sense) polynomial of degree at most d that fits data points in R2.

Introduced the coefficient of determination
R2 = SSR/SSTO .
Typically R2 is close to 1 when the fit is good (i.e. when the fit "explains" a lot of the variation in the yi).

Next lecture we'll show that 0<= R2 <= 1. We'll also give the formula for R2 in matrix notation.
3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24