Module Content

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering and communicating useful information, informing conclusions, and supporting decision-making. Data analysis involves many overlapping areas of study, such as descriptive statistics (concerned with summarizing information in a sample), statistical inference (concerned with inferences about a population based on properties of a sample), machine learning (concerned with performing tasks using algorithms that rely on patterns and inference), data mining (sometimes considered a subfield of machine learning and concerned with discovering patterns in large data sets), information visualization (concerned with the visual representations of abstract data to reinforce human cognition), and several other areas. See this interesting review for comments about the related term data science.

Geometry is concerned with questions of shape, size, relative position of figures, isometry, and the properties of space. It underlies much of data analysis, as can be seen from textbooks such as those by [Kendall], [Le Roux], [Kirby], [Tossdorff], [Hartmann], [Outot], [Tierny], [Edelsbrunner], [Patrangenaru], [Biau], [Wichura], [Dryden]. The recent online textbook Mathematical Foundations for Data Analysis by Jeff M. Phillips again emphasizes the importance of a geometric understanding of techniques when applying them to data analysis.

This module focuses on some geometric methods used in data analysis. It covers the geometric and algorithmic aspects of these methods, as well as their implementation as Python code on Linux computers, and their application to a range of different types of data. The first half of the course emphasizes geometric aspects of classical techniques such as least squares fitting, principal component analysis, hierarchical clustering, nearest neighbour searching, and the Johnson-Lindenstrauss Theorem for dimensionality reduction. The second half of the course covers more recent techniques that have been developed over the last two or three decades, and emphasizes topological aspects as well as geometric aspects. The second half of the course makes use of R interfaces to Python Mapper and to efficient C++ routines for persistent homology.



Part I = CS4102: Classical Techniques (5 ECTS, first 24 lectures)
  • Least Squares Fitting
  • Principal Component Analysis
  • Hierarchical Clustering and Persistence
  • Nearest Neighbours and the Johnson–Lindenstrauss Theorem
Part II = CS4102: Topological Data Analysis (5 ECTS, second 24 lectures)
  • Topological Preliminaries
  • Mapper Clustering
  • Persistent Homology

Module Coordinates

Module Assessment

Part I will be assessed by a 2-hour written exam (52%) and three continuous assessment assignments (16% each).

Part II will be assessed by a 2-hour written exam (50%) and two continuous assessment assignments (25% each).

Each exam will consist of four questions, with full marks for four correct answers.

Each assignment will consist of a data analysis problem that needs to be tackled using the Python programming language or the R programming environment, and submitted (by email to both lecturers) as a PDF document.

Supplementary Material and News

Homework results are here.

Emil's Lecture Notes

These will be posted here.

Exam Details

Exam hints/help for MA500/CS4102/CS4103 exams can be found here and here.

Lecture Notes

Lecture Notes
(Click number to download notes for the lecture.)

Lecture Summaries
1
Video of the lecture.

Talked about "dimension reduction" and, as an example, the paper "Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival" by Nicolau, Levine and Carlsson.

Then gave the definition and examples of a simplicial complex.
2
Video of the lecture.

Explained what is meant by the geometric realization |K| of a simplicial complex K.

Then explained how, from any n×n matrix of pairwise distances between n objects, and any ε>0, we can construct a geometric realization |Kε| of a simplicial complex.

We considered a sample S of 250 readings x_i=(hi,0, hi,2, hi,4) of water levels at Galway harbour, where hi,0 is the initial level when person i reaches the harbour, hi,2 is the level 2 hours later, and hi,4 is the level 4 hours after the initial reading. We constructed a 250x250 matrix of distances di,j=||xi-xj|| and used this to construct a simplicial complex Kε for various thresholds ε>0. The graph of one of these simplicial complexes was as follows.
This graph seems to capture the "cyclic nature" of tides!
3
Video of the lecture.

Began with 750 points randomly selected from two disjoint quarters of a torus in R3. Any linear transformation R3 ---> R2 (obtained say from PCA or the Johnson-Lindenstrauss theorem) would lose geometric information. However, we saw that the geometric information seems to be retained when the points are mapped to the vertices of the clique complex Kε for various values of ε.

Introduced the notion of homotopy between two maps f,g:X-->Y. Introduced the notion of homotopy equivalence between two topological spaces. Proved that the circle is homotopy equivalence to the projective plane minus the origin.
4
Video of the lecture.

Stated and illustrated Leray's Nerve Theorem. This provides theoretical motivation for studying clique complexes in data analysis.
5
Video of the lecture.

Explained how Leray's nerve theorem motivates the use of the clique complex Kε in data analysis.

Next lecture we'll see how these ideas lead to the Mapper clustering procedure for representing a matrix of distances between data points as a simplicial complex. The procedure was illustrated in this lecture by taking 1000 points in the plane, sampled at random from the image of the starfish. The following code produced the following 1-dimensional simplicial complex as a representation of the data.

gap> HapExample("1.3.5");
gap> #The example uses GraphViz software.
gap> file:=HapFile("data135.txt");
gap> Read(file);
gap> dx:=EuclideanApproximatedMetric;;
gap> dz:=EuclideanApproximatedMetric;;
gap> P:=30*[0..100];; P:=List(P, i->[i]);; r:=29;;
gap> epsilon:=32;;
gap> cluster:=function(S)
local Y, P, C;
if Length(S)=0 then return S; fi;
Y:=VectorsToOneSkeleton(S,epsilon,dx);
P:=PiZero(Y);
C:=Classify([1..Length(S)],P[2]);
return List(C,x->S{x});
end;;
gap> L:=List(S,v->Maximum(List(S,y->dx(v,y))));;
gap> n:=Position(L,Minimum(L));;
gap> f:=function(x); return [dx(S[n],x)]; end;;
gap> M:=Mapper(S,dx,f,dz,P,r,cluster);
gap> Display(GraphOfSimplicialComplex(M));



6
Video of the lecture.

Described the Mapper Algorithm for visualizing data.
7
Video of the lecture.

Recapped what we've done so far in topological data analysis, and then defined the n-th Betti number of a simplicial complex. Ended up with an example in which I calculated the 0-th and 1-st Betti numbers of a simplicial complex.
8
Video of the lecture.

Introduced, in an informal way, the notion of persistence of 1-dimensional holes in a data set, and explained how this persistence can be represented using barcodes.

Ended up with the definition of the degree n homology vector space Hn(K) of a simplicial complex K. For finite K this vector space has finite dimension equal to the Betti number βn(K).
9
Video of the lecture.

Started with a computer illustration of how persistent homology can be used to place some topological structure on a data set which might help an expert in the application domain interpret the data.

Then gave the precise definition of persistent homology and persistent Betti numbers.

Finished by stating that the computation of persistent Betti numbers boils down to nothing more than column reduction of a matrix to semi-echelon form.

Warning: I think I kept saying "row reduction" and "row echelon form" at the end of the lecture where I meant to say "column reduction" and "column echelon form". Apologies for that!
10
Video of the lecture.

Went through an example of the use of persistent homology to analyze the space of natural images.

Finished off with a statement of a stability theorem. A rough, non-mathematical, statement is: if data sets are changed just a small amount then the resulting barcodes only change a small amount.
11
Tutorial session
12

13

14

15

16

17

18

19

20

21

22

23

24