Organisation

Time Tuesday, 08.15h-11.30h
Room U32
Credits 4SWS/6ECTS
Exam Lab Exercises

Announcements

  • First lesson in term SS 20: Tuesday, 21.04.2020

Important notice for the summer term 2020: The course will initially be offered as a synchronous distance learning course during the SARS-CoV-2-related restrictions in term SS 20. I will post the corresponding Zoom link to the group of online registered users in the Persönlicher Stundenplan before 20.04.2020. The timetable according to the Starplan will apply. If or when we may meet again in lecture halls, the room indicated in Starplan will apply.

For saving Zoom-Breakoutrooms in a persistent way, it is necessary that you register for a free account at Zoom. Then insert the email-address of your Zoom-Account into the second column of this group-assignment-spreedsheat.

Data Mining Lab: Contents

In this course 6 different data mining and pattern recognition applications are implemented by all student groups. A group contains at most 3 students. The implementation of each application should be done within one session. The applications, which have to be implemented, are described in the subsections below.

For each of the 6 lab excercises:

  • a jupyter-notebook is provided, which contains the task-description and questions.
  • students have to prepare themselves before the exercise-date. For a focused preparation a list of preparation questions is contained in the jupyter-notebook of each exercise. These questions will be interrogated randomly at the start of each excercise.
  • the tasks as formulated in the jupyter-notebook must be implemented in the code-cells. Moreover, the questions must be answered in the jupyter-notebook.
  • Important: Even though it is not always explicitly stated, the obtained results must be discussed scientifically: Try to explain the results, document what you find interesting, propose improvements, …This discussion must also be included in the jupyter-notebook.
  • the prepared jupyter-notebooks (as described in the previous items, including the answers on the preparation questions!) must be submitted to the lecturer. Due date for each notebook, is immediately before the start of the next lab-exercise. The Jupyter Notebook (.ipynb), it’s .html representations and a link to download the entire project must be submitted.
  • Each exercise is marked. The final mark is the average over all 6 marks.
  • Unexcused absence yields a submark of 4.7.

Global Health Data Analysis:

In this exercise data on global health and nutrition is analysed. In particular

  • life-expectancy per country is visualized in a global map
  • correlations between nutrition-facts, such as daily calories consumption per capita, and life-expectancy is analysed
  • machine-learning models to predict life-expectancy from nutrition features are trained
  • countries are clustered according to their nutrition-development within the last 50 years

For this exercise two sessions are allocated.

Recommender Systems:

Recommender Systems are applied in E-commerce for generating customized recommendations. Well known are the Amazon.com recommendations which are either distributed by e-mail or presented on the Amazon web page after login. For generating these recommendations the products which have already purchased or reviewed by the user are taken into account. In this exercise the currently most popular algorithms (Collaborative Filtering) for generating recommendations are implemented, tested and analysed.

Clustering of music files and automatic playlist generation:

In this exercise a collection of mp3 encoded music files is first transcoded to the .wav format. From the .wav files a comprehensive set of audio features ise extracted. The corresponding feature-vectors are then clustered, such that the clusters contain similar music-files.

Spam Filter:

A Naive Bayes Classifier is implemented for filtering spam. It is also shown how to apply this algorithm for document classification in general

Face Recognition:

In this excercise a programm for face recognition is implemented. For a given set of training images (biometrical face photos) the Principal Component Analysis (PCA) is applied to calculate the space of eigenfaces. Then a photo which has to be recognized is transformed to the space of eigenfaces and the closest training photo is calculated.

Traffic Sign Recognition with Deep Neural Networks:

In this excercise a Convolutional Neural Network (CNN) for the recognition of German traffic signs must be implemented, using tensorflow and keras.

Dates and Documents

The links in the table below refer to the exercise-instruction-notebooks. However, for interactively working with the notebooks, they must be downloaded. All notebooks and resources can be downloaded from GitLab project. For executing jupyter-notebooks, Python and jupyter-notebooks must be installed. It is strongly recommended to install the Anaconda Python distribution. This distribution does not only contain Python and Jupyter-Notebooks but also nearly all packages, which are required in this lab-exercise. See the Tipps&Tricks notebook in the Gitlab repo of this course for further hints on the setup of the development environment.

Date Title Document Links
21.04.2020 Introduction, Organizational aspects
28.04.2020 Registration, Python Introduction, Environment Setup Data Science Programming Course
05.05.2020 Global Health Data Global Health Data (.ipynb)
12.05.2020 Global Health Data
19.05.2020 Collaborative Recommender Systems Recommender Systems(.ipynb)
26.05.2020 Collaborative Recommender Systems
09.06.2020 Music Clustering Music Clustering (.ipynb)
16.06.2020 Document Classification Document Classification (.ipynb)
23.06.2020 Face Recognition Face Recognition (.ipynb)
30.06.2020 Traffic Sign Recognition, Convolutional Neural Networks (CNNs) Traffic Sign Classification (.ipynb)

Literature

  • Programming collective intelligence : building smart web 2.0 applications (23 August 2007) by Toby Segaran
  • Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems) (22 June 2005) by I. H. Witten, Eibe Frank
  • Natural Language Processing with Python (2009) by Steven Bird, Ewan Klein, Edward Loper