|Credits||2SWS/3ECTS for 113446b + 4SWS/6ECTS for 113446a = 6SWS/9ECTS|
Note: This lecture and the lecture Natural Language Processing together constitute the module 113446 Data Mining. It is not possible to attend and credit only one of these lectures! However, it is not necessary to attend these lectures in the same term. There are 2 independent grades in both lectures of the module. The grade for the module is a weighted average of both parts. The weights are:
- 3⁄5 for lab exercise data mining and pattern recognition
- 2⁄5 for the NLP lecture.
- First lesson in term SS 18: Monday, 26.03. 2018
Data Mining Lab: Contents
In this course 6 different data mining and pattern recognition applications are implemented by all student groups. A group contains at most 3 students. The implementation of each application should be done within one session. The applications, which have to be implemented, are described in the subsections below.
For each of the 6 lab excercises:
- a jupyter-notebook is provided, which contains the task-description and questions.
- students have to prepare themselves before the exercise-date. For a focused preparation a list of preparation questions is contained in the jupyter-notebook of each exercise. These questions will be asked before the excercise (short oral test).
- the tasks as formulated in the jupyter-notebook must be implemented in the code-cells. Moreover, the the questions must be answered in the jupyter-notebook.
- Important: Even though it is not always explicitly stated, the obtained results must be discussed scientifically: Try to explain the results, document what you find interesting, propose improvements, …This discussion must also be included in the jupyter-notebook.
- the prepared jupyter-notebooks (as described in the previous items) must be submitted to the lecturer. Due that for each notebook, is immediately before the start of the next lab-exercise
- Each exercise is marked. The final mark is the average over all 6 marks. The quality of the protocol and the quality of the sporadic test at the start of each exercise determine the marks.
- Unexcused absence yields a submark of 4.7.
Data Analysis and Gender-Age Group Prediction of Mobile Users:
In this exercise data of more than 60000 chinese mobile users is analysed. The data has been published on Kaggle Talking Data Mobile User Demographics. The tasks in this exercise are:
- Calculate overall statistics such as distribution of mobile users over gender-age groups, distribution of used smartphone brands, distribution of app-category-usage
- Analyse single user behavior over time and location
- Preprocess data and extract meaningful features for …
- …prediction of gender and age of mobile users
For this exercise at least two sessions are allocated.
Recommender Systems are applied in E-commerce for generating customized recommendations. Well known are the Amazon.com recommendations which are either distributed by e-mail or presented on the Amazon web page after login. For generating these recommendations the products which have already purchased or reviewed by the user are taken into account. In this exercise the currently most popular algorithms (Collaborative Filtering) for generating recommendations are implemented, tested and analysed.
Clustering of music files and automatic playlist generation:
In this exercise a collection of mp3 encoded music files is first transcoded to the .wav format. From the .wav files a comprehensive set of audio features ise extracted. The corresponding feature-vectors are then clustered, such that the clusters contain similar music-files.
A Naive Bayes Classifier is implemented for filtering spam. It is also shown how to apply this algorithm for document classification in general
Document Classification and Feature Extraction:
In this excercise a large amount of RSS-Newsfeeds is collected. All articles coming from the different feeds are clustered using non-negative matrix factorisation. Essential features of each document cluster are extracted
In this excercise a programm for face recognition is implemented. For a given set of training images (biometrical face photos) the Principal Component Analysis (PCA) is applied to calculate the space of eigenfaces. Then a photo which has to be recognized is transformed to the space of eigenfaces and the closest training photo is calculated.
Dates and Documents
The links in the table below refer to the exercise-instruction-notebooks. However, for interactively working with the notebooks, they must be downloaded. All notebooks and resources can be downloaded from GitLab project. For executing jupyter-notebooks, Python and jupyter-notebooks must be installed. It is strongly recommended to install the Anaconda Python distribution. This distribution does not only contain Python and Jupyter-Notebooks but also nearly all packages, which are required in this lab-exercise.
|26.03.2018||Introduction, Organizational aspects|
|09.04.2018||Registration, Python Introduction, Environment Setup||Python (.html), Numpy (.ipynb), Matplotlib (.ipynb), Pandas (.ipynb), Exercises (.ipynb)|
|16.04.2018||Mobile Users Analysis and Gender-/Age-Prediction||Mobile User Analysis (.ipynb)|
|23.04.2018||Mobile Users Analysis and Gender-/Age-Prediction|
|30.04.2018||Collaborative Recommender Systems||Recommender Systems(.ipynb)|
|07.05.2018||Collaborative Recommender Systems|
|14.05.2018||Music Clustering||Music Clustering (.ipynb)|
|28.05.2018||Document Classification||Document Classification (.ipynb)|
|04.06.2018||Topic Extraction / Document Clustering||Topic Extraction (.ipynb)|
|11.06.2018||Face Recognition||Face Recognition (.ipynb)|