Vorlesung „Big Data Analytics“

Description

This lecture introduces theory and techniques to analyze large volumes of data. Big data is usually created by experiments, observations or humans. Besides the sheer volume, data can be characterized by the following characteristics: the velocity it is produced, the variability of its structure, the suboptimal data quality and its inherent value.

We can gain knowledge by analyzing this data using techniques from statistics and machine learning. Global players like Google and Facebook use the introduced techniques for targeted advertising to optimize revenue. However, the techniques are also applicable in the scientific context.

In the exercises, selected open source tools such as Apache Pig, Hive, Spark or Neo4j are utilized to reveal interesting properties of publicly available data sets. The exercises teach the language R and Python and build upon them.

Target Audience

The lecture is a “Wahlpflichtmodul/Vertiefung” in the Master of Computer science; interested students of other degree programs are also welcome – please contact the organizer.

It is expected that attendees have experience in any programming language (e.g., Java). Knowledge about Python, SQL and machine learning is not necessary but helpful.

Information about the course

Location		DKRZ, room 034
Time lecture		Friday 12:15 - 13:45
Time exercise		Friday 14:00 - 15:30
First meeting		Friday 2017-10-20 12:15
Mailing list		BD-1718
Language		English

Note that it is mandatory to subscribe to the mailing list.

Lecturer

Prof. Dr. Julian Kunkel

Schedule and material

2017-10-20 - Topic 1. Introduction
Slides – Slides Visual Analytics – Exercise
- Big Data Challenges and Characteristics, Analytical Workflows, Use Cases, Programming
- Exercise 1: Introduction to R and Python, data processing: CSV-files and raw text, basic data visualization
2017-10-27 - Room change: we are in the Bundesstraße 43 (Bioinformatik, ZBH) Room 16
- Topic 2. Data Models and Processing and Statistics: A Primer
  Slides (data models) – Slides (statistics) – Exercise
- Exercise 2: Data exploration, data cleaning/extraction using the Wikipedia dataset
2017-11-03 - Topic 3. Databases and Data-Warehouses
Slides – Exercise
- Exercise 3: Data Exploration (Diamond), Data Ingestion Using PostgreSQL, Database Schema: Relational & OLAP Cube
2017-11-10 - Topic 4. Map Reduce & Hadoop
Slides – Exercise
- Exercise 4: Data exploration (Chicago-Crime), Map-Reduce, visualization of Wikipedia Data
2017-11-17 - Guest talk by Dr. Philipp Neumann: Exascale Computing and Big Data
Slides – Exercise
- Exercise 5: Data exploration (Wikipedia), Data Cleaning IMDB, JOIN using Map-Reduce
2017-11-24 - Topic 5. Machine Learning
Slides – Exercise
- Exercise 6: Data exploration (Weather data), spatial data in PostGIS, machine learning
2017-12-01 - Topic 6. Processing Relational Data with Hive
Slides – Exercise
- Exercise 7: Data exploration (IMDb Quotes), classification (titanic), word frequencies/external scripts (Hive)
2017-12-08 - Topic 7. Graph Processing with Neo4J and REST APIs
Slides (Neo4j) – Slides (Rest) – Exercise
- Exercise 8: Neo4j Cypher data model, queries, clustering Wikipedia, HDFS REST API
2017-12-15 - Topic 8. Columnar Access with HBase and Document Storage with MongoDB
Slides (HBase) – Slides (Mongo) – Exercise
- Exercise 9: Columnar model and import, classification of Wikipedia, document data model
2017-12-22 - Topic 12. Overview of Tools in the Hadoop Ecosystem
Slides
2018-01-12 - Topic 9. Data Flow Languages & Pig Latin and Performance Aspects
Slides (Pig) – Slides (Performance) – Exercise
- Exercise 10: Data flow programming, Pig examples, performance analysis
2018-01-19 - Topic 10. In-Memory Computation with Spark
Slides – Exercise
- Exercise 11: Spark basics, distance metrics, clustering, performance analysis
2018-01-26 - Topic 11. Stream Processing (Storm, Spark, Flink)
Slides – Exercise
- Exercise 12: Find movie quotes, streaming data model for crime data
2018-02-06 Examination 10:00 (60 minutes), DKRZ, R034
2018-02-27 Examination 10:00 (60 minutes), DKRZ, ~~R034~~ R023

Literature

Diverse R-Topics: Veranstaltung Programmierung in R
Book: Data Science for Dummies, Lillian Pierson, Wiley Verlag
Book: Big Data - Priciples and best practices of scalable real-time data systems, Nathan Marz und James Warren, Manning Verlag
Horton Works Platform: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/index.html
Introductions to programming languages
- Python: Interaktives Tutorial
- R: Interaktives Tutorial
- Java: Interaktives Tutorial
R Books:
- R Packages
- Advanced R
- ggplot2-Buch
- Machine Learning with R, Second Edition, Brett Lantz, 2015
Python Books:
- Python Data Science Handbook, 2016
Interesting tools:
- http://ipython.org/notebook.html Python Notebook, vgl. Laborbuch mit Experimentalbeschreibung und Ergebnissen.
Cheat cheats:
- Für diverse R Pakete
Resource for data science: https://www.kaggle.com/

Table of Contents