Vorlesung „Big Data Analytics“


This lecture introduces theory and techniques to analyze large volumes of data. Big data is usually created by experiments, observations or humans. Besides the sheer volume, data can be characterized by the following characteristics: the velocity it is produced, the variability of its structure, the suboptimal data quality and its inherent value.

We can gain knowledge by analyzing this data using techniques from statistics and machine learning. Global players like Google and Facebook use the introduced techniques for targeted advertising to optimize revenue. However, the techniques are also applicable in the scientific context.

In the exercises, selected open source tools such as Apache Pig, Hive, Spark or Neo4j are utilized to reveal interesting properties of publicly available data sets. The exercises teach the language R and Python and build upon them.

Target Audience

The lecture is a “Wahlpflichtmodul/Vertiefung” in the Master of Computer science; interested students of other degree programs are also welcome – please contact the organizer.

It is expected that attendees have experience in any programming language (e.g., Java). Knowledge about Python, SQL and machine learning is not necessary.

Information about the course

Location DKRZ, room 034
Time lecture Friday 12:15 - 13:45
Time exercise Friday 14:00 - 15:30
First meeting Friday 21.10.2016 12:15
Mailing list BD-1617
Language English

Note that it is mandatory to subscribe to the mailing list.


Schedule and material

  • 2016-10-21 - Topic 1. Introduction
    • Big Data Challenges and Characteristics, Analytical Workflows, Use Cases, Programming
    • Exercise 1: Introduction to R and Python, data processing: CSV-files and raw text, basic data visualization
  • 2016-10-28 - Topic 2. Data Models and Processing and Statistics: A Primer
    Slides (data models)Slides (statistics)Exercise
    • Exercise 2: Data exploration, data cleaning/extraction using the Wikipedia dataset
  • 2016-11-04 - Topic 3. Databases and Data-Warehouses
    • Exercise 3: Data Exploration (Diamond), Data Ingestion Using PostgreSQL, Database Schema: Relational & OLAP Cube
  • 2016-11-11 - Topic 4. Map Reduce & Hadoop
    • Exercise 4: Data exploration (Chicago-Crime), Map-Reduce, visualization of Wikipedia Data
  • 2016-11-18 - Guest talk by Dr. Philipp Neumann: Exascale Computing and Big Data
    • Exercise 5: Data exploration (Wikipedia), Data Cleaning IMDB, JOIN using Map-Reduce
  • 2016-11-25 - Topic 5. Machine Learning
    • Exercise 6: Data exploration (Weather data), spatial data in PostGIS, machine learning
  • 2016-12-02 - Topic 6. Processing Relational Data with Hive
    • Exercise 7: Data exploration (IMDb Quotes), classification (titanic), word frequencies/external scripts (Hive)
  • 2016-12-09 - Topic 7. Graph Processing with Neo4J and REST APIs
    Slides (Neo4j)Slides (Rest)Exercise
    • Exercise 8: Neo4j Cypher data model, queries, clustering Wikipedia, HDFS REST API
  • 2016-12-16 - Topic 8. Columnar Access with HBase and Document Storage with MongoDBSlides (HBase)Slides (Mongo)Exercise
    • Exercise 9: HBase data model and import, classification of Wikipedia, document data model
  • 2016-12-23 - We will have exercises only
  • 2017-01-13 - Topic 9. Data Flow Languages & Pig Latin and Performance AspectsSlides (Pig)Slides (Performance)Exercise
    • Exercise 10: Data flow programming, Pig examples, performance analysis
  • 2017-01-20 - Topic 10. In-Memory Computation with SparkSlidesExercise
    • Exercise 11: Spark basics, distance metrics, clustering, performance analysis
  • 2017-01-27 - Topic 11. Stream Processing with (with Storm, Spark, Flink)SlidesExercise
    • Exercise 12: Find movie quotes, streaming data model for crime data
  • 2017-02-03 - Topic 12. Overview of Tools in the Hadoop EcosystemSlides