## Topics

The following list of topics contains references to papers. One seminar topic can be comprised of more than one paper. (This is the case if the papers are very short or if one of the papers merely helps to give a better overview.)

If a topic has material for more than one participant, the parts are separated by a mark --.

## Background

#### What is Big Data?

Literature search.

Starting point:

Undefined by data: a survey of Big Data definitions

Jonathan Stuart Ward, Adam Barker

#### Principal Component Analysis

A tutorial on principal component analysis

Jon Shelens

Principal component analysis

Jake Lever, Martin Krzywinski, Naomi Altman

#### Hash Functions

Hashing techniques: a survey and taxonomy

Lianhua Chi, Xingquan Zhu

--

Cuckoo filter: practically better than Bloom

Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher

#### Computational models: Streaming, Sketching, MapReduce

Sketching and streaming algorithms for processing massive data

Jelani Nelson

--

Google's MapReduce programming model - revisited

Ralf Lämmel

--

Evaluating MapReduce for multi-core and multiprocessor systems

Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis

#### Support Vector Machines

Support vector machine - a survey

Ashis Pradhan

Applications of support victor machines for pattern recognition: a survey

Hyeran Byun, Seong-Whan Lee

## Algorithms for Big Data

### Volume

#### Dimensionality Reduction

The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction

Kasper Green Larsen, Jelani Nelson

#### Bloom Filters

Network applications of Bloom filters: a survey

Andrei Broder, Michael Mitzenmacher

--

Overview paper (possibly needs some literature search).

An optimal Bloom filter replacement

Anna Pagh, Rasmus Pagh, S. Srinivasa Rao

--

Exact pattern matching with feed-forward Bloom filters

Iulian Moraru, David G. Andersen

#### Probabilistic Counting

Counting large numbers of events in small registers

Robert Morris

Approximate counting: a detailed analysis

Philippe Flajolet

#### Frequency Moments

Optimal approximations of the frequency moments of data streams

Piotr Indyk, David Woodruff

#### Graph Streaming

Graph Stream Algorithms: A Survey

Andrew McGregor

#### Matrix Sketching

Simple and deterministic matrix sketching

Edo Liberty

### Velocity

#### Sublinar-time algorithms

Sublinear-time algorithms

Artur Czumaj, Christian Sohler

### Variety

#### Clustering

Theoretical analysis of the k-means algorithm - a survey

Johannes Blömer, Chrisiane Lammersen, Melanie Schmidt, Christian Sohler

--

Local search yields a PTAS for k-means in doubling metrics

Zachary Friggstad, Mohsen Rezapour, Mohammad R. Salavatipour

--

Heavy hitters via cluster-preserving clustering

Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, Mikkel Thorup

--

On Lloyd's algorithm: new theoretical insights for clustering in practice

Cheng Tang, Claire Monteleoni

--

The global k-means clustering algorithm

Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek

## Applications of Algorithms for Big Data

### Biology

The application of principal component analysis to drug discovery and biomedical data

Alessandro Giuliani

### Further Fields: Physics, Astronomy, Economy etc.

It is possible to choose a literature search on the use of algorithms for Big Data in other contexts.

Working on such a topic requires self-sufficient judgement of the quality of the material.

If several participants want to do such a topic, each has to choose a different scientific domain.

## Frameworks and Tools

#### MapReduce, Hadoop

Parallel data processing with MapReduce: a survey

Kyong-Ha Lee, Hyunsik Choi, Bongki Moon

#### Overview of Tools Used in Practice

Literature search on tools like

HA proxy, Elasticseach, Logstash, Prometheus, Grafana