Topics
The following list of topics contains references to papers. One seminar topic can be comprised of more than one paper. (This is the case if the papers are very short or if one of the papers merely helps to give a better overview.)
If a topic has material for more than one participant, the parts are separated by a mark --.
Background
What is Big Data?
Literature search.
Starting point:
Undefined by data: a survey of Big Data definitions
Jonathan Stuart Ward, Adam Barker
Principal Component Analysis
A tutorial on principal component analysis
Jon Shelens
Principal component analysis
Jake Lever, Martin Krzywinski, Naomi Altman
Hash Functions
Hashing techniques: a survey and taxonomy
Lianhua Chi, Xingquan Zhu
--
Cuckoo filter: practically better than Bloom
Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher
Computational models: Streaming, Sketching, MapReduce
Sketching and streaming algorithms for processing massive data
Jelani Nelson
--
Google's MapReduce programming model - revisited
Ralf Lämmel
--
Evaluating MapReduce for multi-core and multiprocessor systems
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis
Support Vector Machines
Support vector machine - a survey
Ashis Pradhan
Applications of support victor machines for pattern recognition: a survey
Hyeran Byun, Seong-Whan Lee
Algorithms for Big Data
Volume
Dimensionality Reduction
The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction
Kasper Green Larsen, Jelani Nelson
Bloom Filters
Network applications of Bloom filters: a survey
Andrei Broder, Michael Mitzenmacher
--
Overview paper (possibly needs some literature search).
An optimal Bloom filter replacement
Anna Pagh, Rasmus Pagh, S. Srinivasa Rao
--
Exact pattern matching with feed-forward Bloom filters
Iulian Moraru, David G. Andersen
Probabilistic Counting
Counting large numbers of events in small registers
Robert Morris
Approximate counting: a detailed analysis
Philippe Flajolet
Frequency Moments
Optimal approximations of the frequency moments of data streams
Piotr Indyk, David Woodruff
Graph Streaming
Graph Stream Algorithms: A Survey
Andrew McGregor
Matrix Sketching
Simple and deterministic matrix sketching
Edo Liberty
Velocity
Sublinar-time algorithms
Sublinear-time algorithms
Artur Czumaj, Christian Sohler
Variety
Clustering
Theoretical analysis of the k-means algorithm - a survey
Johannes Blömer, Chrisiane Lammersen, Melanie Schmidt, Christian Sohler
--
Local search yields a PTAS for k-means in doubling metrics
Zachary Friggstad, Mohsen Rezapour, Mohammad R. Salavatipour
--
Heavy hitters via cluster-preserving clustering
Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, Mikkel Thorup
--
On Lloyd's algorithm: new theoretical insights for clustering in practice
Cheng Tang, Claire Monteleoni
--
The global k-means clustering algorithm
Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek
Applications of Algorithms for Big Data
Biology
The application of principal component analysis to drug discovery and biomedical data
Alessandro Giuliani
Further Fields: Physics, Astronomy, Economy etc.
It is possible to choose a literature search on the use of algorithms for Big Data in other contexts.
Working on such a topic requires self-sufficient judgement of the quality of the material.
If several participants want to do such a topic, each has to choose a different scientific domain.
Frameworks and Tools
MapReduce, Hadoop
Parallel data processing with MapReduce: a survey
Kyong-Ha Lee, Hyunsik Choi, Bongki Moon
Overview of Tools Used in Practice
Literature search on tools like
HA proxy, Elasticseach, Logstash, Prometheus, Grafana