Research

Classification of Encrypted Network Traffic without Traffic Decryption Using Machine Learning

In 2015 I worked on a research partnership with an industrial partner, Sandvine Inc.. I designed and implemented a classifier to separate encrypted Facebook from non-Facebook traffic, and identify different types of content (such as video, audio, image, …) carried by the encrypted Facebook traffic. All information available to us are the packet headers and packets statistical features, and no traffic decryption or deep packet inspection is performed. The generic framework designed has an online module to classify encrypted traffic and an offline module for periodic update of the classifier. This identification plays an important role in many areas, such as traffic engineering, quality of service and security, just to name a few.

Classification of Streaming Data Under Influence of Concept Change

In 2014, as a post-doc researcher, I continued my research on machine learning at Dalhousie University, this time with a focus on streaming data — one of the varieties of big data — under influence of concept drift. The objective of this research is to use streaming algorithms to classify non-stationary streaming data in which the definition of classes changes over time. Therefore, traditional offline train and test methodology is not effective, and continuous learning should be employed to track the concept changes. This is a situation present in a lot of businesses, for example the constantly changing customers preferences or subscription churns in services requiring membership. Furthermore we assume a limit on data labels used, which we call the ‘label budget’. This idea — based on active learning concept — is to minimize the number of labels used in training by having the learning algorithm query the label of data points only if it finds them useful to the training process.

Symbiotic Evolutionary Subspace Clustering (S-ESC)

Application domains with large attribute spaces, such as genomics and text analysis, necessitate clustering algorithms with a higher complexity than traditional clustering algorithms. More sophisticated approaches are required to cope with the increasing dimensionality and cardinality of these data sets. Subspace clustering, a generalization of traditional clustering, identifies the attribute support for each cluster as well as the location and number of clusters. In the most general case, attributes associated with each cluster could be unique.

The proposed algorithm, Symbiotic Evolutionary Subspace Clustering (S-ESC), borrows from ‘symbiosis’ in the sense that each clustering solution is defined in terms of a host (a single member of the host population) and a number of co-evolved cluster centroids (or symbionts in an independent symbiont population). Symbionts define clusters and therefore attribute subspaces, whereas hosts define sets of clusters to constitute a non-degenerate solution. The symbiotic representation of S-ESC is the key to making it scalable to high-dimensional data sets, while an integrated subsampling process makes it scalable to large-scale data sets. A bi-objective evolutionary method is proposed to identify the unique attribute support of each cluster while detecting its data instances.

Benchmarking is performed against a well-known test suite of subspace clustering data sets with four well-known comparator algorithms from both the full-dimensional and subspace clustering literature: EM, MINECLUS, PROCLUS, STATPC and a generic genetic algorithm-based subspace clustering. Performance of the S-ESC algorithm was found to be robust across a wide cross-section of properties with a common parameterization utilized throughout. This was not the case for the comparator algorithms. Specifically, performance could be sensitive to a particular data distribution or parameter sweeps might be necessary to provide comparable performance.

A comparison is also made relative to a non-symbiotic genetic algorithm. In this case each individual represents the set of clusters comprising a subspace cluster solution. Benchmarking indicates that the proposed symbiotic framework can be demonstrated to be superior once again. For a list of publications about S-ESC please see the Publication page.