Over the past years the use of anonymization services has gained significant relevance as more users are interested in protecting their data and privacy on the internet. One of the most popular ways to achieve this result is Tor. The anonymity and untraceability that Tor provides, however, can also be used by ill-intentioned users who try to take advantage of bypassing security control and policies. The Cybersecurity and Infrastructure Security Agency (CISA) mentions two methods of recognizing Tor traffic in the enterprise: indicator- or behavior-based analysis. The first one uses log analysis and lists of Tor exit nodes to identify the suspicious activity while the latter inspects patterns in TCP and UDP ports, DNS queries and inspecting the payload of the packets. In this paper, we propose a different approach using white-box machine learning models such as decision trees and Random Forest. On the one hand, our classifier achieves accuracy levels above 95%. On the other hand, our approach is the first one to allow understanding the importance of each traffic feature in the classification. Our results demonstrate that the TCP window size, the frame size and time related traffic features can be used to identify Tor traffic. In this paper we will describe a Machine Learning methodology used to identify Tor network traffic utilizing decision trees C5.0 and Random Forest. We followed a white-box approach and accomplished accuracy of over 95% in the prediction in both models. We also present an analysis of the importance of the top predictor variables.
Available at: https://doi.org/10.1109/LATINCOM50620.2020.9282317