Random Forest Classification

Intro

Digital fingerprinting for attribution involves identifying and tracking users across devices and sessions by leverage browser attributes. Even in the presence of limited data, it is crucial for advertising platforms, such as Lucia Protocol to accurately attribute users’ actions and calculate lifetime value (LTV) and customer acquisition cost (CAC). This paper focuses on Random Forest classification as a method to process such small datasets and demonstrate its theoretical ability to achieve near perfect accuracy.

Random Forest Classification

Random forest classification is an ensemble learning method that constructs a large number of decision trees during training and outputs the majority vote across all trees. For digital fingerprints, we use features such as

User Agent
IP Address
Screen Width & Height
Hardware Concurrency

Operating System

Timezone

Language

CPU Class

Device Type

Plugins

Fonts

Color Depth

Battery Life

Browser Version

Touch Support

Platform

Mathematical Model

Let D={(x1,y1),(x2,y2),,(xn,yn)}D = \{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\} represents the dataset, where xnx_n is a feature vector (up to 20 features) and yiy_i is the user identity (fingerprint). Random Forest builds NN trees, each trained on a random subset of DD. Each tree TiT_i predicts a class label y^\hat{y}, using majority voting, defined as:

y^=mode(T1(x),T2(x),,TN(x))\hat{y} = mode(T_1(x), T_2(x), \dots, T_N(x))

The impurity function, such as Gini impurity is used to optimize the splits:

G(D)=1k=1Kpk2G(D) = 1 - \sum_{k=1}^{K} p_k^2

The random selection of features ensures that the trees are diverse, which strengthens generalization even with small data where pkp_k is the proportion of samples of class kk in subset DD. The random selection of features ensures that the trees are diverse, which strengthens generalization even with small data.

Theoretical Correctness and Performance Bound

To prove Random Forest’s ability to accurately classify digital fingerprints, consider the following:

  • Assumption: Each user’s digital fingerprint is unique across the feature set, meaning no two users will have identical browser-exposed attributes.

  • High dimensional feature space: With 20 features, and many fields containing highly granular information (e.g., screen width, IP, user-agent string), the combination of features creates a nearly unique vector for each user. Random Forest excels at handling this high-dimensional space by effectively capturing non-linear relationships.

  • Accuracy Bound: Given sufficient trees NN and assuming diversity across tree outputs (due to random feature selection), Random Forest can achieve nearly perfect classification. Breiman’s original theorem (2001) on Random Forest bounds the generalization error:

$$ P(\hat{y} \neq y) \leq \rho(1 - s^2) $$

where ρ\rho is the correlation between trees, and ss is the strength of individual trees. With highly diverse features such as browser data, ρ\rho is low, an dfor large NN, the error approaches 0.

Empirically, in digital fingerprinting, high accuracy (up to 99.99%) is achievable because:

  1. Unique Combination of features benefit our data model. User-agent strings, screen dimensions, IP, and other features are highly distinct across users.

  2. Feature redundancy is apparent. Even if some features are missing or noisy, Random Forests’s ensemble nature allows it to correctly classify based on the remaining strong features.

  3. Small data set sufficiency is relevant in our case. Bootstrapping and feature randomness make Random Forest robust to overfitting on small datasets.

Last updated