Random Forest Classification
Last updated
Last updated
Digital fingerprinting for attribution involves identifying and tracking users across devices and sessions by leverage browser attributes. Even in the presence of limited data, it is crucial for advertising platforms, such as Lucia Protocol to accurately attribute usersβ actions and calculate lifetime value (LTV) and customer acquisition cost (CAC). This paper focuses on Random Forest classification as a method to process such small datasets and demonstrate its theoretical ability to achieve near perfect accuracy.
Random forest classification is an ensemble learning method that constructs a large number of decision trees during training and outputs the majority vote across all trees. For digital fingerprints, we use features such as
Operating System
Timezone
Language
CPU Class
Device Type
Plugins
Fonts
Color Depth
Battery Life
Browser Version
Touch Support
Platform
Let represents the dataset, where is a feature vector (up to 20 features) and is the user identity (fingerprint). Random Forest builds trees, each trained on a random subset of . Each tree predicts a class label , using majority voting, defined as:
The impurity function, such as Gini impurity is used to optimize the splits:
To prove Random Forestβs ability to accurately classify digital fingerprints, consider the following:
Assumption: Each userβs digital fingerprint is unique across the feature set, meaning no two users will have identical browser-exposed attributes.
High dimensional feature space: With 20 features, and many fields containing highly granular information (e.g., screen width, IP, user-agent string), the combination of features creates a nearly unique vector for each user. Random Forest excels at handling this high-dimensional space by effectively capturing non-linear relationships.
$$ P(\hat{y} \neq y) \leq \rho(1 - s^2) $$
Empirically, in digital fingerprinting, high accuracy (up to 99.99%) is achievable because:
Unique Combination of features benefit our data model. User-agent strings, screen dimensions, IP, and other features are highly distinct across users.
Feature redundancy is apparent. Even if some features are missing or noisy, Random Forestsβs ensemble nature allows it to correctly classify based on the remaining strong features.
Small data set sufficiency is relevant in our case. Bootstrapping and feature randomness make Random Forest robust to overfitting on small datasets.
The random selection of features ensures that the trees are diverse, which strengthens generalization even with small data where is the proportion of samples of class in subset . The random selection of features ensures that the trees are diverse, which strengthens generalization even with small data.
Accuracy Bound: Given sufficient trees and assuming diversity across tree outputs (due to random feature selection), Random Forest can achieve nearly perfect classification. Breimanβs original theorem (2001) on Random Forest bounds the generalization error:
where is the correlation between trees, and is the strength of individual trees. With highly diverse features such as browser data, is low, an dfor large , the error approaches 0.