Random Forest Classification

Intro

Digital fingerprinting for attribution involves identifying and tracking users across devices and sessions by leverage browser attributes. Even in the presence of limited data, it is crucial for advertising platforms, such as Lucia Protocol to accurately attribute users’ actions and calculate lifetime value (LTV) and customer acquisition cost (CAC). This paper focuses on Random Forest classification as a method to process such small datasets and demonstrate its theoretical ability to achieve near perfect accuracy.

Random Forest Classification

Random forest classification is an ensemble learning method that constructs a large number of decision trees during training and outputs the majority vote across all trees. For digital fingerprints, we use features such as

User Agent

IP Address

Screen Width & Height

Hardware Concurrency

Operating System

Timezone

Language

CPU Class

Device Type

Plugins

Fonts

Color Depth

Battery Life

Browser Version

Touch Support

Platform

Mathematical Model

Let $D = \{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\}$ represents the dataset, where $x_n$ is a feature vector (up to 20 features) and $y_i$ is the user identity (fingerprint). Random Forest builds $N$ trees, each trained on a random subset of $D$ . Each tree $T_i$ predicts a class label $\hat{y}$ , using majority voting, defined as:

\hat{y} = mode(T_1(x), T_2(x), \dots, T_N(x))

The impurity function, such as Gini impurity is used to optimize the splits:

G(D) = 1 - \sum_{k=1}^{K} p_k^2

The random selection of features ensures that the trees are diverse, which strengthens generalization even with small data where $p_k$ is the proportion of samples of class $k$ in subset $D$ . The random selection of features ensures that the trees are diverse, which strengthens generalization even with small data.

Theoretical Correctness and Performance Bound

To prove Random Forest’s ability to accurately classify digital fingerprints, consider the following:

Assumption: Each user’s digital fingerprint is unique across the feature set, meaning no two users will have identical browser-exposed attributes.
High dimensional feature space: With 20 features, and many fields containing highly granular information (e.g., screen width, IP, user-agent string), the combination of features creates a nearly unique vector for each user. Random Forest excels at handling this high-dimensional space by effectively capturing non-linear relationships.
Accuracy Bound: Given sufficient trees $N$ and assuming diversity across tree outputs (due to random feature selection), Random Forest can achieve nearly perfect classification. Breiman’s original theorem (2001) on Random Forest bounds the generalization error:

$$ P(\hat{y} \neq y) \leq \rho(1 - s^2) $$

where $\rho$ is the correlation between trees, and $s$ is the strength of individual trees. With highly diverse features such as browser data, $\rho$ is low, an dfor large $N$ , the error approaches 0.

Empirically, in digital fingerprinting, high accuracy (up to 99.99%) is achievable because:

Unique Combination of features benefit our data model. User-agent strings, screen dimensions, IP, and other features are highly distinct across users.
Feature redundancy is apparent. Even if some features are missing or noisy, Random Forests’s ensemble nature allows it to correctly classify based on the remaining strong features.
Small data set sufficiency is relevant in our case. Bootstrapping and feature randomness make Random Forest robust to overfitting on small datasets.

PreviousSDK NextMaking a post

Last updated 6 months ago