October 29, 2024

Fortress of Fair Play - Part 2

Graph based Model for Stopping Frauds at Games24x7

In our previous article Fortress of Fair Play: Stopping Frauds at Games24x7, we highlighted the importance of fair play in real money gaming along with the necessity of protecting players from fraudulent activities. We also discussed the architecture of Risk-Rule Engine — RRE, a robust fraud detection framework that monitors player activity and triggers specific rules in response to unusual behavior.

In this article, we will outline the development of the Graph based ML model for real-time inference in RRE, which is employed to flag entire networks of suspected fraudsters. We will explain the rationale for selecting a graph database, feature extraction from the database, ways to mitigate data leakage during development phase, user level score prediction process after integration of an explain ability layer with additional features of identification and flagging of fraud chains.

The Problem

The existing version of RRE, relied on a real-time relational DB features based model to score users. These features capture user activity and static user information like geolocation, IP, etc. on the platform to provide fraud probability scores. Since the model prediction occurs late in the user journey, a drawback is that the model has low recall.

Additionally, there were rules that checked for linkages which served as evidence, but these were constrained by capabilities of relational databases making them difficult to look beyond 1hop (Figure 1) and multiple types of linkages.

**Figure1**: Example rule - single linkage type (lack of strong evidence for fraud)

‍

Besides lack of real-time capability of rules, a significant limitation of the Relational DB model and rules was their inability to represent intricate user-user relationships, which prevented capturing complete fraud chains, loss of bonus amount resulting in abuse of offers and prize money. Moreover, the rules lacked flexibility for fraud detection, meaning that they have to be changed very frequently on the basis of emerging fraud patterns.

Previously, we have utilized graph databases in real time to check for IP linkages while fraudulent players try to sit on the same gameplay table as mentioned in Prevent Fraud and Collusion — The Graph Way. However, this needs to be further improved as it again doesn’t capture different types of interconnections that form fraud chains.

The Solution

Mitigating Fraud Risks with Graph based Fraud Detection System

In order to address the limitations of the existing setup, the challenge here was to develop areal-time solution that could look at complex linkages forming a network with in the knowledge graph, and provide fraud scores with high precision for all users to aid agent investigation. Hence, usage of Real Time Graph based Fraud Detection Model has been considered and evaluated.

**Figure 2:** Using Knowledge Graph to find complex linkages

‍

Fraudulent activities involve complex networks of interconnected users through linkages such as location, IP addresses, etc. Traditional methods often struggle to capture these relationships effectively as these would need multiple joins and computation to traverse multiple hops. Knowledge Graph-based approach provides an efficient way to model such complex user connections and provides advantages such as:

Allow for effective querying beyond 1 hop
Easily Scalable for large-scale datasets
Offer interpretability in feature extraction
Smooth Integration with machine learning techniques

Solution Framework

Below figure represents the architecture utilized for development of an enhanced real time Fraud Detection System:

‍

Graph Database Creation

We have come up with the schema of the graph database on the basis of domain knowledge and business understanding. The chosen knowledge graph schema is robust as augmentation is much less frequent than rules which require to be changed frequently as per new fraud patterns.

We have used Neo4j to build a graph database for model development since it provides good visualization and graph querying capabilities. It works on Cypher query language to create/insert nodes, establish relationships between the nodes and extract information from the graph DB. Below is a snippet for node and edge creation using Cypher:

Node creation -

CALL apoc.periodic.iterate(

"LOAD CSV WITH HEADERS FROM'path/file_nodetype1.csv' AS row return row",

"CALLapoc.create.node(['NodeType1'],

{Property1: row.property1})

YIELD node RETURN node",

{batchSize:10000, parallel:true}

);

CALL apoc.periodic.iterate(

"LOAD CSV WITH HEADERS FROM'path/file_nodetype2.csv' AS row return row",

"CALLapoc.create.node(['NodeType2'],

{Property2: row.property2})

YIELD node RETURN node",

{batchSize:10000, parallel:true}

);

The above code creates 2 types of nodes - ‘NodeType1’ and ‘NodeType2’ having properties Property1 and Property2respectively.

Edge creation -

CALL apoc.periodic.iterate(

"LOAD CSV WITH HEADERS FROM'path/file_edgedata.csv' AS row return row",

"MATCH (n1:NodeType1 {Property1:row.property1})

USING INDEX n1:NodeType1(Property1)

MATCH (n2:NodeType2 {Property2: row.property2})

USING INDEX n2:NodeType2(Property2)

CALL apoc.create.relationship(n1,'LINKAGE1',

{EdgeProperty: row.edgeproperty},n2)

YIELD rel

RETURN rel",

{batchSize:10000, parallel:true}

);

The above code creates edges‘LINKAGE1’ with property EdgeProperty between - NodeType1 and NodeType2 using properties Property1 and Property2 respectively.

Feature Creation using Graph Traversal

Once the graph database has been created, we have computed the features using graph traversal from each user node to extract:

Distinct Node Type Count
Distinct User Count for Distinct Node Types

Below is a snippet used for feature creation using traversal of graph DB:

MATCH (n1:User)

WHEREn1.Property1 = 'property_value'

WITH n1

MATCH(n1)-[r:LINKAGE1|LINKAGE2]->(n2:NodeType1|NodeType2)

WITH DISTINCT n1, n2

MATCH (n2)<-[r1:LINKAGE1|LINKAGE2]-(n3:User)

WITH DISTINCT n1 as user, labels(n2)[0] as linkage_name,

n2 as linked_node, u3 as linked_user

RETURN DISTINCTuser.Property1 as primary_user, linkage_name,

count(DISTINCT linked_node) as node_count,

count(DISTINCTlinked_user.Property1) as linked_user_count,

collect(DISTINCTlinked_user.Property1) as linked_user_list

The above snippet traverses the graph 2 hops away from the user node and gives count of non-user nodes connected to the user, number of linked users to non-user nodes along with the user list.

Model Development and Evaluation

We have explored various methods to develop our models as mentioned below. These methods were assessed for their usefulness, interpretability, and performance metrics.

Model 1: Extracting Relational DB user activity features for supervised ML Model
Model 2: Interpretable single hop (user and node count)features for ML Model
Model 3: Interpretable multi hop (user and node count)features for ML Model
Model 4: Above mentioned approach after integrating PageRank-based features

The above-mentioned features were integrated with previously fraud tagged user data to train and evaluate binary classification ML models using Logistic Regression, XGBoost, etc. Model evaluation and selection has been performed based on AUC, Precision, Recall and F1 Score.

**Figure 4**: Test AUC-ROC Curves (False Positive Rate 0% - 2.5%)

‍

We have chosen Model3 (without PageRank features) over Model4 (with PageRank features) due to the below mentioned reasons:

Both models have very similar performance metrics
PageRank features computed have some temporal leakage i.e. they have some future data present as these have been computed at daily level

Moreover, the candidate multihop features based XGBoost model has also been compared with supervised GNN based models in terms of model performance like precision, recall, AUC, etc. and subsequently chosen over GNN on the basis of its similar performance, along with utility, interpretability, ability to capture secondary users and no temporal data leakage during model training.

Model Interpretability Layer

It has consistently been noted that fraud tends to happen in groups or clusters. This prior knowledge has been incorporated into the XGBoost model by applying monotonic constraints to enforce a positive directional relationship between the input features and target variable. This is because higher values of input features signify more interconnected users, indicating a greater likelihood of fraud.

Further, to aid the investigation process, sensitivity analysis is carried out for users to generate top-n contributing factors/reasons as an explanation for the observed probability score.

**Figure 5**: Sample Reason Details for High Probability User

‍

Additional Capabilities of the Graph based Fraud Model

In our context, the primary users are the starting points, or user nodes, from which graph traversal begins. Any user connected to the primary users during graph traversal is considered as a secondary user.

Primary users with high probability scores are connected with multiple secondary users, constituting a fraud chain. The capability of Graph Database has been utilized to capture secondary users based on linkages as evidence. These secondary user IDs are extracted as a list during graph traversal and flagged along with the primary user having a high fraud probability score. This provides a mechanism to stop entire fraud chains instantly.

**Figure 6**: Graph Traversal - Primary and Secondary Users

‍

Additionally, the system has been structured to assess users at multiple touchpoints rather than just one, expanding coverage throughout their journey. This broader coverage increases the system's ability to detect potential fraudulent behavior across various stages of the user journey, enhancing overall fraud detection effectiveness.

Real Time Model Performance Evaluation

The model's performance is continuously monitored using a dedicated pipeline that tracks key metrics including AUC, Precision, Recall, and Population Stability Index (PSI). The model has achieved an AUC of approximately 80% and is currently being utilized with precision and recall values set at 70% and 35% respectively.

Business Usage of Enhanced Fraud Detection System

Real time generated model predictions are being consumed by the business to take quick necessary informed action of blocking fraudulent users. This usage has led to about ~27% reduction in bonus amount spent going to fraudsters, timely identification and risk mitigation of new fraud chains.

Acknowledgements

Sincerely thankful to Tridib Mukherjee, Chief Data Science and AI Officer, Games24x7, for his constant support, and to Sachin Kumar, Associate Director of Data Science, Games24x7 for his mentorship during the development of Graph Based Fraud Detection models. This project has been a valuable learning journey.

About the author

Parth Bhargava is working as a Data Scientist with Games24x7 and has been associated with various projects related to fraud detection ,customer service and acquisition marketing. He has a B.Tech in Mechanical Engineering from IIT BHU Varanasi and PGDBA from IIM Calcutta, ISI Kolkata and IIT Kharagpur.

LinkedIn Profile - https://www.linkedin.com/in/parth-bhargava-215753a7

Link to the previous blogs

‍

Download full report