Dota 2 Match Analysis: Unleashing the Potential of LightGBM Machine Learning Techniques

Published:

EXECUTIVE SUMMARY

Dota 2, a leading multiplayer online battle arena (MOBA) game, engages two teams of five players in a strategic contest to destroy their opponent’s “Ancient” structure while safeguarding their own. With its prominent presence in the esports domain, insights into the determinants of match outcomes hold significant value.

In each match, Radiant and Dire teams consist of five players who adopt specialized roles based on their chosen heroes. The game map comprises diverse elements, including team bases, lanes, shops, and Roshan’s lair. Players focus on hero upgrades, item acquisitions, and enemy base destruction to achieve victory.

This study employs machine learning algorithms to analyze collected data, establish baseline scores, and optimize model performance, yielding high-accuracy models for predicting Dota 2 match outcomes. Nevertheless, the reduction of the dataset due to time constraints and computational resources presents a limitation to the study’s scope.

Future research can investigate additional models and optimization approaches to augment or supplement existing findings, further advancing our understanding and predictive capabilities for Dota 2 match outcomes.

HIGHLIGTHS

  1. Showcased the application of Bayesian Optimization and its impact on enhancing accuracy levels.
  2. Illustrated the implementation of Light Gradient Boosting alongside hyperparameter tuning.
  3. Explored the utilization of various Classification Models for diverse predictions.
  4. Identified gold as the most influential predictor in determining match outcomes.
  5. Emphasized the importance of Feature Engineering in optimizing model performance and efficiency.

METHODOLOGY

The overarching methodology of this project focuses on employing various machine learning models, particularly LightGBM, to accurately predict Dota 2 match outcomes. The following steps outline the process:

  • Data Retrieval: Acquiring the relevant dataset for analysis.
  • Data Cleaning: Ensuring data quality by removing inconsistencies and inaccuracies.
  • Exploratory Data Analysis: Investigating data patterns, relationships, and trends to gain insights.
  • Data Preprocessing: Transforming and preparing the data for machine learning models.
  • ML Models Simulation: Implementing and comparing the performance of various machine learning models.
  • Hyperparameter Optimization: Fine-tuning model parameters to enhance predictive accuracy and efficiency.

Import Libraries

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Machine learning models and algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge, Lasso, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC, SVC, SVR
from sklearn.ensemble import (RandomForestClassifier,
                              GradientBoostingClassifier,
                              AdaBoostClassifier,
                              ExtraTreesClassifier,
                              VotingClassifier)
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Machine learning tools and utilities
from mltools import *
from bayes_opt import BayesianOptimization

# Data preprocessing and preparation
from sklearn.preprocessing import StandardScaler

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

Data Retrieval

The data used in this project is sourced from a Kaggle dataset [5]. The OpenDota API, which could have been an alternative, is no longer a viable option due to its restrictions on non-premium subscribers and the removal of free access to the API.

Data Loading

# loading data
train_data = pd.read_csv('./train_features.csv', index_col='match_id_hash')
test_data = pd.read_csv('./test_features.csv', index_col='match_id_hash')
train_y = pd.read_csv('./train_targets.csv', index_col='match_id_hash')

colors = ['#d9534f', '#5cb85c']
# Compute the target value counts
target_counts = train_y['radiant_win'].value_counts()

# Create a plot using seaborn
plt.figure(figsize=(6, 4))
sns.set_style('whitegrid')
ax = sns.barplot(x=target_counts.index,
                 y=target_counts.values, palette=colors)
ax.set_title('Win/Lose Distribution')
ax.set_xlabel('Target')
ax.set_ylabel('Count')
ax.set_xticklabels(['Lose', 'Win'])

# Save the plot as a PNG file
plt.savefig("images/balance.png")

# Create an HTML img tag to display the image
img_tag = (f'<img src="images/balance.png" alt="balance"'
           f'style="display:block; margin-left:auto;'
           f'margin-right:auto; width:80%;">')

# display the HTML <img> tag
display(HTML(img_tag))
plt.close()

sns.pairplot(train_data.iloc[:,:4], hue='lobby_type')

Data Description

num_rows = train_data.shape[0]
num_cols = train_data.shape[1]
html_table = train_data.head().to_html()
html_table_with_info = f"{html_table} \n <p>Number of Rows: {num_rows}<br>Number of Columns: {num_cols}</p>"

# Print the HTML table
print(html_table_with_info)
game_timegame_modelobby_typeobjectives_lenchat_lenr1_hero_idr1_killsr1_deathsr1_assistsr1_deniesr1_goldr1_lhr1_xpr1_healthr1_max_healthr1_max_manar1_levelr1_xr1_yr1_stunsr1_creeps_stackedr1_camps_stackedr1_rune_pickupsr1_firstblood_claimedr1_teamfight_participationr1_towers_killedr1_roshans_killedr1_obs_placedr1_sen_placedr2_hero_idr2_killsr2_deathsr2_assistsr2_deniesr2_goldr2_lhr2_xpr2_healthr2_max_healthr2_max_manar2_levelr2_xr2_yr2_stunsr2_creeps_stackedr2_camps_stackedr2_rune_pickupsr2_firstblood_claimedr2_teamfight_participationr2_towers_killedr2_roshans_killedr2_obs_placedr2_sen_placedr3_hero_idr3_killsr3_deathsr3_assistsr3_deniesr3_goldr3_lhr3_xpr3_healthr3_max_healthr3_max_manar3_levelr3_xr3_yr3_stunsr3_creeps_stackedr3_camps_stackedr3_rune_pickupsr3_firstblood_claimedr3_teamfight_participationr3_towers_killedr3_roshans_killedr3_obs_placedr3_sen_placedr4_hero_idr4_killsr4_deathsr4_assistsr4_deniesr4_goldr4_lhr4_xpr4_healthr4_max_healthr4_max_manar4_levelr4_xr4_yr4_stunsr4_creeps_stackedr4_camps_stackedr4_rune_pickupsr4_firstblood_claimedr4_teamfight_participationr4_towers_killedr4_roshans_killedr4_obs_placedr4_sen_placedr5_hero_idr5_killsr5_deathsr5_assistsr5_deniesr5_goldr5_lhr5_xpr5_healthr5_max_healthr5_max_manar5_levelr5_xr5_yr5_stunsr5_creeps_stackedr5_camps_stackedr5_rune_pickupsr5_firstblood_claimedr5_teamfight_participationr5_towers_killedr5_roshans_killedr5_obs_placedr5_sen_placedd1_hero_idd1_killsd1_deathsd1_assistsd1_deniesd1_goldd1_lhd1_xpd1_healthd1_max_healthd1_max_manad1_leveld1_xd1_yd1_stunsd1_creeps_stackedd1_camps_stackedd1_rune_pickupsd1_firstblood_claimedd1_teamfight_participationd1_towers_killedd1_roshans_killedd1_obs_placedd1_sen_placedd2_hero_idd2_killsd2_deathsd2_assistsd2_deniesd2_goldd2_lhd2_xpd2_healthd2_max_healthd2_max_manad2_leveld2_xd2_yd2_stunsd2_creeps_stackedd2_camps_stackedd2_rune_pickupsd2_firstblood_claimedd2_teamfight_participationd2_towers_killedd2_roshans_killedd2_obs_placedd2_sen_placedd3_hero_idd3_killsd3_deathsd3_assistsd3_deniesd3_goldd3_lhd3_xpd3_healthd3_max_healthd3_max_manad3_leveld3_xd3_yd3_stunsd3_creeps_stackedd3_camps_stackedd3_rune_pickupsd3_firstblood_claimedd3_teamfight_participationd3_towers_killedd3_roshans_killedd3_obs_placedd3_sen_placedd4_hero_idd4_killsd4_deathsd4_assistsd4_deniesd4_goldd4_lhd4_xpd4_healthd4_max_healthd4_max_manad4_leveld4_xd4_yd4_stunsd4_creeps_stackedd4_camps_stackedd4_rune_pickupsd4_firstblood_claimedd4_teamfight_participationd4_towers_killedd4_roshans_killedd4_obs_placedd4_sen_placedd5_hero_idd5_killsd5_deathsd5_assistsd5_deniesd5_goldd5_lhd5_xpd5_healthd5_max_healthd5_max_manad5_leveld5_xd5_yd5_stunsd5_creeps_stackedd5_camps_stackedd5_rune_pickupsd5_firstblood_claimedd5_teamfight_participationd5_towers_killedd5_roshans_killedd5_obs_placedd5_sen_placed
match_id_hash
a400b8f29dece5f4d266f49f1ae2e98a1552271111100005437533358600350.9378421161220.00000000100.00000000007800033994478636720254.9377421241260.00000000000.00000000001401003040130700700242.937731701560.00000000100.00000000005900013894506399700326.937802170860.00000000000.000000000077000040210344422800314.9378021201000.00000000000.0000000000120011398212780650720386.937873821700.00000000101.0000002100067889706640640422.937903174900.00000000200.0000006000015310307720720242.937732180840.29994800200.0000008410007960421760760326.937802901500.00000000211.0001034000085111870593680566.9380531281280.00000000000.000000
b9c57c450ce74a2af79c9ce96fac144d65840310157207525752393711601160566.93805876780.00000000000.4375000000963123339419389713521380386.937878781668.39794900400.3125000000271142221242561710860530.93800615614611.96495121400.312500003163403124206384459420880482.9379691541480.00000000300.43750000128910543103142712856900446.93793615014821.69739500200.37500010005812042823243281700700686.938207881703.16590111300.25001014160124661723607581040326.937806156980.06665000110.250042113173624293418485800350.9378471241440.29995521400.50000056032328081827305671160410.9379061241420.00000000600.50000920201142381136800800446.9379341801760.00000000000.000000
6db558535151ea18ca70a6892197db412123000101000017600680680506.9380011181180.00000000000.000000000051000017600720720278.9377711561040.00000000000.000000000044000017600568600254.937741781440.00000000100.000000000049000017600580580254.937741150780.00000000100.000000000053000017600580580374.937871781420.00000000100.00000000001800009600660660266.9377411801780.00000000000.0000006700009600586620278.9377711001740.00000000000.0000004700009600660660290.9377711781120.00000000100.0000004000009600600600302.9377711761100.00000000000.000001700009600640640446.9379311621620.00000000000.000000
46a0ddce8f7ed2a8d9bd5edcbb92568257622714141031161301471900900290.937774170962.36608900500.571429000099101228163036028781100494.937968821540.00000000100.285714000010131194017444811980980902.9383591261280.00000000210.5714290020261121155821228640640422.9379041201387.09826400500.42857100204100130334455355110791100362.937847176941.93288400000.142857000018000027126925038251160338.937846941580.00000031400.0000009813052217233310735880506.9380071261420.00000000100.500010801163035442508817860350.937846781600.00000000100.500000690200200416164411601160386.9378741761004.99886300200.00000860101133321878630740518.938005821608.66452731300.000020
b1b35ff97723d9b7ade1c9c3cf48f7704532271342011014049135110001000338.937844801649.93090300400.500000000069100018401416938681000350.937845781661.83289200010.50000000002701001204103210578860792.9382371201223.49914600000.000000000010400201724211964777980434.937935138940.00000000101.0000000000651200190781544281820446.9379341741000.00000000600.50000000002310001422101933709940362.9378458417011.03072000100.2500002210011457121759712820482.9379651741062.19939900100.2500003500122402353544349720434.9379371281260.00000000200.2500007221001697121651680680374.93787417610813.59667800200.50000101182199321919692740302.9377751041620.00000021200.250000

Number of Rows: 39675
Number of Columns: 245

The dataset comprises 39,675 rows and 245 columns or features, representing various aspects of Dota 2 matches. Each match involves 10 players, distributed across two teams of five players each. For every player, there are 24 unique features, leading to a total of 240 columns representing player-specific data.

A detailed description of each player-specific feature can be found in the provided reference table [6].

FeatureDescription
hero_idID of player’s hero (int64). Heroes are the essential element of Dota 2, as the course of the match is dependent on their intervention. During a match, two opposing teams select five out of 117 heroes that accumulate experience and gold to grow stronger and gain new abilities in order to destroy the opponent’s Ancient. Most heroes have a distinct role that defines how they affect the battlefield, though many heroes can perform multiple roles. A hero’s appearance can be modified with equipment.
killsNumber of killed players (int64).
deathsNumber of deaths of the player (int64).
goldAmount of gold (int64). Gold is the currency used to buy items or instantly revive your hero. Gold can be earned from killing heroes, creeps, or buildings.
xpExperience points (int64). Experience is an element heroes can gather by killing enemy units, or being present as enemy units get killed. On its own, experience does nothing, but when accumulated, it increases the hero’s level, so that they grow more powerful.
lhNumber of last hits (int64). Last-hitting is a technique where you (or a creep under your control) get the ‘last hit’ on a neutral creep, enemy lane creep, or enemy hero. The hero that dealt the killing blow to the enemy unit will be awarded a bounty.
deniesNumber of denies (int64). Denying is the act of preventing enemy heroes from getting the last hit on a friendly unit by last hitting the unit oneself. Enemies earn reduced experience if the denied unit is not controlled by a player, and no experience if it is a player controlled unit. Enemies gain no gold from any denied unit.
assistsNumber of assists (int64). Allied heroes within 1300 radius of a killed enemy, including the killer, receive experience and reliable gold if they assisted in the kill. To qualify for an assist, the allied hero merely has to be within the given radius of the dying enemy hero.
healthHealth points (int64). Health represents the life force of a unit. When a unit’s current health reaches 0, it dies. Every hero has a base health pool of 200. This value exists for all heroes and cannot be altered. This means that a hero’s maximum health cannot drop below 200.
max_healthHero’s maximum health pool (int64).
max_manaHero’s maximum mana pool (float64). Mana represents the magic power of a unit. It is used as a cost for the majority of active and even some passive abilities. Every hero has a base mana pool of 75, while most non-hero units only have a set mana pool if they have abilities which require mana, with a few exceptions. These values cannot be altered. This means that a hero’s maximum mana cannot drop below 75.
levelLevel of player’s hero (int64). Each hero begins at level 1, with one free ability point to spend. Heroes may level up by acquiring certain amounts of experience. Upon leveling up, the hero’s attributes increase by fixed amounts (unique for each hero), which makes them overall more powerful. Heroes may also gain more ability points by leveling up, allowing them to learn new spells, or to improve an already learned spell, making it more powerful. Heroes can gain a total for 24 levels, resulting in level 25 as the highest possible level a hero can reach.
xPlayer’s X coordinate (int64)
yPlayer’s Y coordinate (int64)
stunsTotal stun duration of all stuns (float64). Stun is a status effect that completely locks down affected units, disabling almost all of its capabilities.
creeps_stackedNumber of stacked creeps (int64). Creep Stacking is the process of drawing neutral creeps away from their camps in order to increase the number of units in an area. By pulling neutral creeps beyond their camp boundaries, the game will generate a new set of creeps for the player to interact with in addition to any remaining creeps. This is incredibly time efficient, since it effectively increases the amount of gold available for a team.
camps_stackedNumber of stacked camps (int64).
rune_pickupsNumber of picked up runes (int64).
firstblood_claimedboolean feature? (int64)
teamfight_participationTeam fight participation rate? (float64)
towers_killedNumber of killed/destroyed Towers (int64). Towers are the main line of defense for both teams, attacking any non-neutral enemy that gets within their range. Both factions have all three lanes guarded by three towers each. Additionally, each faction’s Ancient have two towers as well, resulting in a total of 11 towers per faction. Towers come in 4 different tiers.
roshans_killedNumber of killed Roshans (int64). Roshan is the most powerful neutral creep in Dota 2. It is the first unit which spawns, right as the match is loaded. During the early to mid game, he easily outmatches almost every hero in one-on-one combat. Very few heroes can take him on alone during the mid-game. Even in the late game, lots of heroes struggle fighting him one on one, since Roshan grows stronger as time passes.
obs_placedNumber of observer-wards placed by a player (int64). Observer Ward, an invisible watcher that gives ground vision in a 1600 radius to your team. Lasts 6 minutes.
sen_placedNumber of sentry-wards placed by a player (int64) Sentry Ward, an invisible watcher that grants True Sight, the ability to see invisible enemy units and wards, to any existing allied vision within a radius. Lasts 6 minutes.

Data Cleaning

train_data.info()
train_data.describe()
train_data.isnull().sum()
train_data.drop_duplicates()

Exploratory Data Analysis

# Create a countplot with a custom color palette
sns.countplot(data=train_data, x='lobby_type', order=train_data['lobby_type'].value_counts().index, palette='Set2');

# Add a title to the plot
plt.title('Counts of games in lobby type');

# Show the plot
plt.show()

# Create a countplot with the 'Set2' color palette for game modes
sns.countplot(data=train_data, x='game_mode',
              order=train_data['game_mode'].value_counts().index, palette='Set2')

# Add a title to the plot
plt.title('Counts of games in different modes')

# Show the plot
plt.show()

# Filter the dataset to only include the most common game mode
most_common_game_mode = train_data['game_mode'].value_counts().idxmax()
filtered_train_data = train_data[train_data['game_mode'] == most_common_game_mode]

Insight

  • Different lobby types and game modes have distinct criteria; therefore, selecting the lobby type and game mode with the highest count ensures a balanced dataset with more consistent datapoints. This approach minimizes potential biases and variations that could arise from considering multiple lobby types and game modes, leading to more reliable predictions and analysis.

Data Preprocessing

Feature Transformation

# remove lobby_type with lower counts
train_y['lobby_type'] = train_data['lobby_type']
train_data = train_data[train_data['lobby_type'] == 7]
test_data = test_data[test_data['lobby_type'] == 7]
train_y = train_y[train_y['lobby_type'] == 7]

# remove game_mode with lower counts
train_y['game_mode'] = train_data['game_mode']
train_data = train_data[train_data['game_mode'] == 22]
test_data = test_data[test_data['game_mode'] == 22]
train_y = train_y[train_y['game_mode'] == 22]
# drop lobby_type and game_mode for train_y
train_y = train_y.drop(columns=['lobby_type', 'game_mode'])
# mapping win and lose values
train_y = train_y['radiant_win'].map({True: 1, False:0})

# Get the unique values of 'radiant_win' column
unique_vals = train_y.reset_index()['radiant_win'].unique()

# Get the count of win for each unique value of 'radiant_win'
win_counts = train_y.reset_index()['radiant_win'].value_counts()

# Create a bar plot with the 'Set2' color palette for Dire and Radiant wins
sns.barplot(x=unique_vals, y=win_counts, palette='Set2')

# Set the x-axis tick labels to 'Dire Win' and 'Radiant Win'
plt.xticks([0, 1], labels=['Dire Win', 'Radiant Win'])

# Add labels to the x- and y-axes and a title to the plot
plt.xlabel('Dire & Radiant')
plt.ylabel('Win Counts')
plt.title('Dire vs Radiant Win Counts')

# Show the plot
plt.show()

Feature Engineering

# combing all individual character features into a team based features
feature_names = train_data.columns
num = []
for y in range(24):
    for i in [feature_names[feature_names.str.contains("r"+str(i)) == True] for i in range(1,6)]:
        num.append(i[y])
    col = num[0].split("_")[1]
    train_data['r_'+col] = train_data[num].sum(axis=1)
    test_data['r_'+col] = test_data[num].sum(axis=1)
    
    # dropping individual features
    train_data.drop(columns=num, inplace=True)
    test_data.drop(columns=num, inplace=True)
    num = []
    
for y in range(24):
    for i in [feature_names[feature_names.str.contains("d"+str(i)) == True] for i in range(1,6)]:
        num.append(i[y])
    col = num[0].split("_")[1]
    train_data['d_'+col] = train_data[num].sum(axis=1)
    test_data['d_'+col] = test_data[num].sum(axis=1)
    
    # dropping individual features
    train_data.drop(columns=num, inplace=True)
    test_data.drop(columns=num, inplace=True)
    num = []

Feature Selection

# what features to be used
to_load = (['r_kills', 'r_deaths', 'r_assists', 'r_denies', 'r_lh', 'r_gold',
            'd_kills', 'd_deaths', 'd_assists', 'd_denies', 'd_lh', 'd_gold'])
train_X = train_data[to_load]
test_X = test_data[to_load]

# reduced the datapoints for the interest of runtime and
# to show the significance of the models
feature_names = train_X.columns
train_sample = int(train_X.shape[0]/2)
test_sample = int(test_X.shape[0]/2)

# getting sample size
train_X = train_X.sample(train_sample, random_state=3)
train_y = train_y.sample(train_sample, random_state=3)
test_X = test_X.sample(test_sample, random_state=3)
train_X

Data Scaling

# Scaling
scaler = StandardScaler()
train_X_scaled = scaler.fit_transform(train_X)
test_X_scaled = scaler.fit_transform(test_X)

Final Data

# scaled data
train_X_scaled # for train value
test_X_scaled # for test value

# original data
train_X # for train value
test_X # for test value

# output to csv
train_X.to_csv('train_X.csv')
test_X.to_csv('test_X.csv')
train_y.to_csv('train_y.csv')

RESULTS AND DISCUSSION

Auto ML Simulation

# Select methods
methods = ['kNN', 'Logistic (L1)', 'Logistic (L2)', 'Decision Tree',
           'RF Classifier', 'GB Classifier', 'XGB Classifier',
           'AdaBoost DT', 'LightGBM Classifier', 'CatBoost Classifier']

# Perform training and testing
ml_models = MLModels.run_classifier(
    train_X_scaled, train_y, feature_names, task='C',
    use_methods=methods, n_trials=2, tree_rs=3, test_size=0.20,
    n_neighbors=list(range(1, 3)),
    C=[1e-1, 1],
    max_depth=[5, 10])
res = MLModels.summarize(ml_models, feature_names,
                         show_plot=True, show_top=True)

Feature Importance

for model_name in methods[3:]:
    try:
        ax = ml_models[model_name].plot_feature_importance(feature_names)
        ax.set_title(model_name, fontsize=16, weight="bold")
    except Exception as e:
        print(model_name, e)

Light GBM Classifier Simulation

# reduced params range based from previous multiple trials
tune_model(train_X_scaled, train_y, 'Classification', 'LightGBM Classifier',
           params={'max_depth': [25, 50],
                   'n_estimators': [150, 200],
                   'learning_rate': [0.1, 0.2]},
           n_trials=2, tree_rs=3)
ModelAccuracyBest Parameter
LightGBM Classifier69.41%{'max_depth': 25, 'n_estimators': 200, 'learning_rate': 0.1}
ClassifierLGBMClassifier(max_depth=25, n_estimators=200, random_state=3)
acc69.41
std0.0013953488372092648

Bayesian Optimization Simulation

In this project, we demonstrated the use of Bayesian optimization as a function optimization package by optimizing three hyperparameters of the LightGBM classifier. This was done to show how Bayesian optimization can improve the accuracy of a machine learning model by finding the optimal combination of hyperparameters.

The function to be optimized

def lgb_cv(num_leaves, max_depth, min_data_in_leaf):
    params = {
        "num_leaves": int(num_leaves),
        "max_depth": int(max_depth),
        "learning_rate": 0.5,
        'min_data_in_leaf': int(min_data_in_leaf),
        "force_col_wise": True,
        'verbose': -1,
        "metric" : "auc",
        "objective" : "binary",
    }
    
    lgtrain = lightgbm.Dataset(train_X_scaled, train_y)
    cv_result = lightgbm.cv(params,
                       lgtrain,
                       200,
                       early_stopping_rounds=200,
                       stratified=True,
                       nfold=5)
    return cv_result['auc-mean'][-1]

The optimizer function

def bayesian_optimizer(init_points, num_iter, **args):
    lgb_BO = BayesianOptimization(lgb_cv, {'num_leaves': (100, 200),
                                           'max_depth': (25, 50),
                                           'min_data_in_leaf': (50, 200)
                                           })
    lgb_BO.maximize(init_points=init_points, n_iter=num_iter, **args)
    return lgb_BO

results = bayesian_optimizer(10,10)
itertargetmax_depthmin_data_in_leafnum_leaves
10.785949.98128.7115.4
20.779642.9468.67190.1
30.778428.3657.48107.0
40.786930.55134.1150.3
50.789236.73162.4187.3
60.786631.26195.5165.7
70.780437.5175.45175.9
80.784845.02123.7153.1
90.784233.06124.2109.2
100.788439.39164.4142.8
110.788239.39165.1143.5
120.788150.0179.0100.0
130.789550.0200.0200.0
140.789225.0186.1200.0
150.788225.0165.9113.4
160.787625.0138.8200.0

Train Iterations with the optimized parameters

def lgb_train(num_leaves, max_depth,  min_data_in_leaf):
    params = {
        "num_leaves": int(num_leaves),
        "max_depth": int(max_depth),
        "learning_rate": 0.5,
        'min_data_in_leaf': int(min_data_in_leaf),
        "force_col_wise": True,
        'verbose': -1,
        "metric": "auc",
        "objective": "binary",
    }

    x_train, x_val, y_train, y_val = train_test_split(
        train_X_scaled, train_y, test_size=0.2, random_state=3)
    lgtrain = lightgbm.Dataset(x_train, y_train)
    lgvalid = lightgbm.Dataset(x_val, y_val)
    model = (lightgbm.train(params, lgtrain, 200, valid_sets=[lgvalid],
                            early_stopping_rounds=200, verbose_eval=False))
    prediction_val = model.predict(
        test_X_scaled, num_iteration=model.best_iteration)
    return prediction_val, model

Optimize Simulation Results

# 5 runs of the prediction model and get mean values.
optimized_params = results.max['params']
prediction_val1, _ = lgb_train(**optimized_params)
prediction_val2, _ = lgb_train(**optimized_params)
prediction_val3, _ = lgb_train(**optimized_params)
prediction_val4, _ = lgb_train(**optimized_params)
prediction_val5, model = lgb_train(**optimized_params)
y_pred = ((prediction_val1 + prediction_val2 +b
           prediction_val3 + prediction_val4 +
           prediction_val5)/5)
df_result = pd.DataFrame(
    {'Radiant_Win_Probability': y_pred})
df_result.sort_values(by='Radiant_Win_Probability', ascending=False).head()
Match ID HashRadiant Win Probability
8950.993668
16170.990840
17520.989719
2450.987856
27420.986903
feature_importance = (pd.DataFrame({'feature': train_X.columns,
                                   'importance': model.feature_importance()})
                      .sort_values('importance', ascending=False))
plt.figure(figsize=(8, 5))
sns.barplot(x=feature_importance.importance,
            y=feature_importance.feature, palette=("Blues_d"))
plt.show()

CONCLUSION AND RECOMMENDATION

The nature of Dota2 as a game with only one winner and one loser makes it a suitable problem for machine learning prediction. However, real-time predictions during a game can be challenging due to the need for specific information per minute. In this project, we used multiple models to determine the most suitable approach for predicting Dota2 game outcomes. Additionally, hyperparameter tuning was performed to optimize model performance, with Bayesian optimization being the preferred method due to its efficiency when dealing with expensive-to-evaluate functions like LightGBM.

While the use of Bayesian optimization for hyperparameter tuning might not be as significant for small datasets or simple models, it becomes essential when dealing with enormous datasets where grid search may not be economically feasible. Therefore, the use of Bayesian optimization can improve the efficiency of the hyperparameter search process.

To further enhance prediction accuracy, we recommend the use of time-series machine learning models to provide real-time forecasting during the game. Such models can take into account the changing dynamics of the game and provide more accurate predictions.

In conclusion, this notebook provides valuable insights into Dota2 as a growing Esport and demonstrates how different machine learning models can be used to predict game outcomes. By leveraging hyperparameter optimization techniques like Bayesian optimization and considering time-series models for real-time prediction, we can improve the accuracy of our predictions and gain a deeper understanding of Dota2 as an Esport.

REFERENCES

[1] Dota 2. (n.d.). https://www.dota2.com/home

[2] Staff, T. G. H. (2022, September 7). What Makes Dota 2 So Successful. The Game Haus. https://thegamehaus.com/dota/what-makes-dota-2-so-successful/2022/04/02/

[3] mlcourse.ai: Dota 2 Winner Prediction Kaggle. (n.d.). https://www.kaggle.com/competitions/mlcourse-dota2-win-prediction/overview

[4] Dota 2 5v5 - Red vs Blue by dcneil on. (2013, October 9). DeviantArt. https://www.deviantart.com/dcneil/art/Dota-2-5v5-Red-vs-Blue-406091855

[5] mlcourse.ai: Dota 2 Winner Prediction Kaggle. (n.d.-b). https://www.kaggle.com/competitions/mlcourse-dota2-win-prediction/data

[6] Dota 2 Wiki. (n.d.). https://dota2.fandom.com/wiki/Dota_2_Wiki

[7] fmfn, F. (n.d.). GitHub - fmfn/BayesianOptimization: A Python implementation of global optimization with gaussian processes. GitHub. https://github.com/fmfn/BayesianOptimization

[8] Natsume, Y. (2022, April 30). Bayesian Optimization with Python - Towards Data Science. Medium. https://towardsdatascience.com/bayesian-optimization-with-python-85c66df711ec

  • Note: the mltools of Prof Leodegario U. Lorenzo II and feedbacks of our Mentor Prof Gilian Uy and also the other professors significantly made this notebook very cool!