Post

The 'Backyard' Chronicles

An in-depth dive into the technical details and development journey

What is the ‘Backyard’?

In Guilty Gear lore, the ‘Backyard’ is described as:

…kind of a metaphysical command prompt. It’s a realm of pure information, and all of that information makes up our collective reality - TheGamer

However Guilty Gear lore is far more complicated than this entire project and is a rabbit-hole only delved into by the most dedicated or deranged Guilty Gear enthusiast. For our purposes the Backyard is a suite of tools that I created to visually analyse matches from Guilty Gear -Strive- and attempt to further understand the factors involved in winning. It is the culmination of over a years worth of work and, as far as I know, it is the first of its kind, at least within fighting games. I am incredibly excited to share it with the community. In this blog, I would like to show the journey I undertook to deliver this and show the pitfalls along the way. In a follow up blog, I will reflect on the findings of the ‘Backyard’ and my opinion on its long-term viability. Before we begin, let me give a brief overview of the entire project.

Guilty Gear -Strive-

Guilty Gear -Strive- Two enter, only one may leave

For the uninitiated (and frankly missing out) who are not familiar with the game in question, according to the Core-A fighting game family tree, Guilty Gear -Strive- is an archetypal(traditional), 2D(2.5D), hold back to block, anime(air-dasher) fighting game.

Fighting Game Tree Credit: Core-A gaming

What? You’re still lost? Ok, let’s try again. Guilty Gear -Strive- is a 2D 1v1 fighting game where two players fight until either one of their characters is knocked out or time runs out. A character is knocked out once their health reaches 0. The standard format is a best-of 3, requiring a player to win two rounds before winning a game. Besides health, all characters have access to the resources burst and tension that allow them to perform additional powerful moves. We will dive deeper into specific mechanics as they come up.

Note: Throughout the project you may see the word ‘set’ or ‘match’ used interchangeably with ‘game’. Technically this is incorrect as ‘set’ or ‘match’ is a group of ‘games’ but I didn’t realise this error until quite late in development. In the future this will be cleaned up but for now ‘set’, ‘match’ and ‘game’ are all synonymous. {: .prompt-warning }

Backyard-Observer

Backyard-Observer main responsibility is training and utilising YOLOv8 vision models for metrics gathering i.e. keeping track of the value of the numerous resources within the game.

Metrics Gathering Backyard Observer visually classifying a game

Based on the metrics gathered, it is also capable of predicting and visualising the winner of the current match in real-time, although this feature is still a proof-of-concept.

Predictions ‘Real-Time’ predictions’

Check out the README.md if you’d like to know more.

Backyard-Insight

Backyard-Insight is a dashboard hosted at https://backyard-insight.info/ that visualises the metrics gathered by Backyard-Observer. The main dashboard features a graph that shows the likelihood of winning by each player at any point during the game. Above the graph is a facsimile of Guilty Gear -Strive-‘s UI, as you hover over the graph this will update and reflect the state of the game when the prediction was made. This can also serve as a historical record of the game. Additionally if the character Asuka is present, information about his ‘spells’ are also displayed, I will go into further details about this later on. The ‘Game Stats’ tab shows a comparison of some resource, such as burst and tension usage, between the two players.

Backyard-Insight Nitro vs Tatuma at EVO Grand Finals

Under ‘Player Stats’, a couple of tables of statistics are shown for each player. The top table shows the statistics, such as burst and tension usage, in each round or game. The bottom table is an aggregate of these statistics showing a players average usage per round or game, win or loss. By default it shows an aggregate across all matches, however it can be grouped by character, tournament or tournament round.

Notebooks

A series of Jupyter Notebooks via Google Colab were utilised for data exploration, data cleaning and training the machine learning models used for the win predictors. I will go over the interesting findings from these notebooks but feel free to read through them in full if you want to know more.

Origin

The idea was conceived, as all good ideas are, in frustration. As I bounced back and forth between Floor 8 and Floor 9 of the ‘Tower’, the rank system of Guilty Gear -Strive-, and struggled to climb the floors, I brainstormed ways I could improve ignoring the ‘Training Room’ and ‘Replay Centre’ menu option. Could I make a tool that compared my gameplay against that of top players and determine what actions influence winning? Maybe I could try machine learning, there are a multitude of library and resources. How hard could it be?1

In all seriousness, I was also inspired by the sheer amount of stats traditional sports have and how that improves the viewing experience. DeepStrike was just starting to gain recognition and its ability to track and represent the data autonomously using computer vision for combat sports, the closest traditional sport parallel to fighting games, was fascinating. At this time I didn’t have a lot of expertise in the area but I wanted to see if I could provide a similar experience for e-sports.

Investigation and Exploration

Initially I wanted to track every single action a character performed in-match —movement, attacking, blocking, etc…— and see the difference in actions taken by low and high ranked players. In my case, that character would be my main, Testament. To help me get started with the computer vision, I first followed this incredibly helpful tutorial by Learn Code by Gaming.

It used template matching a technique where you try and find a template image on a larger input image by sliding the template image over the input image and comparing the similarity2. Adapting this to my use case, is it possible to take the images of every move from dustloop (the wiki of Guilty Gear -Strive-) as a template image and accurately determine if the move was used in a frame of a video. To start off with I would try Testament’s far slash.

Testament's Far Slash Testament’s far slash from dustloop

Testament template matching Testament’s far slash found within a frame of a video with template matching

Success! A green rectangle is drawn around a match with confidence over 0.85. In the example above the output was Best match confidence: 0.8679274916648865. Let’s try with a video now.

Testament template matching Many false positives…

Testament false positives template matching Those are some funny looking far slashes

That’s not good. It is completely unreliable. Clearly template matching is not going to cut it, let’s try a more robust technique feature detection and matching - a technique that detects and extracts features in the template image and attempts to match them in the input image. Typical features usually consist of edges, corners or any other interest points. In our case the edge of Testament’s scythe and the edges of their cloak are potential features. Unfortunately this turned out to also be ineffective.

Testament template matching Feature detection and matching

The feature detection on the template looks fine but the feature matching is all over the place. It may have been possible to tweak some parameters to make this work more consistently however there was another looming issue, this was only a single frame from a single move. Testament has five attacks in standing, crouching and jumping position, command normals, special moves, overdrive, totalling to ~60 moves. That would mean for each frame, at minimum 60 sets of feature matching would need to be performed, that’s not accounting for the fact that moves happen over multiple frames, at 60 fps (frames per second) this quickly becomes hundreds of images. Performance was going to be an issue and with less than promising accuracy efforts were better placed elsewhere.

When in Doubt - YOLO

As I was doing some research I stumbled upon this video:

Credit: River’s Educational Channel

In summary, they attempt to create a bot in the FPS (First Person Shooter) game Valorant with computer vision. YOLO (You Only Look Once) from Ultralytics was used for object detection. Usually training vision models would require vast amounts of labeled and annotated images. YOLO utilises transfer learning to solve this; by providing a model that was pre-trained on a large, general dataset, the heavy lifting has been done as the model understands images now just requiring fine-tuning for specific images greatly reducing the training time. To fine-tune the model, a dataset of appropriately labeled and annotated images is needed.

In the video the YOLO model is able to detect opponents on screen in real-time, though it failed to detect characters that were extremely close-up. This is not a common occurrence under normal gameplay and therefore there were no examples of it provided in the dataset. The issue with Valorant, a 3D tactical shooter, is that objects can be viewed from any angle or distance and finding numerous examples at these varying angles and distances can be tough.

In a 2D fighter such as Guilty Gear -Strive-, the camera is at a static angle and distance away from the characters, only panning left and right alleviating this issue entirely. YOLO seems like an ideal choice for my purposes. All that is needed is a dataset. Unfortunately there is no existing dataset for Guilty Gear -Strive-, so I’ll need to obtain frames from gameplay footage and annotate them myself using the tool labelimg.

Data Annotation - My Nightmare

Testament annotation My life for weeks…

I wanted to track a few key objects and actions:

Testament
All actions associated with Testament, including movement, attacks and blocking

Testament moves A small sample of Testament’s actions

Objects associated with Testament
This includes the crow, succubus and the skull projectile from Grave Reaper Testament projectiles Top Left: Succubus, Bottom Left: Crow, Right: Skull from Grave Reaper
Opponent
Differentiating between Testament and any other character. Testament also has a unique mechanic ‘Stain’ that they can apply to their opponent which is visually indicated by a ‘purple’ aura.

Testament stain state Opponent with stain state

System Mechanics
All Bursts and Roman Cancels

Testament Burst and RCs Left: Blue Psych Burst Right: Blue Roman cancel

Each of these actions is defined as a class for YOLO and there are over 60+ of them to annotate and track. After a week, I was able to annotate around 200+ images and start training the vision model. However, early results were not very promising.

Testament initial try At least it tracks the crow…

It’s very quick but it’s struggling to really identify Testament at all. Time to annotate more data…

In retrospect I have regrets about how I handled this. For me, data annotation is truly a tedious, menial task and diminished my optimism. The poor results created a negative feedback loop as I would begrudgingly: annotate hundreds of images; train the model; be frustrated by any flaws; and dread having to start the process again. Better structure and project planning could have helped here. I was hyper-focused on perfecting the vision model, although an immensely important component, there were other aspects that could have been worked on. I knew vaguely that the final product consisted of training a vision model -> metric gathering -> visualisation. With more concrete definitions for each part, I could have tried to create a vertical slice. Instead of trying to create a vision model that could distinguish all of Testament’s actions, I could create one that only accurately classified a single move and build the rest of the systems with that. This is similar to how software teams would ‘mock’ systems that aren’t built yet. It would have given a realistic goal to aim for and the bulk of the annotation work could have been deferred to a point with a more mature project where the impact of the tedious task could have been reflected in the entire system, resulting in a more positive feedback loop.

The lack of planning also hurt, given that if you don’t plan out all the classes you want to track, adding them in later can be painful. Multiple times I had to go back over hundreds of already annotated images, and annotate them again because there was a new class that needed to be added.

Besides annotation, the main breakthrough for accuracy was discovering the different model sizes. Up until this point I had been using the smallest size, YOLOv8n with only 2.6 params. Params is a measure of the number of weight and biases in a model, basically the number of components that can be tweaked. The higher the number, the larger and more complicated the model. Increasing this to the largest class, YOLOv8x with 68.2 params, improved the quality greatly.

Testament final model Tracks most of Testament’s actions correctly

This does come with some consequences. There is an impact on performance in both training and running. While this is an improvement, it’s still not quite accurate enough. I could keep annotating more images, however I was starting an issue.

Testament final model Class Distribution of annotated images

The top left of the graph shows the distribution of classes. As you can see there is a huge imbalance of examples of each class. This is expected to a degree, the two classes with huge spikes are ‘Testament’s Crow’ and the ‘Opponent’ which are in almost every example but there are moves that rarely turn up. I’ve been obtaining frames by splitting up videos of high-level Testament matches from YouTube. There is an inherent selection bias present in the matches as players have gravitated towards stronger moves causing the unequal distribution.

Regardless, with this model I attempted to make some visualisations. If I track the order of every single move a character does throughout the match it could then be represented as a directed network graph. This is where every move is a node and where the edges could be weighted to show how often the source move led into the destination move.

Testament move chart I did say ‘attempt’

I was not happy with the model nor the complete mess of the resulting visualisation. I could keep annotating images to hope that the accuracy improved to a point I was satisfied,however I realised something. Earlier on I had stated as my initial goal:

Could I make a tool that compared my gameplay against that of top players and determine what actions influence winning?

Yet I had completely ignored the most fundamental part, who wins a match?

Starting Over

This is the beginning of Backyard Insight and Observer as it is now. Instead of looking at the actions of a character on the screen, track the values of the different UI elements such as gauges and system texts. Those values can then be fed into a machine learning model that will predict the eventual game winner. The effects of different gauges on winning could then be examined. How important is Tension and Burst at different health values? The eventual goal would be to use this model to analyse tournament matches and provide a prediction throughout the match in a manner similar to chess, where modern chess engines inspect the strength of the positions of each player.

chess Bar on the left showing the strength of player positions

To achieve this the following steps would need to be completed:

  1. Annotate and create a dataset for a vision model;
  2. Train the vision model;
  3. Use the vision model to create a dataset for a predictive machine learning model;
  4. Train the predictive model;
  5. Run the vision model on tournament matches to create a tournament dataset;
  6. Use the predictive model on the tournament dataset; and
  7. Create a dashboard from the prediction and tournament dataset.

Determining Classes to Track

Before the start of annotation, classes for the vision model needed to be determined. It wasn’t clear exactly what would be needed down the line so as much as possible was tracked. This includes:

ggstrive glossary

Health
Also known as Life Gauge, the amount of remaining health that a character has, once this reaches 0 the character loses the round
Damaged
If a player is currently being damaged, the health bar will have a red segment on it.
Burst
Used primarily for Psych Burst, a defensive option that consumes 100% Burst to interrupt and knock back an opponent. Starts at 100% at the beginning of a match and is the only gauge to be carried over between rounds. Season 2 added ‘Wild Assault’ and ‘Deflect Shield’ which both consume 50% Burst
R.I.S.C.
A gauge that builds as a player blocks attacks. It depletes when a player is hit and they will take additional damage depending on how full the bar was
Tension
The main ‘resource’ in Guilty Gear. Can be spent on a variety of offensive and defensive options. Most actions consume 50% tension which is marked by a ‘gear’ icon on the bar.
Round Count
Represented by the hearts above the health bar, become grey and broken when a round is lost

ggstrive_system_text Source: Dustloop3

Counter
Hitting an opponent before their recovery results in a powerful ‘Counter’ hit state
Reversal
Performing an action immediately as they recover. Can be potent if combined with invincible moves
Just/IB
IB stands for Instant Block. If a player blocks within two frames of their opponents attack, large ‘JUST’ text appears on screen and they are benefited with an advantageous state
Punish
If a player attacks their opponent during their recovery, the word ‘Punish’ will flash on their screen

ggstrive_system_text

Round Start
Each round start with the same Let’s Rock graphic.

ggstrive_system_text

Round End:
Both ‘Slash’ and ‘Perfect’ screens can indicate the end of a round. Additionally the Player 1 and Player 2 text is also tracked

My Recurring Nightmare - Data Annotation

I knew this would involve making another YOLOv8 vision model for this, but it should be easier this time. The gauges are UI elements and should be on screen at most times and this makes class distribution consistent. Each gauge is distinct in colour, location, appearance and take a significant portion of the screen. This should result in less annotated frames needed overall. Although the system texts are more situational, there is still an abundance of examples for most of them in a typical match.

ggstrive_system_text Class distribution from initial training

The class distribution is much better resulting in good results early on. After the initial training I implemented an iterative process where I would train a vision model -> Use the model to create labels on new frames -> Adjust any incorrect labels -> train the vision model with the new frames. This allowed me to quickly annotate frames at a much higher rate.

ggstrive_bar_confusion_matrix Confusion Matrix during a later training

The confusion matrix shows the actual class against the predicted class. An ideal matrix would only have points along the diagonal as that is where the actual class and predicted class intersect. In this model there were a few minor errors however they are primarily ‘background’, which is used when no prediction is made. This is not a big deal as the videos run at 60fps (frames per second) and missing the value of a gauge on a singular frame is not the end of the world.

At Last… Data Collection

The next task was to use this vision model to record a match.

Considerations

Data Format:
The data format would need to be able represent every variable over the course of a match. This could be represented as a table, the columns are each variable and the rows are a specific point in time and saved to a CSV file.
time,p1_name,p2_name,p1_health,p2_health,p1_tension,p2_tension,...
...
17.4,aba,goldlewis,0.99863,0.10833,0.3231,0.48698,..
17.5,aba,goldlewis,0.99883,0.18119,0.32424,0.49,..
17.6,aba,goldlewis,0.99857,0.22916,0.3226,0.49,..
17.7,aba,goldlewis,0.99842,0.28257,0.33599,0.49,..
17.8,aba,goldlewis,0.99913,0.33827,0.34219,0.49,..
17.9,aba,goldlewis,0.99888,0.33886,0.38146,0.49,..
Variables:
There are two type of variables that need to be determined before training a predictive machine learning model, features or input variables and target or output variables. The features are variables that will predict or explain the target, otherwise known as the outcome. The machine learning model will try to find a relationship between the different features that best determines the target. In this case, features are relatively straight forward, it’s mainly composed of the value of each gauge repeated for both players and the number of times an action with system text occurs e.g. counter hit, reversal…

Features

Column NameTypeDescription
timefloatTime elapsed in current round
p1_namestringName of character in P1 position
p2_namestringName of character in P2 position
p1_healthfloatPercentage of health remaining for P1 between 0-1
p2_healthfloatPercentage of health remaining for P2 between 0-1
p1_tensionfloatPercentage of tension filled for P1 between 0-1
p2_tensionfloatPercentage of tension filled for P2 between 0-1
p1_burstfloatPercentage of burst filled for P1 between 0-1
p2_burstfloatPercentage of burst filled for P2 between 0-1
p1_riscfloatPercentage of burst filled for P1 between 0-1
p2_riscfloatPercentage of burst filled for P2 between 0-1
p1_round_countintNumber of rounds won by P1 i.e. number of hearts lost by P2
p2_round_countintNumber of rounds won by P2 i.e. number of hearts lost by P1
p1_counterintNumber of ‘Counters’ performed by P1
p2_counterintNumber of ‘Counters’ performed by P2
p1_justintNumber of ‘Just’/IBs performed by P1
p2_justintNumber of ‘Just’/IBs performed by P2
p1_punishintNumber of Punishes performed by P1
p2_punishintNumber of Punishes performed by P2
p1_reversalintNumber of Reversals performed by P1
p2_reversalintNumber of Reversals performed by P2
p1_curr_damagedbooleanIf P1 is currently being damaged
p2_curr_damagedbooleanIf P2 is currently being damaged

Targets

The standard format for Guilty Gear -Strive- (and most other fighting games) is played over a best of 3 ‘rounds’ known as a ‘game’, although referred to as ‘set/match’ in the project. While most gauges reset between rounds, critically, ‘Burst’ is actually carried over throughout the game i.e. if used in round 1 it may not be available for round 2. This necessitates 2 target variables, Round Win and Set Win and therefore most likely 2 machine learning models.

Column NameTypeDescription
p1_round_winbooleanIf P1 wins this round
p1_set_winbooleanIf P1 wins this match/set

Recording Data

To finally begin recording data the YOLOv8 output needs to be interpreted to the CSV data format detailed above. The pseduo-code looks for the logic looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
read YOLOv8 frame
if 'round_start' in frame: # Looking the round start graphic "Let's Rock"
  wait until 'round_start' disappears # This will be the first fame of the actual round
  update 'round_count' # Look at the number of hearts vs hearts broken in frame for each player
  initialise new 'round'
# Slash is the round end graphic time to determine winner
elif 'slash' in frame:
  #Determine winner from health values
  if p1_prev_health > p2_prev_health:
        winner = P1
  else:
        winner = P2
  round_history['round_winner'] = winner
  set_history.append(round_history)
  # round_count + 1 will be equivalent to number of rounds won at the end of a round
  if 'round_count' + 1 == 'max_round_count':
    set_history['set_winner'] = winner
    record_set_to_csv(set_history)
else:
  for each 'feature':
    if 'feature' in frame:
      current_round['feature'] = current frame value
    else:
      current_round['feature'] = previous frame value
  round_history.append(current_round)

This accounts for all fields in the CSV besides p1_name and p2_name which are manually set for each video via the filename. The full data collection can now begin. The videos were sourced from GGST: High Level Gameplay totalling to about ~238 videos with a roughly equal character distribution. I started the process late at night, intending for it to complete overnight. It kicked off processing ‘aba’ games, and slowly making its way down the list of videos in alphabetical order as I went to bed. Waking up, energised to finally see the results, it must be done or close to it. Checking the output, it was currently processing ‘asuka’…. It was still on ‘a’, this might take a while.

I had severely underestimated how long this may take. At the current rate it may take more than a week. The average video length was ~10 minute and with 238 videos that equates to around 40 hours of videos to process. However the data collection runs slower than ‘real-time’ i.e. 60 frames-per-second. The main bottleneck was the the GPU, a 2080 with only 8GB of VRAM needed for the YOLO model. An easy way would be to skip frames as it’s not necessary to process every single frame but what would be the appropriate amount of frames to skip? Fighting games’ rounds are not long, most not lasting longer than a minute; losing too much fidelity would be significantly detrimental to accuracy. In the end, every 6th frame or every 0.1 seconds, seemed to be the right balance of speed and fidelity. While the GPU will still be a bottleneck, looking back at the code, every action besides ‘read YOLOv8 frame’ uses the CPU. As the code is run sequentially/synchronously, any process using the CPU must wait till the GPU is done and vice versa, resulting in idle resources. Multi-processing can solve this. By running multiple processes in parallel and assigning a video to each, one process can use the GPU while the others can use the CPU. With this processing all the videos took ~2 days, roughly the real-time of the videos. Time to actually do something with the data.

Training the predictors

While the data exploration and final outcomes are covered here (ggstrive.ipynb) along with a greater explanation on the data format and eventual model I want to talk about the decisions and discoveries made along the way. There were two models that needed to be created: the ‘round predictor’; and the ‘set predictor’. Let’s first look at the ‘round predictor’.

Feature Selection

During data collection many metrics were tracked that could be used, however not all features are equal. It’s usually a good idea to select the features that will be most ‘impactful’ for predictions. What is considered impactful?

Data exploration

Doing some preliminary graphing and comparing the different player’s ‘gauges’, a clear trend is revealed.

Data Exploration Shockingly the player with more health is more likely going to win

x-axis is the value of the gauge for P1, likewise y-axis is the value gauge for P2. Blue indicates a P1 win and Red indicates a P2 win. So there is more blue when P1 approaches 1.0 on the x-axis and vice versa. It’s clear that health will be an important feature and will be selected however let’s see if there are any other trends we can gather.

Data Exploration Data Exploration Data Exploration P1 vs P2 for various gauges

The tension graph is separated into 4 quadrants, this is due to how to implement details of how tension was recorded visually, however it becomes a useful side-effect when reviewing the graphs. Any action that uses tension requires the player to have at least 0.5 (50%), therefore in the graph the top half and right half represent when P2 and P1 can utilise their tension respectively. Top right and bottom left quadrants are both equal states, both players either can or can not use tension. However in the top left where only P2 can use tension there are more red dots, indicating more P2 wins however and vice versa for bottom right and P1 wins.

Burst and R.I.S.C graphs seem to have no obvious trend. On top of that, R.I.S.C has the vast majority of data points around 0. This is unsurprising for those familiar with the game however it does put into question whether it is a useful feature.

For the gauges health is absolutely necessary, with tension also looking useful but burst and R.I.S.C needs to be investigated further

Statistical Tests

In doing some research I came across aziztitu’s football match predictor. A project that trained a model to predict football matches with a helpful write-up attached. In it they detail the statistical tests that they perform including Chi^2, VIF for collinearity and variance.

Chi^2 Test
Chi^2 is a statistical test that evaluates how likely two categorical variables are independent by looking at any statistically significant differences in ‘observed’ vs ‘expected’ frequencies. For our case, we’re looking to see if the observed frequencies deviate from the expected frequencies for a feature in a win compared to a loss. If the observed frequency and expected frequency are too similar, it indicates that the feature does not meaningfully inform us about the target, as its distribution can most likely be attributed to randomness. Conversely, if the frequencies deviate greatly on a win or loss, the results can not be explained by randomness and more the target is dependent on the feature, making it a useful predictor. Therefore the test is only useful for variables that are categorical or frequencies and the gauges were not evaluated with this. (however it could have been possible if the gauge values were ‘binned’, i.e. put into buckets/groups of values.) This left the frequency of system texts and currently damaged to be evaluated.

chi^2 test Results of Chi^2 test analysis

The damaged feature is quite significant as well as counter hit, however the rest don’t look too impactful. This intuitively makes sense, the player that is damaged more is likely going to lose and being ‘counter’ hit usually results in serious damage. For punish as the data collection was performed on high-level play, players at that skill level don’t put themselves in position to be punished as often. If it was performed on ‘low-mid’ level player this would likely be a much more significant feature. ‘Just’ is rare in general, requiring an instant block - a technique requiring a 2-frame input, there just weren’t a lot of examples for it. Reversal seems to not matter, this only appears if a player performs a special move directly coming out of block stun or a knockdown but that’s only useful in very specific situations.

Collinearity
Using VIF (Variable Inflation Factor) each feature is checked for collinearity, that is how correlated two different features are. If two features are too similar, there is little reason to use both. chi^2 test VIF test output for features

Although p2_health and p1_burst are in the ‘removed features’ section, it wouldn’t make sense to remove a feature only for a single player, therefore these features would be kept. However it looks like time is highly collinear and will be removed.

Characters
Character matchups are an essential part of any fighting game and should be essential to making an accurate prediction. Therefore p1_name and p2_name should be selected features. However after observing the results of some trained models the impact of the character choices seemed too significant. The likely reason for this is simply there wasn’t enough variety of matches observed. Guilty Gear -Strive- has 28 characters totalling in a possible 784 different matchups. There were around 238 videos used - mostly only covering a single matchup each. Of the matchups that were represented it would usually be the same handful of players in it. For example, the majority of I-No matches in the training set were played by Daru, a notable I-No specialist. The training set has a few samples of I-No and Sol, played by Mocchi. With such few examples about the dataset, would the results be indicative of the characters I-No vs Sol or the players Daru vs Mocchi? Looking even more broadly, are any results with I-No representative of the character or Daru’s play? The goal was to see how character choice could influence the result, however with the small sample size it was difficult to separate character from player and therefore p1_name and p2_name were dropped.

The final feature list for the round predictor looks like:

Column NameType
p1_healthfloat
p2_healthfloat
p1_tensionfloat
p2_tensionfloat
p1_burstfloat
p2_burstfloat
p1_counterint
p2_counterint
p1_curr_damagedboolean
p2_curr_damagedboolean

Training the round predictor

The models Gaussian Naive Bayes, Logistic Regression, Random Forest Classifier, Decision Tree Classifier and Multi-Layered Perceptron were trained and compared by their accuracy. Initially the results looked very promising:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gb fit
Accuracy: 0.6087470540909009
Precision: 0.6214640836619235
Recall: 0.6041636153352761
F1 Score: 0.6126917462675101
lr fit
Accuracy: 0.7034352339782644
Precision: 0.7025487766358092
Recall: 0.7301680294543967
F1 Score: 0.7160921874458254
rfc fit
Accuracy: 0.990182458809012
Precision: 0.9903052826111105
Recall: 0.9905303508568138
F1 Score: 0.9904178039475113
dtc fit
Accuracy: 0.6769395141157875
Precision: 0.67378067619532
Recall: 0.7159155761839586
F1 Score: 0.6942093715718539

While all the results were fine there was a clear winner, 99% accuracy with the Random Forest Classifier. This is beyond my wildest expectations for this project, a near perfect predictor. Was it really possible to predict matches this accurately, with only these few variables? Unfortunately, if it sounds too good to be true, it probably is.

Overfitting

A lesson for myself, be careful of code that you copy. When learning how to create the dataset I took a bunch of code from various sources. There was one line I glossed over though:

1
2
3
4
#Split the data
round_x_train, round_x_test, round_y_train, round_y_test = train_test_split(
    round_x, round_y, test_size=0.33, random_state=125
)

Seemed innocent enough, it splits the data into training and testing data but random_state=125 shuffled the data. The split should look like this:

Training Test Split

With training and test data being distinct matches. However with the shuffle the test data was instead instances embedded in training data:

Training Test Split

This caused overfitting to occur. For example, take these 3 data points:

timep1_healthp2_healthp1_tensionp2_tensionp1_burstp2_burstp1_counterp2_counterp1_curr_damagedp2_curr_damagedp1_round_win
Training10.2s78.765.950.022.010011.921TrueFalse
Test10.3s76.065.950.523.010012.221TrueFalse
Training10.4s75.565.950.124.010012.521TrueFalse

The test data is situated in between the training data and its the all have similar values and the same target, player 1 wins. The point of the model is to find the relationship between all the features that can predict the ‘target’. It does this iteratively constantly adjusting values, evaluated against the test data and then adjusting again. Since the training and evaluation is performed on essentially the same match, it becomes overconfident that the matches within the dataset are indicative of all matches. Eventually the model converges to almost match the dataset exactly resulting in 99% accuracy. It seems like Random Forest Classifier was especially adept at this, being able to build a multitude of decision trees to match these specific data points. This is an almost textbook example of overfitting and it would be more correct to say that the model is recollecting the matches in the dataset rather than trying to predict the outcome. It is not indicative of how it will perform when viewing matches outside the dataset essentially making it useless. Simply removing the random_state=125 solves the overfitting. The random forest classifier drops to 65% accuracy, still a good result but I can no longer believe that I had made a perfect model.

Training the set predictor

This is mainly more of the same, I would highly recommend reading the notebook above if you’re interested in more details however there is one notable thing to talk about. Determining the features was a little more challenging. Burst and the round count were the only features that persist between rounds however these features alone would not be able to determine the winner of a set, it would need to know what is happening in the current round. It didn’t seem right to use the same features as the round predictor, now with an added round count. Instead the prediction from the round predictor was fed into the set predictor.

The final features for the set predictor are:

  • p1_burst
  • p2_burst
  • p1_round_count
  • p2_round_count
  • current_round_pred
    • This is the result of the round predictor in the current game state

Burst is fed into both predictors but its impact on a round and set are different.

After some hyper-parameter tuning the Multi-Layered Perceptron for both the round and set predictors was chosen moving forward. It’s time to start looking at tournament matches…

Asuka Arc

Although the results looked good, I wasn’t fully satisfied by it. Especially after removing the character matchups from the features, the model seemed very generic only looking at a few UI elements. I wanted to see if there was something more character specific I could dive into and then I saw this:

Credit: Sajam

Sajam talks about the difficulty in commentating Asuka due to the sheer complexity of his spell mechanic. Asuka has 26 different unique spells and can hold up to 4 at any point. After using a spell he can draw more from a pool of 30 spells. Asuka’s spells can be thought of as cards, the 4 on screen can be considered ‘hand’ that he plays and the pool would be his deck. Using the techniques I’ve learnt so far would there be a way to determine the quality of the currently held spells? I can’t believe I scope creeped myself…

Synthetic Frames

Investigating Asuka’s Spells requires another vision model but I refuse to draw more boxes. There is the additional issue that some spells are used much more often than others leading to the class distribution imbalance if I were to manually annotate game footage.

Asuka frame Asuka and his spell in game

Asuka’s spells UI elements that are on screen at the same place at all times. Synthetic frames that mock Asuka’s spell UI can be created by taking any frame of the game, even ones without Asuka, and overlaying random spells onto it in the correct location. Since overlaying the spell would be done pragmatically, the corresponding labels can be created for the YOLOv8 training thereby avoiding any need to annotate manually.

Asuka Synthetic frame Asuka Synthetic Frame, spells placed on screen with correct label

Using this technique it was possible to create and have thousands of annotated frames almost instantaneously to use for training. This vastly sped up the process for training the model taking only a few days rather than the weeks and months of the previous models.

Training the Asuka Spell Model

Full details of the training and model exploration are shown here If you are interested ggstrive_asuka.ipynb. There are some interesting obstacles I want to highlight here.

Data Format
The data obtained from the vision model and translated into the csv looked like this:
asuka_spell_1asuka_spell_2asuka_spell_3asuka_spell_4
howling_metronhowling_metrongo_to_markerbookmark_random_import

Much like before, each row represents a point in time during the match and each column represents the spells as shown on the screen at the current moment, from left-right. The models don’t know how to interpret string, therefore need to be encoded via one-hot encoder, effectively turning each column into a binary vector of its possible values. The above example would become:

asuka_spell_1_howling_metronasuka_spell_2_howling_metronasuka_spell_3_go_to_markerasuka_spell_4_bookmark_random_import
10…010…010…010…0

This inflates the number of columns from 4 to 4 * number of spells. However the model should not care about the order of the spell as, functionally, the spells above in any other order areis identical. It doesn’t matter if any individual spell is in any specific column. To alleviate this issue, a custom encoder was created that converted the above format into:

howling_metrongo_to_markerbookmark_random_import
20…010…010…0

Each spell has its own column and the value becomes the number of spells seen on that frame.

Under Sampler
Initial training of the models produced some interesting results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
gb fit
Accuracy: 0.6259795457564086
Precision: 0.7165239328155082
Recall: 0.7209271257765012
F1 Score: 0.7187187853765732
lr fit
Accuracy: 0.6673307654845708
Precision: 0.6738147405715351
Recall: 0.9654665686994857
F1 Score: 0.7936961177310417
rfc fit
Accuracy: 0.6576349227431708
Precision: 0.6926024481106972
Recall: 0.8692806091777436
F1 Score: 0.7709487278220432
dtc fit
Accuracy: 0.6675964050117325
Precision: 0.6778683445350112
Recall: 0.949903146082426
F1 Score: 0.79115438108484
mlp fit
Accuracy: 0.6702085270288219
Precision: 0.6817435005315551
Recall: 0.9423552200921782
F1 Score: 0.7911397728865834

These are some strange results. The clue is in the consistently high Recall value. Recall is a measure of how well the model finds all instances of the positive class.

\[Recall = {True Positive \over True Positive + False Negative}\]

The higher the Recall the lower the False Negative value. For that to happen either the model was extremely accurate or it rarely predicted a negative instance. For the latter to be true while maintaining a high accuracy, the test data must be overwhelmingly positive instances.

1
2
3
4
#Test Data
asuka_win
True	14971
False	7616

This turns out to be the case not, a significant class imbalance for the positive class. This trend also holds true for the dataset as a whole. This introduces a bias to the predictions that should be avoided. This can be circumvented by utilising a Random Undersampler that will remove instances of the majority class.

1
2
3
#Test Data after Random Undersampler
True	7664
False	7600

The training results then fall within expectations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
gb fit
Accuracy: 0.5914570230607966
Precision: 0.5843205574912892
Recall: 0.6543964620187305
F1 Score: 0.6173763651981838
lr fit
Accuracy: 0.5870020964360587
Precision: 0.5888318356867779
Recall: 0.5966441207075962
F1 Score: 0.5927122367230908
rfc fit
Accuracy: 0.5818265199161425
Precision: 0.5951582324631763
Recall: 0.5308272632674298
F1 Score: 0.561155036094878
dtc fit
Accuracy: 0.5721960167714885
Precision: 0.5572700296735905
Recall: 0.7328303850156087
F1 Score: 0.6331048432408136
mlp fit
Accuracy: 0.5906053459119497
Precision: 0.6004187020237265
Recall: 0.559573361082206
F1 Score: 0.5792769137547971
svc fit
Accuracy: 0.6072458071278826
Precision: 0.6116312804958459
Recall: 0.6032778355879292
F1 Score: 0.6074258398271233
Training and results
For the most part training was similar to how it was performed before although a SVC (Support Vector Classifier) was also evaluated. The accuracy for all models was on the lower side ~60% but that’s not unexpected as only looking at spells was never going to be incredibly accurate. In the end the Multi-Layered Perceptron once again gave the best results. The model when given a hand of spells outputs its confidence, represented as a percentage, that Asuka will win. Generally this range is about 40% -> 60%. In isolation it’s not intuitive to determine the quality of a hand compared to others. Instead all possible hands were evaluated by the model, ranked and a percentile created. This has the added benefit that precomputing all the hands allows all the results to be cached which allows much faster lookups and entirely skips the need to use the model at all.

Evaluating Tournaments

If you remember at the very beginning of this, a list of tasks were defined:

  1. Annotate and create a dataset for a vision model
  2. Train the vision model
  3. Use the vision model to create a dataset for a predictive machine learning model
  4. Train the predictive model
  5. Run the vision model on tournament matches to create a tournament dataset
  6. Use the predictive model on the tournament dataset
  7. Create a dashboard from the prediction and tournament dataset

At this stage, steps 1-5 have been completed with a small detour in creating a model for Asuka’s spells as well. Analysing the tournament matches should be simple enough with the tools that have already been created. The vision and predictive model should work exactly like they did during the data collection phase as long as the tournament matches are similar to the videos used in the data set.

Oh no…

Did you catch that? If we go back to logic used in the data collection:

1
2
3
4
5
6
read YOLOv8 frame
if 'round_start' in frame: # Looking the round start graphic "Let's Rock"
  wait until 'round_start' disappears # This will be the first fame of the actual round
  update 'round_count' # Look at the number of hearts vs hearts broken in frame for each player
  initialise new 'round'
...

The round started, but where was the round start graphic? The videos that had been analysed up until this point were recorded in-game replays of raw match footage, a controlled environment only concerned with the gameplay. However, production of a tournament is more dynamic, there are other aspects to focus on such as players, commentators or crowd reactions and showing the ‘round start’ graphic is not necessarily priority. This ends up being the case for the end of round ‘Slash’ graphic as well often cutting to player reactions of the match result. The solution for this was to build a fail-safe condition, based on the round count/hearts. The round count will only change if a player wins a round, and a heart on the top will go grey or a player wins the entire set and all hearts return to red:

1
2
3
4
5
6
7
8
9
read YOLOv8 frame
if 'round_start' in frame: # Looking the round start graphic "Let's Rock"
  wait until 'round_start' disappears # This will be the first fame of the actual round
  update 'round_count' # Look at the number of hearts vs hearts broken in frame for each player
  initialise new 'round'
if current round count > previous round count:
  initialise new 'round'
if current round count == 0:
  initialise new 'set'

It sounds simple but the actual execution is much more involved. If you’re interested the code is open sourced and can be found here.

Data Cleaning

One of the downsides of using a vision-based model for reading gauges is that sometimes, for a small number of frames, the model would ‘hallucinate’ and see the gauge expand or shrink without any actual movement. To combat these random outliers, the median on a rolling window was applied to all the gauges in the tournament matches. There was balance that needed to be struck with the size of the window. A large window would catch outliers that lasted for more than a few frames, however there would be fidelity loss, as it would delay the reporting on the actual movement of the gauge. In the end a window size of 5 seems to be optimal although some outliers still do exist.

Raw data

time20.120.220.320.420.520.620.720.820.9
p1_health50.050.075.575.550.045.545.545.545.5

After cleaning

time20.120.220.320.420.520.620.720.820.9
p1_health50.050.050.050.050.045.545.545.545.5

Should this have been applied to the training dataset during the predictive model training? Probably, but the thought was that, with enough data, any inaccurate outliers would be outweighed by correct data, However, if I continue working on this, it will be applied to new datasets.

More Stats!

While the main point of collecting the data was to create a running prediction, it also is a history of the match. Using this history it is possible to extract other interesting stats from each match such as:

  • Bursts used
  • Burst Gauged used
  • Tension used
  • Time spent in the lead according to the predictor

Furthermore these stats were grouped and aggregated by player to show a breakdown of stats over many games. These are created here, ggstrive-Tournament.ipynb, in a pre-processing step to speed up the eventual dashboard.

Almost there, Creating Visualisations

The dashboard hosted at https://backyard-insight.info/, with the source code at backyard-insight, is a Plotly Dash webapp created in python with a MongoDB Atlas backend host on Render. As someone with limited frontend experience it was the easiest way to build something quickly. The bulk of the work was learning some CSS to replicate the UI elements of Guilty Gear -Strive- for the prediction graph. Unfortunately it is not ‘responsive’ i.e. it does not resize itself well to smaller screens such as mobile or tablets. The limitation for this, besides my own skill, is the Plotly graph component itself does not work well with limited screen size.

Conclusions

This was quite a journey, I appreciate everyone who made it through the entire blog and look forward to any questions you may have. In the follow-up blog I’ll reflect on the outcomes of the ‘Backyard’ as well as talk about its potential future.

Footnotes

  1. https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect ↩︎

  2. https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html ↩︎

  3. https://www.dustloop.com/ ↩︎

This post is licensed under CC BY 4.0 by the author.

Trending Tags