top of page

How to do basic football data analyst stuff — Technical article

Submitted by: Confaderal | 6th February 2024


A computer programmer is using Rstudio to analyze football or soccer matches

Ever wondered what goes on in the work behind the data analyst? Ok so, this article will stray away from our beloved Hougang United for a moment and do just a basic dive into the data-collecting work done by football's data analyst. This is by no means a comprehensive jobscope, but rather just a glimpse into the work done.


If you've played Football manager and have wondered how those graphs, bar charts and scatterplots came to be then you've come to the right place. How to do basic football data analysis — Technical article.


I'll try to keep it as straightforward and simple as possible for the layman so that we can focus on the fun part without going into so much details. But it would really help if you have at least some basic skills in the softwares we are going to go through namely the video editing software and the programming IDE software.



Softwares needed(for your basic football data analyst session):



Also, I will be posting up a sample CSV file so that you guys can put it up in the programming IDE to see the data come to life. You can use this if you're not keen on doing the steps 1-3. It was from a 2022 match between LCS and our Hougang United! (Due to some requests, you can also download the CSV and all of the different project source codes from my Github repository!)


Without further ado, Here we go!




1.First step: Getting the full match video and "cleaning it"

The first thing we're going to do is to clean up the video of the full match. We do this because we want our timecodes for specific parts of the match to fall in line with the the actual timing of the match.


You know how a full match usually shows the pre-game where the players are shaking hands with each other. Yeah, we don't want that. So we take that part out. We also don't want any action after the first half whistle is blown to be in our video. So we cut straight from the end of the first half into the start of the second half instantly.


Do take note however, that the first half rarely ends at 45mins, but it's okay. If the game ends at 48 or 49 minutes, just be sure to take this into account when you see your final data for the timecodes that happen in the second half. An example would be IF THE FIRST HALF ENDS AT 49MINS, then if Kracjek scores in the second half at 70mins on your data sheet, you shave-off 4mins from the timecode. SO KRACJEK NOW SCORED AT 66mins!(It should directly relate to the on-screen timer of the match).


Let's begin! Go on ahead and put your full un-edited match video into your video editing software of choice. I'm going to use Davinci Resolve(free to use, link above).


Open up Davinci Resolve and drag the video of the full match into the NLE (non-linear editor software). Right at the bottom of the software there is a tab. Click on "Edit mode"(pictured below, circled in red, fig1a).


fig1a. Using Davinci resolve to modify the full match video.
fig1a. Using Davinci resolve to modify the full match video.

This is the screen where you need to do the hard work and (a)trim the start of the video to when the ball is kicked-off. You also need to (b)cut out the in-between the ending of the first-half and the start of the second-half. Then finally you (c)cut-out the parts after the final whistle is blown. You can use the razor tool: [Ctrl/Cmd + B].


Once that is done, you'd need to add a timecode. You do this by clicking on "Workspace" at the top tab and clicking the "Data burn-in"(pictured below, circled in red, fig1b).


fig1b. Creating a timecode generator for our video
fig1b. Creating a timecode generator for our video

A window will then pop-up. Check the "Record Timecode" checkbox as pictured below. Then close the pop-up window, (fig1c).


fig1c. Click on the Record timecode checkbox and a timecode will appear.
fig1c. Click on the Record timecode checkbox and a timecode will appear.

Now you're done! You just need to export out the video so that you can upload it on (as unlisted on)youtube. To export, you'd need to go to the bottom tab and click on the "Delivery mode" and modify the video settings if any, and hit "Add to render que"(pictured below, circled in red, fig1d). Do remember to always set your resolution to 1920x1080 or higher so that you can see the player's numbers on their jerseys or see their faces clearly. I named my filename to "Fullmatch_LCSvHGU" and placed the location on my desktop.


fig1d. Change the settings of the video as needed and add to render que
fig1d. Change the settings of the video as needed and add to render que

After doing so, the video will appear in your render que on the panel in the top-right. Click "Render all" (as pictured below, fig1e). Your video will be "playing" along with your PC trying to chug out a video to wherever you place the location of the export. In my case, it's to my desktop.


fig1e. Render all will export out the video to your desired destination.
fig1e. Render all will export out the video to your desired destination.

Now, unto the next step!




2. SECOND step: Uploading to youtube

This step is pretty self-explanatory. You can use your youtube account to upload. The only thing being, it needs to be an "unlisted" video on youtube. Once uploaded, you've got to copy the URL link to the video.


fig2a. Ensure that the video is 'Unlisted'. Otherwise youtube will take down your video.
fig2a. Ensure that the video is 'Unlisted'. Otherwise youtube will take down your video.


On to the next step!




3. THIRD STEP: Event-tagging the action

This step requires patience and meticulous work. Basically "event-tagging" is where every pass, shot, cross is recorded and at which time of the game it is being committed. After you have tagged all the events in the full match, you can then "export a csv" file so that you get sort of like an excel sheet full of the happenings in the game. This CSV file is important for the next step, so do take some time to make sure the data is accurate as possible.


Open the website and the first thing to do is to change the URL of the video to the youtube video which you have painstakingly uploaded awhile ago(picture below, fig3a).


fig3a. Enter the youtube url of your unlisted video.
fig3a. Enter the youtube url of your unlisted video.


Next, you want to click on "Edit tags" and change "Home" to the home team and "Away" to the away team. And assign the names of the respective players(and numbers) on the tags(Pictured below and circled in red, fig3b).


fig3b. Click "edit tags" and begin to change the data according to your video.
fig3b. Click "edit tags" and begin to change the data according to your video.

Next, you want to edit the tags too for the different types of actions that will happen on the pitch, e.g. shot, goal, foul, set piece, pass etc. In my example below, I've listed my actions as:

-Successful pass

-Shot

-Thru pass

-Cross

-Throw

-Stray pass

-Set Piece

-Foul

-Interception

-Dribble attempt

-Corner

-Goal

-Assist

-Possession Lost

-Offside


fig3c. When the tags are purple, you can edit them. Once done, click "Edit tags" again to lock your tags.
fig3c. When the tags are purple, you can edit them. Once done, click "Edit tags" again to lock your tags.

Remember to click on "edit tags" again so that the information is locked. Now, we can move on to actually tagging the video! Play the video on the top-left panel while tagging the action by first clicking on the player's name and then the action that he did. Then "draw" it on the mini-field in the bottom-right panel.


Your event will automatically appear in the top-right panel(as pictured below, fig3d). If I were to click on an event at the top-right panel, it would be highlighted in green and on the mini-field below, I can see the pass from the start point to the end point in the form of purple circles. Also, when you click on an event, the youtube video on the top-left panel will "rewind" or "go-forward" to the point where the event/action is made. Try it! It's magic.


fig3d. You can see the "events" that you've tagged by simply clicking on the action made.
fig3d. You can see the "events" that you've tagged by simply clicking on the action made.

Once you have done event-tagging the full match, the hard work is now over and on to the very fun part where you can see the fruits of your labor come to life. Click on "Export CSV" (pictured below, circled in red, fig3e).


fig3e. Once done, click on Export CSV where you'll get a CSV file with all the event tag information that you've done.
fig3e. Once done, click on Export CSV where you'll get a CSV file with all the event tag information that you've done.

As a result, a CSV file will be churned out for you. You can use microsoft excel to open the file and it should look something like this below. In my case below, I added a few parameters after column I onwards. This is to further beef-up my data based on what I planned to do. I also added a "Half" perimeter so that I can seperate the first half from the second half.


fig3f. This photo shows an example of the CSV. It displays action-by-action information on the goings-ons in the match.
fig3f. This photo shows an example of the CSV. It displays action-by-action information on the goings-ons in the match.

Once this is done and you have your CSV! It's time to move to the next and final step!



4. FINAL STEP: Using Rstudio to get graphs

This step is the most rewarding and fun. It will clear up any misconceptions you had about a player and display the statistics pretty clearly for you to see.


(A) As listed above, the first step you want to do is to download the R-language. https://cran.rstudio.com/

Download the latest version for either windows or mac or linux.

fig4a. Install R language
fig4a. Install R language

(B) Next, you'd want to install your programming IDE/software. We're going to use R-studio to do so. So go ahead and download from here: https://posit.co/download/rstudio-desktop/ . Go ahead and install it as per usual once done.


fig4b. Wait for the installer to finish installing R studio
fig4b. Wait for the installer to finish installing R studio

fig4c. Use this default settings when prompted during the installation.
fig4c. Use this default settings when prompted during the installation.


(C) Now that you've installed it, go ahead and open the software to get a sense of what the software does. It may look daunting at first, but you'll get used to it! This IDE will display four panels — the Source editor, Console, Workspace/History browser, and Plots panel (as pictured below).


Grossly simplifying things: The source panel is where you will input your code/ R-code into. The history panel is where it will display the changes that you have made to your code in the source panel. The console panel is where It will display useful information regarding the code from your source panel. An example would be it showing you which line in your code has an error. Or it could also be used to install add-on programming library packages which you might need to call upon later on to make your code work. Lastly, the plot panel shows the output of your source code. So all the graphics and bar chart and scatterplots would show up here(be sure to click on the "Plots" tab, In my screenshot below(fig 4d), the tab was on "file". Sorry).

fig4d. This shows the different panels in the Rstudio IDE.
fig4d. This shows the different panels in the Rstudio IDE.

If you open up R-studio and don't have the source panel showing, don't worry. Just click on the "box icon" at the top as shown below (fig4e).


fig4e. When you first open the IDE, if you dont see the source panel, just click this "box" circled in red.
fig4e. When you first open the IDE, if you dont see the source panel, just click this "box" circled in red.

(D) Next, you'd have to install the add-on library packages to make your graphical charts work as intended.

You have to

  • Install 'Dplyr' package within the IDE console (install.packages('dplyr'))

  • also install 'plotly' (install.packages('plotly'))

  • also (install.packages('ggplot2'))

  • And any other project specific library packages that you see is listed at the top of the source code in the source panel(e.g. ggsoccer, ggthemes etc as pictured below). Don't worry, the IDE will prompt you to install it if you don't already have it.


fig4f. When you open a source code, the IDE will prompt you that a certain package is not installed. Don't worry.
fig4f. When you open a source code, the IDE will prompt you that a certain package is not installed. Don't worry.

To install the library packages, you'd have to go to the console panel and type in the prefix — install.packages(). Inside the parentheses, add the library package that you want to install with the apostrophes (example below with ggplot2, fig4g).


fig4g. Install the packages by typing in the console panel.
fig4g. Install the packages by typing in the console panel.

After awhile, the installation will be finished and it'll look something like below (fig4h).


fig4h. The console panel will chug through your installation as shown.
fig4h. The console panel will chug through your installation as shown.

Once done. Now you're ready to code. (E)Since we're beginners here, we're not actually going to code/programme anything. So just go ahead and look at the glossary below for a sample of each type of graph which I've already done before. All you need to do is open it with your R-studio IDE/software and make some small changes:


  1. Install all the library packages as shown in the first few lines of the source code.

  2. Take note that hashtags(#) before a line is a programmer's notes. It is not intended as part of a code.

  3. Swap all the variables inside my source code to fit your data.

  4. Change the directory of the CSV file that you've created using the event tagger.


When changing the directory of the CSV file, do take note that different operating systems require different syntax. For windows, you need to use 'double backslashes' to denote the file path. On a mac, we used a 'forward slash' to denote the file path. (Please see the picture below to get a clearer picture, fig4i). It is also better to put the CSV file on your desktop...for now...as a beginner.


fig4i. Remember that different OS uses different slash syntax when pointing to the file path.
fig4i. Remember that different OS uses different slash syntax when pointing to the file path.

Now, all that's left is to change the variables based on my source code..and basically you're done. and all that's left is to run the source code! Make sure you click on the "source run" button in the source panel, otherwise, the software might just run one line in your code. The source button is circled in red below(fig4j).


fig4j. Be mindful of pressing the correct button to run the code: Source button
fig4j. Be mindful of pressing the correct button to run the code: "Source" button

Once you run the source code button, the plot panel will show you your graph(as shown below, fig4k).


fig4k. Here's what the final output would look like.
fig4k. Here's what the final output would look like.

How to change the variables?

In it's most basic of terms, we need to change the parameters that we used so that it can properly call the attributes that you have outlined in your csv. An example would be to change the Home and Away team because each project could feature different teams. Also, a variable in R-studio is usually green. See below to see the correlation between the CSV and the IDE source code.


fig4l. It would be good if you understood some R-programming. But if not, try to scrutinize the code to see where the variables are. Usually its in green.
fig4l. It would be good if you understood some R-programming. But if not, try to scrutinize the code to see where the variables are. Usually its in green.

In this "Shot_Analysis" project (on the right), we can see that the teams consist of "HGU(A)" and "LCS(H)". This is the same as the CSV information column (on the left). The spelling is exactly the same. So if you've created a CSV where the two teams are "AUS(A)" and "KOR(H)", do change the variables on the project source code(circled in red). Changing the teams is A MUST! as no one single match will involve the same teams(unless you're going to be analyzing another Hougang vs LCS match in 2024 or something).


Similarly, since this is a "Shot_Analysis" project, the variable that we're measuring are the "shots" that happen in the "event" column(circled in red). If let's say you aim to do a "Cross_Analysis", you'd have to change the 'Shot' to a 'Cross'. But it is advisable to leave this part alone and not change the source code except for the changes in teams.


Additionally, some of the projects will only feature the "half" column to indicate the first or second half of the match. Once again, don't change anything unless you know what you are doing.


So do keep a lookout for variables. Should you experience any problems, do leave a comment below and we'll try to help you out.




5. Glossary of different types of graphs

First of all, You can download the sample CSV file I did of the Hougang vs LCS match in 2022 here. All the files are also in my Github repository since there are some of the readers who've requested it.


Here are the different types of graphs and the source code attached to it.



A graph from Rstudio showing football data

Type: Shot Analysis


Description: Shots and attempted shots along with all the details.






 

A graph from Rstudio showing football data

Type: FOULS


Description: Fouls committed and at which part of the pitch it is committed. By which team and during which half.




 

A graph from Rstudio showing football data

Type: Set pieces


Description: Where the setpieces are won, on which part of the pitch and by which team. Also shows the half in which the setpiece is at.



 

A graph from Rstudio showing football data

Type: Crosses (1st half)


Description: Crosses done and the direction of the pass and whether it was successful or not. Also include which part of the pitch the cross was made. First half only





A graph from Rstudio showing football data

Type: Crosses (2nd half)


Description: Crosses done and the direction of the pass and whether it was successful or not. Also include which part of the pitch the cross was made. Second half only




 


A graph from Rstudio showing football data

Type: Interceptions (1st half)


Description: A heat map for the interceptions made in the match, at which area of the pitch and a total count of the interceptions. First half only




A graph from Rstudio showing football data

Type: Interceptions (2nd half)


Description: A heat map for the interceptions made in the match, at which area of the pitch and a total count of the interceptions. Second half only




 

A graph from Rstudio showing football data

Type: Throw-ins (1st half)


Description: A graph showing the throw-ins from which part of the pitch. First half only






A graph from Rstudio showing football data

Type: Throw-ins (2nd half)


Description: A graph showing the throw-ins from which part of the pitch. Second half only





 

A graph from Rstudio showing football data

Type: Passing heatmap (Team A, 1st half)


Description: A heat map for passes. First half only and for one team, HGU.




A graph from Rstudio showing football data

Type: Passing heatmap (Team A, 2nd half)


Description: A heat map for passes. Second half only and for one team, HGU.




A graph from Rstudio showing football data

Type: Passing heatmap (Team B, 1st half)


Description: A heat map for passes. First half only and for one team, LCS.




A graph from Rstudio showing football data

Type: Passing heatmap (Team B, 2nd half)


Description: A heat map for passes. Second half only and for one team, LCS.


 


A graph from Rstudio showing football data

Type: Posession lost (Team A, 1st half)


Description: A graph showing the percentage of posession lost and at which part of the pitch. First half only, HGU.




A graph from Rstudio showing football data

Type: Posession lost (Team A, 2nd half)


Description: A graph showing the percentage of posession lost and at which part of the pitch. Second half only, HGU.




A graph from Rstudio showing football data

Type: Posession lost (Team B, 1st half)


Description: A graph showing the percentage of posession lost and at which part of the pitch. First half only, LCS.




A graph from Rstudio showing football data

Type: Posession lost (Team B, 2nd half)


Description: A graph showing the percentage of posession lost and at which part of the pitch. Second half only, LCS.




 

In summary

Hope that sharing this knowledge will help hougang and perhaps other football fans to have an idea behind the work done by a data analytics staff in preperation for a game.


Please do read up on how to use the softwares listed above to sharpen your understanding of them. Sooner or later, you'll be analyzing more games on your own and understand how to tweak different source codes to suit different needs. All the best!


P.S. don't be afraid to leave a comment below if you're stuck! I am NOT a data scientist or analyst or whatever, just an enthusiast. So don't be afraid. The community can help! 🐆

 

END


Was this article useful?

  • 0%Yes

  • 0%No!

You can vote for more than one answer.


Recent Posts

See All

2 Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Rated 5 out of 5 stars.

Class work!

Like
Replying to

Thanks KingKiki! 😁 Stay tuned yo!

Like
bottom of page