Cross Examination of the Slovak State SOC Competition Results and the Analysis of the Judges' Bias
Backstory
In the fall of 2024, I decided to participate in the 2024/2025 round of SOC, organized by an institute ran by the Ministry of Education of the Slovak Republic. My category was Psychology, Sociology, and Pedagogy (number 17), you can find my paper here. With this paper, I won first place in the county round and continued on to the state round with a girl who won 2nd place.
I have been made aware of the bias in the state round, but I wanted to see for myself how it would really turn out. Me and my collegue from Bratislava didn't win anything despite having good papers and the by-far best paper only won 4th place (shoutout to Sofia Petrovičová).
I was quite outraged by the results, so I decided to see if there is some statistical evidence to back up the claims of bias.
Statistics
This project is being written in 2025 and I have only collected data from 2017 to 2025 (including), so if you're reading this a couple of years later, this may not be relevant.
We conducted two statistical tests:
- to see whether some counties won more than others
- and if they won, whether some counties won better places
First test
The following data represents how many times a county won (any place):
BA - 75
TN - 80
TT - 76
NR - 94
BB - 83
ZA - 141
PO - 103
KE - 113
We also needed to adjust for population, since some counties have more inhabitants than others, and therefore are expected to win more. The population data have been extracted from this source (this was accessed in May of 2025 and holds the statistical data from 31.12.2024).
The unit we used is rather cursed, but understandable once you realize that pure wins per capita would yield very small numbers, and therefore we had to use micro-wins per capita - we just multiplied everything by one million. This is the adjusted (and rounded for the purposes of this README) data:
BA - 102
TN - 141
TT - 134
NR - 141
BB - 136
ZA - 206
PO - 127
KE - 145
It is evident that Bratislava county has by far the least micro-wins per capita, already hinting at the possibly existing bias. The expected (average) value for each county was 142 micro-wins per capita.
We used Chi Square test to compare the observed and expected values and received the following results:
Chi-Square = 42.1935
p = 0.00000048
Since p
is far below the common threshold of 0.05, we reject the null hypothesis and accept the alternative hypothesis, which states that some counties win more often than others. Our claims of bias can therefore be supported by statistical evidence.
Second test
For the purposes of the second test, we had to reshape the dataset so that rows represent counties and columns their placements. An example (not real data) could look like this:
BA | 5 | 1 | 4 | ...
TN | 2 | 2 | 3 | ...
TT ...
Then, we performed a Kruskal-Wallis test (since the data didn't fit a normal distribution) to compare the counties between each other and see whether some counties won higher places than others (once they actually won). The results were as follows:
F = 6.2036
p = 0.51619147
p
indicates that there is no statistical evidence to support the alternative hypothesis and we shall thus reject it. The null hypothesis therefore stands, once counties win, there is no difference in placements between the counties.
Conclusion
We have demonstrated that bias is very much present, especially against the Bratislava county and in favor if Žilina county. More data would be able to shed even more light onto the situation, but unfortunately, the digitized records only start in 2017, and we can't really get our hands on earlier data. This topic should be reexamined every couple of years once new data is available.
Data & Dataset
The raw documents from the Institute are all saved in data.zip in the following format: rYEAR-CAT.EXT
(eg. r2023-06.pdf
for the 6th category of 2023). These aren't of much use, since they're very inconsistent and downright terrible to work with, I don't recommend handling them unless you know what you're doing. They have been scraped from the official website of the Institute.
The labeled dataset can be found in dataset.txt, where each line follows the format of: YEAR CAT WIN1,WIN2,WIN3,WIN4,WIN5
(eg. 2023 06 ZA,NR,PO,BB,KE
). This is actually what was used for the analysis, since it can easily be parsed. You can find 153 entries there (17 categories * 9 years = 153) and a total of 765 samples (county-placement).
Credits
I would like to thank my mentor, Mgr. Martina Šandor, who has been of great help with my SOC paper, and Adam Kováč, who manually labeled the data and created the dataset.
License
This project is licensed under the GNU GPLv3 license