📝 Updates README.md and adds images

This commit is contained in:
Daniel Svitan 2025-02-22 17:47:15 +01:00
parent 2f3c547b55
commit cd3755f167
4 changed files with 89 additions and 17 deletions

2
.gitignore vendored
View File

@ -15,6 +15,8 @@ paper/
*.jasp
*.pth
*.png
!structure.png
!example-graph.png
*.drawio
*.tar.gz

104
README.md
View File

@ -1,22 +1,92 @@
# Hello!
Welcome, you either don't know what the hell this is or know exactly what the hell this is. Either way, here's a quick
explanation:
I decided to write a special paper at my highschool - why? It's kind of a competition, a bunch of students submit their
SOC paper and the best one wins. They're actually graded by a whole ahh comittee and it's a big deal and whatever.
Anyway, I'm here cuz it's fun, not cuz I wanna win (obviously I wanna win but that's not why I decided to do the SOC)
I will eventually (probably Feb 2025) publish the paper, which I will also send to everyone who participated in the
survey (fr thanks to everyone who did). Here I wanna keep all the scripts I used while writing the paper, it's mostly
stuff for cleaning, analyzing, and graphing the dataset
I will also probably send out a link to this repo along with the paper, so if you got here from that link, welcome!
Thanks so much for participating in the survey, you can check out the scripts and the other markdowns (to be added)
where I explain what I did in a slightly more friendly way
And if you're interested in the dataset, no I'm not publishing it, sorry (mom said no)
Welcome to the technical repository for my 2024/2025 SOC paper,
this is where I keep all my scripts, scientific tests, algorithms, and graphing programs,
let me walk you through how it works
### Dataset
The documentation for the dataset structure can be found [here](https://github.com/Streamer272/soc-2024/blob/main/DATASET.md)
The documentation for the dataset structure can be found in [DATASET.md](DATASET.md),
this is only interesting for the nerds
### Distribution
This is probably the easiest part of this whole thing, it's basically just making charts and computing percentages,
say you have 12 male and 15 female respondents, what is the distribution? It's quite simple, here:
`(number of elements in a group) / (number of elements in the dataset)`
So in this case, the distribution of male respondents would be `(12) / (12 + 15) = ~44%`, and female `(15) / (12 + 15) = ~56%`,
now that we know this, we can make a pretty pie graph! The script that does all of this is [distribution.py](distribution.py)
### Analysis and scientific tests
This is where stuff gets interesting, the script that does all the heavy lifting is [analyze.py](analyze.py),
then, you have the specified analysis scripts, like [analyze_sex.py](analyze_sex.py)
(which, surprisingly, only analyzes sex)
[analyze_sex.py](analyze_sex.py) only picks out its data from the dataset and passes it down to [analyze.py](analyze.py) to do all the analyses, where the following things happen:
1. the received data is put into groups, each receiving an assigned letter (A, B, C, etc), groups of insufficient size are removed
2. if there are less than 2 groups, analysis aborts
3. [Kruskal-Wallis test](https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_test) is performed and `F` and `p` values are received
4. if `p` is greater than 0.05, the difference between those groups is not statistically significant and analysis aborts
5. post-hoc [Dunn test](https://www.statology.org/dunns-test/) is performed and `p` values are saved
6. a result table is created, a comparison between each group is added as well as their [rank-biserial correlation](https://www.statisticshowto.com/rank-biserial-correlation/), difference in medians, difference in means, and post-hoc `p` value
Problem solved!
If the difference is statistically insignificant,
or we don't have sufficient data to perform a statistical test, the analysis aborts,
otherwise, we get our `F` value, `p` value, and the result table, which could look something like this:
| Skupina 1 | Skupina 2 | Veľkosť účinku | Rozdiel priemerov | Rozdiel mediánov | Post-Hoc p-hodnota |
|-----------|-----------|----------------|-------------------|------------------|--------------------|
| A | B | 0.0440 | 0.4198 | 0.0000 | 0.0497 |
| A | C | 0.0399 | 0.2723 | 0.0000 | 0.5239 |
| B | C | -0.0084 | -0.1475 | 0.0000 | 0.3706 |
### Graphing
Once the analysis is complete, successful or not, we can graph the data, we mostly use violin plots,
which are quite easy to understand and interpret, it goes like this:
1. the window is split into four subplots, top left for average grade, top right for math grade, bottom left for slovak grade, and bottom right for english grade
2. all groups get added to each subplot as a violin plot, so for sex, each subplot would contain a violin plot for males and a violin plot for females
3. the `F` and `p` values get added to the top left corner of each subplot
4. the legend gets added to the top right corner of each subplot and axes are marked
5. each violin plot contains five pieces of valuable information:
1. the shaded background that shows the distribution of the data
2. the gray line that represents the data between the first and third quartile
3. the red mean line with the mean value on the left
4. the green median line with the median value on the right
5. the minimum and maximum bounds
6. labels get added to each violin plot
Quite complicated, right? A ton of data packed into one small image, which could look something like this:
![example graph](example-graph.png)
It can be overwhelming to look at at first, but once you understand what's going on, it's quite intuitive, anyway,
the function that does all this is also saved in [analyze.py](analyze.py)
### Neural network
Ah!
AI stuff!
Well, it didn't work in the end because of the abysmal amount of data, but the structure and
the training process is still here and can be looked at
The script that trains the neural network is [train_nn.py](train_nn.py) (yes, I am very creative, I am aware), it uses
the [pytorch](https://pytorch.org/) library to do all the math stuff that goes on behind the scenes, but the important
part is the structure of the neural network, right here:
![structure of a neural network](structure.png)
Of course, we have to use the `.npy` file format to load the data into our program, so how do we convert the `.csv`
data provided by the Google Forms into a `.npy`?
The answer lies in [clean.py](clean.py), but I'm not going to go
into how it all works, it's just cleaning the data
The whole training thing is pretty complicated, so if you don't know anything about neural networks, just forget about
it and attribute it to magic, but if you do, read through [train_nn.py](train_nn.py),
it's a pretty clean and readable code

BIN
example-graph.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

BIN
structure.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB