diff --git a/.gitignore b/.gitignore index 62c14be..3579dcf 100644 --- a/.gitignore +++ b/.gitignore @@ -15,6 +15,8 @@ paper/ *.jasp *.pth *.png +!structure.png +!example-graph.png *.drawio *.tar.gz diff --git a/README.md b/README.md index 3833c42..96d34d4 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,92 @@ # Hello! -Welcome, you either don't know what the hell this is or know exactly what the hell this is. Either way, here's a quick -explanation: - -I decided to write a special paper at my highschool - why? It's kind of a competition, a bunch of students submit their -SOC paper and the best one wins. They're actually graded by a whole ahh comittee and it's a big deal and whatever. -Anyway, I'm here cuz it's fun, not cuz I wanna win (obviously I wanna win but that's not why I decided to do the SOC) - -I will eventually (probably Feb 2025) publish the paper, which I will also send to everyone who participated in the -survey (fr thanks to everyone who did). Here I wanna keep all the scripts I used while writing the paper, it's mostly -stuff for cleaning, analyzing, and graphing the dataset - -I will also probably send out a link to this repo along with the paper, so if you got here from that link, welcome! -Thanks so much for participating in the survey, you can check out the scripts and the other markdowns (to be added) -where I explain what I did in a slightly more friendly way - -And if you're interested in the dataset, no I'm not publishing it, sorry (mom said no) +Welcome to the technical repository for my 2024/2025 SOC paper, +this is where I keep all my scripts, scientific tests, algorithms, and graphing programs, +let me walk you through how it works ### Dataset -The documentation for the dataset structure can be found [here](https://github.com/Streamer272/soc-2024/blob/main/DATASET.md) \ No newline at end of file +The documentation for the dataset structure can be found in [DATASET.md](DATASET.md), +this is only interesting for the nerds + +### Distribution + +This is probably the easiest part of this whole thing, it's basically just making charts and computing percentages, +say you have 12 male and 15 female respondents, what is the distribution? It's quite simple, here: + +`(number of elements in a group) / (number of elements in the dataset)` + +So in this case, the distribution of male respondents would be `(12) / (12 + 15) = ~44%`, and female `(15) / (12 + 15) = ~56%`, +now that we know this, we can make a pretty pie graph! The script that does all of this is [distribution.py](distribution.py) + +### Analysis and scientific tests + +This is where stuff gets interesting, the script that does all the heavy lifting is [analyze.py](analyze.py), +then, you have the specified analysis scripts, like [analyze_sex.py](analyze_sex.py) +(which, surprisingly, only analyzes sex) + +[analyze_sex.py](analyze_sex.py) only picks out its data from the dataset and passes it down to [analyze.py](analyze.py) to do all the analyses, where the following things happen: + +1. the received data is put into groups, each receiving an assigned letter (A, B, C, etc), groups of insufficient size are removed +2. if there are less than 2 groups, analysis aborts +3. [Kruskal-Wallis test](https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_test) is performed and `F` and `p` values are received +4. if `p` is greater than 0.05, the difference between those groups is not statistically significant and analysis aborts +5. post-hoc [Dunn test](https://www.statology.org/dunns-test/) is performed and `p` values are saved +6. a result table is created, a comparison between each group is added as well as their [rank-biserial correlation](https://www.statisticshowto.com/rank-biserial-correlation/), difference in medians, difference in means, and post-hoc `p` value + +Problem solved! +If the difference is statistically insignificant, +or we don't have sufficient data to perform a statistical test, the analysis aborts, +otherwise, we get our `F` value, `p` value, and the result table, which could look something like this: + +| Skupina 1 | Skupina 2 | Veľkosť účinku | Rozdiel priemerov | Rozdiel mediánov | Post-Hoc p-hodnota | +|-----------|-----------|----------------|-------------------|------------------|--------------------| +| A | B | 0.0440 | 0.4198 | 0.0000 | 0.0497 | +| A | C | 0.0399 | 0.2723 | 0.0000 | 0.5239 | +| B | C | -0.0084 | -0.1475 | 0.0000 | 0.3706 | + +### Graphing + +Once the analysis is complete, successful or not, we can graph the data, we mostly use violin plots, +which are quite easy to understand and interpret, it goes like this: + +1. the window is split into four subplots, top left for average grade, top right for math grade, bottom left for slovak grade, and bottom right for english grade +2. all groups get added to each subplot as a violin plot, so for sex, each subplot would contain a violin plot for males and a violin plot for females +3. the `F` and `p` values get added to the top left corner of each subplot +4. the legend gets added to the top right corner of each subplot and axes are marked +5. each violin plot contains five pieces of valuable information: + 1. the shaded background that shows the distribution of the data + 2. the gray line that represents the data between the first and third quartile + 3. the red mean line with the mean value on the left + 4. the green median line with the median value on the right + 5. the minimum and maximum bounds +6. labels get added to each violin plot + +Quite complicated, right? A ton of data packed into one small image, which could look something like this: + +![example graph](example-graph.png) + +It can be overwhelming to look at at first, but once you understand what's going on, it's quite intuitive, anyway, +the function that does all this is also saved in [analyze.py](analyze.py) + +### Neural network + +Ah! +AI stuff! +Well, it didn't work in the end because of the abysmal amount of data, but the structure and +the training process is still here and can be looked at + +The script that trains the neural network is [train_nn.py](train_nn.py) (yes, I am very creative, I am aware), it uses +the [pytorch](https://pytorch.org/) library to do all the math stuff that goes on behind the scenes, but the important +part is the structure of the neural network, right here: + +![structure of a neural network](structure.png) + +Of course, we have to use the `.npy` file format to load the data into our program, so how do we convert the `.csv` +data provided by the Google Forms into a `.npy`? +The answer lies in [clean.py](clean.py), but I'm not going to go +into how it all works, it's just cleaning the data + +The whole training thing is pretty complicated, so if you don't know anything about neural networks, just forget about +it and attribute it to magic, but if you do, read through [train_nn.py](train_nn.py), +it's a pretty clean and readable code diff --git a/example-graph.png b/example-graph.png new file mode 100644 index 0000000..c7ad7d1 Binary files /dev/null and b/example-graph.png differ diff --git a/structure.png b/structure.png new file mode 100644 index 0000000..968090a Binary files /dev/null and b/structure.png differ