The Data
Discord allows you to request your data from them, which gives you access to pretty much everything that Discord can and will keep track of. Basically every action you've ever sent on your account, every picture, every user you've ever been friends with, etc. (Deleted messages do get deleted, though. They really are gone forever, thankfully). When you download your message data it gives you a bunch of .json files which are separated by channel.
My Data
Who said you needed to be a big corp to peer at other people's data! Now you, too, can take a looksie at every word I've ever said on discord! (note: probably because it's too small a dataset, my word frequency actually doesn't follow zipf's law. I thought it would, but I guess not.)
Google Sheets
Code (BASH)
Getting message contents from json files
I'm sure there's a quicker way to do this but I don't care, this works
rm ./output.txt for file in ./*/; do while IFS= read -r line; do message="$(echo "$line" | sed -n 's/.*"Contents": "\([^"]*\)".*/\1/p' )" if [[ -n "$message" ]]; then echo "$message" >> ./output.txt fi done < "${file}messages.json" echo "finished with $file" done
Cleaning up the messages
This takes the data and reduces it down to being purely lowercase letters. Takes a super long time but it can be run with GNU parallel with no problems which makes it go significantly quicker.
# RUN USING '$ cat ./output.txt | parallel -j 12 ./cleaner.sh' line="$1" line=$(echo "$line" | iconv -f UTF-8 -t ASCII//TRANSLIT) line=$(echo "$line" | sed 's/[^a-zA-Z ]//g') # remove all non-letter characters line=$(echo "$line" | xargs ) # normalize whitespace line=$(echo "$line" | tr '[:upper:]' '[:lower:]') # normalize case if [[ -n "$line" ]]; then echo "$line" >> cleaned_messages.txt fi
Making a .csv for frequency
Pretty self explanatory. Counts the frequency of each word (any characters separated by whitespace) and outputs it to a csv [word,frequency] which can be processed with pretty much anything that supports .csv files.
(tr ' ' '\n' | sort | uniq -c | awk '{print $2","$1}' > frequency.csv) <./cleaned_messages.txt