TRANSATLANTICISM Discord Word Frequency

The Data

Discord allows you to request your data from them, which gives you access to pretty much everything that Discord can and will keep track of. Basically every action you've ever sent on your account, every picture, every user you've ever been friends with, etc. (Deleted messages do get deleted, though. They really are gone forever, thankfully). When you download your message data it gives you a bunch of .json files which are separated by channel.

My Data

Who said you needed to be a big corp to peer at other people's data! Now you, too, can take a looksie at every word I've ever said on discord! (note: probably because it's too small a dataset, my word frequency actually doesn't follow zipf's law. I thought it would, but I guess not.)

Google Sheets

Code (BASH)

Getting message contents from json files

I'm sure there's a quicker way to do this but I don't care, this works

rm ./output.txt
for file in ./*/; do
    while IFS= read -r line; do
        message="$(echo "$line" | sed -n 's/.*"Contents": "\([^"]*\)".*/\1/p' )"
        if [[ -n "$message" ]]; then
            echo "$message" >> ./output.txt
        fi
    done < "${file}messages.json"
    echo "finished with $file"
done

Cleaning up the messages

This takes the data and reduces it down to being purely lowercase letters. Takes a super long time but it can be run with GNU parallel with no problems which makes it go significantly quicker.

# RUN USING '$ cat ./output.txt | parallel -j 12 ./cleaner.sh'

line="$1"
line=$(echo "$line" | iconv -f UTF-8 -t ASCII//TRANSLIT)
line=$(echo "$line" | sed 's/[^a-zA-Z ]//g') # remove all non-letter characters
line=$(echo "$line" | xargs ) # normalize whitespace
line=$(echo "$line" | tr '[:upper:]' '[:lower:]') # normalize case
if [[ -n "$line" ]]; then
    echo "$line" >> cleaned_messages.txt
fi

Making a .csv for frequency

Pretty self explanatory. Counts the frequency of each word (any characters separated by whitespace) and outputs it to a csv [word,frequency] which can be processed with pretty much anything that supports .csv files.

(tr ' ' '\n' | sort | uniq -c | awk '{print $2","$1}' > frequency.csv) <./cleaned_messages.txt