Guessing German genders using statistics

8 min readMay 25, 2020

Originally published at https://mejuto.co on May 25, 2020.

in German there are 3 genders: masculine, feminine, neutral, der; die; das. You need to memorize which words are which gender, since it affects certain constructions. The accusative, dative forms, for instance, can be different according to the noun’s gender.

A quick example, with accusative:

Er trinkt ein Bier: Bier is neutral.

Er trinkt einen Kaffee Kaffee is masculine. Notice the -en.

There is no escape: you need to learn — or guess — the noun’s gender.

It gets easier with a combination of time and some rule-of-thumb laws, or heuristics.

A heuristic is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect or rational, but which is nevertheless sufficient for reaching an immediate, short-term goal.
Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. Heuristics can be mental shortcuts that ease the cognitive load of making a decision .

Mental shortcuts to ease the cognitive load… sounds like something useful to learn a language! (all those pesky cases, vocabulary and pronunciation already take a lot of my cognitive load). Let’s see how good those heuristics are and if we can come up with easier ones that apply in most cases.

The data

We have at our disposal a list of the 500.000 most frequent words in german and their genders The file is a csv without a header and begins like this:

500.000 most frequent German words: source

Textbook heuristics: the rules of the game

We want a rule that:

can be applied mechanically (no countries are masculine and similar rules. We want a simple rule, not more things to memorize!) Examples are things like boat names are feminine.
Also not interested in learning exceptions. We want the simplest mechanical rule that guesses right the most. We will ignore all the rules that are not mechanical (names of ships are feminine… ok), since we do not want to teach a program what is a ship and what not.

Distribution of genders

Because of this we cannot rely on averages to summarize (in fact, averages are often, like pie charts, a bad idea to analyze data). It is a good idea to look at the distribution of the data.

Based on the 500.000 most common German words

Let’s apply the heuristic rules of our grammar book to our data and see how accurate they are. We will guess the gender according to the textbook rules:

Some rules are pretty short (ending in a single letter). Are there internal contradictions?

We can see that some words match several of the rules at the same time!: Some examples of multiple matches are:

-in can be both feminine and neutral Automatically any word ending in-in, according to the textbook heuristics is probably feminine. But it is also probably neutral (see table above), since it has exactly the same rule!We will ignore this -in ending rule
-itis (feminine ending), -s (masculine ending) Both match

Whenever there are multiple matches we will match the longest one (in number of characters) only. Once we apply this we get rid of cases with multiple matches

We can see now that the distribution is similar to our first chart. There are more feminine nouns, followed by masculine and neutral. That huge last bar is a problem, though: a lot of the words did not match any rule!

No match

Our only problem now is we still have a lot of words without any heuristic we can apply. This is not good for our cognitive load, remember? We will deal with this in a second, but first, how accurate are the results that do match?

Accuracy

How accurate is the classification of the words that matched a rule? Let’s find out:

This will be our baseline for, you guessed it, our quest to find a better rule.

Groups of n characters

Is there a group of n characters in a noun that correspond to a particular gender?

We will test all the permutations from 2 to 4 letters containing, starting or ending in the string.

All 4 character permutations of the 30 letters a b c d e f g h i j k l m n o p q r s t u v w x y z ß ü ö ä means we are dealing with P(n,r)=n!(n−r)!=30!(30−4)! = 657,720 permutations.

We will discard groups that match less than 1% of the time, to avoid having a huge list of values. We are looking at words that start, end , or contain each group of characters.

An inefficient implementation, and 21 hours later the program has finished. What are the results?

Removing rules inconsistency

Some rules match all genders. We remove these duplicates and get our data to start looking for useful rules.

Here is the start of the table, ordered by Gender matched, Percent matched, Match type and Num matched:

Here we hope to find our rules.

The sequence Value matched can appear at the end ( end) or anywhere ( contain) in the word.

In the previous textbook heuristics table we had a number of rules per gender:

Masculine: 40 rules
Feminine: 15 rules
Neutral: 13 rules
Total (40 + 15 + 13) = 68 rules

A new set of rules

In our quest to reduce cognitive load we are looking for better rules than the ones you have seen. They are pretty accurate. Can we reduce the number of rules to memorize? Armed with the big table of characters groups we can start looking.

First try: best 20 rules based on ending

Let’s try to find the best rules from all the permutations possible. As a first exploration we will focus on ending in rules and try with a maximum number of 20 rules.

20 best ending in rules. If it ends in the value we classify it as in that gender. No masculine rule made it to the top 20.

Results best 20

**2% improvement** over the textbook. We can also see how each gender `m`, `f` and `n` perform individually with textbook rules vs ours.

Both the graph and the numbers give us the same insights. Our first guess, with 20 rules already performs 2% better than the baseline!

In search of even better rules

Until now we checked only 20 rules. Why 20? since the textbook had (as seen before):

Masculine: 40 rules
Feminine: 15 rules
Neutral: 13 rules
Total (40 + 15 + 13) = 68 rules

20 was just a guess to explore the data. What is the optimal number of ending in rules that gives as the greatest accuracy? Let's find out.

Best rules based on begins, contains or ending

Until now we checked only 20 rules per gender, where the rule is “word ends in” -ung, -er, etc. mimicking the rules from the textbook. Probably ending in is easier to remember, but we do not need to limit ourselves to this rule. It might be that contains (anywhere in the word) produces an easier to remember rule with the same or better accuracy.

We will run the best rules from 1 rule to 68 (the textbook number of rules) and see how accurate they are.

Results from 1 to 68 rules

We take the best rules from 1 rule to 68 rules and measure their accuracy vs the baseline (“Ours % match” vs “Textbook % match”)

Below we can see this data in a more intuitive way:

The letters at the end of a word ( `ending`) predict the gender better than in other positions ( `contain`)

We can see clearly:

Textbook rules match 27% of the time.
Ending rules are more accurate that contain rules.
With just 22 rules we can achieve the same accuracy as with textbook's 68 rules.

Or, alternatively:

with 68 rules (same as textbook) we can increase the accuracy from 27%to 45%.

The new rules

In the previous section we calculated the different tradeoffs between accuracy and number of rules to remember. How many rules are you willing to memorize?

If you only remember one rule

You cannot be bothered, tempus fugit. If you can only learn one rule, learn this one.

To be 20% accurate: 10 rules

Even a broken clock is right twice a day. If you are happy with a set of rules correct 20% of the time use these 10 rules:

Same accuracy, less to memorize (recommended)

To guess the same amount of words as with the textbook rules: 27%. You will have to memorize a bit less: instead of the 68 rules before, only the 22 below:

Maximum accuracy: 68 rules

Only the best is enough. Perfection is not the enemy of the good for you, but the only good. Then, study these 68 rules - same number as before, from our textbook [1] - for a correct guess 45% of the time, versus a now pale 27%.