Recognizing the language in a Natural Language Text with NLanguageRecognizer

Continuing my trend of writing about language processing, today I want to discuss about identifying the language of a body of text. This is an interesting task we can do thanks, once again, to Apple’s investment in APIs linked to machine learning.

Today we will explore the NLLanguageRecognizer object. Introduced in iOS 12, this class can do a lot of language recognizing, from detecting the “dominant language” of a string, to all the possible languages.

Introducing NLLanguageRecognizer

Don’t try to use one instance of this object through multiple threads.

This class is actually very easy to use. It has very few methods, and the easiest task to perform is one static method away. You need to import NaturalLanguage to use it.

Quickly Recognizing a Language in a String.

If all you need to do is to quickly recognize a language in a string, you can use the static dominantLanguage(for:) method. This method takes the string to recognize and returns an optional NLLanguage object, which contains the language itself. If the string cannot recognize a language at all, it will be nil.

var stringToRecognize = "This is an awesome string."

if let lang = NLLanguageRecognizer.dominantLanguage(for: stringToRecognize) {
  print(lang.rawValue) // prints "en"
}

And the fun part is, because the method can return the dominant language, you can mix multiple languages together and it will return the one with most presence.

var stringToRecognize = "This is an awesome string. Cuando yo estaba por ahí en las calles decidí preguntar el significado de la vida"

if let lang = NLLanguageRecognizer.dominantLanguage(for: stringToRecognize) {
  print(lang.rawValue) // prints "es"
}

The above example uses a string with both English and Spanish. Because Spanish is the dominant language, it prints “es”.

Advanced Usage

That’s probably a bad title, because using the other features o this class is not complicated at all.

First, we can detect all the languages in a string. Although this will not be accurate all the time, an instance of this class offers the languageHypotheses(withMaximum) method, which tries to return all the languages found in a string. The return type is a dictionary of type [NLLanguage: Double]. The double is the probability of each language. The withMaximum parameter is the maximum number of languages to return.

To use an instance instead of the static methods of this class, you have to call the processString method, which takes a string and returns nothing. After you call this method, NSLanguageRecognizers will have its dominantLanguage property filled. You will also be able to use languageHypotheses(withMaximum) method. Calling processString is essential to do anything interesting with this class. And like you are able to tell, everything happens in the same thread, so remember not to use one instance concurrently.

The following example gets all the possible languages in the string:

var stringToRecognize = "This is an awesome string. Cuando yo estaba por ahí en las calles decidí preguntar el significado de la vida"

let langRecognizer = NLLanguageRecognizer()
langRecognizer.processString(stringToRecognize)
for (lang, perc) in langRecognizer.languageHypotheses(withMaximum: 10) {
  print("Probability of \(lang.rawValue): \(perc)")
}

It will output something like the following:

Probability of de: 0.0011527120368555188
Probability of sk: 0.0013781085144728422
Probability of hu: 0.001778516685590148
Probability of it: 0.003761883592233062
Probability of pt: 0.01308358833193779
Probability of nl: 0.0018757604993879795
Probability of ro: 0.0038427249528467655
Probability of hr: 0.0007617694209329784
Probability of en: 0.09818074852228165
Probability of es: 0.8707313537597656

You should try to give the withMaximum parameter a more reasonable value if you have any idea of what the dominant languages are going to be. We can observe that English and Spanish have the bigger percentages.

You can also guide the recognizer by specifying the languageHints and languageConstraints properties. I wasn’t able to find much use for languageHints because it takes a dictionary similar to the one returned by languageHypotheses(withMaxium:), but by using languageConstraints you can limit the languages you want to recognize.

If we add the following code to the piece of code above

langRecognizer.languageConstraints = [.english, .spanish]

It will print:

Probability of es: 0.8986690044403076
Probability of fr: 0.0
Probability of hr: 0.0
Probability of da: 0.0
Probability of en: 0.10133091360330582
Probability of cs: 0.0
Probability of fi: 0.0
Probability of de: 0.0
Probability of hu: 0.0

And it would be good to assign withMaxium to 2 here, as we know we only want to recognize two languages.

Conclusion

Recognizing a language in iOS is as easy as using the NLLanguageRecognizer API introduced in iOS 12 and calling a few lines of code. The system will do its best to determine the dominant language or all the possible languages in a string, and you can use this information for natural-language apps.