How Deep Learning Actually Learns About Language
Recent years have witnessed rapid progress in the field of Artificial Intelligence (AI). This progress happened primarily as a result of advances in machine learning, a sub-field of AI, and in particular deep learning. Deep learning is a class of machine learning methods loosely inspired by information processing in the human brain. Just as human brains have neurons connected to other neurons, scientists can design an artificial network of neurons connected to one another (called an artificial neural network). Similar to the human brain, each neuron in this artificial network can receive, process, and transmit information to other neurons. This network can then learn from data. Data can be images, video, text, or other forms of signal. Neurons in the artificial network are arranged in layers, and since there can be many of these layers (sometimes more than 100), we call the network deep. This is where the name “deep learning” comes from.
These deep neural networks can now be used in a fast-growing number of applications, from disease diagnosis and drug discovery to self-driving cars and smart agriculture. At the time of COVID-19, progress and use of AI and deep learning is more relevant than ever. The reason is that, as the pandemic hit, our dependence on technology skyrocketed. Take, for example, how language learners turned to apps instead of being in a physical classroom. Or how college students are making use of services such as automatic captioning of videos recorded by their instructors as a way to facilitate notetaking from lectures. Or the use of machine translation for online learning by millions of self-learners on a single website such as YouTube. All these activities are possible thanks to applications of deep learning and natural language processing (NLP). NLP is the field focused at teaching machines to understand and generate human language, and it too was revolutionized by deep learning. Dialog systems where the machine interacts with humans, such as Apple’s Siri or Amazon Echo, taking in some input such as a question and providing an answer is an example application of NLP. Almost all commercial NLP applications are now based entirely on deep learning. But how do computers learn human language?
How Computers Learn Human Language
To answer this question, let’s take machine translation as an example. Traditionally, scientists would depend on word lists and dictionaries compiled by humans to translate between a pair of languages. We talk about a source language (say English) and a target language (say French). A text that needs to be translated from the source into the target can be split into individual sentences. We can use the dictionary to translate each word in the source language into its equivalent in the target language. Perhaps our dictionary is not complete and so some words that we will see in the sentence we are translating will not appear in the dictionary. We can try to solve this problem by continuously adding new words and their translations to the dictionary. This method, however, quickly becomes problematic since a single word in the source language can mean a whole phrase in the target. For example, the English word “please” translates into “s’il vous plaît” in French. A simple solution would be to update our dictionary to map a single word to a phrase and vice versa. But we then face yet another problem: The same word can mean different things when it occurs in different contexts. For example, the English word “bank” can either mean the “financial institution” or the bank of a “river”. To solve this new problem, we can try to take the context in which a word is used into account. For example, if the words “cash” and “clerk” occur close in text to the word “bank”, we can assume that the intended meaning is the “financial” institution. But this still is not perfect since the context itself can be incomplete or ambiguous. We would then need to think about new solutions. And so on. In short, solving problems of machine translation will require too many rules.
Even though scientists tried hard, and many such rules and techniques were developed in IBM in the 1990s, the task of translation remained a challenging one. Since humans can be very creative with how they use language, the rules could not be sufficient, and resulting translation remained far from fluent. Looking at a piece of translated text, you would definitely know it was translated from another language. This is not a good sign since when efficient humans translate text, the translation comes out looking natural. Likewise, we would want computers to produce natural translations. Trying to make machine translation better by finding and applying too many rules does not work, and humans cannot just list all the rules since new rules will always be needed. And even if humans could find all the rules for a pair of languages, we would still need to come up with a whole new set of rules if we wanted to translate into a new pair, say from English to Spanish. With the number of world languages today estimated at more than 7,000, it becomes immediately clear how laborious the process can be.
Filling Buckets
Faced with this difficult problem, perhaps one idea would be to come up with the translation rules themselves automatically. Jumping several years ahead, scientists started thinking about words in a language not as atomic symbols, but as meaning containers that have associations with other words. Thinking about each word as a container allows us to give the computer a full bucket of information about each word. Scientists use the mathematical concept of a vector to refer to such a container, but the idea remains the same. Instead of telling the computer this sentence has the word “queen” and it translates into the word “reine” in French, we can give it the whole bucketful of information about the word. We can, for example, say the word “queen” is related to words such as “king”, “royal”, and “female”. If we have a large text such as English Wikipedia, we can ask the computer to look at all occurrences of the word “queen” as well as all contexts it occurs in. While the computer is doing that, we can give it an empty bucket (or a random the vector) to fill in with information about the word. It will do this by using an artificial network of connected neurons, and we will say it is filling these buckets (i.e., it is learning the right set of weights for each dimension in the vector). If we are flexible enough, we can think about phrases just as words stuck together, and we can just learn their meanings in the same way. And sentences are really groups of words, and so we can just learn the meaning of a sentence via a group of buckets (let’s call that group of buckets a matrix). Once we have done this, we would have a powerful basis to build our translation system on. Learning about language becomes a business of bucket filling. The right type of bucket filling. Well, it is a lot of math going on, such as weighted summing and matrix multiplication, but we do not need to worry about it since the computers are good with that.
Back to Translation, or Using the Buckets
Now that we have taught computers how to understand a sentence, we can try to map one or more buckets from a source language to one or more buckets from a target language. Since these buckets are rich with details about words and phrases, our job mapping between languages should be easier. Translation is hard but we now have read all English Wikipedia and have more information about the words and phrases in each sentence. Instead of going back to using dictionaries, we can ask humans to translate large amounts of texts (thousands or millions of sentences) and use these data to teach the computer to translate. Every time the machine sees a sentence in English and its French translation, it will learn correlations between words and phrases across the two languages without us needing to provide the rules about these correlations. How does it do that? Well, that is the whole point about deep learning. It is able to learn the rules on its own, so long as we give it the human translated data. It does that as an artificial neural network by tasking groups of neurons working together (called architectures) in many iterations. Every time the network makes a mistake, it will know it did because we gave it the correct human translation to learn from. And so, when it fails to produce the correct translation, it will try to do better next time by reducing its own rate of errors. Even if it reduces errors just very slightly each time, it will still improve because it can run for thousands or millions of iterations. In short, given sufficiently large human translated data, the network can learn very good translations. When translating a given word or phrase, it knows what words or phrases in the target it should pay attention to. It also considers what it has already translated thus far. So, ultimately, the network is able to produce fluent translations. To appreciate how good the machine translation can be, let’s read the story of “The Two Ducks and the Turtle” below. It is translated from Arabic into English, using Google’s public translation service.
There were two ducks and a turtle living with them in a spring of water, and when the spring of water dried up, the two ducks decided to leave to look for another place where there was water, and they informed the turtle of their departure. The turtle said: “And I cannot live without water, please take me with you.” Then the two ducks said: “But we fly while you crawl, so how will you come with us?”; The two ducks had an idea and said, “We will take you with us, but on the condition that you do not talk to anyone, and do not open your mouth during your journey with us.” The turtle was amazed at first, then she agreed, and told them: “How am I going to go with you?”; One of the ducks brought a wooden stick, and she said to the turtle: “Bite on the middle of it, and we will carry it from both ends with the beak, and so you will come with us.” (Google Translate, February 10, 2021. Arabic source is in the end of this piece.)
You will notice that the story reads well. Since this text is intended for young children, we probably can improve the translation by choosing simpler words in some places (for example, we can use “told” instead of “informed”). But the translation is good enough. And we can even train the computer to simplify the text, without a need to intervene manually.
Why Does it Matter?
Machine translation using deep learning is usually called “neural machine translation” (NMT). Current NMT technology is quite good for languages where we have human translated data for computers to learn from. The technology can be used to alleviate lack of human resources or curricular content. For example, it can be used to rapidly convert curriculum text passages from one language to another, create exercises automatically, and provide access to services that are otherwise hard to staff. This could help diversify curricula, and nurture a culture of appreciation for different social groups by learning about them through their language. This improvement of literacy can have positive social and economic impacts, and it is the right thing to do for our children as well as adults.
Arabic source for “The Two Ducks and the Turtle” story:
كان هناك بطتين وسلحفاة تعيش معهما في عين ماء، وعندما جفت عين الماء، قررت البطتان الرحيل للبحث عن مكان آخر يوجد به ماء، فأخبرتا السلحفاة برحيلهما؛ فقالت السلحفاة: ”وأنا لا أستطيع العيش بدون الماء، أرجوكم خذوني معكما“، فقالت البطتان: ”ولكننا نطير وأنت تزحفين فكيف ستأتي معنا؟“؛ ففكرت البطتان في فكرة وقالتا: ”سنأخذك معنا ولكن بشرط ألا تتحدثين مع أحد، ولا تفتحين فمكي خلال رحلتك معنا“. استغربت السلحفاة في بادئ الأمر ومن ثم وافقت، وقالت لهما: ”كيف سأذهب معكما؟“؛ فأحضرت إحدى البطتين عودا من الخشب، وقالت للسلحفاة: ”قومي بالعض على منتصفه، ونحن سوف نقوم بحمله من طرفيه بالمنقار، وهكذا “سوف تأتي معنا.