How is it that programs like OpenAI’s GPT-3 neural network can answer multiple choice questions, or write a poem in a particular style, despite never being programmed for those specific tasks?
This may be because human language has statistical properties that prompt a neural network to expect the unexpected, according to new research from DeepMind, Google’s AI unit.
Natural language, when viewed from the perspective of statistics, has properties that are “non-uniform”, such as words that can stand for a number of things, known as “polysemy”, Like the word “bank”, which means a place where you put money or a growing mound of earth. And words that sound alike can stand for different things, known as synonyms, such as “here” and “listen.”
Those qualities of language are the focus of a paper Posted this month on arXivDeepMind scientists Stephanie CY Chan, Adam Santoro, Andrew K. Lampinen, Gen X. Wang, Aditya Singh, Pierre H. “Data Distributional Properties Drive Emergent Few-shot Learning in Transformers” by Richmond, J. McClelland, and Felix Hill.
Too: What is GPT-3? Everything your business needs to know about OpenAI’s successful AI Language Program
The authors began by asking how programs such as the GPT-3 can solve tasks where they are presented with questions for which they have not been explicitly trained, known as “few-shot learning”. known as.
For example, the GPT-3 can answer multiple choice questions, without being explicitly programmed to answer such form of question, simply by a human user prompted to type an example of a multiple choice question and answer pair. is done.
“Large transformer-based language models are capable of doing few-shot learning (also known as in-context learning) without being explicitly trained for it,” they write, from Google’s wildly popular Transformer Neural Nets. Referring to GPT-3 and based on Google’s BERT language program.
As they explain, “We hypothesized that specific distributive properties of natural language may be driving this emergent phenomenon.”
The authors speculate that such large language model programs are behaving like another type of machine learning program, known as meta-learning. Meta-learning programs, which have been discovered by DeepMind in recent years, work by being able to model patterns of data spread across different data sets. Such programs are trained to model not a data distribution but distribution of data sets, as explained Prior research by team member Adam Santoro,
Too: OpenAI’s huge GPT-3 hints at the limits of language models for AI
here is the idea of the key Different data set. All the non-uniformity of language, they speculate, as polysemy and the “long tail” of language, is the fact that speech consists of words used with relatively little frequency—each of these strange facts of language. The data are similar to separating distributions.
In fact, the language, they write, is something like this between supervised training data, with regular patterns, and meta-learning with different data:
As in supervised training, the item (word) is repetitive, and the item-label mapping (such as word meaning) is somewhat fixed. Also, the long-tailed distribution ensures that there are rare words that only occur repeatedly in the context window, but may burst (appear multiple times) within the context window. We can also view synonyms, synonyms, and polysemy as weaker versions of the completely undetermined item-label mappings used in few-shot meta-training, where the mapping changes every episode.
To test the hypothesis, Chan and colleagues take a surprising approach: They don’t actually work with language functions. Instead, they train a transformer neural net, called the omniglot, to solve a visual task. Introduced in 2016 by scholars from NYU, Carnegie Mellon and MIT. Omniglot challenges a program to assign the correct taxonomic labels to 1,623 handwritten character glyphs.
In the case of Chan et al.’s work, they turn the labeled Omniglot challenge into a one-shot task by randomly shuffling the labels of the glyphs, so that the neural net is learning with each “episode”:
Unlike training, where labels were fixed in all sequences, the labels for these two image classes were randomly reassigned to each sequence […] Since the labels were randomly reassigned for each sequence, the model must use the context in the current sequence to make a label prediction for the query image (a 2-way classification problem). Unless otherwise stated, few-shot learning was always assessed on holdout image classes that were never observed in training.
In this way, the authors are manipulating visual data, glyphs, to capture non-uniform properties of the language. “At the time of training, we arrange the Omniglot images and labels in sequence with different language-inspired delivery properties,” they write. For example, they are used to estimate the quality of polysemy by gradually changing the number of class labels that can be assigned to a given glyph.
“In the assessment, we then assess whether these qualities lead to few-shot learning abilities.”
What they found is that the neural network is better able to learn few-shots as soon as they multiply the number of labels for a given glyph. “We see that increasing this ‘polysemi factor’ (the number of labels assigned to each word) also increases some-shot learning,” as Chan and colleagues noted.
“In other words, making the generalization problem harder actually made some shot learning emerge more robustly.”
Also, there is something about the specific structure of the transformer neural network that helps it learn few-shots, Chan and colleagues found. They test “a vanilla recurrent neural network”, they write, and find that such a network never Gains some shot ability.
“Transformers show a significantly higher bias towards few-shot learning than recurrent models.”
The authors conclude that the properties of the data, such as the long tail of language, and the nature of neural nets, such as the transformer structure, matter. It is not one or the other but both.
The author suggests several avenues to explore in the future. There is a connection to human experience as children demonstrate what appears to be a learner of some shots.
For example, infants learn the statistical properties of language faster. Can these delivery features help infants acquire the ability to learn faster, or serve as useful pre-training for later learning? And could similar non-uniform distributions in other areas of experience, such as vision, also play a role in this development?
It should be clear that the present job is not a test of language at all. Instead, it aims to simulate the predictive statistical properties of language by recreating non-uniformity in visual data, Omniglot images.
The authors do not clarify whether translation from one way to another has any effect on the significance of their work. Instead, they write that they expect their work to expand to more aspects of the language.
“The above results suggest exciting lines of future research,” they write, “how do these data distribution properties interact with reinforcement learning versus supervised loss? How might the results differ in experiments that use language and language modeling? What are other aspects of repeating, for example using symbolic input, training on next-token or masked-token prediction, and having the meaning of words determined by their context?