Exclusive TDS interview
This Google Scientist teaches AI to build better AI
AI visionary on her cutting-edge research at Google Brain, how Deep Reinforcement Learning works, and more.
Jan 23 ·9min read
Interviewer: Haebichan Jung , Project Lead at TowardsDataScience.com.
Interviewee: Azalia Mirhoseini , Ph.D., Tech Lead and Senior Research Scientist at Google Brain. One of the “35 innovators/visionaries under 35” by MIT Review.
This interview took place at Toronto Machine Learning Summit (TMLS) on November 2019.
For more TDS-only interview, please check outhere:
Can you tell us about your professional background?
I got my Masters and Ph.D in electrical and computer engineering at Rice University. I was working on algorithmic, hardware/software code design for large-scale data analytics model and Machine Learning. Towards the end of my Ph.D when Deep Learning took off, I switch my focus to Deep Learning.
Afterwards, I very accidentally saw a flyer for Google Brain Residency Program. The goal was for people with backgrounds other than Deep/Machine Learning to become researchers in the field. I’ve been there 3.5 years now.
On a slight tangent, something I notice in Machine Learning community that is very common is that a lot of people are very excited about research in computer vision, NLP, robotics, Machine Learning, and games.
So I was very excited about doing ML for systems and computer design because these are the biggest enablers of Machine Learning. A lot of success on Deep Learning is due to better hardware and systems we have right now that we didn’t have 10 years ago.
Can you give us some examples on the role of systems/chips in AI’s progress?
Of course, like Tensorflow and Pytorch that’s powerful and people can easily get into for training their own DL models. The other important aspects are the chips and the hardware, like GPUs and TPUs.
TPUs or Tensor Processing Units are very powerful hardwares for running ML algorithms that we didn’t have before. These kinds of systems enable Deep Learning to be what it is today.
Also if you look at the trend of A.I., we really need significantly better computer hardware to be able to keep up with computational demands of A.I. So the way I think about it is to use Machine Learning itself to help A.I. and design the next generation of systems and chips.
Is developing these systems your primary responsibility at Google Brain?
In 2018, my colleague Anna Goldie and I founded the Machine Learning for Systems Team at Google Brain. The main focus of our team is to use AI to design and optimize the next generation of systems and chips. The type of research we focus on is mostly Deep Reinforcement Algorithms, sequential decision making optimization methods that enable us to solve large-scale optimization problems.
What do you mean by sequential decision making process?
To give you an example, in a robotics task, a robot wants to get to a target. The robot is located on a 2-D grid and wants to move around. The sequence of decisions are decisions about where the robots move first (left, right, straight ahead, barriers, etc.).
So there are series of decisions that you take in sequence. You take one action, which is to go right. And then you take the next action. You actions are conditioned on the previous actions you’ve taken.
Your end goal is to take the series of actions such that you reach the target as fast as possible with minimum number of moves. Sequential Decision making task has to do with how you can optimize a series of decisions such that you get to your target reward function (whatever that is) in an optimized way.
How does optimization happen in that process?
I will tell you in the context of policy optimization (reinforcement learning). You can imagine you have Neural Net model that represents your policy. In the beginning:
- The model is initialized with random weights. It doesn’t know anything about your environment. But gradually,
- The model takes the current state of your problem and outputs a probability distribution over the actions that you have. Let’s say you have 4 actions (up, down, left, right). You take these actions and measure how closer you are to your target, like a reward function.
- You now have this intermediary reward function and you take the next action all the way until you reach your target.
- The end reward function from a collection of intermediary reward functions can be used a feedback to go back and update the parameters of the Neural Net that represent your policy.
What does the term policy mean here?
Policy here means reinforcement learning model. The reason why it’s called a “policy” is that it takes an input state and it takes a set of actions. It predictions an action given a state. That’s why it’s called a policy.
Can you clarify how optimization happens without a target variable and on just reward functions alone?
Yes, there are no labels here, but we still have state of the input of the problem and in the end, we want to optimize for the reward function. In a lot of these policy optimization algorithms, what we are doing is, we are training a policy that optimizes the expected reward function given the distribution of actions that it predicts for a given state.
Switching gears, can you tell us about your other exciting Deep Learning research: the Sparsely-Gated Mixture-of-Experts Layer?
This was one of my first projects that I did when I joined Google Brain with a group of excellent colleagues. The idea behind this layer is that if you look into the Deep Learning models (transformer, convolutions, LSTMs), lots of these are relatively dense.
Dense means that input examples goes through all the network, from the beginning to the end. We process each input with the same amount of computation across all input examples.
The idea behind this work is that we can have a union of experts where each expert is represented by Neural Network itself, and these experts can be specialized in different types of data within your training dataset. So as you’re passing an example throughout the model, the example goes through certain paths through this Neural Network and goes all the way to the end, but not through everything.
There are many advantages to this model. First, we can have models with a lot of capacity that can learn from massive data. This means we can have a lot of parameters. One of the models we have built had billions of parameters where we trained billions of data points on it.
The beauty of this is in its simplicity because an example only sees a small portion of the model. We can have say 1,024 experts and example goes through only 4 or 8 of the experts. So we have this large capacity model but the amount of computing power applied to each sample is still very small. Yet this big model can collectively learn from this large amount of data we have with a large amount of parameters that it can use to encode the knowledge and use for training the model.
So to repeat, Sparsely-Gated Mixture-of-Experts (MOE) layer is:
a layer that is integrated with a global model and you have a whole bunch of experts (aka Neural Nets) that are investigating different parts of the data. But why yours is called Sparsely-gated is because we aren’t looking at every experts but a few, but you’re still able to arrive at similar results as all of the different experts?
Or better results! The reason why it’s named Sparsely-Gated is because we train what we call a Gater that together with these experts. The input to the expert layer goes to the Gater first, and the Gater decides which experts should process this input. Asparse number of those experts receive any input to this Gater. Think of the Gater as routing input examples through a few number of experts, and that’s why it’s called a Sparse Gater.
And yes, as you say, we have these expert layer with the Gater, and you can embed that into a Deep Learning model, like an LSTM. Surprisingly, not only we can not only get results faster or achieve to the same accuracy, we can achieve better results. Part of it was because we could process more data and the other part of it was because of the regularization effects of the Sparse-Gated MOE that make the model generalize better to the unseen test data.
Can elaborate on the regularization?
Dropout is an example of doing regularization. You can think about the MOE as in some sense similar to dropout. Except it is used in a more structured way where we are using the benefit of sparsity for computational efficiency as well.
So we have all these models, and each time the Gater looks into whatever activation it got for an example and passes it to a sparse number of models. Whereas in a dense layer you see all of the model go through that. Here, we’re dropping by design a large number of experts.
What’s exciting about this research is that you guys solved the problem that conditional computation was trying to solve but couldn’t. Can you tell us more about this?
To explain, the reason why this is a conditional computation is that despite most other Deep Learning models where input goes through all the network, here we condition an input through a Gater model jointly with the rest of the network.
We condition it through the Gater to restrict it to certain paths of the Neural Network. That way, our experts or modules are becoming specialized in different parts of the training data, which helps them to become better at processing and evaluating the data.
Can you tell us more about Google Brain? How are the teams structured and what kind of work do people do?
It’s a great team we have with a large number of great researchers and engineers that work very well together. We focus on important and really hard problems. It’s just a lot of fun working with this team and working on a type of problems that we do.
I would say we have a good academic culture. We are very actively present in publication and top tier conferences. We really encourage our researchers to publish and collaborate with the outside.
At the same time, a lot of the exciting work came out of Google Brain. An example of it is Google Translate, which is based on the LSTM/Seq-2-Seq approach which is really amazing. They pretty much revolutionized the way NLP machine translation is done.
Do you get to work with Geoffrey Hinton at all?
Actually my paper on Mixture-of-Experts was with Geoffrey Hinton. I feel very lucky to be able to work with him. He’s an amazing person, such a pleasure just to be around him. The most humble person at the same time.
Final Question from TDS Audience: What is the future of AI? AI as a tool or AI as a product? What is the more likely scenario in the future?
I’m very positive about AI. I believe that AI will help everyone. Sure in industry they might have access to compute and data that others might not. At the same time, I feel like there is a lot of work and research that can be done even without a lot of data, because then you can innovate around the constraints that we have. We see this all the time in academia.
So I think there are opportunities in academia and industry to thrive and in this era of AI and at the same time if we do something great, for example self-driving cars, I personally believe that it’s going to be great to everyone, freeing more of their time to pursue other things and innovate in other areas.
For the full interview, check out the video here :
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。