>>Hi, everyone. I’m very happy to introduce Wei Wen

from Duke University to give us his research presentation on Efficient and

Scalable Deep Learning. So many of us are working

on deep learning or use the area related to deep

learning especially in nowadays, we love to train large models

that had burdened GPT. But here, Wei is going to share

a very unique perspective of deep learning where he focus on

the efficiency of deep learning. In particular, he focus on model compression and

distributed training and AutoML. Wei also have rich

industry experience. He interned with NSR here and integrates Google Brain

and the Facebook AI, and now let’s learn from Wei.>>Thank you [inaudible]. Hi everyone, I’m Wei from

Duke. Thanks for coming. I’m really glad to

be here since I did two internships here and I’m

trying to get myself upgraded. So topic today is about Efficient and Scalable

Deep Learning and Beyond. So the general trend of

deep learning is we can always get better performance

if we are able to train a larger models given

we have a lot of data. So here is a figure which covers the majority of

the computer vision models, and the x-axis is the

computation cost, y-axis is the accuracy. So in general, if we

compute the larger model, we now lets get a better performance. This is typically only true for

one specific neural architectures. For example, for ResNet, we can get better performance if we can get the

deeper new networks. This is similar for Inception model. Similarly, in natural

language processing, we also get similar trend. So here is example for

language modeling on Wikitext. So if we can build the larger model, we can get a better perplexity. So we are still curious

about how can we go below if we can still build a

larger models in NLP programs. So the question is, why don’t we always

build a larger model? But there are obstacles. The first obstacle is

on the training side. Since training larger

model is very slow, so if we have a new model, we want to evaluate. So we want to check is this

new model is a good or not, then we will take a longer

training time to get the feedback. So usually we have to evaluate lot of model before we find a good model. So if the model is too large, we will extend our research

cycle and production cycle. The second obstacle is

on the inference part. So after we build a model, eventually, we want to use it. But if the model is too large, the inference turn is very slow. So it would be very

challenging to deploy those models to applications which have very limited

computing resources or memory resources like

the Microsoft Hololens. So my research in general

is trying to make the training faster and

also inference faster. So we can build a larger

model and also to make those model deliverable to applications in real

industry productions. So here is outline. I’ll first introduce my

previous research to make the training faster typically in distributed Training Systems and also my research on Sparse New Networks

to make the inference faster. Finally, I will introduce

my future research. So let’s go to the first part. How can we make the distributed

training system faster? So I’ll focus on one work I did on Ternary Gradients to reduce the communication in

distributed deep learning. So here is a little bit of background about

Distributed Deep Learning. So in synchronize SGD, we have our sender, parameter

server and we first split. We send the parameter to multiple machines and train those model in parallel

but using different data. After the training is

done at each machine, we will synchronize the gradient

to the sender parameter server. This finishes one run, one iteration though it will

keep doing this again and again. This is a good because

we have a lot of machine which means

we can train faster. But we have problem. We have the communication

bottleneck problem. So in general, if we

have more machine, then we can always reduce

our computation time. But since we have more machine, which means we have more

synchronization over the network. So the communication

time will increase. So your total time will

saturate at some point. So you cannot go beyond one point. So it limit the scalability of

distributed deep learning system. So my research goal

is trying to reduce the communication time and make the distributed

training more scalable. So that’s simple. So in distributed training system, we can slightly change

the communication pattern that we only communicate

gradients over the network. So each worker will compute the

gradients and send over the sender parameter which would average

the gradients and send back. So we only communicate

gradients over the network. So usually, gradients are floating precision which have

32 bits per element.>>[inaudible] thing of rotation.>>On location?>>Rotation. What’s G in

red and what’s G in blue?>>So rotation here, W is the weight and

the G is the gradient. So at each worker, it will first compute

gradient on each data. So it will send back to the

parameter server and it will average all the gradients from

all the worker and get the averaged

gradient and send back, then each worker will be updated

by the averaged gradients.>>So you assume that

the model has n copies?>>Yes.>>Then all the models

are always the same. So they [inaudible] by

the same [inaudible]?>>Yes. So the

initialization is the same. So the update is also the same, so it will always be the same.>>The distributed trend in

[inaudible]. The model is in the copy.>>Model is in the worker and they use the same

initialization seat. So they will be always the same. This way we can only

communicate the gradients and explain why it’s beneficial to only communicate

the gradients over the network. So usually, it’s 32. In this work, we do quantization. We reduce the precision

to only three levels. So only three discrete values. We call this ternary gradients. TernGran, in short. If we can successfully do this, we can reduce at least 16 times of

reduction of the communication. So this will be very

challenging because it will lose lot of precision

in the training process.>>As [inaudible] research

I did as [inaudible].>>Yes. I’m aware of that. Yes. So the basic idea is simple. So before I go to the detail, let’s go one step back. So in the supervised

learning in general, we want to minimize the average loss over all

the training samples, and the gradients can be updated by the gradient from all the dataset, and n is usually very large. So the computation is very costly. So in deep learning usually

we use a stochastic version. We randomly sample from

the dataset and we use the sample gradients to estimate

the original batch gradient. It works well because one reason is the expectation of the gradients

is the original batch gradient. So it’s unbiased. So if so, why don’t we do a quantization in a way such that

after the quantization, we still keep the expectation. So here the motivation is

we do ternarization on the floating gradient and we want to keep the expectation

as the original gradient. So it’s still unbiased gradient. So that’s the motivation. So how can we do it? It’s simple actually. So this one is the floating gradient. We first get the sign

of all the gradients. Then we have a scalar which

is usually very small.>>[inaudible].>>It’s either not learned. So s_t is the maximum absolute

value of all the gradients. So we have those two part

and then to element wise, multiplication with binary code

which is either one or zero. But this one is the random variable. So it’s basically

Bernoulli distribution. The probability for each element, its probability of being one

is just the absolute value of the gradients over

the maximum scalar here.>>What is k?>>k is the index of the gradients. So we have k gradients.>>Coordinate.>>Yes, coordinate. So the index. So gt is all the gradients and it has element wise

multiplied by a vector. So k is the index of the vector

which is a binary code like here. So let’s go to the example. So let’s say our floating

gradient is like this value and we get the maximum

absolute value which is 1.2. Then we get the sign

of all the gradients. Then we form our

Bernoulli distribution and its probability of being one. So let’s say for this element, its probability of being one is

just 0.3 over 1.2. So it’s simple. Then we draw a Bernoulli

distribution from this probability. We gather sample of the binary

code and we multiply all of them. So if you’re doing this way, we do a little bit of math here. So the expectation of the quantized gradients is just the expectation

of their original one. So its unbiased so we

keep the expectation. But we do increase a

little bit of variance, but I’ll go more detail about

how we reduce the variance. So then we only need

like two bits for each element and the one

single floating value. So it work significantly reduced

the communication volume.>>Can you explain the [inaudible]? So for example, why do we need b_t?>>Why do you need b_t

is we want to draw. We want to draw a

Bernoulli distribution. So first, eventually, the value is just one or

zero. So it’s binary. So there are only three values, three possible values here. Then we can encode it in

a very low precision.>>Assuming that is a

zero one and minus one.>>Zero minus t and s_t. So there is a scalar,

small scalar here.>>Excuse me. How did you

get from the last line? How did you get from the bottom, the final line, how did

you get from this to that?>>This here?>>Yes.>>So here, we have

expectation over two, and then we get expectation over z, and the expectation over b given z.>>Yeah, that’s it.>>So this one have low value

related to z, so it’s gone.>>Yes.>>Then this one is just b. So we get it there.>>So but how is expected value of b equals the expected value

of g. I don’t follow. I think, I’m missing something.>>This one?>>Yes.>>Exactly. I just want to

see your simplification here. How do you simplify this?>>The probability. The

[inaudible] probability.>>Yes.>>Just there. The probability

just g_tk over s_t.>>Yes.>>Didn’t expect how

you just got g_t.>>Yes.>>Original g_tk back.>>Yes.>>It’s basically variable. So we’ll not go into detail, but it’s just the product. This is just the data. So we only care about b_t. So the expectation of b_t

is just the probability. A probability which is this one and they multiply

this one with this one. It will reduce to that one. We proved the

convergence of TernGrad, and this one is the basic assumption to prove

the convergence of standard SGD, and so it’s standard, it’s not ours. So to prove the

convergence of TernGrad, we do need a little bit

stronger assumption on the gradient bound here. So in the standard SGD

the L2 should be bounded, but now we require the multiplication between the maximum norm

and L1 norm bounded. So this value is always

larger than this value, so TernGrad do need a

little bit stronger bound. But we propose some tricks to make those two bounds

closer to each other. So one is layer-wise ternarization. We do it layer-wisely because differently layer

have different distribution, and then we also do gradient clipping to limit the range of the gradients, the details is in the paper. So evaluation, we evaluate

on ImageNet or AlexNet. So on AlexNet there’s

low accuracy loss, even using three discrete

value in the gradients. So in some cases, we even observe higher accuracy, because when the batch

size is very large, the variance is too

small to learn well, but the variance from the quantization

will help the exploration. Then here is the convergence curve

compared with a standard SGD. So the convergence is also the same. We evaluate on Google app, we observe some small, loss but on average,

it’s two percent. One thing I want to emphasize is

we didn’t do any hyperparameter. All the hyperparameters are

used from the standard SGD, so we just use it. We use the same venerate. We use the same batch size. We use the same total epochs, but we could get a better

accuracy if we tuned that.>>So have you ever tried that? In the beginning you use

your TernGrad two to train, but in the end for example, standard training were

decreasing everywhere? For example, after you decreased

the learning rate twice, for this smallest number which you return to the full

precession of the gradient. So it’s recovered the

original accuracy or it stills get to better accuracy?>>We didn’t try exactly that way, but we tried like every 10 epochs. The first that are line

epochs, we used TernGrad, and the last one epoch, we used full precision. It is similar. It’s the same.>>So you mean its accuracy

is similar with TernGrad, it’s not similar with TernGrad?>>Right.>>Original SGD?>>All right. I didn’t try the one you said, probably that won’t help.>>Okay.>>So that’s all about the convergence and we can

reduce the communication. So in practice, how

can it make it faster? So here are the speed over the number of GPUs in

the distributed system, and the solid bars are the

standard floating point of SGD, and the shaded bars are hours. So we can always get speedup. So in general, TernGrad can give us a higher speedup if the communication time

occupies a higher ratio. That’s obvious because we

reduced the communication. So in general, it give us a higher speedup if

we use more machines. So if we use more machine, we have more communication ratio, or if our communication

bandwidth is very low, like we can get a higher

benefit if we use low-end, unlike a network like

Ethernet verses InfiniBand. Or if we trained giving

you network which have more hyperparameter

versus the computation. So I personally I think it

will give us more benefits in an NLP problem than CNN because NLP problem have more

parameter over competition. Also, it can give us

more benefits if you use GPU best distributed system because GPU computes faster and the communication is

more of the bottleneck. TernGrad is in production, it’s adopted by Facebook, AI Infra, and use it to reduce the communication

bottleneck in the AI Infra, and it’s evaluated by

the ads ranking model, which have a zero tolerance

of accuracy loss. They cannot tolerate

any accuracy loss. Otherwise, that means a lot of money. It is also available in the PyTorch. So this finished my first part

to make the training faster. I could take one

question if you have.>>Assume that you

test the [inaudible].>>Yes. We tested for

both momentum and add up.>>What’s the result?>>Its similar.>>Similar.>>Yes.>>So can you give

an intuition why you can reduce a context like gradient so much and the final accuracy is

still roughly the same?>>Because of the variance. We have higher variance. If you keep the expectation, but you increase the variance. Yeah.>>Okay.>>So I go to move to our second part of our research

on make the inference faster. So inference acceleration. I did a little bit more

research on this learn. So I even tried to put

the new network with a chip which only support

a spiking new network. I also tried to cluster the neurons

of this sparse new network. Why your congestion in the circuit

design to be more efficient. But I wouldn’t into too much detail. I will only focus on two research here on sparse

deep neural networks. By the way, one of the work was published when I

was an intern here. My mentor was [inaudible]. So when people talk about

the sparse neural network, usually they’re refering to

a neural network on which a lot of connections are removed. It can significantly reduce the storage size of a

given neural network, and if we can customize the hardware for the specific neural network, they can get a good speedup. However, when not a major the speed on a general platform CPU or GPU, the story is totally different. So here is a sparsity and we

get a 95 percent of sparsity. But when we measure

the speed on the CPU, on the GPU, the speed

is very limited. In many cases, when the speedup is y, which means the speed is the same. So this speed is very limited. In some cases, it’s even worse. So why? Because the

dispersity is non-structured. It’s randomly distributed,

but the hardware is very customized for

regular computation, but the non-structured

pattern just break the regularity for hardware

parallel [inaudible]. So you’d have a very

poor data locality, so you get a very trivial speedup. It’s better on a CPU platform. So this one is the speedup of random sparse neural

network and over sparsity. But it’s not as good as

the theoretic speedup. For example, here, the

sparsity is 90 percent, but only two times speedup, because of a similar issue I just mentioned in

the previous slides. So to make it more efficient. So we think we should use structure sparsity instead of

random non-structured sparsity. So here is an example to show the scalability of structured

sparsity versus non-structured one. So when I say a structured sparsity, I mean lot of rows and

the columns are all zero. So we can just remove those

zero rows and the columns, and then compress it to just

a small dense weight matrix. So because it’s dense and it’s small, it can compute much faster. So what’s structurally

sparse deep neural network? So it means connections or weights are removed group

by group not just one by one. So in Neural Architecture it means, we remove one dense

structure like one neuron, or it can be one layer, it can be one filter in

the convolution layer, or it can be one hidden state in

the recurrent neural networks. In terms of the perspective

of a weight matrix, structurally sparse deep

neural network means, we remove a weight block by block, so one group can be

one rectangle block, can be one row, can be one column, or even can be the whole matrix. So it’s pretty fun. So how can we achieve it? How can we learn structure sparsity? It simple again. Group lasso

regularization is all you need. So group lasso regularization was proposed and it’s very effective

to learn structure sparsity. So basically, how does it work? We first split the weights

to a lot of groups. Like here is a example. We split them into two groups. Then we add group lasso

regularization on each group, which is basically the

vector length of the group. So here is the group lasso

on those two groups. Then we add all those group lasso regularization as

one single regularization, and we add regularization to our data loss function

like a cross entropy. So we just learn it end-to-end

using stochastic gradient descent.>>What’s your criteria to split

weights into several groups?>>Good question. It depends on

what structure you want to learn.>>So this is structure

dependence [inaudible].>>Yes. So it depends on what

structure you want to learn. Let’s say you want to remove filters in the convolution

neural network, then one group is all the

weights in one filter. So if we want to remove a one row in the weight matrix then one group is one row

of all the weights. So it depends on what

structure you want to learn. So we refer to our visit as SSL. I’ll use this one a lot. So they are more regress proof about why group lasso regularization can learn structured sparsity, but nets explain in iteratively. So here is the group

of all the weights, so it’s the vector. The way we do gradient descent, it will be updated by regular

gradients which comes from the data loss plus one

additional gradients. This one is the additional gradients. Basically, it’s just

the unit vector going against the direction of the vector. So during the training, it will iteratively squeeze the

size of the vector and eventually, if it can, they all

will go to the zero, so it will remove all the weights. So many groups can

be pushed to zeros, then we can learn our

structured sparsity pattern.>>Sorry, just a good one, you said many groups

will push to zero, is the whole group push to zero?>>The whole group.>>The whole group.>>Many whole groups. Yes. So here is a

comparison on AlexNet, don’t laugh at me, AlexNet. It was a state-of-art, but is fair for comparison between non-structured sparsity and a

structured one, so it’s fair. When I say structured sparsity, it means this random sparsity, and so here, structured sparsity

I mean I remove rows and columns. There are a lot of information

here but let’s break it down. So bars here are

speedup on CPU and GPU, and lines here are like sparsity

across our five layers, and the orange color corresponds

to non-structured sparsity, and the green one

corresponds to our approach. So you can see the bar of

the green one are taller than the orange one which means

we can give a vertical speedup.>>Can you explain why the

speedup layers are different?>>Because you get

different sparsity.>>Is this related to [inaudible]>>Yes. So in general, shallow network have a less compact, because usually you

have fewer filters, but you have more filters in the deeper layer and also the features in the

deeper layer are more sparse. So you can achieve a higher

sparsity in the deeper layers.>>Are you think this

sparsity can help you to read this algorithm?>>Yes. I would suggest

that we don’t need that much filters in the

deeper layers in this case.>>Excuse me, what’s asked is

numbers of the parallel layers?>>This number?>>Yes.>>The index of the convolution.>>Okay.>>Convolution one, two.>>So what’s the meaning of

the column and the row line?>>This one?>>No. In the figure you

have four knots, Y is which?>>Okay, this one is the sparsity

of how many weights are removed.>>Okay.>>This one is how

many rows are removed. This one how many rows, this is Y, how many columns.>>Do lose any accuracy

by doing that?>>This one is two percent but the

accuracy are the same for both. They both have two percent.>>Two percent both?>>Yes, both.>>So for the baseline, when did you use the original paper that you [inaudible]

and then retrain?>>Not really, we use just

like we have used L1 but the sparsity is higher

than the original paper. So this is the AlexNet. So as I said, what

structure we want to learn depends on how

we split the groups. So here, if you want

to remove layers, then one group or weights are

all the weights in one layer. So if we remove one group,

then all the weights. So one day you can be removed.>>Does this mean that you can reduce one layer of

the whole network?>>Yes, that means we

can reduce the depth.>>Okay. Does it mean the features in both layers interact as

an important [inaudible]?>>Yes.>>Okay.>>Then the information can

go through the shortcuts. So here is a experiments

on [inaudible]. So on ResNet 32 we can reduce

the number of layers by 14, but we get a similar accuracy. There are a lot of redundancy

in the deep neural networks. Again, we can generalize it to LSTM. So again, what structure we want to learn depends on how

we split the weights. So in this case, we can remove the hidden size. So all the white strips here are the structures that are

associated with one hidden state. So if we want to remove

one hidden state, we have to remove all the structure

in the white strips back here. Here is a sophisticated

formula but let me cover it. So if you want to remove

one hidden state, that means we have to

remove two rows and four columns in the weight

matrix in the LSTM. So we basically group all the weights in one group and then

we remove many groups, then we can reduce the hidden size.>>Can I put the same

question [inaudible]. Well, different models, use different definitions

of the groups.>>Yes.>>So is there any heuristics

for the different groups?>>So it depends on what

structure you want to remove. So in this case, we

have to predefine. We predefine, say we want

to remove hidden states, then we find all the weights

associated with one hidden state.>>The most [inaudible] total which structure we

ought to [inaudible]>>In that case, then I would just split the

weights to many small blocks, let’s say eight by

eight and let it learn, then it will learn what

structure can be important.>>Well put. Have you ever

tried different heuristics say, okay I grouped the premises this way and that way

and read the comparison, first of all, in resonance because your original goal is

to use the left layers. So you assume initially at this time, you say okay, I want to

remove the hidden structures.>>So I would say we can

put all the structure, all the regularization

for all kinds of structure we want to learn

into one single loss. Then we can let it to learn but that would have

more hyper parameters.>>So in this case if you

perform them in the beginning, you do not know this structure. So you specify maybe the hidden size, hidden dimension to be one-sided. You put this [inaudible] to

learn their sparse structure. Under there, you find that okay, I can remove 200. So finally it’s 800,

800 hidden neuros. So we’ll you retrain the model with this 800 neuros again or will

you just use a previous one?>>In CNN, fine tuning will help. But in RNN, we find it

won’t happen a lot. It’s quite similar. So we didn’t do that.>>Also for seeing just now, I missed one part. So what do we mean by remove

row and remove column? Because it’s more like weight

times h times channel. So I’m not really understanding

what’s the meaning of row, what’s the meaning of column.>>So that depends on the

implementation of the lower level. So in general in cafe, one row, basically, is one 3D filter. But in that case, we are trying to

squeeze the size of the matrix. We could do it in the

perspective of computation. So one row is filter.>>So can I understand

one row is a channel?>>One row is a filter and one

column is more sophisticated. But that’s from the, yeah.>>When you remove a row it’s like removing one channel out of

this hidden state, right?>>Oh, so you’re going

back to L scale?>>Yeah, I’m talking about

the L scale in this case like the white row and

then the [inaudible]>>So all of them will be removed.>>Yes, I know. So when you

remove that row in that case, it will look like this a column and row and

each of these matrices. It’s like instead of

let’s say you have 1,000 dimension for

the hidden h value, it would make it 999 right?>>Yes.>>That’s the case?>>Yes.>>So if you train for

LSTM for a size of 999, will give the same result?>>I’ll get there.>>Or there is something

specific about that?>>I’ll get there. Yes. So I’m sorry, this one is better

but I’ll get there.>>No, I think I understand the idea. So what you’re going to get there

is going to say that no you can’t target for small

dimension right off the bat. Instead, you do this so the [inaudible] has the

ability to actually find the right structure or

the maximum sparsity. Okay. So is this correct

understanding if I say that you’re actually searching for the

right dimensions of the model? You have a method to find the

right size for the model.>>That’s one benefit.>>If you use sparsity

involved in the case, I just want to go and find

the right hidden dimension. This is the right way to do it.>>That’s yes, and

the speed is faster.>>I don’t because in this case, this kind of this facility

it’s very simple. You just see that okay, I just first use very

large hidden insights. Once I solved that, then I find okay, 980 is a good number. So I think that your

question is very rare to fine tune it or just first pick

the all recall sparsity number, not 80 and then train

from the beginning.>>Yes. I’ll get there. Let’s go

to here and answer your question. So we first use this

baseline and they have originally 1,500 hidden

stuff in this hidden size. The user of approach, we keep the original perplexity

but we reduce the size. So we get a significant speedup. So to equal your question, can we train a smaller model? Can we get the same performance? So we did the experiments here. So we train the same

on LSTM which have the same hidden size,

but from scratch. Instead, train is

smaller from scratch. But the perplexities is much

worse than to prove it down.>>Yeah. Okay. So the reason I’m

bringing this up is that there is another method that you

can reduce the size, has nothing do with sparsity, it has nothing to do with reduction, reducing rows or columns but they do single value composition after

they train a big matrix. So the idea is that if you were training the small

models at the beginning, you will never find the same

result as if you have trained the full big model and then

do single value composition.>>Exactly.>>Then do some fine

tuning after that.>>Yes. Similar thing.>>So it’s along the same line.>>Yes. It’s similar

here. Yes, it’s basically a larger model will give

you more exploration.>>Exactly.>>You find a better one.>>That’s exactly what

I wanted to get to.>>Yes. Agree. Okay. So basically here is just the

structure we learned. It’s very regular. It means we remove a lot of

rows and the columns. Finally we just inference, use this small model.>>So after you find the

structure’s sparsity, can you pack it to a regular matrix?>>Yes.>>To pack it back?>>Because it’s regular, yes. That’s the benefit of

structured sparsity. So we basically just create a smaller LSTM and we use the

non-zero weights to initialize it, then we get the performance. So in general, LSTM can reduce

the size while maintaining the same perplexity or we can

also make it a trade-off. We can reduce the Sparsity

Regularization with a little bit larger model but we can reduce the perplexity and it’s better than training a

smaller model from scratch.>>Why do you either reduce the parameters to

get them test perplexity?>>You mean why do we?>>No, a similar. The second one, create that fly, right? That background of the

last scene but still better reduce the parameters.>>Yes.>>I’m pretty sure if you

use these parameters, you can get better test perplexity.>>Yeah, this is test perplexity.>>Why does it get better?>>This is the benefit, OSSL.>>Okay.>>Yeah. So basically we can get a smaller model which

has a better performance.>>[inaudible] exactly, like

can you have a parameter?>>I think one single how

it’s going to be changed, it’s the job out ratio because we add additional regularization from the Sparse Regularization

then we don’t need that much regularization from this

job work. That’s the only one.>>So you change the upper parameter, you can get better perplexity. Is this how that works? I mean the green light, how did you achieve that green light exactly?>>This one?>>Yeah.>>We use SSL and then we change the Lambda

Regularization from the->>So you change the Lambda. Okay.>>Yeah.>>Is this PTV in Lambda?>>Yes.>>This is then dataset.>>Yes.>>Because it all depends on

how you do the regularization. It can only be part

tested for [inaudible]>>I’m not sure about the conclusion.>>Okay. If you test the

training [inaudible]. Are they similar?>>Yes.>>Even for the last one where

you train a smaller model?>>For this one?>>No, for the last.>>This one?>>Yeah.>>I can’t exactly

remember the numbers. So pardon me. We care more about tests.>>So what’s the exact meaning of those two pictures

in the last slide?>>That means the structured

pattern we learn. So the white regions are zeros

and the blue dots are non-zeros.>>So what’s the meaning of LSTM

1 and LSTM 2? Do you [inaudible]>>Those are two layers, layer one, layer two. So here, SSL can get a better trade-off in terms of performance

and the model size. We also did experiments on recurrent highway network and

we start from this baseline, and then we use SSL

to reduce the size, and then we either reduce the perplexity or we under

use a smaller model. Or we can keep the perplexity, but again they’re much smaller model. So from this trend we can see, if we start from a large

model then pull it down, then we can get the benefits. We can make a better trade-off. So the implication is we should start from a redundant

model and sparsify turn. So this is how how does SSL

work for very large model? But we are also

curious about how does it work for very small compact model. So we did experiments on BDAF model, which is a smaller model, have only 2.7 million parameters, and the hidden size is only 100. In that case because the

model is very compact, it’s originally very compact, so we cannot say keep their

performance but reduce the size. But we can still make a

good trade-off by SSL. So for example, we can reduce

the size less than one million by only drop two

percent of the F1 score.>>Which one is tasking?>>It’s a question-answering

on squash?>>Okay.>>So this conclude

my previous research. So summary. So we did the research to make the distributed training

more scalable using stochastic quantization over the

gradients and it’s effective. It’s effective by evaluated

by Facebook’s AI production. So we can reduce the

research cycle and production cycle because we can

make the model train faster. On sparse new network, we can enable more ubiquitous

AI on edge devices like our cell phone or self-driving car or any VR devices because the

computation is very limited. We also observe performance

gain if we can start from a very large model then

prune down, specified down.>>So before you continue

to the future directions. On the first one you said that you believe the turn grad work

because you preserve the variant, actually you had more variance.>>More variance.>>More variance. No bias

but more variance and that more variance helped

you in some situations. What is your understanding

of why it’s sparse? I mean, how would you explain why this method works better than like, for example, any other method

out there for exploring the correct sizes of the

model? Why does it work?>>Why does it work?

At first there is the deep model region needs to be redundant so we can prune the num. The second reason is comparing

with long structure one, structured pattern is efficient.>>I want you to specifically tell

me if you have understanding, why is going and searching in a

bigger space and then pruning it down is better than just

going directly [inaudible]? What I’m trying to say

is that maybe we don’t have a specific tool to

just find the right model. Or you have an understanding that searching in a bigger space is much better and then bringing

down gives us no loss.>>Exactly.>>I just want to see

your understanding.>>So my understanding is if

you start from a larger model, you have more directions to explore. Then the pruning process will find which direction is

the right direction. So you have a larger space, which means you have more exploration in the parameter model space. Then specification is a process

of trying to find the good one.>>Does the same concept of high-variance gradients

apply here too? Like the fact that you don’t have

some of these rows or columns, does it mean the gradients are

zero and therefore you have higher variance in the

gradients, the great gradients.>>From the perspective

exploration, they’re similar.>>That’s what I’m

trying to get. I mean, can I get the same message from the first one and

the second one too?>>Yes. The first one is

exploration in one model space.>>In the precision of the gradients, and the second one is the

variance in the inside gradient.>>The second one is in the

space of model space, yes. So go to the future. How can we go beyond my research and what I want to do in the future? So and how is my research

related to a recent research? So one related research is a lottery ticket hypothesis which is the actually a bit of

the paper this year. So a lottery ticket hypothesis says, for 10 randomly initialized

the feed-forward new network, there is a sub-network which

is a sparse new network. They’re also referred

to as winning tickets. So if you train this sub

network from scratch, you can reach the similar accuracy with the similar

number of iterations. The only hypothesis they existence, but we don’t know how to find it. So if you just randomly pick up

a random sparse new network, the accuracy will be very bad. So in our recent research, we found that SSL possibly can identify these sub-network at the early stage of

the training process. So in our experiments we find

that during SSL learning process, the SSL have very high sparsity in the very early stage

of our training process, and that those zero structure

will never come back. So if so, that means we can just

remove those useless structure, and we gradually remove those structures one-by-one and

then finally we just converge to our smaller and new network which probably is

the winning tickets. So this one gives us

about 40 percent of the training time reduction

for ResNet on ImageNet. This paper is nominated by Best Final Student papers in

supercomputing this year.>>From my memory, there is a big difference of your SSL and this Lottery

ticket hypothesis. Because in that hypothesis, if they find that sparse structure, and then they treat that first sparsity as an oracle and then they

train that first network, then the equation will be recovered. But in your case what you’re arguing, if you are given that sparsity or oracle and they

just trained from the beginning, you cannot reach the

same level of accuracy. Because you didn’t

use the exploration and power of first

explore on a very large, dense space and then finally

gradually go to this first person.>>No.>>Actually. Sorry.>>Go ahead.>>The lottery ticket story is that you can recover good

accuracy in the smaller network. If you start with the

same initialization as you did before your improvement. This is a way of preserving the residue of the initialization

after proving as well.>>So does that mean that in

your summarization slides, in that last point you

just said that it’s better to start from

a dense large network and then use SSL to find

the sparse network. That one give you high accuracy. So you try that if you

are directly given the sparse structure and

essential from the beginning, you cannot reach this in accuracy. But have you ever tried just like the lottery ticket

hypothesis just to remember your initialization and then use the same run the initialization

to train this sparse model. So exactly, we read through

the north ray [inaudible]?>>Yes. So that means we can use SSL to identify

the winning tickets. So that’s the relationship here. So in the paper, they just do the exploration on

non-structured sparsity pattern. So we’re thinking probably SSL

can identify the winning tickets.>>Okay. It also identify

structured winning tickets?>>Yes. Exactly.>>But have you ever run it, remember the random initialization of the [inaudible] with the

sparsity pattern again?>>Not yet, it’s open question. But I think we will try,

that’s visual work.>>[inaudible]. Because by no

way you can get around that. I think you let experiment

shows you get the same results.>>So I want to understand

what’s the connection between the SSL and the

lottery ticket hypothesis, whether they are the same

thing or they are different.>>Yes.>>That’s a very interesting

future research. So the second related work to

my previous research is AutoML. So an automated machine learning to use a machine to

design machine learning model. So when the community make

lot of progress on AutoML, they found that it’s much more

efficient to do in this way. So they designed a very

large model which have all the operation options enabled

in this single one-shot model. They coined one-shop

model because they have all the options enabled. The goal is to pick

up the optimal path. So the experiments show this method

is very efficient for AutoML. So the question is can

we use SSL for AutoML. So in SSL basically trying

to remove structures. So a second open question is, can we use SSL for AutoML. So another research direction inspired by my previous research

is on scaling NLP problem. So going back to my statement at the very beginning to

train of deep learning is, if we can train a large model we can always get a

better performance. This is even more true in NLP

for an unsupervised learning. So here is the ablation

study for the part and if they can train a larger model

they get a better performance. But personally, I think NLP is

in the registering the error and it’s very hard

to further scale up because of the computation

and the memory cost. So we need more scalable

training method and the most scalable best

models to scale up. So one direction we can go is we design more scalable

training method. So basically it means given

our architecture let’s say, can we make the training faster? So we could use TernGrad, SSL, but what’s more?>>SSL does not make training faster.>>It can.>>We just mention one [inaudible]>>But you still to have

the entire architecture to reduce it otherwise unless you

reduce it in the training.>>We just save the checkpoint

and create a new smaller one.>>Okay. If you want to

truly go and the original model and you want to find

that you still have to keep the original size or oracle.>>But our experiments show zero

architecture rarely come back. So we just remove that and

train a smaller one gradually.>>Okay.>>Yeah. So the second directions

design more scalable new models. So we should design our fast accurate and compact based model and scale up. So and I also believe that AutoML will play a very important

role in designing compact, small, and no people models. So this brings me to another future research I

would want to do on AutoML. So this figure also equals

to my previous statement. This is the computer vision models. In recent years people generally

design more scalable model. So this smaller model but high accuracy designs more

scalable model and scale up. So this model is designed by AutoML, it’s not by human. So I believe AutoML will

be very important in design compact NLP

models in the future. But there are two problems in AutoML. The first problem is, AutoML it’s memory is everything

so it’s very inefficient. The second problem

is AutoML relies on human-designed operations and

only search the combinations. It cannot event any

new architectures. That’s the problem of AutoML. So in the past half years, I did some research on the first problem and to make

the sample more efficiency. So the x-axis is the

number of our network. We have to train to read

some test accuracy. So this one is a random method. This one is the state of art regularized evolution and my

recent research is getting here. So now we can find the best oracle and model

within hundreds of samples, we have to only train with

hundreds of network to find the optimal and that will make

it public in next month. With that conclude my talk

and I’m open to questions.>>The question with trained grip. So apparently you transmit

two bits for any great game, but you only had to

use three [inaudible].>>Yes, we risk a lot.>>So have you found out the better or efficient way

to use four bits fully to make it more accurate

or anything like that?>>Yes. So if we use two

bits in real production, we definitely should use four levels. We don’t want to waste it. But that’s the research, you have more interest in how can we really aggressively to

reduce the precision. So in that paper, we

use the three levels. But in production,

they were mentioned, they used the eight bits. So that’s the trade off.>>Okay. I’ll ask one. So let’s say I want to

apply SL to transformers, and I would then place

them to your suggestions. I understand that SL has the flexibility where you

can design the groups, just that they didn’t

foresee as the LSTM. Let’s say, I want to design some

groups to form transformers. My goal can be, one, I want to get some

interpretation. Sorry.>>Interpretability?>>Yes, and second one is I want to reduce the computation

cost to be done. I know that this could

be different goals. Can you say something that, how I can design groups so that

I can achieve these goals?>>Simultaneously or?>>No, not simultaneously. Simultaneously, it will be hard.>>So to reduce the size of a model. So let’s say we can use it to reduce the hidden sides and like bird or we can reduce

the number of heads. So that’s basically how

we split the groups. For the first on the

interpretability, so usually, we say simpler

model is more interpretable. So if we use SSL, we reduce the size of the model then that will be more interpretable.>>For transformers, one interesting thing is

the attention mechanism. If I can apply it perfectly

to transformers, let’s say, if I can guess something

like for a given token, this token attends only a few part of the other tokens in a sentence. This could be something interesting

for people to interpret.>>Then, I think we can use

a mask parameter to mask their attention and then

regularize sparsity. If the corresponding

weights in the max is zero, that means that attention is useless.>>I suggest that we

can learn the mask?>>Yes.>>I think in the current

exploration mask now is pretty fine.>>I mean, not known that mask. I mean the mask over

the whole attention, that’s the parameter we can learn.>>Thank you.>>I think this is as

if the boss who wrote a book with a lot cluster while

the protocol described models. Basically, it’s that we train you as a profession at the

90 proportion task. If effective adults incurs this, while some components

should be easier. Because if you will notice, well, sometimes it’s just some layers

is not important property. You want to compress your [inaudible] module into the

general module, [inaudible]. How do you compare using [inaudible]>>So for the unsupervised learning, people usually use a

language modeling.>>Yeah.>>So I have shown the

effectiveness of SSL on language modeling.

That’s one thing. Another thing is how we can separate this specification for

unsupervised learning and for a specific task. So given a specific task, we can specify the new

on-premise for specific task.>>Go back to the

unsupervised learning model. Basically, you still

test on same tasks. You said after you compress the

learning model and in fact view this language model to factual

other task [inaudible].>>[inaudible].>>Okay.>>So a quick question. I’m not still familiar with

the AutoML literature. But for example, the biggest thing we’ve seen in terms of

shrinking somebody for example, where we had done is that you

end up with parameter sharing. How would I care? So first of all, I guess

there’s a couple of questions. One is, we probably don’t have the right architectures

in the first place. But related to that

is we may not have the right set of actions to

put into your framework. So for example, can I put

into an AutoML framework? Yeah. You should be searching over all possible sets of parameters that should be shared and

not actually duplicated. Or what are the kind of

limits here because we just don’t actually even know what the building blocks are that

we should be searching over.>>Yes. That’s the second

problem I mentioned in AutoML. It cannot event any new architecture. So I think we could be but this search space

that will be very large, so it will take a very long time. So for example, what can we

learn or convolution you upload new network from a

fully connected new network? Theoretically, yes. But the search space

is just too large. So I think that would be

the trade off between.>>Did you have thoughts

on how we bridge that? So in the sense that like is there a way to use

AutoML to basically say, “Hey human, please help

me out a little bit.” I’m seeing some sparsity over here. Maybe it’s nothing. Maybe there’s something here. Then, you could go in and you could do this as a

more iterative fashion. I think the thing that bothers

me about the formulation both for the structure

sparsity and for AutoML is this notion that our priority have already created the

right basic architectures, because I just don’t buy the

assumption in the first place. But I think all of our models

are probably very wrong. So it’s great that we

can make them better. But then is there a way to

extend the framework so that you have a more iterative process

of improving these things or?>>Yes. So one suggestion is we

can feel the AutoML framework, but we human can propose

some new architecture, we add it into the Search Space, and the AutoML will automatically define where should you

put the new architecture. So in that sense, AutoML will help. But I think AutoML and

the human experts, if they work together, we’ll get much more benefits.>>More questions? If not, let’s thank our speaker again.>>Thank you.

Microsoft research lol

is this a neurips talk or msr talk?