ai-day-2022/introducting-optimus.png

Table of Contents

Introducing: Optimus
- Latest generation
  - actuators
  - hands
  - software
Full Self-Driving
- DOJO: In-House Supercomputer
Transcript

01 Oct 2022

AI day is a recruitment event aimed at engineers. Sharing Tesla's progress on the AI front to get people excited about joining them.

Since 2020, Tesla tops the list of the most attractive companies for US engineering students (ahead of SpaceX, Lockheed Martin, Google, Boeing, NASA, Apple, Microsoft and Amazon).

It also acts as PR for Tesla and provides a better understanding of the technical progress & roadmap for investors, clients, fans.. and competitors!

It is a 3 hours long, quite technical presentation.

Even if a lot flies over my head, I find it fascinating to get a glimpse "under the hood" at how these innovative technologies get engineered and built.

What amazes me always with Tesla is how open they are with their engineering - from having open-sourced their patents years ago, to sharing their engineering work (typically one of the most closed guarded trade secret at companies who build things) in granular details.

Introducing: Optimus

This could be a defining moment in history.

"Tesla could make a meaningful contribution to AGI"

production planned at high-volume (millions of unit)
aim of <$20k
on "Elon time", Optimus will get to market in 3-5 years (so probably 5-8 years).

This would mean same cost as a year's low-wage salary of one person, for a robot who - over time - will be much more productive.

Economy is defined as production value per capita - what does an economy look like when no limitation of capita? 🤔 🤯

"the potential for optimus is i think appreciated by very few people"

"join Tesla and help make it a reality and bring it to fruition at scale such that it can help millions of people.
The potential like i said really boggles the mind, because you have to say what is an economy?
An economy is sort of productive entities times their productivity - capita times productivity per capita. At the point at which there is not a limitation on capita it's not clear what an economy even means. At that point an economy becomes quasi-infinite - this means a future of abundance, a future where there is no poverty where people you can have whatever you want in terms of products and services it really is a fundamental transformation of civilization as we know it.
Obviously we want to make sure that transformation is a positive one and safe but that's also why i think Tesla as an entity doing this being a single class of stock publicly traded owned by the public is very important and should not be overlooked. I think this is essential because then if the public doesn't like what Tesla is doing the public can buy shares in Tesla and vote differently - this is a big deal it's very important that i can't just do what i want sometimes people think that but it's not true so it's very important that the corporate entity that makes this happen is something that the public can properly influence so i think the Tesla structure is ideal for that."

Elon Musk

Self-driving cars have a potential for 10x economic output.

Optimus has potential for 100x economic output!

Test use cases showed:

moving boxes and objects on the factory floor

(same software as Tesla FSD)
(actual workstation in one of the Tesla factories)
bringing packages to office workers
watering flowers

Using semi-off-the-shelf at the moment. Working on custom design.
Working on optimising cost & scalability of actuators.

Opposable thumbs: can operate tools.

"we've also designed it using the same discipline that we use in designing the car which is to design it for manufacturing such that it's possible to make the robot in high volume at low cost with high reliability"

ai-day-2022/221003-021104-tesla-ai-day-2022-0018-

ai-day-2022/221003-021121-tesla-ai-day-2022-0019-

what amazes me is the pace of innovation. From concept to working prototype in < 1 year (6-8 months they said).

ai-day-2022/221003-021137-tesla-ai-day-2022-0020-

similar weight as a human means no weight restrictions/limitations.

Latest generation

ai-day-2022/221003-091529-tesla-ai-day-2022-0021-

Orange are actuators, blue are electrical systems.

Cost and efficiency are focus.

Part count and power consumption will be optimised/minimised.

ai-day-2022/221003-091632-tesla-ai-day-2022-0022-

battery will be good for 1x day of work

ai-day-2022/221003-091723-tesla-ai-day-2022-0023-

bot brain in the torso, leveraging Tesla FSD software

ai-day-2022/221003-091933-tesla-ai-day-2022-0024-

models used for crash test simulations are extremly complex and accurate. Same models used for Optimus.

ai-day-2022/221003-092050-tesla-ai-day-2022-0025-

"we're just bags of soggy jelly and bones thrown in" 😂

ai-day-2022/221003-092304-tesla-ai-day-2022-0026-

ai-day-2022/221003-092458-tesla-ai-day-2022-0028-

actuators

ai-day-2022/221003-092732-tesla-ai-day-2022-0029-

ai-day-2022/221003-092754-tesla-ai-day-2022-0030-

ai-day-2022/221003-092844-tesla-ai-day-2022-0031-

red axis denotes optimal
"communality study" to minimise number of different actuators

ai-day-2022/221003-093010-tesla-ai-day-2022-0032-

1x actuator able to lift a 500kg piano.

hands

ai-day-2022/221003-093133-tesla-ai-day-2022-0033-

biologically inspired design because the world around us is designed for human biology ergonomics. Adapt the robot to its environment, not vice versa. So the robot can interact with the world of humans, no matter what.

If you are interested in the technical details, I encourage you to watch the whole presentation - it's quite fascinatting.

software

it was possible to quickly get to a functioning version of the concept from last year because of the years spend by the FSD team.
robot on legs vs robot of wheels.
same "occupancy network" as in Tesla cars.

ai-day-2022/221003-093443-tesla-ai-day-2022-0034-

ai-day-2022/221003-2136_optimus-walking.gif

Full Self-Driving

ai-day-2022/221003-094514-tesla-ai-day-2022-0035-

ai-day-2022/221003-095428-tesla-ai-day-2022-0037-

ai-day-2022/221003-103405-tesla-ai-day-2022-0038-

FSD Beta can technically be made available worldwide by the end of the year.
Hurdle will be local regulatory approvals.

Metric to optimise against: how many miles in full autonomy between necessary interventions.

DOJO: In-House Supercomputer

400,000 video instantiations per second!

Invented a "language of lanes" to handle the logic of complicated 3D representations of lanes and their interrelations.
Framework could be extrapolated to "language of walking paths" for Optimus.

ai-day-2022/221003-110012-tesla-ai-day-2022-0039-

ai-day-2022/221003-110831-tesla-ai-day-2022-0040-

ai-day-2022/221003-111129-tesla-ai-day-2022-0042-

Transcript

Full transcript generated with Whisper Whisper:

Tesla AI Day 2022.m4a
transcribed: 2022-10-02 21:19 | english

1
0:14:00,000 --> 0:14:02,000
Oh

2
0:14:30,000 --> 0:14:37,200
All right, welcome everybody give everyone a moment to

3
0:14:39,840 --> 0:14:41,840
Get back in the audience and

4
0:14:43,760 --> 0:14:49,920
All right, great welcome to Tesla AI day 2022

5
0:14:49,920 --> 0:14:56,920
We've got some really exciting things to show you I think you'll be pretty impressed I do want to set some expectations with respect to our

6
0:15:08,680 --> 0:15:14,600
Optimist robot as as you know last year was just a person in a robot suit

7
0:15:14,600 --> 0:15:22,600
But we've knocked we've come a long way and it's I think we you know compared to that it's gonna be very impressive

8
0:15:23,720 --> 0:15:25,720
and

9
0:15:26,280 --> 0:15:28,040
We're gonna talk about

10
0:15:28,040 --> 0:15:32,720
The advancements in AI for full self-driving as well as how they apply to

11
0:15:33,240 --> 0:15:38,600
More generally to real-world AI problems like a humanoid robot and even going beyond that

12
0:15:38,600 --> 0:15:43,640
I think there's some potential that what we're doing here at Tesla could

13
0:15:44,480 --> 0:15:48,320
make a meaningful contribution to AGI and

14
0:15:49,400 --> 0:15:51,860
And I think actually Tesla's a good

15
0:15:52,680 --> 0:15:58,400
entity to do it from a governance standpoint because we're a publicly traded company with one class of

16
0:15:58,920 --> 0:16:04,580
Stock and that means that the public controls Tesla, and I think that's actually a good thing

17
0:16:04,580 --> 0:16:07,380
So if I go crazy you can fire me this is important

18
0:16:08,740 --> 0:16:10,740
Maybe I'm not crazy. All right

19
0:16:11,380 --> 0:16:13,380
so

20
0:16:14,340 --> 0:16:19,620
Yeah, so we're going to talk a lot about our progress in AI autopilot as well as progress in

21
0:16:20,340 --> 0:16:27,380
with Dojo and then we're gonna bring the team out and do a long Q&A so you can ask tough questions

22
0:16:29,140 --> 0:16:32,260
Whatever you'd like existential questions technical questions

23
0:16:32,260 --> 0:16:37,060
But we want to have as much time for Q&A as possible

24
0:16:37,540 --> 0:16:40,100
So, let's see with that

25
0:16:41,380 --> 0:16:42,340
Because

26
0:16:42,340 --> 0:16:47,700
Hey guys, I'm Milana work on autopilot and it is my book and I'm Lizzie

27
0:16:48,580 --> 0:16:51,780
Mechanical engineer on the project as well. Okay

28
0:16:53,060 --> 0:16:57,620
So should we should we bring out the bot before we do that we have one?

29
0:16:57,620 --> 0:17:02,660
One little bonus tip for the day. This is actually the first time we try this robot without any

30
0:17:03,300 --> 0:17:05,300
backup support cranes

31
0:17:05,540 --> 0:17:08,740
Mechanical mechanisms no cables nothing. Yeah

32
0:17:08,740 --> 0:17:28,020
I want to do it with you guys tonight. That is the first time. Let's see. You ready? Let's go

33
0:17:38,740 --> 0:17:40,980
So

34
0:18:08,820 --> 0:18:12,820
I think the bot got some moves here

35
0:18:24,740 --> 0:18:29,700
So this is essentially the same full self-driving computer that runs in your tesla cars by the way

36
0:18:29,700 --> 0:18:37,400
So this is literally the first time the robot has operated without a tether was on stage tonight

37
0:18:59,700 --> 0:19:01,700
So

38
0:19:14,580 --> 0:19:19,220
So the robot can actually do a lot more than we just showed you we just didn't want it to fall on its face

39
0:19:20,500 --> 0:19:25,860
So we'll we'll show you some videos now of the robot doing a bunch of other things

40
0:19:25,860 --> 0:19:32,420
Um, yeah, which are less risky. Yeah, we should close the screen guys

41
0:19:34,420 --> 0:19:36,420
Yeah

42
0:19:40,900 --> 0:19:46,660
Yeah, we wanted to show a little bit more what we've done over the past few months with the bot and just walking around and dancing on stage

43
0:19:49,700 --> 0:19:50,900
Just humble beginnings

44
0:19:50,900 --> 0:19:56,260
But you can see the autopilot neural networks running as it's just retrained for the bot

45
0:19:56,900 --> 0:19:58,900
Directly on that on that new platform

46
0:19:59,620 --> 0:20:03,540
That's my watering can yeah when you when you see a rendered view, that's that's the robot

47
0:20:03,780 --> 0:20:08,740
What's the that's the world the robot sees so it's it's very clearly identifying objects

48
0:20:09,300 --> 0:20:11,860
Like this is the object it should pick up picking it up

49
0:20:12,500 --> 0:20:13,700
um

50
0:20:13,700 --> 0:20:15,700
Yeah

51
0:20:15,700 --> 0:20:21,300
So we use the same process as we did for the pilot to connect data and train neural networks that we then deploy on the robot

52
0:20:22,020 --> 0:20:25,460
That's an example that illustrates the upper body a little bit more

53
0:20:28,660 --> 0:20:32,740
Something that will like try to nail down in a few months over the next few months, I would say

54
0:20:33,460 --> 0:20:35,060
to perfection

55
0:20:35,060 --> 0:20:39,060
This is really an actual station in the fremont factory as well that it's working at

56
0:20:39,060 --> 0:20:45,060
Yep, so

57
0:20:54,180 --> 0:20:57,700
And that's not the only thing we have to show today, right? Yeah, absolutely. So

58
0:20:58,180 --> 0:20:59,140
um

59
0:20:59,140 --> 0:21:02,500
that what you saw was what we call bumble see that's our

60
0:21:03,620 --> 0:21:06,420
uh sort of rough development robot using

61
0:21:06,420 --> 0:21:08,420
Semi off-the-shelf actuators

62
0:21:08,980 --> 0:21:14,420
Um, but we actually uh have gone a step further than that already the team's done an incredible job

63
0:21:14,980 --> 0:21:20,660
Um, and we actually have an optimist bot with uh fully tesla designed and built actuators

64
0:21:21,460 --> 0:21:25,220
um battery pack uh control system everything um

65
0:21:25,780 --> 0:21:30,420
It it wasn't quite ready to walk, but I think it will walk in a few weeks

66
0:21:30,420 --> 0:21:37,700
Um, but we wanted to show you the robot, uh, the the something that's actually fairly close to what will go into production

67
0:21:38,420 --> 0:21:42,180
And um and show you all the things it can do so let's bring it out

68
0:21:42,180 --> 0:21:58,180
Do it

69
0:22:12,180 --> 0:22:14,180
So

70
0:22:33,380 --> 0:22:37,620
So here you're seeing optimists with uh, these are the

71
0:22:37,620 --> 0:22:43,780
The with the degrees of freedom that we expect to have in optimist production unit one

72
0:22:44,340 --> 0:22:47,860
Which is the ability to move all the fingers independently move the

73
0:22:48,900 --> 0:22:51,060
To have the thumb have two degrees of freedom

74
0:22:51,700 --> 0:22:53,620
So it has opposable thumbs

75
0:22:53,620 --> 0:22:59,380
And uh both left and right hand so it's able to operate tools and do useful things our goal is to make

76
0:23:00,660 --> 0:23:04,580
a useful humanoid robot as quickly as possible and

77
0:23:04,580 --> 0:23:10,100
Uh, we've also designed it using the same discipline that we use in designing the car

78
0:23:10,180 --> 0:23:13,080
Which is to say to to design it for manufacturing

79
0:23:14,020 --> 0:23:17,620
Such that it's possible to make the robot at in high volume

80
0:23:18,340 --> 0:23:20,580
At low cost with high reliability

81
0:23:21,300 --> 0:23:27,000
So that that's incredibly important. I mean you've all seen very impressive humanoid robot demonstrations

82
0:23:28,020 --> 0:23:30,100
And that that's great. But what are they missing?

83
0:23:30,100 --> 0:23:37,300
Um, they're missing a brain that they don't have the intelligence to navigate the world by themselves

84
0:23:37,700 --> 0:23:39,700
And they're they're also very expensive

85
0:23:40,340 --> 0:23:42,340
and made in low volume

86
0:23:42,340 --> 0:23:43,460
whereas

87
0:23:43,460 --> 0:23:49,860
This is the optimist is designed to be an extremely capable robot but made in very high volume probably

88
0:23:50,420 --> 0:23:52,260
ultimately millions of units

89
0:23:52,260 --> 0:23:55,940
Um, and it is expected to cost much less than a car

90
0:23:55,940 --> 0:24:00,740
So uh, I would say probably less than twenty thousand dollars would be my guess

91
0:24:06,980 --> 0:24:12,740
The potential for optimists is I think appreciated by very few people

92
0:24:16,980 --> 0:24:19,380
As usual tesla demos are coming in hot

93
0:24:20,740 --> 0:24:22,740
So

94
0:24:22,740 --> 0:24:25,380
So, okay, that's good. That's good. Um

95
0:24:26,180 --> 0:24:27,380
Yeah

96
0:24:27,380 --> 0:24:32,100
Uh, the i'm the team's put in put in and the team has put in an incredible amount of work

97
0:24:32,580 --> 0:24:37,540
Uh, it's uh working days, you know, seven days a week running the 3am oil

98
0:24:38,100 --> 0:24:43,780
That to to get to the demonstration today. Um, super proud of what they've done is they've really done done a great job

99
0:24:43,780 --> 0:24:52,980
I just like to give a hand to the whole optimist team

100
0:24:56,900 --> 0:25:02,980
So, you know that now there's still a lot of work to be done to refine optimists and

101
0:25:03,620 --> 0:25:06,580
Improve it obviously this is just optimist version one

102
0:25:06,580 --> 0:25:14,660
Um, and that's really why we're holding this event which is to convince some of the most talented people in the world like you guys

103
0:25:15,140 --> 0:25:16,340
um

104
0:25:16,340 --> 0:25:17,380
to

105
0:25:17,380 --> 0:25:22,820
Join tesla and help make it a reality and bring it to fruition at scale

106
0:25:23,620 --> 0:25:25,300
Such that it can help

107
0:25:25,300 --> 0:25:26,980
millions of people

108
0:25:26,980 --> 0:25:30,340
um, and the the and the potential like I said is is really

109
0:25:30,340 --> 0:25:35,860
Buggles the mind because you have to say like what what is an economy an economy is?

110
0:25:36,580 --> 0:25:37,700
uh

111
0:25:37,700 --> 0:25:39,700
sort of productive

112
0:25:39,700 --> 0:25:42,820
entities times the productivity uh capital times

113
0:25:43,380 --> 0:25:44,420
output

114
0:25:44,420 --> 0:25:48,500
Productivity per capita at the point at which there is not a limitation on capital

115
0:25:49,220 --> 0:25:54,100
The it's not clear what an economy even means at that point. It an economy becomes quasi infinite

116
0:25:54,980 --> 0:25:56,100
um

117
0:25:56,100 --> 0:25:58,100
so

118
0:25:58,100 --> 0:26:02,740
What what you know take into fruition in the hopefully benign scenario?

119
0:26:04,420 --> 0:26:05,940
the

120
0:26:05,940 --> 0:26:10,260
this means a future of abundance a future where

121
0:26:12,260 --> 0:26:18,760
There is no poverty where people you can have whatever you want in terms of products and services

122
0:26:18,760 --> 0:26:27,320
Um it really is a a fundamental transformation of civilization as we know it

123
0:26:28,680 --> 0:26:30,040
um

124
0:26:30,040 --> 0:26:33,800
Obviously, we want to make sure that transformation is a positive one and um

125
0:26:35,000 --> 0:26:36,600
safe

126
0:26:36,600 --> 0:26:38,600
And but but that's also why I think

127
0:26:39,320 --> 0:26:45,400
tesla as an entity doing this being a single class of stock publicly traded owned by the public

128
0:26:46,200 --> 0:26:48,200
Um is very important

129
0:26:48,200 --> 0:26:50,200
Um and should not be overlooked

130
0:26:50,360 --> 0:26:57,960
I think this is essential because then if the public doesn't like what tesla is doing the public can buy shares in tesla and vote

131
0:26:58,500 --> 0:27:00,200
differently

132
0:27:00,200 --> 0:27:02,200
This is a big deal. Um

133
0:27:03,000 --> 0:27:05,720
Like it's very important that that I can't just do what I want

134
0:27:06,360 --> 0:27:08,920
You know sometimes people think that but it's not true

135
0:27:09,480 --> 0:27:10,680
um

136
0:27:10,680 --> 0:27:12,680
so um

137
0:27:13,720 --> 0:27:15,720
You know that it's very important that the

138
0:27:15,720 --> 0:27:21,400
the corporate entity that has that makes this happen is something that the public can

139
0:27:22,120 --> 0:27:24,120
properly influence

140
0:27:24,120 --> 0:27:25,240
um

141
0:27:25,240 --> 0:27:28,200
And so I think the tesla structure is is is ideal for that

142
0:27:29,240 --> 0:27:31,240
um

143
0:27:32,760 --> 0:27:39,080
And like I said that you know self-driving cars will certainly have a tremendous impact on the world

144
0:27:39,720 --> 0:27:41,800
um, I think they will improve

145
0:27:41,800 --> 0:27:45,000
the productivity of transport by at least

146
0:27:46,120 --> 0:27:49,880
A half order of magnitude perhaps an order of magnitude perhaps more

147
0:27:51,000 --> 0:27:52,680
um

148
0:27:52,680 --> 0:27:54,680
Optimist I think

149
0:27:54,920 --> 0:27:56,920
has

150
0:27:57,400 --> 0:28:03,880
Maybe a two order of magnitude uh potential improvement in uh economic output

151
0:28:05,160 --> 0:28:09,240
Like like it's not clear. It's not clear what the limit actually even is

152
0:28:09,240 --> 0:28:11,240
um

153
0:28:11,800 --> 0:28:13,800
So

154
0:28:14,040 --> 0:28:17,320
But we need to do this in the right way we need to do it carefully and safely

155
0:28:17,960 --> 0:28:21,800
and ensure that the outcome is one that is beneficial to

156
0:28:22,580 --> 0:28:26,040
uh civilization and and one that humanity wants

157
0:28:27,240 --> 0:28:30,040
Uh can't this is extremely important obviously

158
0:28:30,920 --> 0:28:32,920
so um

159
0:28:34,440 --> 0:28:36,440
And I hope you will consider

160
0:28:36,680 --> 0:28:38,360
uh joining

161
0:28:38,360 --> 0:28:40,360
tesla to uh

162
0:28:40,920 --> 0:28:42,920
achieve those goals

163
0:28:43,160 --> 0:28:44,120
um

164
0:28:44,120 --> 0:28:49,880
It tells us we're we're we really care about doing the right thing here or aspire to do the right thing and and really not

165
0:28:51,000 --> 0:28:53,000
Pave the road to hell with with good intentions

166
0:28:53,240 --> 0:28:55,800
And I think the road is road to hell is mostly paved with bad intentions

167
0:28:55,800 --> 0:28:57,880
But every now and again, there's a good intention in there

168
0:28:58,440 --> 0:29:03,400
So we want to do the right thing. Um, so, you know consider joining us and helping make it happen

169
0:29:04,760 --> 0:29:07,480
With that let's uh, we want to the next phase

170
0:29:07,480 --> 0:29:09,480
Please right on. Thank you

171
0:29:15,960 --> 0:29:19,640
All right, so you've seen a couple robots today, let's do a quick timeline recap

172
0:29:20,200 --> 0:29:24,760
So last year we unveiled the tesla bot concept, but a concept doesn't get us very far

173
0:29:25,160 --> 0:29:30,680
We knew we needed a real development and integration platform to get real life learnings as quickly as possible

174
0:29:31,240 --> 0:29:36,280
So that robot that came out and did the little routine for you guys. We had that within six months built

175
0:29:36,280 --> 0:29:40,760
working on software integration hardware upgrades over the months since then

176
0:29:41,240 --> 0:29:45,160
But in parallel, we've also been designing the next generation this one over here

177
0:29:46,520 --> 0:29:51,720
So this guy is rooted in the the foundation of sort of the vehicle design process

178
0:29:51,720 --> 0:29:54,840
You know, we're leveraging all of those learnings that we already have

179
0:29:55,960 --> 0:29:58,200
Obviously, there's a lot that's changed since last year

180
0:29:58,200 --> 0:30:00,440
But there's a few things that are still the same you'll notice

181
0:30:00,440 --> 0:30:04,040
We still have this really detailed focus on the true human form

182
0:30:04,040 --> 0:30:07,800
We think that matters for a few reasons, but it's fun

183
0:30:07,800 --> 0:30:11,000
We spend a lot of time thinking about how amazing the human body is

184
0:30:11,720 --> 0:30:13,720
We have this incredible range of motion

185
0:30:14,280 --> 0:30:16,280
Typically really amazing strength

186
0:30:17,080 --> 0:30:22,680
A fun exercise is if you put your fingertip on the chair in front of you, you'll notice that there's a huge

187
0:30:23,480 --> 0:30:28,200
Range of motion that you have in your shoulder and your elbow, for example without moving your fingertip

188
0:30:28,200 --> 0:30:30,200
You can move those joints all over the place

189
0:30:30,200 --> 0:30:34,200
But the robot, you know, its main function is to do real useful work

190
0:30:34,200 --> 0:30:38,200
And it maybe doesn't necessarily need all of those degrees of freedom right away

191
0:30:38,200 --> 0:30:42,200
So we've stripped it down to a minimum sort of 28 fundamental degrees of freedom

192
0:30:42,200 --> 0:30:44,200
And then of course our hands in addition to that

193
0:30:46,200 --> 0:30:50,200
Humans are also pretty efficient at some things and not so efficient in other times

194
0:30:50,200 --> 0:30:56,200
So for example, we can eat a small amount of food to sustain ourselves for several hours. That's great

195
0:30:56,200 --> 0:31:02,200
But when we're just kind of sitting around, no offense, but we're kind of inefficient. We're just sort of burning energy

196
0:31:02,200 --> 0:31:06,200
So on the robot platform what we're going to do is we're going to minimize that idle power consumption

197
0:31:06,200 --> 0:31:08,200
Drop it as low as possible

198
0:31:08,200 --> 0:31:14,200
And that way we can just flip a switch and immediately the robot turns into something that does useful work

199
0:31:16,200 --> 0:31:20,200
So let's talk about this latest generation in some detail, shall we?

200
0:31:20,200 --> 0:31:24,200
So on the screen here, you'll see in orange our actuators, which we'll get to in a little bit

201
0:31:24,200 --> 0:31:26,200
And in blue our electrical system

202
0:31:28,200 --> 0:31:33,200
So now that we have our sort of human-based research and we have our first development platform

203
0:31:33,200 --> 0:31:37,200
We have both research and execution to draw from for this design

204
0:31:37,200 --> 0:31:40,200
Again, we're using that vehicle design foundation

205
0:31:40,200 --> 0:31:46,200
So we're taking it from concept through design and analysis and then build and validation

206
0:31:46,200 --> 0:31:50,200
Along the way, we're going to optimize for things like cost and efficiency

207
0:31:50,200 --> 0:31:54,200
Because those are critical metrics to take this product to scale eventually

208
0:31:54,200 --> 0:31:56,200
How are we going to do that?

209
0:31:56,200 --> 0:32:01,200
Well, we're going to reduce our part count and our power consumption of every element possible

210
0:32:01,200 --> 0:32:05,200
We're going to do things like reduce the sensing and the wiring at our extremities

211
0:32:05,200 --> 0:32:11,200
You can imagine a lot of mass in your hands and feet is going to be quite difficult and power consumptive to move around

212
0:32:11,200 --> 0:32:18,200
And we're going to centralize both our power distribution and our compute to the physical center of the platform

213
0:32:18,200 --> 0:32:23,200
So in the middle of our torso, actually it is the torso, we have our battery pack

214
0:32:23,200 --> 0:32:28,200
This is sized at 2.3 kilowatt hours, which is perfect for about a full day's worth of work

215
0:32:28,200 --> 0:32:36,200
What's really unique about this battery pack is it has all of the battery electronics integrated into a single PCB within the pack

216
0:32:36,200 --> 0:32:45,200
So that means everything from sensing to fusing, charge management and power distribution is all in one place

217
0:32:45,200 --> 0:32:54,200
We're also leveraging both our vehicle products and our energy products to roll all of those key features into this battery

218
0:32:54,200 --> 0:33:02,200
So that's streamlined manufacturing, really efficient and simple cooling methods, battery management and also safety

219
0:33:02,200 --> 0:33:08,200
And of course we can leverage Tesla's existing infrastructure and supply chain to make it

220
0:33:08,200 --> 0:33:15,200
So going on to sort of our brain, it's not in the head, but it's pretty close

221
0:33:15,200 --> 0:33:19,200
Also in our torso we have our central computer

222
0:33:19,200 --> 0:33:24,200
So as you know, Tesla already ships full self-driving computers in every vehicle we produce

223
0:33:24,200 --> 0:33:30,200
We want to leverage both the autopilot hardware and the software for the humanoid platform

224
0:33:30,200 --> 0:33:35,200
But because it's different in requirements and in form factor, we're going to change a few things first

225
0:33:35,200 --> 0:33:45,200
So we still are going to do everything that a human brain does, processing vision data, making split-second decisions based on multiple sensory inputs

226
0:33:45,200 --> 0:33:53,200
And also communications, so to support communications it's equipped with wireless connectivity as well as audio support

227
0:33:53,200 --> 0:34:00,200
And then it also has hardware level security features, which are important to protect both the robot and the people around the robot

228
0:34:00,200 --> 0:34:07,200
So now that we have our sort of core, we're going to need some limbs on this guy

229
0:34:07,200 --> 0:34:12,200
And we'd love to show you a little bit about our actuators and our fully functional hands as well

230
0:34:12,200 --> 0:34:18,200
But before we do that, I'd like to introduce Malcolm, who's going to speak a little bit about our structural foundation for the robot

231
0:34:18,200 --> 0:34:26,200
Thank you, Jiji

232
0:34:26,200 --> 0:34:33,200
Tesla have the capabilities to analyze highly complex systems

233
0:34:33,200 --> 0:34:36,200
They don't get much more complex than a crash

234
0:34:36,200 --> 0:34:41,200
You can see here a simulated crash from model 3 superimposed on top of the actual physical crash

235
0:34:41,200 --> 0:34:44,200
It's actually incredible how accurate it is

236
0:34:44,200 --> 0:34:47,200
Just to give you an idea of the complexity of this model

237
0:34:47,200 --> 0:34:53,200
It includes every not-bolt-and-washer, every spot weld, and it has 35 million degrees of freedom

238
0:34:53,200 --> 0:34:55,200
Quite amazing

239
0:34:55,200 --> 0:35:01,200
And it's true to say that if we didn't have models like this, we wouldn't be able to make the safest cars in the world

240
0:35:01,200 --> 0:35:09,200
So can we utilize our capabilities and our methods from the automotive side to influence a robot?

241
0:35:09,200 --> 0:35:16,200
Well, we can make a model, and since we have crash software, we're using the same software here, we can make it fall down

242
0:35:16,200 --> 0:35:23,200
The purpose of this is to make sure that if it falls down, ideally it doesn't, but it's superficial damage

243
0:35:23,200 --> 0:35:26,200
We don't want it to, for example, break its gearbox and its arms

244
0:35:26,200 --> 0:35:31,200
That's equivalent of a dislocated shoulder of a robot, difficult and expensive to fix

245
0:35:31,200 --> 0:35:38,200
So we want it to dust itself off, get on with the job it's being given

246
0:35:38,200 --> 0:35:47,200
We can also take the same model, and we can drive the actuators using the inputs from a previously solved model, bringing it to life

247
0:35:47,200 --> 0:35:51,200
So this is producing the motions for the tasks we want the robot to do

248
0:35:51,200 --> 0:35:55,200
These tasks are picking up boxes, turning, squatting, walking upstairs

249
0:35:55,200 --> 0:35:58,200
Whatever the set of tasks are, we can place the model

250
0:35:58,200 --> 0:36:00,200
This is showing just simple walking

251
0:36:00,200 --> 0:36:08,200
We can create the stresses in all the components that helps us to optimize the components

252
0:36:08,200 --> 0:36:10,200
These are not dancing robots

253
0:36:10,200 --> 0:36:14,200
These are actually the modal behavior, the first five modes of the robot

254
0:36:14,200 --> 0:36:22,200
Typically, when people make robots, they make sure the first mode is up around the top single figure, up towards 10 hertz

255
0:36:22,200 --> 0:36:26,200
The reason we do this is to make the controls of walking easier

256
0:36:26,200 --> 0:36:30,200
It's very difficult to walk if you can't guarantee where your foot is wobbling around

257
0:36:30,200 --> 0:36:34,200
That's okay to make one robot, we want to make thousands, maybe millions

258
0:36:34,200 --> 0:36:37,200
We haven't got the luxury of making them from carbon fiber, titanium

259
0:36:37,200 --> 0:36:41,200
We want to make them from plastic, things are not quite as stiff

260
0:36:41,200 --> 0:36:46,200
So we can't have these high targets, I call them dumb targets

261
0:36:46,200 --> 0:36:49,200
We've got to make them work at lower targets

262
0:36:49,200 --> 0:36:51,200
So is that going to work?

263
0:36:51,200 --> 0:36:57,200
Well, if you think about it, sorry about this, but we're just bags of soggy, jelly and bones thrown in

264
0:36:57,200 --> 0:37:02,200
We're not high frequency, if I stand on my leg, I don't vibrate at 10 hertz

265
0:37:02,200 --> 0:37:08,200
People operate at low frequency, so we know the robot actually can, it just makes controls harder

266
0:37:08,200 --> 0:37:14,200
So we take the information from this, the modal data and the stiffness and feed that into the control system

267
0:37:14,200 --> 0:37:16,200
That allows it to walk

268
0:37:18,200 --> 0:37:21,200
Just changing tack slightly, looking at the knee

269
0:37:21,200 --> 0:37:27,200
We can take some inspiration from biology and we can look to see what the mechanical advantage of the knee is

270
0:37:27,200 --> 0:37:33,200
It turns out it actually represents quite similar to four-bar link, and that's quite non-linear

271
0:37:33,200 --> 0:37:41,200
That's not surprising really, because if you think when you bend your leg down, the torque on your knee is much more when it's bent than it is when it's straight

272
0:37:41,200 --> 0:37:48,200
So you'd expect a non-linear function, and in fact the biology is non-linear, this matches it quite accurately

273
0:37:50,200 --> 0:37:56,200
So that's the representation, the four-bar link is obviously not physically four-bar link, as I said the characteristics are similar

274
0:37:56,200 --> 0:38:00,200
But me bending down, that's not very scientific, let's be a bit more scientific

275
0:38:00,200 --> 0:38:09,200
We've played all the tasks through this graph, and this is showing picking things up, walking, squatting, the tasks I said we did on the stress

276
0:38:09,200 --> 0:38:16,200
And that's the torque seen at the knee against the knee bend on the horizontal axis

277
0:38:16,200 --> 0:38:20,200
This is showing the requirement for the knee to do all these tasks

278
0:38:20,200 --> 0:38:31,200
And then put a curve through it, surfing over the top of the peaks, and that's saying this is what's required to make the robot do these tasks

279
0:38:31,200 --> 0:38:42,200
So if we look at the four-bar link, that's actually the green curve, and it's saying that the non-linearity of the four-bar link has actually linearized the characteristic of the force

280
0:38:42,200 --> 0:38:50,200
What that really says is that's lowered the force, that's what makes the actuator have the lowest possible force, which is the most efficient, we want to burn energy up slowly

281
0:38:50,200 --> 0:39:00,200
What's the blue curve? Well the blue curve is actually if we didn't have a four-bar link, we just had an arm sticking out of my leg here with an actuator on it, a simple two-bar link

282
0:39:00,200 --> 0:39:08,200
That's the best we could do with a simple two-bar link, and it shows that that would create much more force in the actuator, which would not be efficient

283
0:39:08,200 --> 0:39:21,200
So what does that look like in practice? Well, as you'll see, it's very tightly packaged in the knee, you'll see it go transparent in a second, you'll see the four-bar link there, it's operating on the actuator

284
0:39:21,200 --> 0:39:25,200
This is determined, the force and the displacements on the actuator

285
0:39:25,200 --> 0:39:32,200
I'll now pass you over to Konstantinos to tell you a lot more detail about how these actuators are made and designed and optimized. Thank you

286
0:39:32,200 --> 0:39:39,200
Thank you Malcolm

287
0:39:39,200 --> 0:39:50,200
So I would like to talk to you about the design process and the actuator portfolio in our robot

288
0:39:50,200 --> 0:39:55,200
So there are many similarities between a car and a robot when it comes to powertrain design

289
0:39:55,200 --> 0:40:06,200
The most important thing that matters here is energy, mass, and cost. We are carrying over most of our designing experience from the car to the robot

290
0:40:08,200 --> 0:40:22,200
So in the particular case, you see a car with two drive units, and the drive units are used in order to accelerate the car 0 to 60 miles per hour time or drive the city's drive site

291
0:40:22,200 --> 0:40:32,200
While the robot that has 28 actuators, it's not obvious what are the tasks at actuator level

292
0:40:32,200 --> 0:40:44,200
So we have tasks that are higher level like walking or climbing stairs or carrying a heavy object which needs to be translated into joint specs

293
0:40:44,200 --> 0:40:59,200
Therefore we use our model that generates the torque speed trajectories for our joints which subsequently is going to be fed in our optimization model to run through the optimization process

294
0:41:01,200 --> 0:41:07,200
This is one of the scenarios that the robot is capable of doing which is turning and walking

295
0:41:07,200 --> 0:41:25,200
So when we have this torque speed trajectory, we lay it over an efficiency map of an actuator and we are able along the trajectory to generate the power consumption and the cumulative energy for the task versus time

296
0:41:25,200 --> 0:41:38,200
So this allows us to define the system cost for the particular actuator and put a simple point into the cloud. Then we do this for hundreds of thousands of actuators by solving in our cluster

297
0:41:38,200 --> 0:41:44,200
And the red line denotes the Pareto front which is the preferred area where we will look for our optimal

298
0:41:44,200 --> 0:41:50,200
So the X denotes the preferred actuator design we have picked for this particular joint

299
0:41:50,200 --> 0:41:57,200
So now we need to do this for every joint. We have 28 joints to optimize and we parse our cloud

300
0:41:57,200 --> 0:42:07,200
We parse our cloud again for every joint spec and the red axis this time denote the bespoke actuator designs for every joint

301
0:42:07,200 --> 0:42:15,200
The problem here is that we have too many unique actuator designs and even if we take advantage of the symmetry, still there are too many

302
0:42:15,200 --> 0:42:23,200
In order to make something mass manufacturable, we need to be able to reduce the amount of unique actuator designs

303
0:42:23,200 --> 0:42:36,200
Therefore, we run something called commonality study which we parse our cloud again looking this time for actuators that simultaneously meet the joint performance requirements for more than one joint at the same time

304
0:42:36,200 --> 0:42:48,200
So the resulting portfolio is six actuators and they show in a color map in the middle figure and the actuators can be also viewed in this slide

305
0:42:48,200 --> 0:42:57,200
We have three rotary and three linear actuators, all of which have a great output force or torque per mass

306
0:42:57,200 --> 0:43:15,200
The rotary actuator in particular has a mechanical class integrated on the high speed side angular contact ball bearing and on the high speed side and on the low speed side a cross roller bearing and the gear train is a strain wave gear

307
0:43:15,200 --> 0:43:23,200
There are three integrated sensors here and bespoke permanent magnet machine

308
0:43:23,200 --> 0:43:31,200
The linear actuator

309
0:43:31,200 --> 0:43:33,200
I'm sorry

310
0:43:33,200 --> 0:43:44,200
The linear actuator has planetary rollers and an inverted planetary screw as a gear train which allows efficiency and compaction and durability

311
0:43:44,200 --> 0:43:58,200
So in order to demonstrate the force capability of our linear actuators, we have set up an experiment in order to test it under its limits

312
0:43:58,200 --> 0:44:07,200
And I will let you enjoy the video

313
0:44:07,200 --> 0:44:19,200
So our actuator is able to lift

314
0:44:19,200 --> 0:44:25,200
A half ton, nine foot concert grand piano

315
0:44:25,200 --> 0:44:31,200
And

316
0:44:31,200 --> 0:44:56,200
This is a requirement. It's not something nice to have because our muscles can do the same when they are direct driven when they are directly driven or quadricep muscles can do the same thing. It's just that the knee is an up year in Lincoln system that converts the force into velocity at the end effect or of our heels for purposes of giving to the human body agility

317
0:44:56,200 --> 0:45:10,200
So this is one of the main things that are amazing about the human body and I'm concluding my part at this point and I would like to welcome my colleague Mike who's going to talk to you about hand design. Thank you very much.

318
0:45:10,200 --> 0:45:13,200
Thanks, Constantine

319
0:45:13,200 --> 0:45:18,200
So we just saw how powerful a human and a humanoid actuator can be.

320
0:45:18,200 --> 0:45:23,200
However, humans are also incredibly dexterous.

321
0:45:23,200 --> 0:45:27,200
The human hand has the ability to move at 300 degrees per second.

322
0:45:27,200 --> 0:45:30,200
There's tens of thousands of tactile sensors.

323
0:45:30,200 --> 0:45:36,200
It has the ability to grasp and manipulate almost every object in our daily lives.

324
0:45:36,200 --> 0:45:40,200
For our robotic hand design, we are inspired by biology.

325
0:45:40,200 --> 0:45:43,200
We have five fingers and opposable thumb.

326
0:45:43,200 --> 0:45:48,200
Our fingers are driven by metallic tendons that are both flexible and strong.

327
0:45:48,200 --> 0:45:57,200
We have the ability to complete wide aperture power grasps, while also being optimized for precision gripping of small, thin and delicate objects.

328
0:45:57,200 --> 0:46:00,200
So why a human like robotic hand?

329
0:46:00,200 --> 0:46:05,200
Well, the main reason that our factories in the world around us is designed to be ergonomic.

330
0:46:05,200 --> 0:46:09,200
So what that means is that it ensures that objects in our factory are graspable.

331
0:46:09,200 --> 0:46:17,200
But it also ensures that new objects that we may have never seen before can be grasped by the human hand and by our robotic hand as well.

332
0:46:17,200 --> 0:46:27,200
The converse there is pretty interesting because it's saying that these objects are designed to our hand instead of having to make changes to our hand to accompany a new object.

333
0:46:27,200 --> 0:46:31,200
Some basic stats about our hand is that it has six actuators and 11 degrees of freedom.

334
0:46:31,200 --> 0:46:37,200
It has an in-hand controller, which drives the fingers and receives sensor feedback.

335
0:46:37,200 --> 0:46:43,200
Sensor feedback is really important to learn a little bit more about the objects that we're grasping and also for proprioception.

336
0:46:43,200 --> 0:46:48,200
And that's the ability for us to recognize where our hand is in space.

337
0:46:48,200 --> 0:46:51,200
One of the important aspects of our hand is that it's adaptive.

338
0:46:51,200 --> 0:46:58,200
This adaptability is involved essentially as complex mechanisms that allow the hand to adapt the objects that's being grasped.

339
0:46:58,200 --> 0:47:01,200
Another important part is that we have a non back drivable finger drive.

340
0:47:01,200 --> 0:47:07,200
This clutching mechanism allows us to hold and transport objects without having to turn on the hand motors.

341
0:47:07,200 --> 0:47:12,200
You just heard how we went about designing the TeslaBot hardware.

342
0:47:12,200 --> 0:47:16,200
Now I'll hand it off to Milan and our autonomy team to bring this robot to life.

343
0:47:16,200 --> 0:47:24,200
Thanks, Michael.

344
0:47:24,200 --> 0:47:26,200
All right.

345
0:47:26,200 --> 0:47:36,200
So all those cool things we've shown earlier in the video were possible just in a matter of a few months thanks to the amazing work that we've done on autopilot over the past few years.

346
0:47:36,200 --> 0:47:40,200
Most of those components ported quite easily over to the bots environment.

347
0:47:40,200 --> 0:47:45,200
If you think about it, we're just moving from a robot on wheels to a robot on legs.

348
0:47:45,200 --> 0:47:51,200
So some of the components are pretty similar and some of them require more heavy lifting.

349
0:47:51,200 --> 0:47:59,200
So for example, our computer vision neural networks were ported directly from autopilot to the bots situation.

350
0:47:59,200 --> 0:48:07,200
It's exactly the same occupancy network that we'll talk into a little bit more details later with the autopilot team that is now running on the bot here in this video.

351
0:48:07,200 --> 0:48:14,200
The only thing that changed really is the training data that we had to recollect.

352
0:48:14,200 --> 0:48:25,200
We're also trying to find ways to improve those occupancy networks using work made on your radiance fields to get really great volumetric rendering of the bots environments.

353
0:48:25,200 --> 0:48:32,200
For example, here some machinery that the bot might have to interact with.

354
0:48:32,200 --> 0:48:42,200
Another interesting problem to think about is in indoor environments, mostly with that sense of GPS signal, how do you get the bot to navigate to its destination?

355
0:48:42,200 --> 0:48:45,200
Say for instance, to find its nearest charging station.

356
0:48:45,200 --> 0:48:59,200
So we've been training more neural networks to identify high frequency features, key points within the bot's camera streams and track them across frames over time as the bot navigates with its environment.

357
0:48:59,200 --> 0:49:09,200
And we're using those points to get a better estimate of the bot's pose and trajectory within its environment as it's walking.

358
0:49:09,200 --> 0:49:18,200
We also did quite some work on the simulation side, and this is literally the autopilot simulator to which we've integrated the robot locomotion code.

359
0:49:18,200 --> 0:49:27,200
And this is a video of the motion control code running in your pilot simulator simulator, showing the evolution of the robots work over time.

360
0:49:27,200 --> 0:49:37,200
So as you can see, we started quite slowly in April and started accelerating as we unlock more joints and deploy more advanced techniques like arms balancing over the past few months.

361
0:49:37,200 --> 0:49:44,200
And so locomotion is specifically one component that's very different as we're moving from the car to the bots environment.

362
0:49:44,200 --> 0:49:57,200
So I think it warrants a little bit more depth and I'd like my colleagues to start talking about this now.

363
0:49:57,200 --> 0:50:04,200
Thank you Milan. Hi everyone, I'm Felix, I'm a robotics engineer on the project, and I'm going to talk about walking.

364
0:50:04,200 --> 0:50:10,200
Walking seems easy, right? People do it every day. You don't even have to think about it.

365
0:50:10,200 --> 0:50:15,200
But there are some aspects of walking which are challenging from an engineering perspective.

366
0:50:15,200 --> 0:50:22,200
For example, physical self-awareness. That means having a good representation of yourself.

367
0:50:22,200 --> 0:50:28,200
What is the length of your limbs? What is the mass of your limbs? What is the size of your feet? All that matters.

368
0:50:28,200 --> 0:50:37,200
Also, having an energy efficient gait. You can imagine there's different styles of walking and all of them are equally efficient.

369
0:50:37,200 --> 0:50:45,200
Most important, keep balance, don't fall. And of course, also coordinate the motion of all of your limbs together.

370
0:50:45,200 --> 0:50:52,200
So now humans do all of this naturally, but as engineers or roboticists, we have to think about these problems.

371
0:50:52,200 --> 0:50:57,200
And the following I'm going to show you how we address them in our locomotion planning and control stack.

372
0:50:57,200 --> 0:51:01,200
So we start with locomotion planning and our representation of the bot.

373
0:51:01,200 --> 0:51:07,200
That means a model of the robot's kinematics, dynamics, and the contact properties.

374
0:51:07,200 --> 0:51:16,200
And using that model and the desired path for the bot, our locomotion planner generates reference trajectories for the entire system.

375
0:51:16,200 --> 0:51:22,200
This means feasible trajectories with respect to the assumptions of our model.

376
0:51:22,200 --> 0:51:29,200
The planner currently works in three stages. It starts planning footsteps and ends with the entire motion for the system.

377
0:51:29,200 --> 0:51:32,200
And let's dive a little bit deeper in how this works.

378
0:51:32,200 --> 0:51:39,200
So in this video, we see footsteps being planned over a planning horizon following the desired path.

379
0:51:39,200 --> 0:51:49,200
And we start from this and add them for trajectories that connect these footsteps using toe-off and heel strike just as humans do.

380
0:51:49,200 --> 0:51:55,200
And this gives us a larger stride and less knee bend for high efficiency of the system.

381
0:51:55,200 --> 0:52:04,200
The last stage is then finding a sense of mass trajectory, which gives us a dynamically feasible motion of the entire system to keep balance.

382
0:52:04,200 --> 0:52:09,200
As we all know, plans are good, but we also have to realize them in reality.

383
0:52:09,200 --> 0:52:20,200
Let's see how we can do this.

384
0:52:20,200 --> 0:52:23,200
Thank you, Felix. Hello, everyone. My name is Anand.

385
0:52:23,200 --> 0:52:26,200
And I'm going to talk to you about controls.

386
0:52:26,200 --> 0:52:33,200
So let's take the motion plan that Felix just talked about and put it in the real world on a real robot.

387
0:52:33,200 --> 0:52:37,200
Let's see what happens.

388
0:52:37,200 --> 0:52:40,200
It takes a couple of steps and falls down.

389
0:52:40,200 --> 0:52:48,200
Well, that's a little disappointing, but we are missing a few key pieces here which will make it walk.

390
0:52:48,200 --> 0:52:57,200
Now, as Felix mentioned, the motion planner is using an idealized version of itself and a version of reality around it.

391
0:52:57,200 --> 0:52:59,200
This is not exactly correct.

392
0:52:59,200 --> 0:53:12,200
It also expresses its intention through trajectories and wrenches, wrenches of forces and torques that it wants to exert on the world to locomotive.

393
0:53:12,200 --> 0:53:16,200
Reality is way more complex than any similar model.

394
0:53:16,200 --> 0:53:18,200
Also, the robot is not simplified.

395
0:53:18,200 --> 0:53:25,200
It's got vibrations and modes, compliance, sensor noise, and on and on and on.

396
0:53:25,200 --> 0:53:30,200
So what does that do to the real world when you put the bot in the real world?

397
0:53:30,200 --> 0:53:36,200
Well, the unexpected forces cause unmodeled dynamics, which essentially the planet doesn't know about.

398
0:53:36,200 --> 0:53:44,200
And that causes destabilization, especially for a system that is dynamically stable like biped locomotion.

399
0:53:44,200 --> 0:53:46,200
So what can we do about it?

400
0:53:46,200 --> 0:53:48,200
Well, we measure reality.

401
0:53:48,200 --> 0:53:53,200
We use sensors and our understanding of the world to do state estimation.

402
0:53:53,200 --> 0:54:00,200
And here you can see the attitude and pelvis pose, which is essentially the vestibular system in a human,

403
0:54:00,200 --> 0:54:07,200
along with the center of mass trajectory being tracked when the robot is walking in the office environment.

404
0:54:07,200 --> 0:54:11,200
Now we have all the pieces we need in order to close the loop.

405
0:54:11,200 --> 0:54:14,200
So we use our better bot model.

406
0:54:14,200 --> 0:54:18,200
We use the understanding of reality that we've gained through state estimation.

407
0:54:18,200 --> 0:54:24,200
And we compare what we want versus what we expect the reality is doing to us

408
0:54:24,200 --> 0:54:30,200
in order to add corrections to the behavior of the robot.

409
0:54:30,200 --> 0:54:38,200
Here, the robot certainly doesn't appreciate being poked, but it does an admirable job of staying upright.

410
0:54:38,200 --> 0:54:43,200
The final point here is a robot that walks is not enough.

411
0:54:43,200 --> 0:54:48,200
We need it to use its hands and arms to be useful.

412
0:54:48,200 --> 0:54:50,200
Let's talk about manipulation.

413
0:55:00,200 --> 0:55:04,200
Hi, everyone. My name is Eric, robotics engineer on Teslabot.

414
0:55:04,200 --> 0:55:09,200
And I want to talk about how we've made the robot manipulate things in the real world.

415
0:55:09,200 --> 0:55:16,200
We wanted to manipulate objects while looking as natural as possible and also get there quickly.

416
0:55:16,200 --> 0:55:20,200
So what we've done is we've broken this process down into two steps.

417
0:55:20,200 --> 0:55:26,200
First is generating a library of natural motion references, or we could call them demonstrations.

418
0:55:26,200 --> 0:55:32,200
And then we've adapted these motion references online to the current real world situation.

419
0:55:32,200 --> 0:55:36,200
So let's say we have a human demonstration of picking up an object.

420
0:55:36,200 --> 0:55:42,200
We can get a motion capture of that demonstration, which is visualized right here as a bunch of key frames

421
0:55:42,200 --> 0:55:46,200
representing the location of the hands, the elbows, the torso.

422
0:55:46,200 --> 0:55:49,200
We can map that to the robot using inverse kinematics.

423
0:55:49,200 --> 0:55:55,200
And if we collect a lot of these, now we have a library that we can work with.

424
0:55:55,200 --> 0:56:01,200
But a single demonstration is not generalizable to the variation in the real world.

425
0:56:01,200 --> 0:56:06,200
For instance, this would only work for a box in a very particular location.

426
0:56:06,200 --> 0:56:12,200
So what we've also done is run these reference trajectories through a trajectory optimization program,

427
0:56:12,200 --> 0:56:17,200
which solves for where the hand should be, how the robot should balance,

428
0:56:17,200 --> 0:56:21,200
when it needs to adapt the motion to the real world.

429
0:56:21,200 --> 0:56:31,200
So for instance, if the box is in this location, then our optimizer will create this trajectory instead.

430
0:56:31,200 --> 0:56:38,200
Next, Milan's going to talk about what's next for the optimist, TeslaVine. Thanks.

431
0:56:38,200 --> 0:56:45,200
Thanks, Eric.

432
0:56:45,200 --> 0:56:50,200
Right. So hopefully by now you guys got a good idea of what we've been up to over the past few months.

433
0:56:50,200 --> 0:56:54,200
We started doing something that's usable, but it's far from being useful.

434
0:56:54,200 --> 0:56:58,200
There's still a long and exciting road ahead of us.

435
0:56:58,200 --> 0:57:03,200
I think the first thing within the next few weeks is to get optimists at least at par with Bumble-C,

436
0:57:03,200 --> 0:57:07,200
the other bot prototype you saw earlier, and probably beyond.

437
0:57:07,200 --> 0:57:12,200
We are also going to start focusing on the real use case at one of our factories

438
0:57:12,200 --> 0:57:18,200
and really going to try to nail this down and iron out all the elements needed

439
0:57:18,200 --> 0:57:20,200
to deploy this product in the real world.

440
0:57:20,200 --> 0:57:27,200
I was mentioning earlier, indoor navigation, graceful form management, or even servicing,

441
0:57:27,200 --> 0:57:31,200
all components needed to scale this product up.

442
0:57:31,200 --> 0:57:35,200
But I don't know about you, but after seeing what we've shown tonight,

443
0:57:35,200 --> 0:57:38,200
I'm pretty sure we can get this done within the next few months or years

444
0:57:38,200 --> 0:57:43,200
and make this product a reality and change the entire economy.

445
0:57:43,200 --> 0:57:47,200
So I would like to thank the entire optimist team for all their hard work over the past few months.

446
0:57:47,200 --> 0:57:51,200
I think it's pretty amazing. All of this was done in barely six or eight months.

447
0:57:51,200 --> 0:57:53,200
Thank you very much.

448
0:57:53,200 --> 0:58:01,200
Applause

449
0:58:07,200 --> 0:58:14,200
Hey, everyone. Hi, I'm Ashok. I lead the Autopilot team alongside Milan.

450
0:58:14,200 --> 0:58:18,200
God, it's going to be so hard to top that optimist section.

451
0:58:18,200 --> 0:58:21,200
We'll try nonetheless.

452
0:58:21,200 --> 0:58:26,200
Anyway, every Tesla that has been built over the last several years

453
0:58:26,200 --> 0:58:30,200
we think of the hardware to make the car drive itself.

454
0:58:30,200 --> 0:58:36,200
We have been working on the software to add higher and higher levels of autonomy.

455
0:58:36,200 --> 0:58:42,200
This time around last year, we had roughly 2,000 cars driving our FSD beta software.

456
0:58:42,200 --> 0:58:47,200
Since then, we have significantly improved the software's robustness and capability

457
0:58:47,200 --> 0:58:53,200
that we have now shipped it to 160,000 customers as of today.

458
0:58:53,200 --> 0:58:59,200
Applause

459
0:58:59,200 --> 0:59:06,200
This has not come for free. It came from the sweat and blood of the engineering team over the last one year.

460
0:59:06,200 --> 0:59:11,200
For example, we trained 75,000 neural network models just last one year.

461
0:59:11,200 --> 0:59:16,200
That's roughly a model every eight minutes that's coming out of the team.

462
0:59:16,200 --> 0:59:19,200
And then we evaluate them on our large clusters.

463
0:59:19,200 --> 0:59:24,200
And then we ship 281 of those models that actually improve the performance of the car.

464
0:59:24,200 --> 0:59:28,200
And this space of innovation is happening throughout the stack.

465
0:59:28,200 --> 0:59:37,200
The planning software, the infrastructure, the tools, even hiring, everything is progressing to the next level.

466
0:59:37,200 --> 0:59:41,200
The FSD beta software is quite capable of driving the car.

467
0:59:41,200 --> 0:59:46,200
It should be able to navigate from parking lot to parking lot, handling city street driving,

468
0:59:46,200 --> 0:59:56,200
stopping for traffic lights and stop signs, negotiating with objects at intersections, making turns and so on.

469
0:59:56,200 --> 1:00:02,200
All of this comes from the camera streams that go through our neural networks that run on the car itself.

470
1:00:02,200 --> 1:00:04,200
It's not coming back to the server or anything.

471
1:00:04,200 --> 1:00:09,200
It's running on the car and produces all the outputs to form the world model around the car.

472
1:00:09,200 --> 1:00:13,200
And the planning software drives the car based on that.

473
1:00:13,200 --> 1:00:17,200
Today we'll go into a lot of the components that make up the system.

474
1:00:17,200 --> 1:00:23,200
The occupancy network acts as the base geometry layer of the system.

475
1:00:23,200 --> 1:00:28,200
This is a multi-camera video neural network that from the images

476
1:00:28,200 --> 1:00:34,200
predicts the full physical occupancy of the world around the robot.

477
1:00:34,200 --> 1:00:39,200
So anything that's physically present, trees, walls, buildings, cars, balls, whatever you,

478
1:00:39,200 --> 1:00:46,200
if it's physically present, it predicts them along with their future motion.

479
1:00:46,200 --> 1:00:51,200
On top of this base level of geometry, we have more semantic layers.

480
1:00:51,200 --> 1:00:56,200
In order to navigate the roadways, we need the lanes, of course.

481
1:00:56,200 --> 1:00:59,200
The roadways have lots of different lanes and they connect in all kinds of ways.

482
1:00:59,200 --> 1:01:03,200
So it's actually a really difficult problem for typical computer vision techniques

483
1:01:03,200 --> 1:01:06,200
to predict the set of lanes and their connectivities.

484
1:01:06,200 --> 1:01:11,200
So we reached all the way into language technologies and then pulled the state of the art from other domains

485
1:01:11,200 --> 1:01:16,200
and not just computer vision to make this task possible.

486
1:01:16,200 --> 1:01:21,200
For vehicles, we need their full kinematic state to control for them.

487
1:01:21,200 --> 1:01:24,200
All of this directly comes from neural networks.

488
1:01:24,200 --> 1:01:28,200
Video streams, raw video streams, come into the networks,

489
1:01:28,200 --> 1:01:31,200
goes through a lot of processing, and then outputs the full kinematic state.

490
1:01:31,200 --> 1:01:37,200
The positions, velocities, acceleration, jerk, all of that directly comes out of networks

491
1:01:37,200 --> 1:01:39,200
with minimal post-processing.

492
1:01:39,200 --> 1:01:42,200
That's really fascinating to me because how is this even possible?

493
1:01:42,200 --> 1:01:45,200
What world do we live in that this magic is possible,

494
1:01:45,200 --> 1:01:48,200
that these networks predict fourth derivatives of these positions

495
1:01:48,200 --> 1:01:53,200
when people thought they couldn't even detect these objects?

496
1:01:53,200 --> 1:01:55,200
My opinion is that it did not come for free.

497
1:01:55,200 --> 1:02:00,200
It required tons of data, so we had to build sophisticated auto-labeling systems

498
1:02:00,200 --> 1:02:05,200
that churn through raw sensor data, run a ton of offline compute on the servers.

499
1:02:05,200 --> 1:02:09,200
It can take a few hours, run expensive neural networks,

500
1:02:09,200 --> 1:02:15,200
distill the information into labels that train our in-car neural networks.

501
1:02:15,200 --> 1:02:20,200
On top of this, we also use our simulation system to synthetically create images,

502
1:02:20,200 --> 1:02:25,200
and since it's a simulation, we trivially have all the labels.

503
1:02:25,200 --> 1:02:29,200
All of this goes through a well-oiled data engine pipeline

504
1:02:29,200 --> 1:02:33,200
where we first train a baseline model with some data,

505
1:02:33,200 --> 1:02:36,200
ship it to the car, see what the failures are,

506
1:02:36,200 --> 1:02:41,200
and once we know the failures, we mine the fleet for the cases where it fails,

507
1:02:41,200 --> 1:02:45,200
provide the correct labels, and add the data to the training set.

508
1:02:45,200 --> 1:02:48,200
This process systematically fixes the issues,

509
1:02:48,200 --> 1:02:51,200
and we do this for every task that runs in the car.

510
1:02:51,200 --> 1:02:54,200
Yeah, and to train these new massive neural networks,

511
1:02:54,200 --> 1:02:59,200
this year we expanded our training infrastructure by roughly 40 to 50 percent,

512
1:02:59,200 --> 1:03:06,200
so that sits us at about 14,000 GPUs today across multiple training clusters in the United States.

513
1:03:06,200 --> 1:03:09,200
We also worked on our AI compiler,

514
1:03:09,200 --> 1:03:13,200
which now supports new operations needed by those neural networks

515
1:03:13,200 --> 1:03:17,200
and maps them to the best of our underlying hardware resources.

516
1:03:17,200 --> 1:03:23,200
And our inference engine today is capable of distributing the execution of a single neural network

517
1:03:23,200 --> 1:03:26,200
across two independent system on chips,

518
1:03:26,200 --> 1:03:32,200
essentially two independent computers interconnected within the same full self-driving computer.

519
1:03:32,200 --> 1:03:37,200
And to make this possible, we had to keep a tight control on the end-to-end latency of this new system,

520
1:03:37,200 --> 1:03:43,200
so we deployed more advanced scheduling code across the full FSD platform.

521
1:03:43,200 --> 1:03:47,200
All of these neural networks running in the car together produce the vector space,

522
1:03:47,200 --> 1:03:50,200
which is again the model of the world around the robot or the car.

523
1:03:50,200 --> 1:03:56,200
And then the planning system operates on top of this, coming up with trajectories that avoid collisions or smooth,

524
1:03:56,200 --> 1:04:00,200
make progress towards the destination using a combination of model-based optimization

525
1:04:00,200 --> 1:04:06,200
plus neural network that helps optimize it to be really fast.

526
1:04:06,200 --> 1:04:11,200
Today, we are really excited to present progress on all of these areas.

527
1:04:11,200 --> 1:04:15,200
We have the engineering leads standing by to come in and explain these various blocks,

528
1:04:15,200 --> 1:04:22,200
and these power not just the car, but the same components also run on the Optimus robot that Milan showed earlier.

529
1:04:22,200 --> 1:04:26,200
With that, I welcome Paril to start talking about the planning section.

530
1:04:26,200 --> 1:04:36,200
Hi, all. I'm Paril Jain.

531
1:04:36,200 --> 1:04:43,200
Let's use this intersection scenario to dive straight into how we do the planning and decision-making in Autopilot.

532
1:04:43,200 --> 1:04:49,200
So we are approaching this intersection from a side street, and we have to yield to all the crossing vehicles.

533
1:04:49,200 --> 1:04:57,200
Right as we are about to enter the intersection, the pedestrian on the other side of the intersection decides to cross the road without a crosswalk.

534
1:04:57,200 --> 1:05:02,200
Now, we need to yield to this pedestrian, yield to the vehicles from the right,

535
1:05:02,200 --> 1:05:08,200
and also understand the relation between the pedestrian and the vehicle on the other side of the intersection.

536
1:05:08,200 --> 1:05:15,200
So a lot of these intra-object dependencies that we need to resolve in a quick glance.

537
1:05:15,200 --> 1:05:17,200
And humans are really good at this.

538
1:05:17,200 --> 1:05:27,200
We look at a scene, understand all the possible interactions, evaluate the most promising ones, and generally end up choosing a reasonable one.

539
1:05:27,200 --> 1:05:31,200
So let's look at a few of these interactions that Autopilot system evaluated.

540
1:05:31,200 --> 1:05:36,200
We could have gone in front of this pedestrian with a very aggressive launch and lateral profile.

541
1:05:36,200 --> 1:05:41,200
Now, obviously, we are being a jerk to the pedestrian, and we would spook the pedestrian and his cute pet.

542
1:05:41,200 --> 1:05:48,200
We could have moved forward slowly, shot for a gap between the pedestrian and the vehicle from the right.

543
1:05:48,200 --> 1:05:51,200
Again, we are being a jerk to the vehicle coming from the right.

544
1:05:51,200 --> 1:05:58,200
But you should not outright reject this interaction in case this is only safe interaction available.

545
1:05:58,200 --> 1:06:01,200
Lastly, the interaction we ended up choosing.

546
1:06:01,200 --> 1:06:09,200
Stay slow initially, find the reasonable gap, and then finish the maneuver after all the agents pass.

547
1:06:09,200 --> 1:06:18,200
Now, evaluation of all of these interactions is not trivial, especially when you care about modeling the higher-order derivatives for other agents.

548
1:06:18,200 --> 1:06:25,200
For example, what is the longitudinal jerk required by the vehicle coming from the right when you assert in front of it?

549
1:06:25,200 --> 1:06:33,200
Relying purely on collision checks with modular predictions will only get you so far because you will miss out on a lot of valid interactions.

550
1:06:33,200 --> 1:06:42,200
This basically boils down to solving the multi-agent joint trajectory planning problem over the trajectories of ego and all the other agents.

551
1:06:42,200 --> 1:06:47,200
Now, how much ever you optimize, there's going to be a limit to how fast you can run this optimization problem.

552
1:06:47,200 --> 1:06:53,200
It will be close to order of 10 milliseconds, even after a lot of incremental approximations.

553
1:06:53,200 --> 1:07:07,200
Now, for a typical crowded unprotected lift, say you have more than 20 objects, each object having multiple different future modes, the number of relevant interaction combinations will blow up.

554
1:07:07,200 --> 1:07:11,200
The planner needs to make a decision every 50 milliseconds.

555
1:07:11,200 --> 1:07:14,200
So how do we solve this in real time?

556
1:07:14,200 --> 1:07:23,200
We rely on a framework what we call as interaction search, which is basically a parallelized research over a bunch of maneuver trajectories.

557
1:07:23,200 --> 1:07:36,200
The state space here corresponds to the kinematic state of ego, the kinematic state of other agents, their nominal future multimodal predictions, and all the static entities in the scene.

558
1:07:36,200 --> 1:07:40,200
The action space is where things get interesting.

559
1:07:40,200 --> 1:07:50,200
We use a set of maneuver trajectory candidates to branch over a bunch of interaction decisions and also incremental goals for a longer horizon maneuver.

560
1:07:50,200 --> 1:07:55,200
Let's walk through this research very quickly to get a sense of how it works.

561
1:07:55,200 --> 1:08:00,200
We start with a set of vision measurements, namely lanes, occupancy, moving objects.

562
1:08:00,200 --> 1:08:05,200
These get represented as sparse attractions as well as latent features.

563
1:08:05,200 --> 1:08:17,200
We use this to create a set of goal candidates, lanes again from the lanes network, or unstructured regions which correspond to a probability mask derived from human demonstration.

564
1:08:17,200 --> 1:08:28,200
Once we have a bunch of these goal candidates, we create seed trajectories using a combination of classical optimization approaches, as well as our network planner, again trained on data from the customer fleet.

565
1:08:28,200 --> 1:08:35,200
Now once we get a bunch of these seed trajectories, we use them to start branching on the interactions.

566
1:08:35,200 --> 1:08:37,200
We find the most critical interaction.

567
1:08:37,200 --> 1:08:43,200
In our case, this would be the interaction with respect to the pedestrian, whether we assert in front of it or yield to it.

568
1:08:43,200 --> 1:08:47,200
Obviously, the option on the left is a high penalty option.

569
1:08:47,200 --> 1:08:49,200
It likely won't get prioritized.

570
1:08:49,200 --> 1:08:57,200
So we branch further onto the option on the right, and that's where we bring in more and more complex interactions, building this optimization problem incrementally with more and more constraints.

571
1:08:57,200 --> 1:09:03,200
And the research keeps flowing, branching on more interactions, branching on more goals.

572
1:09:03,200 --> 1:09:09,200
Now a lot of tricks here lie in evaluation of each of this node of the research.

573
1:09:09,200 --> 1:09:19,200
Inside each node, initially we started with creating trajectories using classical optimization approaches, where the constraints, like I described, would be added incrementally.

574
1:09:19,200 --> 1:09:24,200
And this would take close to one to five milliseconds per action.

575
1:09:24,200 --> 1:09:31,200
Now even though this is fairly good number, when you want to evaluate more than 100% interactions, this does not scale.

576
1:09:31,200 --> 1:09:37,200
So we ended up building lightweight, queryable networks that you can run in the loop of the planner.

577
1:09:37,200 --> 1:09:44,200
These networks are trained on human demonstrations from the fleet, as well as offline solvers with relaxed time limits.

578
1:09:44,200 --> 1:09:51,200
With this, we were able to bring the run time down to close to 100 microseconds per action.

579
1:09:51,200 --> 1:10:06,200
Now doing this alone is not enough, because you still have this massive research that you need to go through, and you need to efficiently prune the search space.

580
1:10:06,200 --> 1:10:11,200
So you need to do scoring on each of these trajectories.

581
1:10:11,200 --> 1:10:18,200
A few of these are fairly standard. You do a bunch of collision checks, you do a bunch of comfort analysis, what is the jerk and access required for a given maneuver.

582
1:10:18,200 --> 1:10:23,200
The customer fleet data plays an important role here again.

583
1:10:23,200 --> 1:10:32,200
We run two sets of, again, lightweight, queryable networks, both really augmenting each other, one of them trained from interventions from the FSD beta fleet,

584
1:10:32,200 --> 1:10:38,200
which gives a score on how likely is a given maneuver to result in interventions over the next few seconds.

585
1:10:38,200 --> 1:10:47,200
And second, which is purely on human demonstrations, human driven data, giving a score on how close is your given selected action to a human driven trajectory.

586
1:10:47,200 --> 1:10:56,200
The scoring helps us prune the search space, keep branching further on the interactions, and focus the compute on the most promising outcomes.

587
1:10:56,200 --> 1:11:06,200
The cool part about this architecture is that it allows us to create a cool blend between data driven approaches,

588
1:11:06,200 --> 1:11:12,200
where you don't have to rely on a lot of hand engineered costs, but also ground it in reality with physics based checks.

589
1:11:12,200 --> 1:11:22,200
Now a lot of what I described was with respect to the agents we could observe in the scene, but the same framework extends to objects behind occlusions.

590
1:11:22,200 --> 1:11:29,200
We use the video feed from eight cameras to generate the 3D occupancy of the world.

591
1:11:29,200 --> 1:11:34,200
The blue mask here corresponds to the visibility region we call it.

592
1:11:34,200 --> 1:11:38,200
It basically gets blocked at the first occlusion you see in the scene.

593
1:11:38,200 --> 1:11:44,200
We consume this visibility mask to generate what we call as ghost objects, which you can see on the top left.

594
1:11:44,200 --> 1:11:50,200
Now if you model the spawn regions and the state transitions of these ghost objects correctly,

595
1:11:50,200 --> 1:11:59,200
if you tune your control response as a function of their existence likelihood, you can extract some really nice human like behaviors.

596
1:11:59,200 --> 1:12:04,200
Now I'll pass it on to Phil to describe more on how we generate these occupancy networks.

597
1:12:04,200 --> 1:12:11,200
Thank you.

598
1:12:11,200 --> 1:12:18,200
Hey guys, my name is Phil. I will share the details of the occupancy network we built over the past year.

599
1:12:18,200 --> 1:12:23,200
This network is our solution to model the physical world in 3D around our cars.

600
1:12:23,200 --> 1:12:27,200
And it is currently not shown in our customer facing visualization.

601
1:12:27,200 --> 1:12:35,200
What you see here is the raw network output from our internal lab tool.

602
1:12:35,200 --> 1:12:46,200
The occupancy network takes video streams of all our eight cameras as input, produces a single unified volumetric occupancy in vector space directly.

603
1:12:46,200 --> 1:12:54,200
For every 3D location around our car, it predicts the probability of that location being occupied or not.

604
1:12:54,200 --> 1:13:02,200
Since it has video context, it is capable of predicting obstacles that are occluded instantaneously.

605
1:13:02,200 --> 1:13:16,200
For each location, it also produces a set of semantics such as a curb, car, pedestrian, and road debris as color coded here.

606
1:13:16,200 --> 1:13:19,200
Occupancy flow is also predicted for motion.

607
1:13:19,200 --> 1:13:26,200
Since the model is a generalized network, it does not tell static and dynamic objects explicitly.

608
1:13:26,200 --> 1:13:33,200
It is able to produce and model the random motion such as a swerving trainer here.

609
1:13:33,200 --> 1:13:40,200
This network is currently running in all testers with FSD computers, and it is incredibly efficient.

610
1:13:40,200 --> 1:13:45,200
Runs about every 10 milliseconds with our neural net accelerator.

611
1:13:45,200 --> 1:13:48,200
So how does this work? Let's take a look at architecture.

612
1:13:48,200 --> 1:13:53,200
First, we rectify each camera images with a camera calibration.

613
1:13:53,200 --> 1:13:59,200
And the images we're showing here, we're giving to the network, it's actually not the typical 8-bit RGB image.

614
1:13:59,200 --> 1:14:06,200
As you can see from the first image on top, we're giving the 12-bit raw photo account image to the network.

615
1:14:06,200 --> 1:14:17,200
Since it has four bits more information, it has 16 times better dynamic range as well as reduced latency since we don't have to run ISP in the loop anymore.

616
1:14:17,200 --> 1:14:25,200
We use a set of reglets and a bag of FPMs as a backbone to extract image space features.

617
1:14:25,200 --> 1:14:34,200
Next, we construct a set of 3D position query along with the image space features as keys and values fit into an attention module.

618
1:14:34,200 --> 1:14:39,200
The output of the attention module is high dimensional spatial features.

619
1:14:39,200 --> 1:14:48,200
These spatial features are aligned temporally using vehicle odometry to derive motion.

620
1:14:48,200 --> 1:14:57,200
Next, these spatial temporal features go through a set of deconvolution to produce the final occupancy and occupancy flow output.

621
1:14:57,200 --> 1:15:04,200
They're formed as fixed-size voxel grid, which might not be precise enough for planning and control.

622
1:15:04,200 --> 1:15:19,200
In order to get a higher resolution, we also produce per-voxel feature maps, which we feed into MLP with 3D spatial point queries to get position and semantics at any arbitrary location.

623
1:15:19,200 --> 1:15:23,200
After knowing the model better, let's take a look at another example.

624
1:15:23,200 --> 1:15:29,200
Here we have an articulated bus parked on the right side of the road, highlighted as an L-shaped voxel here.

625
1:15:29,200 --> 1:15:42,200
As we approach, the bus starts to move. The front of the car turns blue first, indicating the model predicts the front of the bus has a long zero occupancy flow.

626
1:15:42,200 --> 1:15:52,200
And as the bus keeps moving, the entire bus turns blue, and you can also see that the network predicts the precise curvature of the bus.

627
1:15:52,200 --> 1:16:03,200
Well, this is a very complicated problem for traditional object detection network, as you have to see whether I'm going to use one cuboid or perhaps two to fit the curvature.

628
1:16:03,200 --> 1:16:13,200
But for occupancy network, since all we care about is the occupancy in the visible space, and we'll be able to model the curvature precisely.

629
1:16:13,200 --> 1:16:18,200
Besides the voxel grid, the occupancy network also produces a drivable surface.

630
1:16:18,200 --> 1:16:27,200
The drivable surface has both 3D geometry and semantics. They are very useful for control, especially on hilly and curvy roads.

631
1:16:27,200 --> 1:16:37,200
The surface and the voxel grid are not predicted independently. Instead, the voxel grid actually aligns with the surface implicitly.

632
1:16:37,200 --> 1:16:46,200
Here we are at a here quest where you can see the 3D geometry of the surface being predicted nicely.

633
1:16:46,200 --> 1:16:51,200
Planner can use this information to decide perhaps we need to slow down more for the here quest.

634
1:16:51,200 --> 1:16:58,200
And as you can also see, the voxel grid aligns with the surface consistently.

635
1:16:58,200 --> 1:17:07,200
Besides the voxels and the surface, we're also very excited about the recent breakthrough in neural radiance field, or LERF.

636
1:17:07,200 --> 1:17:19,200
We're looking into both incorporate some of the last LERF features into occupancy network training, as well as using our network output as the input state for LERF.

637
1:17:19,200 --> 1:17:28,200
As a matter of fact, Ashok is very excited about this. This has been his personal weekend project for a while.

638
1:17:28,200 --> 1:17:38,200
I think academia is building a lot of these foundation models for language using tons of large data sets for language.

639
1:17:38,200 --> 1:17:45,200
I think for vision, NERFs are going to provide the foundation models for computer vision because they are grounded in geometry.

640
1:17:45,200 --> 1:17:52,200
Geometry gives us a nice way to supervise these networks and frees us of the requirement to define an ontology.

641
1:17:52,200 --> 1:17:56,200
And the supervision is essentially free because you just have to differentially render these images.

642
1:17:56,200 --> 1:18:11,200
So I think in the future, this occupancy network idea where images come in and then the network produces a consistent volumetric representation of the scene that can then be differentially rendered into any image that was observed,

643
1:18:11,200 --> 1:18:14,200
I personally think is a future of computer vision.

644
1:18:14,200 --> 1:18:29,200
And we do some initial work on it right now, but I think in the future, both at Tesla and in academia, we will see that this combination of one-shot prediction of volumetric occupancy will be the future.

645
1:18:29,200 --> 1:18:32,200
That's my personal bet.

646
1:18:32,200 --> 1:18:34,200
Thanks, Ashok.

647
1:18:34,200 --> 1:18:39,200
So here's an example early result of a 3D reconstruction from our free data.

648
1:18:39,200 --> 1:18:49,200
Instead of focusing on getting perfect RGB rep projection in image space, our primary goal here is to accurately represent the world in 3D space for driving.

649
1:18:49,200 --> 1:18:54,200
And we want to do this for all our free data all over the world in all weather and lighting conditions.

650
1:18:54,200 --> 1:19:00,200
And obviously, this is a very challenging problem, and we're looking for you guys to help.

651
1:19:00,200 --> 1:19:07,200
Finally, the occupancy network is trained with large auto-labeled data set without any human in the loop.

652
1:19:07,200 --> 1:19:12,200
And with that, I'll pass to Tim to talk about what it takes to train this network.

653
1:19:12,200 --> 1:19:18,200
Thanks, Phil.

654
1:19:18,200 --> 1:19:20,200
All right. Hey, everyone.

655
1:19:20,200 --> 1:19:23,200
Let's talk about some training infrastructure.

656
1:19:23,200 --> 1:19:32,200
So we've seen a couple of videos, you know, four or five, I think, and care more and worry more about a lot more clips on that.

657
1:19:32,200 --> 1:19:38,200
So we've been looking at the occupancy networks just from Phil, just Phil's videos.

658
1:19:38,200 --> 1:19:43,200
It takes 1.4 billion frames to train that network, which you just saw.

659
1:19:43,200 --> 1:19:47,200
And if you have 100,000 GPUs, it would take one hour.

660
1:19:47,200 --> 1:19:52,200
But if you have one GPU, it would take 100,000 hours.

661
1:19:52,200 --> 1:19:56,200
So that is not a humane time period that you can wait for your training job to run, right?

662
1:19:56,200 --> 1:19:58,200
We want to ship faster than that.

663
1:19:58,200 --> 1:20:00,200
So that means you're going to need to go parallel.

664
1:20:00,200 --> 1:20:03,200
So you need more compute for that.

665
1:20:03,200 --> 1:20:06,200
That means you're going to need a supercomputer.

666
1:20:06,200 --> 1:20:18,200
So this is why we've built in-house three supercomputers comprising of 14,000 GPUs, where we use 10,000 GPUs for training and run 4,000 GPUs for auto-labeling.

667
1:20:18,200 --> 1:20:24,200
All these videos are stored in 30 petabytes of a distributed, managed video cache.

668
1:20:24,200 --> 1:20:31,200
You shouldn't think of our data sets as fixed, let's say, as you think of your ImageNet or something, you know, with like a million frames.

669
1:20:31,200 --> 1:20:34,200
You should think of it as a very fluid thing.

670
1:20:34,200 --> 1:20:42,200
So we've got half a million of these videos flowing in and out of these clusters every single day.

671
1:20:42,200 --> 1:20:49,200
And we track 400,000 of these kind of Python video instantiations every second.

672
1:20:49,200 --> 1:20:51,200
So that's a lot of calls.

673
1:20:51,200 --> 1:20:57,200
We're going to need to capture that in order to govern the retention policies of this distributed video cache.

674
1:20:57,200 --> 1:21:04,200
So underlying all of this is a huge amount of infra, all of which we build and manage in-house.

675
1:21:04,200 --> 1:21:13,200
So you cannot just buy, you know, 14,000 GPUs and then 30 petabytes of flash NVMe and just put it together and let's go train.

676
1:21:13,200 --> 1:21:17,200
It actually takes a lot of work, and I'm going to go into a little bit of that.

677
1:21:17,200 --> 1:21:25,200
What you actually typically want to do is you want to take your accelerator, so that could be the GPU or Dojo, which we'll talk about later.

678
1:21:25,200 --> 1:21:31,200
And because that's the most expensive component, that's where you want to put your bottleneck.

679
1:21:31,200 --> 1:21:37,200
And so that means that every single part of your system is going to need to outperform this accelerator.

680
1:21:37,200 --> 1:21:39,200
And so that is really complicated.

681
1:21:39,200 --> 1:21:46,200
That means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes.

682
1:21:46,200 --> 1:21:53,200
These nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning framework.

683
1:21:53,200 --> 1:21:58,200
This machine learning framework then needs to hand it off to your GPU, and then you can start training.

684
1:21:58,200 --> 1:22:06,200
But then you need to do so across hundreds or thousands of GPU in a reliable way, in lockstep, and in a way that's also fast.

685
1:22:06,200 --> 1:22:10,200
So you're also going to need an interconnect. Extremely complicated.

686
1:22:10,200 --> 1:22:13,200
We'll talk more about Dojo in a second.

687
1:22:13,200 --> 1:22:18,200
So first I want to take you through some optimizations that we've done on our cluster.

688
1:22:18,200 --> 1:22:27,200
So we're getting in a lot of videos, and video is very much unlike, let's say, training on images or text, which I think is very well established.

689
1:22:27,200 --> 1:22:31,200
Video is quite literally a dimension more complicated.

690
1:22:31,200 --> 1:22:39,200
And so that's why we needed to go end to end from the storage layer down to the accelerator and optimize every single piece of that.

691
1:22:39,200 --> 1:22:45,200
Because we train on the photon count videos that come directly from our fleet, we train on those directly.

692
1:22:45,200 --> 1:22:48,200
We do not post-process those at all.

693
1:22:48,200 --> 1:22:53,200
The way it's just done is we seek exactly to the frames we select for our batch.

694
1:22:53,200 --> 1:22:56,200
We load those in, including the frames that they depend on.

695
1:22:56,200 --> 1:22:58,200
So these are your I-frames or your key frames.

696
1:22:58,200 --> 1:23:03,200
We package those up, move them into shared memory, move them into a double buffer on the GPU,

697
1:23:03,200 --> 1:23:09,200
and then use the hardware decoder that's only accelerated to actually decode the video.

698
1:23:09,200 --> 1:23:11,200
So we do that on the GPU natively.

699
1:23:11,200 --> 1:23:15,200
And it's all in a very nice byte-torch extension.

700
1:23:15,200 --> 1:23:26,200
Doing so unlocks more than 30% training speed increase for the occupancy networks and frees up basically the whole CPU to do any other thing.

701
1:23:26,200 --> 1:23:29,200
You cannot just do training with just videos.

702
1:23:29,200 --> 1:23:31,200
Of course, you need some kind of a ground truth.

703
1:23:31,200 --> 1:23:34,200
And that is actually an interesting problem as well.

704
1:23:34,200 --> 1:23:43,200
The objective for storing your ground truth is that you want to make sure you get to your ground truth that you need in the minimal amount of file system operations

705
1:23:43,200 --> 1:23:49,200
and load in the minimal size of what you need in order to optimize for aggregate cross-cluster throughput.

706
1:23:49,200 --> 1:23:56,200
Because you should see a compute cluster as one big device which has internally fixed constraints and thresholds.

707
1:23:56,200 --> 1:24:02,200
So for this, we rolled out a format that is native to us that's called Small.

708
1:24:02,200 --> 1:24:06,200
We use this for our ground truth, our feature cache, and any inference outputs.

709
1:24:06,200 --> 1:24:08,200
So a lot of tensors that are in there.

710
1:24:08,200 --> 1:24:10,200
And so just a cartoon here.

711
1:24:10,200 --> 1:24:13,200
Let's say this is your table that you want to store.

712
1:24:13,200 --> 1:24:16,200
Then that's how that would look out if you rolled out on disk.

713
1:24:16,200 --> 1:24:22,200
So what you do is you take anything you'd want to index on, so for example, video timestamps, you put those all in the header

714
1:24:22,200 --> 1:24:26,200
so that in your initial header read, you know exactly where to go on disk.

715
1:24:26,200 --> 1:24:34,200
Then if you have any tensors, you're going to try to transpose the dimensions to put a different dimension last as the contiguous dimension.

716
1:24:34,200 --> 1:24:37,200
And then also try different types of compression.

717
1:24:37,200 --> 1:24:41,200
Then you check out which one was most optimal and then store that one.

718
1:24:41,200 --> 1:24:46,200
This is actually a huge step if you do feature caching, unintelligible output from the machine learning network,

719
1:24:46,200 --> 1:24:52,200
rotate around the dimensions a little bit, you can get up to 20% increase in efficiency of storage.

720
1:24:52,200 --> 1:25:01,200
Then when you store that, we also order the columns by size so that all your small columns and small values are together

721
1:25:01,200 --> 1:25:06,200
so that when you seek for a single value, you're likely to overlap with the read on more values,

722
1:25:06,200 --> 1:25:11,200
which you'll use later so that you don't need to do another file system operation.

723
1:25:11,200 --> 1:25:13,200
So I could go on and on.

724
1:25:13,200 --> 1:25:17,200
I just went on, touched on two projects that we have internally.

725
1:25:17,200 --> 1:25:23,200
But this is actually part of a huge continuous effort to optimize the compute that we have in-house.

726
1:25:23,200 --> 1:25:27,200
So accumulating and aggregating through all these optimizations,

727
1:25:27,200 --> 1:25:32,200
we now train our occupancy networks twice as fast just because it's twice as efficient.

728
1:25:32,200 --> 1:25:38,200
And now if we add in a bunch more compute and go parallel, we can now train this in hours instead of days.

729
1:25:38,200 --> 1:25:43,200
And with that, I'd like to hand it off to the biggest user of compute, John.

730
1:25:43,200 --> 1:25:52,200
Hi, everybody. My name is John Emmons.

731
1:25:52,200 --> 1:25:54,200
I lead the Autopilot vision team.

732
1:25:54,200 --> 1:25:57,200
I'm going to cover two topics with you today.

733
1:25:57,200 --> 1:26:04,200
The first is how we predict lanes, and the second is how we predict the future behavior of other agents on the road.

734
1:26:04,200 --> 1:26:11,200
In the early days of Autopilot, we modeled the lane detection problem as an image space instant segmentation task.

735
1:26:11,200 --> 1:26:13,200
Our network was super simple, though.

736
1:26:13,200 --> 1:26:18,200
In fact, it was only capable of predicting lanes of a few different kinds of geometries.

737
1:26:18,200 --> 1:26:26,200
Specifically, it would segment the Eagle lane, it could segment adjacent lanes, and then it had some special casing for forks and merges.

738
1:26:26,200 --> 1:26:31,200
This simplistic modeling of the problem worked for highly structured roads like highways.

739
1:26:31,200 --> 1:26:35,200
But today we're trying to build a system that's capable of much more complex maneuvers.

740
1:26:35,200 --> 1:26:41,200
Specifically, we want to make left and right turns at intersections where the road topology can be quite a bit more complex and diverse.

741
1:26:41,200 --> 1:26:47,200
When we try to apply this simplistic modeling of the problem here, it just totally breaks down.

742
1:26:47,200 --> 1:26:54,200
Taking a step back for a moment, what we're trying to do here is to predict the sparse set of lane instances and their connectivity.

743
1:26:54,200 --> 1:27:00,200
And what we want to do is to have a neural network that basically predicts this graph where the nodes are the lane segments

744
1:27:00,200 --> 1:27:04,200
and the edges encode the connectivities between these lanes.

745
1:27:04,200 --> 1:27:08,200
So what we have is our lane detection neural network.

746
1:27:08,200 --> 1:27:11,200
It's made up of three components.

747
1:27:11,200 --> 1:27:16,200
In the first component, we have a set of convolutional layers, attention layers, and other neural network layers

748
1:27:16,200 --> 1:27:23,200
that encode the video streams from our eight cameras on the vehicle and produce a rich visual representation.

749
1:27:23,200 --> 1:27:32,200
We then enhance this visual representation with a coarse road level map data, which we encode with a set of additional neural network layers

750
1:27:32,200 --> 1:27:35,200
that we call the lane guidance module.

751
1:27:35,200 --> 1:27:40,200
This map is not an HD map, but it provides a lot of useful hints about the topology of lanes inside of intersections,

752
1:27:40,200 --> 1:27:46,200
the lane counts on various roads, and a set of other attributes that help us.

753
1:27:46,200 --> 1:27:51,200
The first two components here produce a dense tensor that sort of encodes the world.

754
1:27:51,200 --> 1:27:57,200
But what we really want to do is to convert this dense tensor into a sparse set of lanes and their connectivities.

755
1:27:57,200 --> 1:28:02,200
We approach this problem like an image captioning task, where the input is this dense tensor,

756
1:28:02,200 --> 1:28:09,200
and the output text is predicted into a special language that we developed at Tesla for encoding lanes and their connectivities.

757
1:28:09,200 --> 1:28:14,200
In this language of lanes, the words and tokens are the lane positions in 3D space.

758
1:28:14,200 --> 1:28:21,200
In the ordering of the tokens, in predicted modifiers in the tokens, encode the connected relationships between these lanes.

759
1:28:21,200 --> 1:28:26,200
By modeling the task as a language problem, we can capitalize on recent autoregressive architectures

760
1:28:26,200 --> 1:28:30,200
and techniques from the language community for handling the multiplicity of the problem.

761
1:28:30,200 --> 1:28:33,200
We're not just solving the computer vision problem at Autopilot.

762
1:28:33,200 --> 1:28:38,200
We're also applying the state-of-the-art in language modeling and machine learning more generally.

763
1:28:38,200 --> 1:28:42,200
I'm now going to dive into a little bit more detail of this language component.

764
1:28:42,200 --> 1:28:48,200
What I have depicted on the screen here is a satellite image which sort of represents the local area around the vehicle.

765
1:28:48,200 --> 1:28:56,200
The set of nodes and edges is what we refer to as the lane graph, and it's ultimately what we want to come out of this neural network.

766
1:28:56,200 --> 1:28:59,200
We start with a blank slate.

767
1:28:59,200 --> 1:29:03,200
We're going to want to make our first prediction here at this green dot.

768
1:29:03,200 --> 1:29:08,200
This green dot's position is encoded as an index into a coarse grid which discretizes the 3D world.

769
1:29:08,200 --> 1:29:13,200
Now, we don't predict this index directly because it would be too computationally expensive to do so.

770
1:29:13,200 --> 1:29:20,200
There's just too many grid points, and predicting a categorical distribution over this has both implications at training time and test time.

771
1:29:20,200 --> 1:29:23,200
So instead what we do is we discretize the world coarsely first.

772
1:29:23,200 --> 1:29:28,200
We predict a heat map over the possible locations, and then we latch in the most probable location.

773
1:29:28,200 --> 1:29:34,200
Condition on this, we then refine the prediction and get the precise point.

774
1:29:34,200 --> 1:29:38,200
Now, we know where the position of this token is, but we don't know its type.

775
1:29:38,200 --> 1:29:41,200
In this case, though, it's the beginning of a new lane.

776
1:29:41,200 --> 1:29:44,200
So we predict it as a start token.

777
1:29:44,200 --> 1:29:48,200
And because it's a start token, there's no additional attributes in our language.

778
1:29:48,200 --> 1:29:54,200
We then take the predictions from this first forward pass, and we encode them using a learned positional embedding

779
1:29:54,200 --> 1:30:00,200
which produces a set of tensors that we combine together, which is actually the first word in our language of lanes.

780
1:30:00,200 --> 1:30:04,200
We add this to the first position in our sentence here.

781
1:30:04,200 --> 1:30:09,200
We then continue this process by predicting the next lane point in a similar fashion.

782
1:30:09,200 --> 1:30:12,200
Now, this lane point is not the beginning of a new lane.

783
1:30:12,200 --> 1:30:15,200
It's actually a continuation of the previous lane.

784
1:30:15,200 --> 1:30:18,200
So it's a continuation token type.

785
1:30:18,200 --> 1:30:23,200
Now, it's not enough just to know that this lane is connected to the previously predicted lane.

786
1:30:23,200 --> 1:30:29,200
We want to encode its precise geometry, which we do by regressing a set of spline coefficients.

787
1:30:29,200 --> 1:30:34,200
We then take this lane, we encode it again, and add it as the next word in the sentence.

788
1:30:34,200 --> 1:30:39,200
We continue predicting these continuation lanes until we get to the end of the prediction grid.

789
1:30:39,200 --> 1:30:42,200
We then move on to a different lane segment.

790
1:30:42,200 --> 1:30:44,200
So you can see that cyan dot there.

791
1:30:44,200 --> 1:30:47,200
Now, it's not topologically connected to that pink point.

792
1:30:47,200 --> 1:30:52,200
It's actually forking off of that blue, sorry, that green point there.

793
1:30:52,200 --> 1:30:54,200
So it's got a fork type.

794
1:30:54,200 --> 1:31:00,200
And fork tokens actually point back to previous tokens from which the fork originates.

795
1:31:00,200 --> 1:31:03,200
So you can see here the fork point predictor is actually the index zero.

796
1:31:03,200 --> 1:31:09,200
So it's actually referencing back to tokens that it's already predicted, like you would in language.

797
1:31:09,200 --> 1:31:14,200
We continue this process over and over again until we've enumerated all of the tokens in the lane graph.

798
1:31:14,200 --> 1:31:18,200
And then the network predicts the end of sentence token.

799
1:31:18,200 --> 1:31:24,200
Yeah, I just wanted to note that the reason we do this is not just because we want to build something complicated.

800
1:31:24,200 --> 1:31:29,200
It almost feels like a Turing complete machine here with neural networks, though, is that we tried simpler approaches.

801
1:31:29,200 --> 1:31:34,200
For example, trying to just segment the lanes along the road or something like that.

802
1:31:34,200 --> 1:31:40,200
But then the problem is when there's uncertainty, say you cannot see the road clearly and there could be two lanes or three lanes,

803
1:31:40,200 --> 1:31:45,200
and you can't tell, a simple segmentation-based approach would just draw both of them.

804
1:31:45,200 --> 1:31:51,200
It's kind of a 2.5 lane situation, and the post-crossing algorithm would hilariously fail when the predictions are such.

805
1:31:51,200 --> 1:31:53,200
Yeah, and the problems don't end there.

806
1:31:53,200 --> 1:32:00,200
I mean, you need to predict these connective lanes inside of intersections, which it's just not possible with the approach that Ashok's mentioning,

807
1:32:00,200 --> 1:32:02,200
which is why we had to upgrade to this sort of approach.

808
1:32:02,200 --> 1:32:05,200
Yeah, when it overlaps like this, segmentation would just go haywire.

809
1:32:05,200 --> 1:32:09,200
But even if you try very hard to put them on separate layers, it's just a really hard problem.

810
1:32:09,200 --> 1:32:19,200
But language just offers a really nice framework for getting a sample from a posterior as opposed to trying to do all of this in post-crossing.

811
1:32:19,200 --> 1:32:21,200
But this doesn't actually stop for just autopilot, right, John?

812
1:32:21,200 --> 1:32:24,200
This can be used for optimists.

813
1:32:24,200 --> 1:32:34,200
Yeah, I guess they wouldn't be called lanes, but you could imagine in this stage here that you might have paths that encode the possible places that people could walk.

814
1:32:34,200 --> 1:32:41,200
Yeah, basically if you're in a factory or in a home setting, you can just ask the robot, okay, let's me please route to the kitchen,

815
1:32:41,200 --> 1:32:48,200
or please route to some location in the factory, and then we predict a set of pathways that would go through the aisles, take the robot,

816
1:32:48,200 --> 1:32:50,200
and say, okay, this is how you get to the kitchen.

817
1:32:50,200 --> 1:33:00,200
It just really gives us a nice framework to model these different paths that simplify the navigation problem for the downstream planner.

818
1:33:00,200 --> 1:33:07,200
All right, so ultimately what we get from this lane detection network is a set of lanes and their connectivities, which comes directly from the network.

819
1:33:07,200 --> 1:33:13,200
There's no additional step here for sparsely tying these dense predictions into sparse ones.

820
1:33:13,200 --> 1:33:18,200
This is just the direct unfiltered output of the network.

821
1:33:18,200 --> 1:33:20,200
Okay, so I talked a little bit about lanes.

822
1:33:20,200 --> 1:33:26,200
I'm going to briefly touch on how we model and predict the future paths and other semantics on objects.

823
1:33:26,200 --> 1:33:29,200
So I'm just going to go really quickly through two examples.

824
1:33:29,200 --> 1:33:34,200
The video on the right here, we've got a car that's actually running a red light and turning in front of us.

825
1:33:34,200 --> 1:33:40,200
What we do to handle situations like this is we predict a set of short time horizon future trajectories on all objects.

826
1:33:40,200 --> 1:33:48,200
We can use these to anticipate the dangerous situation here and apply whatever braking and steering action is required to avoid a collision.

827
1:33:48,200 --> 1:33:51,200
In the video on the right, there's two vehicles in front of us.

828
1:33:51,200 --> 1:33:53,200
The one on the left lane is parked.

829
1:33:53,200 --> 1:33:55,200
Apparently it's being loaded, unloaded.

830
1:33:55,200 --> 1:33:57,200
I don't know why the driver decided to park there.

831
1:33:57,200 --> 1:34:02,200
But the important thing is that our neural network predicted that it was stopped, which is the red color there.

832
1:34:02,200 --> 1:34:06,200
The vehicle in the other lane, as you notice, also is stationary.

833
1:34:06,200 --> 1:34:08,200
But that one's obviously just waiting for that red light to turn green.

834
1:34:08,200 --> 1:34:19,200
So even though both objects are stationary and have zero velocity, it's the semantics that is really important here so that we don't get stuck behind that awkwardly parked car.

835
1:34:19,200 --> 1:34:24,200
Predicting all of these agent attributes presents some practical problems when trying to build a real time system.

836
1:34:24,200 --> 1:34:30,200
We need to maximize the frame rate of our object section stack so that autopilot can quickly react to the changing environment.

837
1:34:30,200 --> 1:34:32,200
Every millisecond really matters here.

838
1:34:32,200 --> 1:34:37,200
To minimize the inference latency, our neural network is split into two phases.

839
1:34:37,200 --> 1:34:42,200
In the first phase, we identify the locations in 3D space where agents exist.

840
1:34:42,200 --> 1:34:52,200
In the second stage, we then pull out tensors at those 3D locations, append it with additional data that's on the vehicle, and then we do the rest of the processing.

841
1:34:52,200 --> 1:35:01,200
This sparsification step allows the neural network to focus compute on the areas that matter most, which gives us superior performance for a fraction of the latency cost.

842
1:35:01,200 --> 1:35:06,200
So putting it all together, the Autopilot Vision Stack predicts more than just the geometry and kinematics of the world.

843
1:35:06,200 --> 1:35:11,200
It also predicts a rich set of semantics, which enables safe and human-like driving.

844
1:35:11,200 --> 1:35:15,200
I'm now going to hand things off to Sharif, who will tell us how we run all these cool neural networks on our FSD computer.

845
1:35:15,200 --> 1:35:16,200
Thank you.

846
1:35:16,200 --> 1:35:26,200
Hi, everyone. I'm Shree.

847
1:35:26,200 --> 1:35:34,200
Today I'm going to give a glimpse of what it takes to run these FSD networks in the car, and how do we optimize for the inference latency.

848
1:35:34,200 --> 1:35:41,200
Today I'm going to focus just on the FSD lanes network that John just talked about.

849
1:35:41,200 --> 1:35:53,200
So when we started this track, we wanted to know if we can run this FSD lanes network natively on the trip engine, which is our in-house neural network accelerator that we built in the FSD computer.

850
1:35:53,200 --> 1:36:02,200
When we built this hardware, we kept it simple, and we made sure it can do one thing ridiculously fast, dense dot products.

851
1:36:02,200 --> 1:36:14,200
But this architecture is autoregressive and iterative, where it crunches through multiple attention blocks in the inner loop, producing sparse points directly at every step.

852
1:36:14,200 --> 1:36:21,200
So the challenge here was, how can we do this sparse point prediction and sparse computation on a dense dot product engine?

853
1:36:21,200 --> 1:36:25,200
Let's see how we did this on the trip.

854
1:36:25,200 --> 1:36:32,200
So the network predicts the heat map of most probable spatial locations of the point.

855
1:36:32,200 --> 1:36:41,200
Now we do an argmax and a one-hot operation, which gives the one-hot encoding of the index of the spatial location.

856
1:36:41,200 --> 1:36:49,200
Now we need to select the embedding associated with this index from an embedding table that is learned during training.

857
1:36:49,200 --> 1:37:02,200
To do this on trip, we actually built a lookup table in SRAM, and we engineered the dimensions of this embedding such that we could achieve all of these things with just matrix multiplication.

858
1:37:02,200 --> 1:37:12,200
Not just that, we also wanted to store this embedding into a token cache so that we don't recompute this for every iteration, rather reuse it for future point prediction.

859
1:37:12,200 --> 1:37:19,200
Again, we put some tricks here where we did all these operations just on the dot product engine.

860
1:37:19,200 --> 1:37:31,200
It's actually cool that our team found creative ways to map all these operations on the trip engine in ways that were not even imagined when this hardware was designed.

861
1:37:31,200 --> 1:37:34,200
But that's not the only thing we had to do to make this work.

862
1:37:34,200 --> 1:37:45,200
We actually implemented a whole lot of operations and features to make this model compilable, to improve the intake accuracy, as well as to optimize performance.

863
1:37:45,200 --> 1:37:56,200
All of these things helped us run this 75 million parameter model just under 10 milliseconds of latency, consuming just 8 watts of power.

864
1:37:56,200 --> 1:38:04,200
But this is not the only architecture running in the car. There are so many other architectures, modules, and networks we need to run in the car.

865
1:38:04,200 --> 1:38:13,200
To give a sense of scale, there are about a billion parameters of all the networks combined, producing around 1,000 neural network signals.

866
1:38:13,200 --> 1:38:24,200
So we need to make sure we optimize them jointly, such that we maximize the compute utilization throughput and minimize the latency.

867
1:38:24,200 --> 1:38:32,200
So we built a compiler just for neural networks that shares the structure to traditional compilers.

868
1:38:32,200 --> 1:38:49,200
As you can see, it takes the massive graph of neural nets with 150k nodes and 375k connections, takes this thing, partitions them into independent subgraphs, and compels each of those subgraphs natively for the inference devices.

869
1:38:49,200 --> 1:38:57,200
Then we have a neural network linker, which shares the structure to traditional linker, where we perform this link time optimization.

870
1:38:57,200 --> 1:39:10,200
There, we solve an offline optimization problem with compute memory and memory bandwidth constraints, so that it comes with an optimized schedule that gets executed in the car.

871
1:39:10,200 --> 1:39:24,200
On the runtime, we designed a hybrid scheduling system, which basically does heterogeneous scheduling on one SoC and distributed scheduling across both the SoCs to run these networks in a model parallel fashion.

872
1:39:24,200 --> 1:39:49,200
To get 100 tops of compute utilization, we need to optimize across all the layers of software right from tuning the network architecture, the compiler, all the way to implementing a low latency, high bandwidth RDMA link across both the SoCs, and in fact, going even deeper to understanding and optimizing the cache coherent and non-coherent data paths of the accelerator in the SoC.

873
1:39:49,200 --> 1:39:59,200
This is a lot of optimization at every level in order to make sure we get the highest frame rate, and as every millisecond counts here.

874
1:39:59,200 --> 1:40:08,200
And this is just the visualization of the neural networks that are running in the car. This is our digital brain, essentially.

875
1:40:08,200 --> 1:40:17,200
As you can see, these operations are nothing but just the matrix multiplication convolution, to name a few, real operations running in the car.

876
1:40:17,200 --> 1:40:36,200
To train this network with a billion parameters, you need a lot of labeled data. So Egan is going to talk about how do we achieve this with the auto labeling pipeline.

877
1:40:36,200 --> 1:40:38,200
Thank you, Sri.

878
1:40:38,200 --> 1:40:43,200
Hi, everyone. I'm Egan Zhang, and I'm leading Geometric Vision at Autopilot.

879
1:40:43,200 --> 1:40:48,200
So, yeah, let's talk about auto labeling.

880
1:40:48,200 --> 1:40:54,200
So we have several kinds of auto labeling frameworks to support various types of networks.

881
1:40:54,200 --> 1:40:59,200
But today, I'd like to focus on the awesome LanesNet here.

882
1:40:59,200 --> 1:41:12,200
So to successfully train and generalize this network to everywhere, we think we went tens of millions of trips from probably one million intersection or even more.

883
1:41:12,200 --> 1:41:15,200
So then how to do that?

884
1:41:15,200 --> 1:41:28,200
So it is certainly achievable to source sufficient amount of trips because we already have, as Tim explained earlier, we already have like 500,000 trips per day cache rate.

885
1:41:28,200 --> 1:41:36,200
However, converting all of those data into a training form is a very challenging technical problem.

886
1:41:36,200 --> 1:41:50,200
To solve this challenge, we've tried various ways of manual and auto labeling. So from the first column to the second, from the second to the third, each advance provided us nearly 100x improvement in throughput.

887
1:41:50,200 --> 1:42:02,200
But still, we want an even better auto labeling machine that can provide us good quality, diversity, and scalability.

888
1:42:02,200 --> 1:42:14,200
To meet all these requirements, despite the huge amount of engineering effort required here, we've developed a new auto labeling machine powered by multi-trip reconstruction.

889
1:42:14,200 --> 1:42:24,200
So this can replace five million hours of manual labeling with just 12 hours of cluster for labeling 10,000 trips.

890
1:42:24,200 --> 1:42:27,200
So how we solved? There are three big steps.

891
1:42:27,200 --> 1:42:34,200
The first step is high precision trajectory and structure recovery by multi-camera visual inertia odometry.

892
1:42:34,200 --> 1:42:43,200
So here all the features, including ground surface, are inferred from videos by neural networks, then tracked and reconstructed in the vector space.

893
1:42:43,200 --> 1:42:56,200
So the typical drift rate of this trajectory in car is like 1.3 centimeter per meter and 0.45 milli radian per meter, which is pretty decent considering its compact compute requirement.

894
1:42:56,200 --> 1:43:04,200
Then the recovery surface and road details are also used as a strong guidance for the later manual verification stuff.

895
1:43:04,200 --> 1:43:13,200
This is also enabled in every FSD vehicle, so we get pre-processed trajectories and structures along with the trip data.

896
1:43:13,200 --> 1:43:21,200
The second step is multi-trip reconstruction, which is the big and core piece of this machine.

897
1:43:21,200 --> 1:43:31,200
So the video shows how the previously shown trip is reconstructed and aligned with other trips, basically other trips from different vehicles, not the same vehicle.

898
1:43:31,200 --> 1:43:40,200
So this is done by multiple internal steps like coarse alignment, pairwise matching, joint optimization, then further surface refinement.

899
1:43:40,200 --> 1:43:45,200
In the end, the human analyst comes in and finalizes the label.

900
1:43:45,200 --> 1:43:55,200
So each habit steps are already fully parallelized on the cluster, so the entire process usually takes just a couple of hours.

901
1:43:55,200 --> 1:44:01,200
The last step is actually auto labeling the new trips.

902
1:44:01,200 --> 1:44:10,200
So here we use the same multi-trip alignment engine, but only between pre-built reconstruction and each new trip.

903
1:44:10,200 --> 1:44:15,200
So it's much, much simpler than fully reconstructing all the clips altogether.

904
1:44:15,200 --> 1:44:24,200
That's why it only takes 30 minutes per trip to auto label instead of several hours of manual labeling.

905
1:44:24,200 --> 1:44:31,200
And this is also the key of scalability of this machine.

906
1:44:31,200 --> 1:44:38,200
This machine easily scales as long as we have available compute and trip data.

907
1:44:38,200 --> 1:44:43,200
So about 50 trips were newly auto labeled from this scene, and some of them are shown here.

908
1:44:43,200 --> 1:44:47,200
So 53 from different vehicles.

909
1:44:47,200 --> 1:44:54,200
So this is how we capture and transform the space-time slices of the world into the network supervision.

910
1:44:54,200 --> 1:45:00,200
Yeah, one thing I'd like to note is that Jegen just talked about how we auto label our lanes.

911
1:45:00,200 --> 1:45:06,200
We have auto labors for almost every task that we do, including our planner, and many of these are fully automatic.

912
1:45:06,200 --> 1:45:13,200
There are no humans involved. For example, for objects, all the kinematics, the shapes, the futures, everything just comes from auto labeling.

913
1:45:13,200 --> 1:45:17,200
And the same is true for occupancy, too. And we have really just built a machine around this.

914
1:45:17,200 --> 1:45:22,200
Yeah, so if you can go back one slide. One more.

915
1:45:22,200 --> 1:45:29,200
It says parallelized on cluster. So that sounds pretty straightforward, but it really wasn't.

916
1:45:29,200 --> 1:45:33,200
Maybe it's fun to share how something like this comes about.

917
1:45:33,200 --> 1:45:39,200
So a while ago, we didn't have any auto labeling at all. And then someone makes a script.

918
1:45:39,200 --> 1:45:45,200
It starts to work. It starts working better until we reach a volume that's pretty high, and we clearly need a solution.

919
1:45:45,200 --> 1:45:51,200
And so there were two other engineers in our team who were like, you know, that's an interesting thing.

920
1:45:51,200 --> 1:45:57,200
What we needed to do was build a whole graph of essentially Python functions that we need to run one after the other.

921
1:45:57,200 --> 1:46:01,200
First, you pull the clip, then you do some cleaning, then you do some network inference,

922
1:46:01,200 --> 1:46:06,200
then another network inference until you finally get this. But so you need to do this at a large scale.

923
1:46:06,200 --> 1:46:14,200
So I tell them, we probably need to shoot for, you know, 100,000 clips per day or like 100,000 items. That seems good.

924
1:46:14,200 --> 1:46:21,200
And so the engineers said, well, we can do, you know, a bit of Postgres and a bit of Elvo grease. We can do it.

925
1:46:21,200 --> 1:46:28,200
Meanwhile, we are a bit later and we're doing 20 million of these functions every single day.

926
1:46:28,200 --> 1:46:34,200
Again, we pull in around half a million clips and on those we run a ton of functions, each of these in a streaming fashion.

927
1:46:34,200 --> 1:46:40,200
And so that's kind of the back end info that's also needed to not just run training, but also auto labeling.

928
1:46:40,200 --> 1:46:46,200
It really is like a factory that produces labels and production lines, yield, quality, inventory,

929
1:46:46,200 --> 1:46:52,200
like all of the same concepts applied to this label factory that applies for the factory for our cars.

930
1:46:52,200 --> 1:46:55,200
That's right.

931
1:46:55,200 --> 1:46:58,200
OK, thanks, Tim and Ashok.

932
1:46:58,200 --> 1:47:06,200
So, yeah, so concluding this section, I'd like to share a few more challenging and interesting examples for network for sure.

933
1:47:06,200 --> 1:47:15,200
And even for humans, probably. So from the top, there's like examples for like lack of lights case or foggy night or roundabout

934
1:47:15,200 --> 1:47:22,200
and occlusions by heavy occlusions by parked cars and even rainy night with the raindrops on camera lenses.

935
1:47:22,200 --> 1:47:27,200
These are challenging, but once their original scenes are fully reconstructed by other clips,

936
1:47:27,200 --> 1:47:34,200
all of them can be auto labeled so that our cars can drive even better through these challenging scenarios.

937
1:47:34,200 --> 1:47:47,200
So now let me pass the mic to David to learn more about how Sim is creating the new world on top of these labels. Thank you.

938
1:47:47,200 --> 1:47:51,200
Thank you, Yegan. My name is David and I'm going to talk about simulation.

939
1:47:51,200 --> 1:47:58,200
So simulation plays a critical role in providing data that is difficult to source and or hard to label.

940
1:47:58,200 --> 1:48:02,200
However, 3D scenes are notoriously slow to produce.

941
1:48:02,200 --> 1:48:10,200
Take, for example, the simulated scene playing behind me, a complex intersection from Market Street in San Francisco.

942
1:48:10,200 --> 1:48:13,200
It would take two weeks for artists to complete.

943
1:48:13,200 --> 1:48:16,200
And for us, that is painfully slow.

944
1:48:16,200 --> 1:48:22,200
However, I'm going to talk about using Yegan's automated ground truth labels along with some brand new tooling

945
1:48:22,200 --> 1:48:27,200
that allows us to procedurally generate this scene and many like it in just five minutes.

946
1:48:27,200 --> 1:48:31,200
That's an amazing a thousand times faster than before.

947
1:48:31,200 --> 1:48:36,200
So let's dive in to how a scene like this is created.

948
1:48:36,200 --> 1:48:43,200
We start by piping the automated ground truth labels into our simulated world creator tooling inside the software Houdini.

949
1:48:43,200 --> 1:48:50,200
Starting with road boundary labels, we can generate a solid road mesh and re-topologize it with the lane graph labels.

950
1:48:50,200 --> 1:48:57,200
This helps inform important road details like crossroad slope and detailed material blending.

951
1:48:57,200 --> 1:49:07,200
Next, we can use the line data and sweep geometry across its surface and project it to the road, creating lane paint decals.

952
1:49:07,200 --> 1:49:13,200
Next, using median edges, we can spawn to island geometry and populate it with randomized foliage.

953
1:49:13,200 --> 1:49:16,200
This drastically changes the visibility of the scene.

954
1:49:16,200 --> 1:49:21,200
Now, the outside world can be generated through a series of randomized heuristics.

955
1:49:21,200 --> 1:49:28,200
Modular building generators create visual obstructions while randomly placed objects like hydrants can change the color of the curbs,

956
1:49:28,200 --> 1:49:33,200
while trees can drop leaves below it obscuring lines or edges.

957
1:49:33,200 --> 1:49:39,200
Next, we can bring in map data to inform positions of things like traffic lights or stop signs.

958
1:49:39,200 --> 1:49:48,200
We can trace along its normal to collect important information like number of lanes and even get accurate street names on the signs themselves.

959
1:49:48,200 --> 1:49:57,200
Next, using lane graph, we can determine lane connectivity and spawn directional road markings on the road and their accompanying road signs.

960
1:49:57,200 --> 1:50:06,200
And finally, with lane graph itself, we can determine lane adjacency and other useful metrics to spawn randomized traffic permutations inside our simulator.

961
1:50:06,200 --> 1:50:11,200
And again, this is all automatic, no artists in the loop, and happens within minutes.

962
1:50:11,200 --> 1:50:15,200
And now this sets us up to do some pretty cool things.

963
1:50:15,200 --> 1:50:23,200
Since everything is based on data and heuristics, we can start to fuzz parameters to create visual variations of the single ground truth.

964
1:50:23,200 --> 1:50:34,200
It can be as subtle as object placement and random material swapping to more drastic changes like entirely new biomes or locations of environment like urban, suburban, or rural.

965
1:50:34,200 --> 1:50:43,200
This allows us to create infinite targeted permutations for specific ground truths that we need more ground truth for.

966
1:50:43,200 --> 1:50:47,200
And all this happens within a click of a button.

967
1:50:47,200 --> 1:50:52,200
And we can even take this one step further by altering our ground truth itself.

968
1:50:52,200 --> 1:50:59,200
Say John wants his network to pay more attention to directional road markings to better detect an upcoming captive left turn lane.

969
1:50:59,200 --> 1:51:12,200
We can start to procedurally alter our lane graph inside the simulator to help focus to create entirely new flows through this intersection to help focus the network's attention to the road markings to create more accurate predictions.

970
1:51:12,200 --> 1:51:20,200
And this is a great example of how this tooling allows us to create new data that can never be collected from the real world.

971
1:51:20,200 --> 1:51:28,200
And the true power of this tool is in its architecture and how we can run all tasks in parallel to infinitely scale.

972
1:51:28,200 --> 1:51:35,200
So you saw the tile creator tool in action converting the ground truth labels into their counterparts.

973
1:51:35,200 --> 1:51:43,200
Next, we can use our tile extractor tool to divide this data into geo hash tiles about 150 meters square in size.

974
1:51:43,200 --> 1:51:47,200
We then save out that data into separate geometry and instance files.

975
1:51:47,200 --> 1:51:56,200
This gives us a clean source of data that's easy to load and allows us to be rendering engine agnostic for the future.

976
1:51:56,200 --> 1:52:02,200
Then using a tile loader tool, we can summon any number of those cash tiles using a geo hash ID.

977
1:52:02,200 --> 1:52:11,200
Currently, we're doing about these five by five tiles or three by three, usually centered around fleet hotspots or interesting lane graph locations.

978
1:52:11,200 --> 1:52:23,200
And the tile loader also converts these tile sets into U assets for consumption by the Unreal Engine and gives you a finished product from what you saw on the first slide.

979
1:52:23,200 --> 1:52:26,200
And this really sets us up for size and scale.

980
1:52:26,200 --> 1:52:32,200
And as you can see on the map behind us, we can easily generate most of San Francisco city streets.

981
1:52:32,200 --> 1:52:38,200
And this didn't take years or even months of work, but rather two weeks by one person.

982
1:52:38,200 --> 1:52:44,200
We can continue to manage and grow all this data using our PDG network inside of the tooling.

983
1:52:44,200 --> 1:52:50,200
This allows us to throw compute at it and regenerate all these tile sets overnight.

984
1:52:50,200 --> 1:53:04,200
This ensures all environments are consistent quality and features, which is super important for training since new ontologies and signals are constantly released.

985
1:53:04,200 --> 1:53:13,200
And we can combine that to come full circle because we generated all these tile sets from ground truth data that contain all the weird intricacies from the real world.

986
1:53:13,200 --> 1:53:21,200
And we can combine that with the procedural visual and traffic variety to create limitless targeted data for the network to learn from.

987
1:53:21,200 --> 1:53:22,200
And that concludes the same section.

988
1:53:22,200 --> 1:53:27,200
I'll pass it to Kate to talk about how we can use all this data to improve autopilot.

989
1:53:27,200 --> 1:53:37,200
Thanks, David.

990
1:53:37,200 --> 1:53:38,200
Hi, everyone.

991
1:53:38,200 --> 1:53:46,200
My name is Kate Park, and I'm here to talk about the data engine, which is the process by which we improve our neural networks via data.

992
1:53:46,200 --> 1:53:54,200
We're going to show you how we deterministically solve interventions via data and walk you through the life of this particular clip.

993
1:53:54,200 --> 1:54:04,200
In this scenario, autopilot is approaching a turn and incorrectly predicts that crossing vehicle as stopped for traffic and thus a vehicle that we would slow down for.

994
1:54:04,200 --> 1:54:07,200
In reality, there's nobody in the car.

995
1:54:07,200 --> 1:54:09,200
It's just awkwardly parked.

996
1:54:09,200 --> 1:54:17,200
We built this tooling to identify the mispredictions, correct the label and categorize this clip into an evaluation set.

997
1:54:17,200 --> 1:54:24,200
This particular clip happens to be one of 126 that we've diagnosed as challenging parked cars at turns.

998
1:54:24,200 --> 1:54:34,200
Because of this infra, we can curate this evaluation set without any engineering resources custom to this particular challenge case.

999
1:54:34,200 --> 1:54:39,200
To actually solve that challenge case requires mining thousands of examples like it.

1000
1:54:39,200 --> 1:54:42,200
And it's something Tesla can trivially do.

1001
1:54:42,200 --> 1:54:53,200
We simply use our data sourcing infra request data and use the tooling shown previously to correct the labels by surgically targeting the mispredictions of the current model.

1002
1:54:53,200 --> 1:54:58,200
We're only adding the most valuable examples to our training set.

1003
1:54:58,200 --> 1:55:02,200
We surgically fix 13,900 clips.

1004
1:55:02,200 --> 1:55:09,200
And because those were examples where the current model struggles, we don't even need to change the model architecture.

1005
1:55:09,200 --> 1:55:14,200
A simple weight update with this new valuable data is enough to solve the challenge case.

1006
1:55:14,200 --> 1:55:21,200
So you see we no longer predict that crossing vehicle as stopped as shown in orange, but parked as shown in red.

1007
1:55:21,200 --> 1:55:25,200
In academia, we often see that people keep data constant.

1008
1:55:25,200 --> 1:55:28,200
But at Tesla, it's very much the opposite.

1009
1:55:28,200 --> 1:55:37,200
We see time and time and again that data is one of the best, if not the most deterministic lever to solving these interventions.

1010
1:55:37,200 --> 1:55:42,200
We just showed you the data engine loop for one challenge case, namely these parked cars at turns.

1011
1:55:42,200 --> 1:55:47,200
But there are many challenge cases even for one signal of vehicle movement.

1012
1:55:47,200 --> 1:55:55,200
We apply this data engine loop to every single challenge case we've diagnosed, whether it's buses, curvy roads, stopped vehicles, parking lots.

1013
1:55:55,200 --> 1:55:57,200
And we don't just add data once.

1014
1:55:57,200 --> 1:56:01,200
We do this again and again to perfect the semantic.

1015
1:56:01,200 --> 1:56:13,200
In fact, this year we updated our vehicle movement signal five times and with every weight update trained on the new data, we push our vehicle movement accuracy up and up.

1016
1:56:13,200 --> 1:56:23,200
This data engine framework applies to all our signals, whether they're 3D, multicam video, whether the data is human labeled, auto labeled or simulated,

1017
1:56:23,200 --> 1:56:27,200
whether it's an offline model or an online model model.

1018
1:56:27,200 --> 1:56:36,200
Tesla is able to do this at scale because of the fleet advantage, the info that our ENG team has built and the labeling resources that feed our networks.

1019
1:56:36,200 --> 1:56:40,200
To train on all this data, we need a massive amount of compute.

1020
1:56:40,200 --> 1:56:45,200
So I'll hand it off to Pete and Ganesh to talk about the Dojo supercomputing platform.

1021
1:56:45,200 --> 1:56:55,200
Thank you.

1022
1:56:55,200 --> 1:56:56,200
Thanks, everybody.

1023
1:56:56,200 --> 1:56:57,200
Thanks for hanging in there.

1024
1:56:57,200 --> 1:56:59,200
We're almost there.

1025
1:56:59,200 --> 1:57:00,200
My name is Pete Bannon.

1026
1:57:00,200 --> 1:57:05,200
I run the custom silicon and low voltage teams at Tesla.

1027
1:57:05,200 --> 1:57:07,200
And my name is Ganesh Venkate.

1028
1:57:07,200 --> 1:57:14,200
I run the Dojo program.

1029
1:57:14,200 --> 1:57:16,200
Thank you.

1030
1:57:16,200 --> 1:57:21,200
I'm frequently asked, why is a car company building a supercomputer for training?

1031
1:57:21,200 --> 1:57:27,200
This question fundamentally misunderstands the nature of Tesla.

1032
1:57:27,200 --> 1:57:31,200
At its heart, Tesla is a hardcore technology company.

1033
1:57:31,200 --> 1:57:43,200
All across the company, people are working hard in science and engineering to advance the fundamental understanding and methods that we have available to build cars,

1034
1:57:43,200 --> 1:57:50,200
energy solutions, robots and anything else that we can do to improve the human condition around the world.

1035
1:57:50,200 --> 1:57:53,200
It's a super exciting thing to be a part of.

1036
1:57:53,200 --> 1:57:57,200
And it's a privilege to run a very small piece of it in the semiconductor group.

1037
1:57:57,200 --> 1:58:04,200
Tonight we're going to talk a little bit about Dojo and give you an update on what we've been able to do over the last year.

1038
1:58:04,200 --> 1:58:10,200
But before we do that, I wanted to give a little bit of background on the initial design that we started a few years ago.

1039
1:58:10,200 --> 1:58:17,200
When we got started, the goal was to provide a substantial improvement to the training latency for our autopilot team.

1040
1:58:17,200 --> 1:58:27,200
Some of the largest neural networks they train today run for over a month, which inhibits their ability to rapidly explore alternatives and evaluate them.

1041
1:58:27,200 --> 1:58:35,200
So, you know, a 30X speed up would be really nice if we could provide it at a cost competitive and energy competitive way.

1042
1:58:35,200 --> 1:58:43,200
To do that, we wanted to build a chip with a lot of arithmetic units that we could utilize at a very high efficiency.

1043
1:58:43,200 --> 1:58:51,200
And we spent a lot of time studying whether we could do that using DRAM, various packaging ideas, all of which failed.

1044
1:58:51,200 --> 1:59:02,200
And in the end, even though it felt like an unnatural act, we decided to reject DRAM as the primary storage medium for this system and instead focus on SRAM embedded in the chip.

1045
1:59:02,200 --> 1:59:09,200
SRAM provides, unfortunately, a modest amount of capacity, but extremely high bandwidth and very low latency.

1046
1:59:09,200 --> 1:59:13,200
And that enables us to achieve high utilization with the arithmetic units.

1047
1:59:13,200 --> 1:59:20,200
Those choices, that particular choice led to a whole bunch of other choices.

1048
1:59:20,200 --> 1:59:24,200
For example, if you want to have virtual memory, you need page tables. They take up a lot of space.

1049
1:59:24,200 --> 1:59:28,200
We didn't have space, so no virtual memory.

1050
1:59:28,200 --> 1:59:30,200
We also don't have interrupts.

1051
1:59:30,200 --> 1:59:41,200
The accelerator is a bare-bonds raw piece of hardware that's presented to a compiler, and the compiler is responsible for scheduling everything that happens in a deterministic way.

1052
1:59:41,200 --> 1:59:45,200
So there's no need or even desire for interrupts in the system.

1053
1:59:45,200 --> 1:59:53,200
We also chose to pursue model parallelism as a training methodology, which is not the typical situation.

1054
1:59:53,200 --> 2:00:01,200
Most machines today use data parallelism, which consumes additional memory capacity, which we obviously don't have.

1055
2:00:01,200 --> 2:00:10,200
So all of those choices led us to build a machine that is pretty radically different from what's available today.

1056
2:00:10,200 --> 2:00:14,200
We also had a whole bunch of other goals. One of the most important ones was no limits.

1057
2:00:14,200 --> 2:00:20,200
So we wanted to build a compute fabric that would scale in an unbounded way, for the most part.

1058
2:00:20,200 --> 2:00:23,200
I mean, obviously, there's physical limits now and then.

1059
2:00:23,200 --> 2:00:29,200
But pretty much, if your model was too big for the computer, you just had to go buy a bigger computer.

1060
2:00:29,200 --> 2:00:31,200
That's what we were looking for.

1061
2:00:31,200 --> 2:00:41,200
Today, the way machines are packaged, there's a pretty fixed ratio of, for example, GPUs, CPUs, and DRAM capacity and network capacity.

1062
2:00:41,200 --> 2:00:54,200
We really wanted to desegregate all that so that as models evolved, we could vary the ratios of those various elements and make the system more flexible to meet the needs of the autopilot team.

1063
2:00:54,200 --> 2:01:01,200
Yeah, and it's so true, Pete, like no limits philosophy was our guiding star all the way.

1064
2:01:01,200 --> 2:01:15,200
All of our choices were centered around that, and to the point that we didn't want traditional data center infrastructure to limit our capacity to execute these programs at speed.

1065
2:01:15,200 --> 2:01:31,200
That's why we integrated vertically our data center, the entire data center by doing a vertical integration of the data center.

1066
2:01:31,200 --> 2:01:34,200
We could extract new levels of efficiency.

1067
2:01:34,200 --> 2:01:49,200
We could optimize power delivery, cooling, and as well as system management across the whole data center stack rather than doing box by box and integrating those boxes into data centers.

1068
2:01:49,200 --> 2:02:06,200
And to do this, we also wanted to integrate early to figure out limits of scale for our software workloads, so we integrated Dojo environment into our autopilot software very early, and we learned a lot of lessons.

1069
2:02:06,200 --> 2:02:25,200
And today, Bill Chang will go over our hardware update as well as some of the challenges that we faced along the way, and Rajiv Kurian will give you a glimpse of our compiler technology as well as go over some of our cool results.

1070
2:02:25,200 --> 2:02:31,200
Great.

1071
2:02:31,200 --> 2:02:34,200
Thanks, Pete. Thanks, Ganesh.

1072
2:02:34,200 --> 2:02:48,200
I'll start tonight with a high level vision of our system that will help set the stage for the challenges and the problems we're solving, and then also how software will then leverage this for performance.

1073
2:02:48,200 --> 2:03:08,200
Now, our vision for Dojo is to build a single unified accelerator, a very large one. Software would see a seamless compute plane with globally addressable, very fast memory, and all connected together with uniform high bandwidth and low latency.

1074
2:03:08,200 --> 2:03:23,200
Now, to realize this, we need to use density to achieve performance. Now, we leverage technology to get this density in order to break levels of hierarchy all the way from the chip to the scale out systems.

1075
2:03:23,200 --> 2:03:36,200
Now, silicon technology has used this, has done this for decades. Chips have followed Moore's law for density integration to get performance scaling.

1076
2:03:36,200 --> 2:03:53,200
Now, a key step in realizing that vision was our training tile. Not only can we integrate 25 dies at extremely high bandwidth, but we can scale that to any number of additional tiles by just connecting them together.

1077
2:03:53,200 --> 2:04:02,200
Now, last year, we showcased our first functional training tile, and at that time we already had workloads running on it.

1078
2:04:02,200 --> 2:04:10,200
And since then, the team here has been working hard and diligently to deploy this at scale.

1079
2:04:10,200 --> 2:04:19,200
Now, we've made amazing progress and had a lot of milestones along the way, and of course, we've had a lot of unexpected challenges.

1080
2:04:19,200 --> 2:04:27,200
But this is where our fail fast philosophy has allowed us to push our boundaries.

1081
2:04:27,200 --> 2:04:35,200
Now, pushing density for performance presents all new challenges. One area is power delivery.

1082
2:04:35,200 --> 2:04:43,200
Here, we need to deliver the power to our compute die, and this directly impacts our top line compute performance.

1083
2:04:43,200 --> 2:04:54,200
But we need to do this at unprecedented density. We need to be able to match our die pitch with a power density of almost one amp per millimeter squared.

1084
2:04:54,200 --> 2:05:01,200
And because of the extreme integration, this needs to be a multi-tiered vertical power solution.

1085
2:05:01,200 --> 2:05:12,200
And because there's a complex heterogeneous material stack up, we have to carefully manage the material transition, especially CTE.

1086
2:05:12,200 --> 2:05:16,200
Now, why does the coefficient of thermal expansion matter in this case?

1087
2:05:16,200 --> 2:05:27,200
CTE is a fundamental material property, and if it's not carefully managed, that stack up would literally rip itself apart.

1088
2:05:27,200 --> 2:05:38,200
So we started this effort by working with vendors to develop this power solution, but we realized that we actually had to develop this in-house.

1089
2:05:38,200 --> 2:05:47,200
Now, to balance schedule and risk, we built quick iterations to support both our system bring up and software development,

1090
2:05:47,200 --> 2:05:53,200
and also to find the optimal design and stack up that would meet our final production goals.

1091
2:05:53,200 --> 2:06:03,200
And in the end, we were able to reduce CTE over 50 percent and meet our performance by 3x over our initial version.

1092
2:06:03,200 --> 2:06:14,200
Now, needless to say, finding this optimal material stack up while maximizing performance at density is extremely difficult.

1093
2:06:14,200 --> 2:06:18,200
Now, we did have unexpected challenges along the way.

1094
2:06:18,200 --> 2:06:25,200
Here's an example where we pushed the boundaries of integration that led to component failures.

1095
2:06:25,200 --> 2:06:35,200
This started when we scaled up to larger and longer workloads, and then intermittently a single site on a tile would fail.

1096
2:06:35,200 --> 2:06:45,200
Now, they started out as recoverable failures, but as we pushed to much higher and higher power, these would become permanent failures.

1097
2:06:45,200 --> 2:06:52,200
Now, to understand this failure, you have to understand why and how we build our power modules.

1098
2:06:52,200 --> 2:06:59,200
Solving density at every level is the cornerstone of actually achieving our system performance.

1099
2:06:59,200 --> 2:07:08,200
Now, because our XY plane is used for high bandwidth communication, everything else must be stacked vertically.

1100
2:07:08,200 --> 2:07:14,200
This means all other components other than our die must be integrated into our power modules.

1101
2:07:14,200 --> 2:07:21,200
Now, that includes our clock and our power supplies and also our system controllers.

1102
2:07:21,200 --> 2:07:27,200
Now, in this case, the failures were due to losing clock output from our oscillators.

1103
2:07:27,200 --> 2:07:39,200
And after an extensive debug, we found that the root cause was due to vibrations on the module from piezoelectric effects on nearby capacitors.

1104
2:07:39,200 --> 2:07:45,200
Now, singing caps are not a new phenomenon and, in fact, very common in power design.

1105
2:07:45,200 --> 2:07:52,200
But normally clock chips are placed in a very quiet area of the board and often not affected by power circuits.

1106
2:07:52,200 --> 2:08:00,200
But because we needed to achieve this level of integration, these oscillators need to be placed in very close proximity.

1107
2:08:00,200 --> 2:08:13,200
Now, due to our switching frequency and then the vibration resonance created, it caused out of plane vibration on our MEMS oscillator that caused it to crack.

1108
2:08:13,200 --> 2:08:16,200
Now, the solution to this problem is a multiprong approach.

1109
2:08:16,200 --> 2:08:22,200
We can reduce the vibration by using soft terminal caps.

1110
2:08:22,200 --> 2:08:30,200
We can update our MEMS part with a lower Q factor for the out of plane direction.

1111
2:08:30,200 --> 2:08:40,200
And we can also update our switching frequency to push the resonance further away from these sensitive bands.

1112
2:08:40,200 --> 2:08:49,200
Now, addition to the density at the system level, we've been making a lot of progress at the infrastructure level.

1113
2:08:49,200 --> 2:08:59,200
We knew that we had to re-examine every aspect of the data center infrastructure in order to support our unprecedented power and cooling density.

1114
2:08:59,200 --> 2:09:06,200
We brought in a fully custom designed CDU to support DOJO's dense cooling requirements.

1115
2:09:06,200 --> 2:09:13,200
And the amazing part is we're able to do this at a fraction of the cost versus buying off the shelf and modifying it.

1116
2:09:13,200 --> 2:09:21,200
And since our DOJO cabinet integrates enough power and cooling to match an entire row of standard IT racks,

1117
2:09:21,200 --> 2:09:26,200
we need to carefully design our cabinet and infrastructure together.

1118
2:09:26,200 --> 2:09:31,200
And we've already gone through several iterations of this cabinet to optimize this.

1119
2:09:31,200 --> 2:09:36,200
And earlier this year, we started load testing our power and cooling infrastructure.

1120
2:09:36,200 --> 2:09:46,200
And we were able to push it over two megawatts before we tripped our substation and got a call from the city.

1121
2:09:46,200 --> 2:09:53,200
Now, last year, we introduced only a couple of components of our system, the custom D1 die and the training tile.

1122
2:09:53,200 --> 2:09:57,200
But we teased the exit pod as our end goal.

1123
2:09:57,200 --> 2:10:04,200
We'll walk through the remaining parts of our system that are required to build out this exit pod.

1124
2:10:04,200 --> 2:10:09,200
Now, the system tray is a key part of realizing our vision of a single accelerator.

1125
2:10:09,200 --> 2:10:17,200
It enables us to seamlessly connect tiles together, not only within the cabinet, but between cabinets.

1126
2:10:17,200 --> 2:10:23,200
We can connect these tiles at very tight spacing across the entire accelerator.

1127
2:10:23,200 --> 2:10:27,200
And this is how we achieve our uniform communication.

1128
2:10:27,200 --> 2:10:36,200
This is a laminate bus bar that allows us to integrate very high power, mechanical and thermal support, and an extremely dense integration.

1129
2:10:36,200 --> 2:10:43,200
It's 75 millimeters in height and supports six tiles at 135 kilograms.

1130
2:10:43,200 --> 2:10:52,200
This is the equivalent of three to four fully loaded high performance racks.

1131
2:10:52,200 --> 2:10:55,200
Next, we need to feed data to the training tiles.

1132
2:10:55,200 --> 2:10:59,200
This is where we've developed the Dojo interface processor.

1133
2:10:59,200 --> 2:11:05,200
It provides our system with high bandwidth DRAM to stage our training data.

1134
2:11:05,200 --> 2:11:15,200
And it provides full memory bandwidth to our training tiles using TTP, our custom protocol that we use to communicate across our entire accelerator.

1135
2:11:15,200 --> 2:11:21,200
It also has high speed Ethernet that helps us extend this custom protocol over standard Ethernet.

1136
2:11:21,200 --> 2:11:27,200
And we provide native hardware support for this with little to no software overhead.

1137
2:11:27,200 --> 2:11:36,200
And lastly, we can connect to it through a standard Gen4 PCIe interface.

1138
2:11:36,200 --> 2:11:43,200
Now, we pair 20 of these cards per tray, and that gives us 640 gigabytes of high bandwidth DRAM.

1139
2:11:43,200 --> 2:11:48,200
And this provides our disaggregated memory layer for our training tiles.

1140
2:11:48,200 --> 2:11:54,200
These cards are a high bandwidth ingest path, both through PCIe and Ethernet.

1141
2:11:54,200 --> 2:12:04,200
They also provide a high-rate XZ connectivity path that allows shortcuts across our large Dojo accelerator.

1142
2:12:04,200 --> 2:12:09,200
Now, we actually integrate the host directly underneath our system tray.

1143
2:12:09,200 --> 2:12:16,200
These hosts provide our ingest processing and connect to our interface processors through PCIe.

1144
2:12:16,200 --> 2:12:23,200
These hosts can provide hardware video decoder support for video-based training.

1145
2:12:23,200 --> 2:12:35,200
And our user applications land on these hosts, so we can provide them with the standard x86 Linux environment.

1146
2:12:35,200 --> 2:12:52,200
Now, we can put two of these assemblies into one cabinet and pair it with redundant power supplies that do direct conversion of 3-phase 480 volt AC power to 52 volt DC power.

1147
2:12:52,200 --> 2:13:00,200
Now, by focusing on density at every level, we can realize the vision of a single accelerator.

1148
2:13:00,200 --> 2:13:09,200
Now, starting with the uniform nodes on our custom D1 die, we can connect them together in our fully integrated training tile,

1149
2:13:09,200 --> 2:13:17,200
and then finally seamlessly connecting them across cabinet boundaries to form our Dojo accelerator.

1150
2:13:17,200 --> 2:13:26,200
And altogether, we can house two full accelerators in our Exopod for a combined one Exoflop of ML compute.

1151
2:13:26,200 --> 2:13:35,200
Now, altogether, this amount of technology and integration has only ever been done a couple of times in the history of compute.

1152
2:13:35,200 --> 2:13:48,200
Next, we'll see how software can leverage this to accelerate their performance.

1153
2:13:48,200 --> 2:13:53,200
Thanks, Bill. My name is Rajiv, and I'm going to talk some numbers.

1154
2:13:53,200 --> 2:14:01,200
The software stack begins with the PyTorch extension that speaks to our commitment to run standard PyTorch models out of the box.

1155
2:14:01,200 --> 2:14:08,200
We're going to talk more about our JIT compiler and the InJS pipeline that feeds the hardware with data.

1156
2:14:08,200 --> 2:14:14,200
Abstractly, performance is tops times utilization times accelerator occupancy.

1157
2:14:14,200 --> 2:14:22,200
We've seen how the hardware provides peak performance as the job of the compiler to extract utilization from the hardware while code is running on it.

1158
2:14:22,200 --> 2:14:30,200
It's the job of the InJS pipeline to make sure that data can be fed at a throughput high enough for the hardware to not ever starve.

1159
2:14:30,200 --> 2:14:34,200
So let's talk about why communication-bound models are difficult to scale.

1160
2:14:34,200 --> 2:14:39,200
But before that, let's look at why ResNet-50-like models are easier to scale.

1161
2:14:39,200 --> 2:14:44,200
You start off with a single accelerator, run the forward and backward passes, followed by the optimizer.

1162
2:14:44,200 --> 2:14:49,200
Then to scale this up, you run multiple copies of this on multiple accelerators.

1163
2:14:49,200 --> 2:14:54,200
The gradients produced by the backward pass do need to be reduced, and this introduces some communication.

1164
2:14:54,200 --> 2:15:00,200
This can be done pipeline with the backward pass.

1165
2:15:00,200 --> 2:15:05,200
This setup scales fairly well, almost linearly.

1166
2:15:05,200 --> 2:15:11,200
For models with much larger activations, we run into a problem as soon as we want to run the forward pass.

1167
2:15:11,200 --> 2:15:16,200
The batch size that fits in a single accelerator is often smaller than the batch norm surface.

1168
2:15:16,200 --> 2:15:22,200
To get around this, researchers typically run this setup on multiple accelerators in sync batch norm mode.

1169
2:15:22,200 --> 2:15:30,200
This introduces latency-bound communication to the critical path of the forward pass, and we already have a communication bottleneck.

1170
2:15:30,200 --> 2:15:36,200
And while there are ways to get around this, they usually involve tedious manual work best suited for a compiler.

1171
2:15:36,200 --> 2:15:45,200
And ultimately, there's no skirting around the fact that if your state does not fit in a single accelerator, you can be communication-bound.

1172
2:15:45,200 --> 2:15:52,200
Even with significant efforts from our ML engineers, we see such models don't scale linearly.

1173
2:15:52,200 --> 2:15:57,200
The Dojo system was built to make such models work at high utilization.

1174
2:15:57,200 --> 2:16:02,200
The high density integration was built to not only accelerate the compute-bound portions of a model,

1175
2:16:02,200 --> 2:16:13,200
but also the latency-bound portions like a batch norm or the bandwidth-bound portions like a gradient all reduced or a parameter all gathered.

1176
2:16:13,200 --> 2:16:18,200
A slice of the Dojo mesh can be carved out to run any model.

1177
2:16:18,200 --> 2:16:25,200
The only thing users need to do is to make the slice large enough to fit a batch norm surface for their particular model.

1178
2:16:25,200 --> 2:16:30,200
After that, the partition presents itself as one large accelerator,

1179
2:16:30,200 --> 2:16:39,200
freeing the users from having to worry about the internal details of execution and as the job of the compiler to maintain this abstraction.

1180
2:16:39,200 --> 2:16:47,200
Long-grain synchronization primitives and uniform low latency makes it easy to accelerate all forms of parallelism across integration boundaries.

1181
2:16:47,200 --> 2:16:53,200
Tensors are usually stored sharded in SRAM and replicated just in time for layers execution.

1182
2:16:53,200 --> 2:16:57,200
We depend on the high Dojo bandwidth to hide this replication time.

1183
2:16:57,200 --> 2:17:07,200
Tensor replication and other data transfers are overlapped with compute, and the compiler can also recompute layers when it's profitable to do so.

1184
2:17:07,200 --> 2:17:10,200
We expect most models to work out of the box.

1185
2:17:10,200 --> 2:17:16,200
As an example, we took the recently released stable diffusion model and got it running on Dojo in minutes.

1186
2:17:16,200 --> 2:17:22,200
Out of the box, the compiler was able to map it in a model parallel manner on 25 Dojo dies.

1187
2:17:22,200 --> 2:17:29,200
Here are some pictures of a cyber-trap on Mars generated by stable diffusion running on Dojo.

1188
2:17:29,200 --> 2:17:42,200
Looks like it still has some ways to go before matching the Tesla Design Studio team.

1189
2:17:42,200 --> 2:17:46,200
So we've talked about how communication bottlenecks can hamper scalability.

1190
2:17:46,200 --> 2:17:52,200
Perhaps an acid test of a compiler and the underlying hardware is executing a cross-dye bashworm layer.

1191
2:17:52,200 --> 2:17:55,200
Like mentioned before, this can be a serial bottleneck.

1192
2:17:55,200 --> 2:18:01,200
The communication phase of a bashworm begins with nodes computing their local mean and standard deviations,

1193
2:18:01,200 --> 2:18:09,200
then coordinating to reduce these values, then broadcasting these values back, and then they resume their work in parallel.

1194
2:18:09,200 --> 2:18:13,200
So what would an ideal bashworm look like on 25 Dojo dies?

1195
2:18:13,200 --> 2:18:19,200
Let's say the previous layer's activations are already split across dies.

1196
2:18:19,200 --> 2:18:26,200
We would expect the 350 nodes on each die to coordinate and produce die local mean and standard deviation values.

1197
2:18:26,200 --> 2:18:33,200
Ideally, these would get further reduced with the final value ending somewhere towards the middle of the tile.

1198
2:18:33,200 --> 2:18:38,200
We would then hope to see a broadcast of this value radiating from the center.

1199
2:18:38,200 --> 2:18:43,200
Let's see how the compiler actually executes a real bashworm operation across 25 dies.

1200
2:18:43,200 --> 2:18:49,200
The communication trees were extracted from the compiler, and the timing is from a real hardware one.

1201
2:18:49,200 --> 2:18:59,200
We're about to see 8,750 nodes on 25 dies coordinating to reduce and then broadcast the bashworm mean and standard deviation values.

1202
2:18:59,200 --> 2:19:05,200
Die local reduction followed by global reduction towards the middle of the tile,

1203
2:19:05,200 --> 2:19:14,200
then the reduced value broadcast radiating from the middle accelerated by the hardware's broadcast facility.

1204
2:19:14,200 --> 2:19:19,200
This operation takes only 5 microseconds on 25 Dojo dies.

1205
2:19:19,200 --> 2:19:24,200
The same operation takes 150 microseconds on 24 GPUs.

1206
2:19:24,200 --> 2:19:28,200
This is in orders of magnitude improvement over GPUs.

1207
2:19:28,200 --> 2:19:32,200
And while we talked about an all-reduce operation in the context of a batch norm,

1208
2:19:32,200 --> 2:19:38,200
it's important to reiterate that the same advantages apply to all other communication primitives,

1209
2:19:38,200 --> 2:19:42,200
and these primitives are essential for large-scale training.

1210
2:19:42,200 --> 2:19:45,200
So how about full model performance?

1211
2:19:45,200 --> 2:19:50,200
So while we think that ResNet-50 is not a good representation of real-world Tesla workloads,

1212
2:19:50,200 --> 2:19:53,200
it is a standard benchmark, so let's start there.

1213
2:19:53,200 --> 2:19:57,200
We are already able to match the A100 die for die.

1214
2:19:57,200 --> 2:20:04,200
However, perhaps a hint of Dojo's capabilities is that we're able to hit this number with just a batch of 8 per die.

1215
2:20:04,200 --> 2:20:08,200
But Dojo was really built to tackle larger complex models.

1216
2:20:08,200 --> 2:20:14,200
So when we set out to tackle real-world workloads, we looked at the usage patterns of our current GPU cluster,

1217
2:20:14,200 --> 2:20:17,200
and two models stood out, the autolabeling networks,

1218
2:20:17,200 --> 2:20:20,200
a class of offline models that are used to generate ground truth,

1219
2:20:20,200 --> 2:20:23,200
and the occupancy networks that you heard about.

1220
2:20:23,200 --> 2:20:28,200
The autolabeling networks are large models that have high arithmetic intensity,

1221
2:20:28,200 --> 2:20:31,200
while the occupancy networks can be in just bound.

1222
2:20:31,200 --> 2:20:36,200
We chose these models because together they account for a large chunk of our current GPU cluster usage,

1223
2:20:36,200 --> 2:20:41,200
and they would challenge the system in different ways.

1224
2:20:41,200 --> 2:20:44,200
So how do we do on these two networks?

1225
2:20:44,200 --> 2:20:49,200
The results we're about to see were measured on multi-die systems for both the GPU and Dojo,

1226
2:20:49,200 --> 2:20:52,200
but normalized to per die numbers.

1227
2:20:52,200 --> 2:20:57,200
On our autolabeling network, we're already able to surpass the performance of an A100

1228
2:20:57,200 --> 2:21:01,200
with our current hardware running on our older generation VRMs.

1229
2:21:01,200 --> 2:21:07,200
On our production hardware with our newer VRMs, that translates to doubling the throughput of an A100.

1230
2:21:07,200 --> 2:21:10,200
And our models show that with some key compiler optimizations,

1231
2:21:10,200 --> 2:21:15,200
we could get to more than 3x the performance of an A100.

1232
2:21:15,200 --> 2:21:19,200
We see even bigger leaps on the occupancy network.

1233
2:21:19,200 --> 2:21:24,200
Almost 3x with our production hardware with room for more.

1234
2:21:24,200 --> 2:21:34,200
So what does that mean for Tesla?

1235
2:21:34,200 --> 2:21:37,200
With our current level of compiler performance,

1236
2:21:37,200 --> 2:21:47,200
we could replace the ML computer of 1, 2, 3, 4, 5, and 6 GPU boxes with just a single Dojo tile.

1237
2:21:47,200 --> 2:21:58,200
And this Dojo tile costs less than one of these GPU boxes.

1238
2:21:58,200 --> 2:22:09,200
What it really means is that networks that took more than a month to train now take less than a week.

1239
2:22:09,200 --> 2:22:16,200
Alas, when we measure things, it did not turn out so well.

1240
2:22:16,200 --> 2:22:20,200
At the PyTorch level, we did not see our expected performance out of the gate.

1241
2:22:20,200 --> 2:22:23,200
And this timeline chart shows our problem.

1242
2:22:23,200 --> 2:22:28,200
The teeny tiny little green bars, that's the compile code running on the accelerator.

1243
2:22:28,200 --> 2:22:35,200
The row is mostly white space where the hardware is just waiting for data.

1244
2:22:35,200 --> 2:22:41,200
With our dense ML compute, Dojo hosts effectively have 10x more ML compute than the GPU hosts.

1245
2:22:41,200 --> 2:22:48,200
The data loaders running on this one host simply can't keep up with all that ML hardware.

1246
2:22:48,200 --> 2:22:54,200
So to solve our data loader scalability issues, we knew we had to get over the limit of this single host.

1247
2:22:54,200 --> 2:23:00,200
The Tesla transport protocol moves data seamlessly across host, tiles, and ingest processors.

1248
2:23:00,200 --> 2:23:04,200
So we extended the Tesla transport protocol to work over Ethernet.

1249
2:23:04,200 --> 2:23:09,200
We then built the Dojo network interface card, the DNIC, to leverage TTP over Ethernet.

1250
2:23:09,200 --> 2:23:16,200
This allows any host with a DNIC card to be able to DM it to and from other TTP endpoints.

1251
2:23:16,200 --> 2:23:19,200
So we started with the Dojo mesh.

1252
2:23:19,200 --> 2:23:25,200
Then we added a tier of data loading hosts equipped with the DNIC card.

1253
2:23:25,200 --> 2:23:29,200
We connected these hosts to the mesh via an Ethernet switch.

1254
2:23:29,200 --> 2:23:38,200
Now every host in this data loading tier is capable of reaching all TTP endpoints in the Dojo mesh via hardware accelerated DM it.

1255
2:23:38,200 --> 2:23:45,200
After these optimizations went in, our occupancy went from 4% to 97%.

1256
2:23:45,200 --> 2:23:57,200
So the data loading sections have reduced drastically, and the ML hardware has kept busy.

1257
2:23:57,200 --> 2:24:01,200
We actually expect this number to go to 100% pretty soon.

1258
2:24:01,200 --> 2:24:09,200
After these changes went in, we saw the full expected speed up from the PyTorch layer, and we were back in business.

1259
2:24:09,200 --> 2:24:17,200
So we started with hardware design that breaks through traditional integration boundaries in service of our vision of a single giant accelerator.

1260
2:24:17,200 --> 2:24:21,200
We've seen how the compiler and ingest layers build on top of that hardware.

1261
2:24:21,200 --> 2:24:28,200
So after proving our performance on these complex real-world networks, we knew what our first large-scale deployment would target.

1262
2:24:28,200 --> 2:24:32,200
Our high arithmetic intensity auto labeling networks.

1263
2:24:32,200 --> 2:24:37,200
Today that occupies 4,000 GPUs over 72 GPU racks.

1264
2:24:37,200 --> 2:24:53,200
With our dense computer and our high performance, we expect to provide the same throughput with just four Dojo cabinets.

1265
2:24:53,200 --> 2:25:00,200
And these four Dojo cabinets will be part of our first exapod that we plan to build by quarter one of 2023.

1266
2:25:00,200 --> 2:25:10,200
This will more than double Tesla's auto labeling capacity.

1267
2:25:10,200 --> 2:25:20,200
The first exapod is part of a total of seven exapods that we plan to build in Palo Alto right here across the wall.

1268
2:25:20,200 --> 2:25:27,200
And we have a display cabinet from one of these exapods for everyone to look at.

1269
2:25:27,200 --> 2:25:39,200
Six tiles densely packed on a tray, 54 petaflops of compute, 640 gigabytes of high bandwidth memory with power and host defeated.

1270
2:25:39,200 --> 2:25:44,200
A lot of compute.

1271
2:25:44,200 --> 2:25:51,200
And we're building out new versions of all our cluster components and constantly improving our software to hit new limits of scale.

1272
2:25:51,200 --> 2:25:58,200
We believe that we can get another 10x improvement with our next generation hardware.

1273
2:25:58,200 --> 2:26:02,200
And to realize our ambitious goals, we need the best software and hardware engineers.

1274
2:26:02,200 --> 2:26:05,200
So please come talk to us or visit tesla.com.

1275
2:26:05,200 --> 2:26:27,200
Thank you.

1276
2:26:27,200 --> 2:26:35,200
All right. So hopefully that was enough detail.

1277
2:26:35,200 --> 2:26:38,200
And now we can move to questions.

1278
2:26:38,200 --> 2:26:46,200
And guys, I think the team came out of the stage.

1279
2:26:46,200 --> 2:26:58,200
We really wanted to show the depth and breadth of Tesla in artificial intelligence, compute hardware, robotics actuators,

1280
2:26:58,200 --> 2:27:09,200
and try to really shift the perception of the company away from, you know, a lot of people think we're like just a car company or we make cool cars, whatever.

1281
2:27:09,200 --> 2:27:19,200
They don't have, most people have no idea that Tesla is arguably the leader in real world AI hardware and software.

1282
2:27:19,200 --> 2:27:32,200
And that we're building what is arguably the first, the most radical computer architecture since the Crayon supercomputer.

1283
2:27:32,200 --> 2:27:43,200
And I think if you're interested in developing some of the most advanced technology in the world that's going to really affect the world in a positive way, Tesla is the place to be.

1284
2:27:43,200 --> 2:27:48,200
So, yeah, let's fire away with some questions.

1285
2:27:48,200 --> 2:27:55,200
I think there's a mic at the front and a mic at the back.

1286
2:27:55,200 --> 2:28:03,200
Just throw mics at people. Jump off on the mic.

1287
2:28:03,200 --> 2:28:08,200
Hi, thank you very much. I was impressed here.

1288
2:28:08,200 --> 2:28:15,200
Yeah, I was impressed very much by Optimus, but I wonder why did they don't even the hunt?

1289
2:28:15,200 --> 2:28:18,200
Why did you choose a tendon driven approach for the hunt?

1290
2:28:18,200 --> 2:28:26,200
Because tendons are not very durable. And why spring loaded?

1291
2:28:26,200 --> 2:28:29,200
Hello, is this working? Cool. Awesome. Yes, that's a great question.

1292
2:28:29,200 --> 2:28:38,200
You know, when it comes to any type of actuation scheme, there's tradeoffs between, you know, whether or not it's a tendon driven system or some type of linkage based system.

1293
2:28:38,200 --> 2:28:39,200
Keep the mic close to your mouth.

1294
2:28:39,200 --> 2:28:40,200
A little closer.

1295
2:28:40,200 --> 2:28:41,200
Yeah.

1296
2:28:41,200 --> 2:28:43,200
Hear me? Cool.

1297
2:28:43,200 --> 2:28:55,200
So, yeah, the main reason why we went for a tendon based system is that, you know, first we actually investigated some synthetic tendons, but we found that metallic boating cables are, you know, a lot stronger.

1298
2:28:55,200 --> 2:29:01,200
One of the advantages of these cables is that it's very good for part reduction.

1299
2:29:01,200 --> 2:29:09,200
We do want to make a lot of these hands. So having a bunch of parts, a bunch of small linkages ends up being, you know, a problem when you're making a lot of something.

1300
2:29:09,200 --> 2:29:17,200
One of the big reasons that, you know, tendons are better than linkages in a sense is that you can be anti backlash.

1301
2:29:17,200 --> 2:29:25,200
So anti backlash essentially, you know, allows you to not have any gaps or, you know, stuttering motion in your fingers.

1302
2:29:25,200 --> 2:29:32,200
Spring loaded. Mainly what spring loaded allows us to do is allows us to have active opening.

1303
2:29:32,200 --> 2:29:43,200
So instead of having to have two actuators to drive the fingers closed and then open, we have the ability to, you know, have the tendon drive them closed and then the springs passively extend.

1304
2:29:43,200 --> 2:29:50,200
And this is something that's seen in our hands as well, right? We have the ability to actively flex and then we also have the ability to extend.

1305
2:29:50,200 --> 2:29:52,200
Yeah.

1306
2:29:52,200 --> 2:30:04,200
I mean, our goal with Optimus is to have a robot that is maximally useful as quickly as possible. So there's a lot of ways to solve the various problems of a humanoid robot.

1307
2:30:04,200 --> 2:30:15,200
And we're probably not barking up the right tree on all the technical solutions. And I should say that we're open to evolving the technical solutions that you see here over time.

1308
2:30:15,200 --> 2:30:29,200
They're not lucked in stone. But we have to pick something and we want to pick something that's going to allow us to produce the robot as quickly as possible and have it, like I said, be useful as quickly as possible.

1309
2:30:29,200 --> 2:30:36,200
We're trying to follow the goal of fastest path to a useful robot that can be made at volume.

1310
2:30:36,200 --> 2:30:52,200
And we're going to test the robot internally at Tesla in our factory and just see, like, how useful is it? Because you have to have a, you've got to close the loop on reality to confirm that the robot is in fact useful.

1311
2:30:52,200 --> 2:31:12,200
And yeah, so we're just going to use it to build things. And we're confident we can do that with the hand that we have currently designed. But for sure there'll be hand version two, version three, and we may change the architecture quite significantly over time.

1312
2:31:12,200 --> 2:31:33,200
Hi. You're the optimist robot is really impressive that you did a great job by penal robots are really difficult. But what I noticed might be missing from your plan is to acknowledge the utility of the human spirit and I'm wondering if optimists will ever get a personality

1313
2:31:33,200 --> 2:31:46,200
and be able to laugh at our jokes while they've, while it folds our clothes. Yeah, absolutely. I think we want to have really fun versions of optimists.

1314
2:31:46,200 --> 2:32:05,200
And so that optimist can both do the utilitarian and do tasks but can also be kind of like a friend and a buddy and hang out with you. And I'm sure people will think of all sorts of creative uses for this robot.

1315
2:32:05,200 --> 2:32:25,200
And, you know, the thing, once you have the core intelligence and actuators figured out, then you can actually, you know, put all sorts of costumes I guess on on the robot I mean you can make the robot look.

1316
2:32:25,200 --> 2:32:40,200
You can skin the robot in many different ways. And I'm sure people will find very interesting ways to, to. Yeah, versions of optimists.

1317
2:32:40,200 --> 2:32:55,200
Thanks for the great presentation. I wanted to know if there was an equivalent to interventions in optimist it seems like labeling through moments where humans disagree with what's going on is important. And in a humanoid robot.

1318
2:32:55,200 --> 2:33:02,200
That might be also a desirable source of information.

1319
2:33:02,200 --> 2:33:15,200
Yeah, I think we will have ways to remote operate the robot and intervene when it does something bad, especially when we are training the robot and bringing it up.

1320
2:33:15,200 --> 2:33:28,200
And hopefully we, you know, design it in a way that we can stop the robot from if it's going to hit something we can just like hold it and then we'll stop it won't like you know crush your hand or something and those are all intervention data.

1321
2:33:28,200 --> 2:33:35,200
And we can learn a lot from our simulation systems to where we can check for collisions and supervise that those are bad actions.

1322
2:33:35,200 --> 2:33:53,200
Yeah, I mean, so optimists, we want over time to for it to be, you know, an Android kind of Android that you've seen in in sci fi movies like Star Trek, the next generation like data, but obviously we could program the robot to be less robot like and more friendly and.

1323
2:33:53,200 --> 2:34:11,200
You know, it can obviously learn to emulate humans and and feel very natural. So as as AI in general improves, we can add that to the robot and you know it should be obviously able to do simple instructions or even into it.

1324
2:34:11,200 --> 2:34:13,200
What it is that you want.

1325
2:34:13,200 --> 2:34:25,200
So you could give it a high level instruction and then it can break that down into a series of actions and and take those actions.

1326
2:34:25,200 --> 2:34:26,200
Hi.

1327
2:34:26,200 --> 2:34:36,200
Yeah, it's exciting to think that with the optimist you will think that you can achieve orders of magnitude of improvement in economic output.

1328
2:34:36,200 --> 2:34:45,200
That's really exciting. And when Tesla started the mission was to accelerate the advent of renewable energy or sustainable transport.

1329
2:34:45,200 --> 2:35:03,200
So with the optimist, do you still see that mission being the mission statement of Tesla or is it going to be updated with, you know, mission to accelerate the advent of infinite abundance or it limited this limitless economy.

1330
2:35:03,200 --> 2:35:11,200
Yeah, it is not strictly speaking optimist is not strictly speaking

1331
2:35:11,200 --> 2:35:15,200
directly in line with accelerating sustainable energy.

1332
2:35:15,200 --> 2:35:21,200
It, you know, to the degree that it is more efficient at getting things done than a person.

1333
2:35:21,200 --> 2:35:35,200
It does, I guess, help with, you know, sustainable energy, but I think the mission effectively does somewhat broaden with the advent of optimist to, you know, I don't know, making the future awesome.

1334
2:35:35,200 --> 2:35:43,200
So, you know, I think you look at optimist and I know about you, but I'm excited to see what optimist will become.

1335
2:35:43,200 --> 2:35:51,200
And, you know, this is like, you know, if you could, I mean, we can tell like any given technology.

1336
2:35:51,200 --> 2:35:58,200
Are you do you want to see what it's like in a year two years three years four years five years 10.

1337
2:35:58,200 --> 2:36:03,200
I'd say for sure. You definitely want to see what's what's happened with optimist.

1338
2:36:03,200 --> 2:36:09,200
Whereas you know a bunch of other technologies are you know sort of plateaued.

1339
2:36:09,200 --> 2:36:16,200
I don't know what name names here but

1340
2:36:16,200 --> 2:36:19,200
you know, so

1341
2:36:19,200 --> 2:36:24,200
I think optimist is going to be incredible in like five years 10 years like mind blowing.

1342
2:36:24,200 --> 2:36:29,200
And I'm really interested to see that happen and I hope you are too.

1343
2:36:29,200 --> 2:36:41,200
Thank you. I have a quick question here. Justin and I was wondering, like, are you planning to extend like conversational capabilities for the robot?

1344
2:36:41,200 --> 2:36:49,200
And my second full out question to that is what's like the end goal? What's the end goal of optimist?

1345
2:36:49,200 --> 2:36:56,200
Yeah, optimist would definitely have conversational capabilities. So

1346
2:36:56,200 --> 2:37:00,200
you'd be able to talk to it and have a conversation and it would feel quite natural.

1347
2:37:00,200 --> 2:37:09,200
So from an end goal standpoint, I'm I'm I don't know. I think it's going to keep keep evolving and

1348
2:37:09,200 --> 2:37:16,200
I'm not sure where it ends up, but some someplace interesting for sure.

1349
2:37:16,200 --> 2:37:21,200
You know, we always have to be careful about the, you know, don't go down the terminator path.

1350
2:37:21,200 --> 2:37:29,200
That's a I thought maybe we should start off with a video of like the terminator starting off with this, you know, skull crushing.

1351
2:37:29,200 --> 2:37:32,200
But there might be people might take that too seriously.

1352
2:37:32,200 --> 2:37:36,200
So, you know, we do want optimist to be safe.

1353
2:37:36,200 --> 2:37:44,200
So we are designing in safeguards where you can locally stop the robot

1354
2:37:44,200 --> 2:37:52,200
and, you know, with like basically a localized control ROM that you can't update over the Internet,

1355
2:37:52,200 --> 2:37:56,200
which I think that's quite important.

1356
2:37:56,200 --> 2:38:12,200
Essential, frankly. So like a localized stop button or remote control, something like that, that that cannot be changed.

1357
2:38:12,200 --> 2:38:22,200
But it's definitely going to be interesting. It won't be boring.

1358
2:38:22,200 --> 2:38:27,200
OK, yeah, I see you today. You have very attractive product with dojo and its applications.

1359
2:38:27,200 --> 2:38:30,200
So I'm wondering what's the future for the dojo platform?

1360
2:38:30,200 --> 2:38:39,200
Will you like provide like infrastructure infrastructure as service like AWS or even like a still the chip like the Nvidia?

1361
2:38:39,200 --> 2:38:43,200
So basically what's the future? Because I see you use a seven millimeter.

1362
2:38:43,200 --> 2:38:46,200
So the developer cost is like easily over 10 million dollars.

1363
2:38:46,200 --> 2:38:51,200
How do you make the business like a business wise?

1364
2:38:51,200 --> 2:39:00,200
Yeah, I mean, dojo is a very big computer and actually will be used a lot of power and needs a lot of cooling.

1365
2:39:00,200 --> 2:39:08,200
So I think it's probably going to make more sense to have dojo operate in like Amazon Web Services manner than to try to sell it to someone else.

1366
2:39:08,200 --> 2:39:20,200
So the most that would be the most efficient way to operate dojo is just have it be a service that you can use that's available online.

1367
2:39:20,200 --> 2:39:25,200
And that where you can train your models way faster and for less money.

1368
2:39:25,200 --> 2:39:34,200
And as the world transitions to software 2.0.

1369
2:39:34,200 --> 2:39:36,200
And that's on the bingo card.

1370
2:39:36,200 --> 2:39:41,200
Someone I know it has to not drink five tequilas.

1371
2:39:41,200 --> 2:39:45,200
So let's see.

1372
2:39:45,200 --> 2:39:49,200
Software 2.0.

1373
2:39:49,200 --> 2:39:53,200
Yeah, we'll use a lot of neural net training.

1374
2:39:53,200 --> 2:40:07,200
So, you know, it kind of makes sense that over time as there's more more neural net stuff, people will want to use the fastest lowest cost neural net training system.

1375
2:40:07,200 --> 2:40:14,200
So I think there's a lot of opportunity in that direction.

1376
2:40:14,200 --> 2:40:21,200
Hi, my name is Ali Jahanian. Thank you for this event. It's very inspirational.

1377
2:40:21,200 --> 2:40:40,200
My question is, I'm wondering, what is your vision for humanity robots that understand our emotions and art and can contribute to our creativity?

1378
2:40:40,200 --> 2:40:53,200
Well, I think there's this you're already seeing robots that at least are able to generate very interesting art with like like Dali and Dali to.

1379
2:40:53,200 --> 2:41:03,200
And I think we'll we'll start seeing AI that can actually generate even movies that have a that have coherence, like interesting movies and tell jokes.

1380
2:41:03,200 --> 2:41:14,200
So it's quite remarkable how fast AI is advancing at many companies besides Tesla.

1381
2:41:14,200 --> 2:41:17,200
We're headed for a very interesting future.

1382
2:41:17,200 --> 2:41:21,200
And, yeah, so you guys want to comment on that?

1383
2:41:21,200 --> 2:41:27,200
Yeah, I guess the optimist robot can come up with physical art, not just digital art.

1384
2:41:27,200 --> 2:41:33,200
You can you know, you can ask for some dance moves in text or voice and then you can produce those in the future.

1385
2:41:33,200 --> 2:41:37,200
So it's a lot of physical art, not just digital art.

1386
2:41:37,200 --> 2:41:41,200
Oh, yeah, yeah. Computers can absolutely make physical art. Yeah. Yeah.

1387
2:41:41,200 --> 2:41:45,200
Like dance, play soccer or whatever.

1388
2:41:45,200 --> 2:41:50,200
It needs to get more agile, but over time, for sure.

1389
2:41:50,200 --> 2:41:52,200
Thanks so much for the presentation.

1390
2:41:52,200 --> 2:42:00,200
For the Tesla autopilot slides, I noticed that the models that you were using were heavily motivated by language models.

1391
2:42:00,200 --> 2:42:05,200
And I was wondering what the history of that was and how much of an improvement it gave.

1392
2:42:05,200 --> 2:42:10,200
I thought that that was a really interesting, curious choice to use language models for the lane transitioning.

1393
2:42:10,200 --> 2:42:14,200
So there are sort of two aspects for why we transition to language modeling.

1394
2:42:14,200 --> 2:42:17,200
So the first talk, talk loud and close. OK.

1395
2:42:17,200 --> 2:42:20,200
It's not coming through very clearly. OK, got it.

1396
2:42:20,200 --> 2:42:23,200
Yeah, so the language models help us in two ways.

1397
2:42:23,200 --> 2:42:26,200
The first way is that it lets us predict lanes that we couldn't have otherwise.

1398
2:42:26,200 --> 2:42:33,200
As Ashok mentioned earlier, basically when we predicted lanes in sort of a dense 3D fashion,

1399
2:42:33,200 --> 2:42:38,200
you can only model certain kinds of lanes, but we want to get those crisscrossing connections inside of intersections.

1400
2:42:38,200 --> 2:42:41,200
It's just not possible to do that without making it a graph prediction.

1401
2:42:41,200 --> 2:42:45,200
If you try to do this with dense segmentation, it just doesn't work.

1402
2:42:45,200 --> 2:42:48,200
Also, the lane prediction is a multimodal problem.

1403
2:42:48,200 --> 2:42:54,200
Sometimes you just don't have sufficient visual information to know precisely how things look on the other side of the intersection.

1404
2:42:54,200 --> 2:42:59,200
So you need a method that can generalize and produce coherent predictions.

1405
2:42:59,200 --> 2:43:02,200
You don't want to be predicting two lanes and three lanes at the same time.

1406
2:43:02,200 --> 2:43:06,200
You want to commit to one. And a general model like these language models provides that.

1407
2:43:10,200 --> 2:43:11,200
Hi.

1408
2:43:11,200 --> 2:43:14,200
Hi. My name is Giovanni.

1409
2:43:14,200 --> 2:43:18,200
Thanks for the presentation. It's really nice.

1410
2:43:18,200 --> 2:43:21,200
I have a question for FSD team.

1411
2:43:21,200 --> 2:43:30,200
For the neural networks, how do you do unit tests, software unit tests on that?

1412
2:43:30,200 --> 2:43:40,200
Do you have a bunch, I don't know, thousands of cases where the neural network,

1413
2:43:40,200 --> 2:43:45,200
after you train it, you have to pass it before you release it as a product?

1414
2:43:45,200 --> 2:43:50,200
What's your software unit testing strategies for this?

1415
2:43:50,200 --> 2:43:51,200
Glad you asked.

1416
2:43:51,200 --> 2:43:56,200
There's a series of tests that we have defined, starting from unit tests for software itself.

1417
2:43:56,200 --> 2:44:00,200
But then for the neural network models, we have VIP sets defined.

1418
2:44:00,200 --> 2:44:05,200
If you just have a large test set, that's not enough what we find.

1419
2:44:05,200 --> 2:44:09,200
We need sophisticated VIP sets for different failure modes.

1420
2:44:09,200 --> 2:44:12,200
And then we cure them and grow them over the time of the product.

1421
2:44:12,200 --> 2:44:19,200
So over the years, we have hundreds of thousands of examples where we have been failing in the past

1422
2:44:19,200 --> 2:44:20,200
that we have curated.

1423
2:44:20,200 --> 2:44:25,200
And so for any new model, we test against the entire history of these failures

1424
2:44:25,200 --> 2:44:27,200
and then keep adding to this test set.

1425
2:44:27,200 --> 2:44:32,200
On top of this, we have shadow modes where we ship these models in silent to the car

1426
2:44:32,200 --> 2:44:35,200
and we get data back on where they are failing or succeeding.

1427
2:44:35,200 --> 2:44:39,200
And there's an extensive QA program.

1428
2:44:39,200 --> 2:44:41,200
It's very hard to ship a regression.

1429
2:44:41,200 --> 2:44:44,200
There's like nine levels of filters before it hits customers.

1430
2:44:44,200 --> 2:44:48,200
But then we have really good infra to make this all efficient.

1431
2:44:48,200 --> 2:44:50,200
I'm one of the QA testers.

1432
2:44:50,200 --> 2:44:52,200
So I QA the car.

1433
2:44:52,200 --> 2:44:54,200
Yeah, QA tester.

1434
2:44:54,200 --> 2:44:55,200
Yeah.

1435
2:44:55,200 --> 2:45:04,200
So I'm constantly in the car just being QA-ing, like whatever the latest alpha bolt is that doesn't totally crash.

1436
2:45:04,200 --> 2:45:06,200
Finds a lot of bugs.

1437
2:45:08,200 --> 2:45:10,200
Hi. Great event.

1438
2:45:10,200 --> 2:45:14,200
I have a question about foundational models for autonomous driving.

1439
2:45:14,200 --> 2:45:21,200
We have all seen that big models that really can, when you scale up with data and model parameter,

1440
2:45:21,200 --> 2:45:25,200
from GPT-3 to POM, it can actually now do reasoning.

1441
2:45:25,200 --> 2:45:32,200
Do you see that it's essential scaling up foundational models with data and size?

1442
2:45:32,200 --> 2:45:38,200
And then at least you can get a teacher model that potentially can solve all the problems.

1443
2:45:38,200 --> 2:45:41,200
And then you distill to a student model.

1444
2:45:41,200 --> 2:45:46,200
Is that how you see foundational models relevant for autonomous driving?

1445
2:45:46,200 --> 2:45:48,200
That's quite similar to our auto labeling model.

1446
2:45:48,200 --> 2:45:51,200
So we don't just have models that run in the car.

1447
2:45:51,200 --> 2:45:57,200
We train models that are entirely offline, that are extremely large, that can't run in real time on the car.

1448
2:45:57,200 --> 2:46:05,200
So we just run those offline on our servers, producing really good labels that can then train the online networks.

1449
2:46:05,200 --> 2:46:10,200
So that's one form of distillation of these teacher-student models.

1450
2:46:10,200 --> 2:46:16,200
In terms of foundational models, we are building some really, really large datasets that are multiple petabytes.

1451
2:46:16,200 --> 2:46:20,200
And we are seeing that some of these tasks work really well when we have these large datasets.

1452
2:46:20,200 --> 2:46:25,200
Like the kinematics, like I mentioned, we go in, all the kinematics out of all the objects.

1453
2:46:25,200 --> 2:46:27,200
And up to the fourth derivative.

1454
2:46:27,200 --> 2:46:29,200
And people thought we couldn't do detection with cameras.

1455
2:46:29,200 --> 2:46:32,200
Detection, depth, velocity, acceleration.

1456
2:46:32,200 --> 2:46:37,200
And imagine how precise these have to be for these higher-order derivatives to be accurate.

1457
2:46:37,200 --> 2:46:41,200
And this all comes from these large datasets and large models.

1458
2:46:41,200 --> 2:46:49,200
So we are seeing the equivalent of foundation models in our own way for geometry and kinematics and things like those.

1459
2:46:49,200 --> 2:46:52,200
Do you want to add anything, John?

1460
2:46:52,200 --> 2:46:53,200
Yeah, I'll keep it brief.

1461
2:46:53,200 --> 2:47:03,200
Basically, whenever we train on a larger dataset, we see big improvements in our model performance.

1462
2:47:03,200 --> 2:47:10,200
And basically, whenever we initialize our networks with some pre-training step from some other auxiliary task, we basically see improvements.

1463
2:47:10,200 --> 2:47:17,200
The self-supervised or supervised with large datasets both help a lot.

1464
2:47:17,200 --> 2:47:25,200
Hi. So at the beginning, Elon said that Tesla was potentially interested in building artificial general intelligence systems.

1465
2:47:25,200 --> 2:47:34,200
Given the potentially transformative impact of technology like that, it seems prudent to invest in technical AGI safety expertise specifically.

1466
2:47:34,200 --> 2:47:38,200
I know Tesla does a lot of technical narrow AI safety research.

1467
2:47:38,200 --> 2:47:48,200
I was curious if Tesla was intending to try to build expertise in technical artificial general intelligence safety specifically.

1468
2:47:48,200 --> 2:47:59,200
Well, I mean, if we're looking like we're going to be making a significant contribution to artificial general intelligence, then we'll for sure invest in safety.

1469
2:47:59,200 --> 2:48:01,200
I'm a big believer in AI safety.

1470
2:48:01,200 --> 2:48:12,200
I think there should be an AI regulatory authority at the government level, just as there is a regulatory authority for anything that affects public safety.

1471
2:48:12,200 --> 2:48:21,200
So we have regulatory authority for aircraft and cars and food and drugs because they affect public safety.

1472
2:48:21,200 --> 2:48:23,200
And AI also affects public safety.

1473
2:48:23,200 --> 2:48:39,200
So I think this is not really something that government, I think, understands yet. But I think there should be a referee that is ensuring or trying to ensure public safety for AGI.

1474
2:48:39,200 --> 2:48:46,200
And you think of like, well, what are the elements that are necessary to create AGI?

1475
2:48:46,200 --> 2:49:09,200
The accessible data set is extremely important. And if you've got a large number of cars and humanoid robots processing petabytes of video data and audio data from the real world, just like humans, that might be the biggest data set.

1476
2:49:09,200 --> 2:49:12,200
It probably is the biggest data set.

1477
2:49:12,200 --> 2:49:17,200
Because in addition to that, you can obviously incrementally scan the Internet.

1478
2:49:17,200 --> 2:49:29,200
But what the Internet can't quite do is have millions or hundreds of millions of cameras in the real world, and like I said, with audio and other sensors as well.

1479
2:49:29,200 --> 2:49:39,200
So I think we probably will have the most amount of data and probably the most amount of training power.

1480
2:49:39,200 --> 2:49:48,200
Therefore, probably we will make a contribution to AGI.

1481
2:49:48,200 --> 2:49:53,200
Hey, I noticed the semi was back there, but we haven't talked about it too much.

1482
2:49:53,200 --> 2:49:59,200
I was just wondering for the semi truck, what are the changes you're thinking about from a sensing perspective?

1483
2:49:59,200 --> 2:50:03,200
I imagine there's very different requirements, obviously, than just a car.

1484
2:50:03,200 --> 2:50:06,200
And if you don't think that's true, why is that true?

1485
2:50:06,200 --> 2:50:10,200
No, I think basically you can drive a car.

1486
2:50:10,200 --> 2:50:12,200
I mean, think about it, what drives any vehicle?

1487
2:50:12,200 --> 2:50:18,200
It's a biological neural net with eyes, with cameras, essentially.

1488
2:50:18,200 --> 2:50:30,200
So if – and really, what is your – your primary sensors are two cameras on a slow gimbal, a very slow gimbal.

1489
2:50:30,200 --> 2:50:32,200
That's your head.

1490
2:50:32,200 --> 2:50:39,200
So if a biological neural net with two cameras on a slow gimbal can drive a semi truck,

1491
2:50:39,200 --> 2:50:48,200
then if you've got like eight cameras with continuous 360-degree vision operating at a higher frame rate and much higher reaction rate,

1492
2:50:48,200 --> 2:50:56,200
then I think it is obvious that you should be able to drive a semi or any vehicle much better than a human.

1493
2:50:56,200 --> 2:51:00,200
Hi, my name is Akshay. Thank you for the event.

1494
2:51:00,200 --> 2:51:08,200
Assuming Optimus would be used for different use cases and would evolve at different pace for these use cases,

1495
2:51:08,200 --> 2:51:15,200
would it be possible to sort of develop and deploy different software and hardware components independently

1496
2:51:15,200 --> 2:51:27,200
and deploy them in Optimus so that the overall feature development is faster for Optimus?

1497
2:51:27,200 --> 2:51:30,200
I'm trying to see the question.

1498
2:51:30,200 --> 2:51:33,200
Okay, all right. We did not comprehend.

1499
2:51:33,200 --> 2:51:38,200
Unfortunately, our neural net did not comprehend the question.

1500
2:51:38,200 --> 2:51:44,200
So next question.

1501
2:51:44,200 --> 2:51:46,200
Hi, I want to switch the gear to the autopilot.

1502
2:51:46,200 --> 2:51:53,200
So when you guys plan to roll out the FSD beta to countries other than U.S. and Canada,

1503
2:51:53,200 --> 2:52:00,200
and also my next question is what's the biggest bottleneck or the technological barrier you think in the current autopilot stack

1504
2:52:00,200 --> 2:52:06,200
and how you envision to solve that to make the autopilot is considerably better than human

1505
2:52:06,200 --> 2:52:11,200
in terms of performance matrix, like safety assurance and the human confidence?

1506
2:52:11,200 --> 2:52:18,200
I think you also mentioned for FSD V11, you are going to combine the highway and the city as a single stack

1507
2:52:18,200 --> 2:52:24,200
and some architectural big improvements. Can you maybe expand a bit on that? Thank you.

1508
2:52:24,200 --> 2:52:29,200
Well, that's a whole bunch of questions.

1509
2:52:29,200 --> 2:52:33,200
We're hopeful to be able to, I think from a technical standpoint,

1510
2:52:33,200 --> 2:52:43,200
FSD beta should be possible to roll out FSD beta worldwide by the end of this year.

1511
2:52:43,200 --> 2:52:54,200
But for a lot of countries, we need regulatory approval, and so we are somewhat gated by the regulatory approval in other countries.

1512
2:52:54,200 --> 2:53:03,200
But I think from a technical standpoint, it will be ready to go to a worldwide beta by the end of this year.

1513
2:53:03,200 --> 2:53:07,200
And there's quite a big improvement that we're expecting to release next month.

1514
2:53:07,200 --> 2:53:16,200
That will be especially good at assessing the velocity of fast-moving cross traffic and a bunch of other things.

1515
2:53:16,200 --> 2:53:22,200
So, anyone want to elaborate?

1516
2:53:22,200 --> 2:53:27,200
Yeah, I guess so. There used to be a lot of differences between production autopilot and the full self-driving beta,

1517
2:53:27,200 --> 2:53:31,200
but those differences have been getting smaller and smaller over time.

1518
2:53:31,200 --> 2:53:40,200
I think just a few months ago, we now use the same vision-only object detection stack in both FSD and in the production autopilot on all vehicles.

1519
2:53:40,200 --> 2:53:45,200
There's still a few differences, the primary one being the way that we predict lanes right now.

1520
2:53:45,200 --> 2:53:50,200
So we upgraded the modeling of lanes so that it could handle these more complex geometries like I mentioned in the talk.

1521
2:53:50,200 --> 2:53:54,200
In production autopilot, we still use a simpler lane model,

1522
2:53:54,200 --> 2:54:01,200
but we're extending our current FSD beta models to work in all sort of highway scenarios as well.

1523
2:54:01,200 --> 2:54:06,200
Yeah, and the version of FSD beta that I drive actually does have the integrated stack.

1524
2:54:06,200 --> 2:54:14,200
So it uses the FSD stack both in city streets and highway, and it works quite well for me.

1525
2:54:14,200 --> 2:54:20,200
But we need to validate it in all kinds of weather like heavy rain, snow, dust,

1526
2:54:20,200 --> 2:54:29,200
and just make sure it's working better than the production stack across a wide range of environments.

1527
2:54:29,200 --> 2:54:32,200
But we're pretty close to that.

1528
2:54:32,200 --> 2:54:40,200
I mean, I think it's, I don't know, maybe, it'll definitely be before the end of the year and maybe November.

1529
2:54:40,200 --> 2:54:46,200
Yeah, in our personal drives, the FSD stack on highway drives already way better than the production stack we have.

1530
2:54:46,200 --> 2:54:53,200
And we do expect to also include the parking lot stack as a part of the FSD stack before the end of this year.

1531
2:54:53,200 --> 2:55:02,200
So that will basically bring us to you sit in the car in the parking lot and drive till the end of the parking lot at a parking spot before the end of this year.

1532
2:55:02,200 --> 2:55:12,200
And in terms of the fundamental metric to optimize against is how many miles between a necessary intervention.

1533
2:55:12,200 --> 2:55:25,200
So just massively improving how many miles the car can drive in full autonomy before an intervention is required that is safety critical.

1534
2:55:25,200 --> 2:55:36,200
So, yeah, that's the fundamental metric that we're measuring every week, and we're making radical improvements on that.

1535
2:55:36,200 --> 2:55:46,200
Hi, thank you. Thank you so much for the presentation. Very inspiring. My name is Daisy. I actually have a non-technical question for you.

1536
2:55:46,200 --> 2:56:07,200
I'm curious if you were back to your 20s, what are some of the things you wish you knew back then? What are some advice you would give to your younger self?

1537
2:56:07,200 --> 2:56:14,200
Well, I'm trying to figure out something useful to say.

1538
2:56:14,200 --> 2:56:20,200
Yeah, yeah, join Tesla would be one thing.

1539
2:56:20,200 --> 2:56:28,200
Yeah, I think just trying to try to expose yourself to as many smart people as possible.

1540
2:56:28,200 --> 2:56:34,200
I don't read a lot of books.

1541
2:56:34,200 --> 2:56:37,200
You know, I do. I did do that, though.

1542
2:56:37,200 --> 2:56:54,200
So I think there's some merit to just also not being necessarily too intense and enjoying the moment a bit more, I would say, to 20-something me.

1543
2:56:54,200 --> 2:57:02,200
Just to stop and smell the roses occasionally would probably be a good idea.

1544
2:57:02,200 --> 2:57:14,200
You know, it's like when we were developing the Falcon 1 rocket on the Kwajalein Atoll, and we had this beautiful little island that we're developing the rocket on,

1545
2:57:14,200 --> 2:57:26,200
and not once during that entire time did I even have a drink on the beach. I'm like, I should have had a drink on the beach. That would have been fine.

1546
2:57:26,200 --> 2:57:32,200
Thank you very much. I think you have excited all of the robotics people with Optimus.

1547
2:57:32,200 --> 2:57:40,200
This feels very much like 10 years ago in driving, but as driving has proved to be harder than it actually looked 10 years ago,

1548
2:57:40,200 --> 2:57:49,200
what do we know now that we didn't 10 years ago that would make, for example, AGI on a humanoid come faster?

1549
2:57:49,200 --> 2:58:00,200
Well, I mean, it seems to me that AGI is advancing very quickly. Hardly a week goes by without some significant announcement.

1550
2:58:00,200 --> 2:58:11,200
And yeah, I mean, at this point, like, AGI seems to be able to win at almost any rule-based game.

1551
2:58:11,200 --> 2:58:31,200
It's able to create extremely impressive art, engage in conversations that are very sophisticated, write essays, and these just keep improving.

1552
2:58:31,200 --> 2:58:45,200
And there's so many more talented people working on AI, and the hardware is getting better. I think AI is on a super, like a strong exponential curve of improvements,

1553
2:58:45,200 --> 2:58:57,200
independent of what we do at Tesla, and obviously will benefit somewhat from that exponential curve of improvement with AI.

1554
2:58:57,200 --> 2:59:07,200
Tesla just also happens to be very good at actuators, at motors, gearboxes, controllers, power electronics, batteries, sensors.

1555
2:59:07,200 --> 2:59:19,200
And really, I'd say the biggest difference between the robot on four wheels and the robot with arms and legs is getting the actuators right.

1556
2:59:19,200 --> 2:59:33,200
It's an actuators and sensors problem. And obviously, how you control those actuators and sensors, but it's, yeah, actuators and sensors and how you control the actuators.

1557
2:59:33,200 --> 2:59:42,200
I don't know, we happen to have the ingredients necessary to create a compelling robot, and we're doing it.

1558
2:59:42,200 --> 2:59:51,200
Hi, Ilan. You are actually bringing the humanity to the next level. Literally, Tesla and you are bringing the humanity to the next level.

1559
2:59:51,200 --> 3:00:03,200
So you said Optimus Prime, Optimus will be used in next Tesla factory. My question is, will a new Tesla factory will be fully run by Optimus program?

1560
3:00:03,200 --> 3:00:10,200
And when can general public order a humanoid?

1561
3:00:10,200 --> 3:00:16,200
Yeah, I think it'll, you know, we're going to start Optimus with very simple tasks in the factory.

1562
3:00:16,200 --> 3:00:34,200
You know, like maybe just like loading a part, like you saw in the video, loading a part, you know, carrying apart from one place to another or loading a part into one of our more conventional robot cells to, you know, that welds body together.

1563
3:00:34,200 --> 3:00:44,200
So we'll start, you know, just trying to, how do we make it useful at all? And then gradually expand the number of situations where it's useful.

1564
3:00:44,200 --> 3:00:55,200
And I think that the number of situations where Optimus is useful will grow exponentially, like really, really fast.

1565
3:00:55,200 --> 3:01:07,200
In terms of when people can order one, I don't know, I think it's not that far away. Well, I think you mean when can people receive one?

1566
3:01:07,200 --> 3:01:23,200
So, I don't know, I'm like, I'd say probably within three years, not more than five years, within three to five years, you could probably receive an Optimus.

1567
3:01:23,200 --> 3:01:29,200
I feel the best way to make the progress for AGI is to involve as many smart people across the world as possible.

1568
3:01:29,200 --> 3:01:37,200
And given the size and resource of Tesla compared to robot companies, and given the state of human research at the moment,

1569
3:01:37,200 --> 3:01:44,200
would it make sense for the kind of Tesla to sort of open source some of the simulation hardware parts?

1570
3:01:44,200 --> 3:01:53,200
I think Tesla can still be the dominant platformer where it can be something like Android OS or like iOS stuff for the entire human research.

1571
3:01:53,200 --> 3:02:00,200
Would that be something that rather than keeping the Optimus to just Tesla researchers or the factory itself,

1572
3:02:00,200 --> 3:02:10,200
can you open it and let the whole world explore human research?

1573
3:02:10,200 --> 3:02:19,200
I think we have to be careful about Optimus being potentially used in ways that are bad, because that is one of the possible things to do.

1574
3:02:19,200 --> 3:02:40,200
So, I think we'd provide Optimus where you can provide instructions to Optimus, but where those instructions are governed by some laws of robotics that you cannot overcome.

1575
3:02:40,200 --> 3:02:52,200
So, not doing harm to others, and I think probably quite a few safety related things with Optimus.

1576
3:02:52,200 --> 3:02:59,200
We'll just take maybe a few more questions, and then thank you all for coming.

1577
3:02:59,200 --> 3:03:09,200
Questions, one deep and one broad. On the deep, for Optimus, what's the current and what's the ideal controller bandwidth?

1578
3:03:09,200 --> 3:03:15,200
And then in the broader question, there's this big advertisement for the depth and breadth of the company.

1579
3:03:15,200 --> 3:03:21,200
What is it uniquely about Tesla that enables that?

1580
3:03:21,200 --> 3:03:25,200
Anyone want to tackle the bandwidth question?

1581
3:03:25,200 --> 3:03:28,200
So, the technical bandwidth of the...

1582
3:03:28,200 --> 3:03:29,200
Close to your mouth and loud.

1583
3:03:29,200 --> 3:03:41,200
Okay. For the bandwidth question, you have to understand or figure out what is the task that you want it to do, and if you took a frequency transform of that task, what is it that you want your limbs to do?

1584
3:03:41,200 --> 3:03:50,200
And that's where you get your bandwidth from. It's not a number that you can specifically just say. You need to understand your use case, and that's where the bandwidth comes from.

1585
3:03:50,200 --> 3:03:54,200
What are the broad questions?

1586
3:03:54,200 --> 3:04:05,200
The breadth and depth thing. I can answer the breadth and depth.

1587
3:04:05,200 --> 3:04:31,200
On the bandwidth question, I think we'll probably just end up increasing the bandwidth, which translates to the effective dexterity and reaction time of the robot. It's safe to say it's not one hertz, and maybe you don't need to go all the way to 100 hertz, but maybe 10, 25, I don't know.

1588
3:04:31,200 --> 3:04:39,200
Over time, I think the bandwidth will increase quite a bit, or translate it to dexterity and latency.

1589
3:04:39,200 --> 3:04:44,200
You'd want to minimize that over time.

1590
3:04:44,200 --> 3:04:48,200
Minimize latency, maximize dexterity.

1591
3:04:48,200 --> 3:05:07,200
In terms of breadth and depth, we're a pretty big company at this point, so we've got a lot of different areas of expertise that we necessarily had to develop in order to make autonomous electric cars, and then in order to make autonomous electric cars.

1592
3:05:07,200 --> 3:05:11,200
Tesla is like a whole series of startups, basically.

1593
3:05:11,200 --> 3:05:19,200
So far, they've almost all been quite successful. So we must be doing something right.

1594
3:05:19,200 --> 3:05:30,200
I consider one of my core responsibilities in running the company is to have an environment where great engineers can flourish.

1595
3:05:30,200 --> 3:05:42,200
And I think in a lot of companies, maybe most companies, if somebody is a really talented, driven engineer, they're unable to actually...

1596
3:05:42,200 --> 3:05:48,200
Their talents are suppressed at a lot of companies.

1597
3:05:48,200 --> 3:06:05,200
And some of the companies that engineering talent is suppressed in a way that is maybe not obviously bad, but where it's just so comfortable, and you paid so much money, and the output you actually have to produce is so low, that it's like a honey trap.

1598
3:06:05,200 --> 3:06:19,200
So there's a few honey traps in Silicon Valley, where they don't necessarily seem like bad places for engineers, but you have to say a good engineer went in, and what did they get out?

1599
3:06:19,200 --> 3:06:28,200
And the output of that engineering talent seems very low, even though they seem to be enjoying themselves.

1600
3:06:28,200 --> 3:06:32,200
That's why I call it a few honey trap companies in Silicon Valley.

1601
3:06:32,200 --> 3:06:44,200
Tesla is not a honey trap. We're demanding, and it's like, you're going to get a lot of shit done, and it's going to be really cool, and it's not going to be easy.

1602
3:06:44,200 --> 3:07:04,200
But if you are a super talented engineer, your talents will be used, I think, to a greater degree than anywhere else. You know, SpaceX also that way.

1603
3:07:04,200 --> 3:07:16,200
Hi, I have two questions. So both to the autopilot team. So the thing is, I have been following your progress for the past few years. So today you have made changes on the lane detection.

1604
3:07:16,200 --> 3:07:23,200
You said that previously you were doing instant semantic segmentation. Now you guys have built transform models for building the lanes.

1605
3:07:23,200 --> 3:07:34,200
So what are some other common challenges which you guys are facing right now, which you are solving in future as a curious engineer so that we as a researcher can work on those, start working on those?

1606
3:07:34,200 --> 3:07:42,200
And the second question is, I'm really curious about the data engine. You guys have told a case where the car is stopped.

1607
3:07:42,200 --> 3:07:50,200
So how are you finding cases which is very much similar to that from the data which you have? So a little bit more on the data engine would be great.

1608
3:07:50,200 --> 3:08:01,200
I'll start with the first question using occupancy network as an example. So what you saw in the presentation did not exist a year ago.

1609
3:08:01,200 --> 3:08:06,200
So we only spent one year of our time. We actually shaped more than 12 occupancy networks.

1610
3:08:06,200 --> 3:08:17,200
And to have one foundation model actually to represent the entire physical world around everywhere and in all weather conditions is actually really, really challenging.

1611
3:08:17,200 --> 3:08:30,200
So only over a year ago, we were kind of like driving a 2D world. If there's a wall and there's a curve, we kind of represent with the same static edge, which is obviously not ideal.

1612
3:08:30,200 --> 3:08:34,200
There's a big difference between a curve and a wall. When you drive, you make different choices.

1613
3:08:34,200 --> 3:08:42,200
So after we realized that, we have to go to 3D. We have to basically rethink the entire problem and think about how we address that.

1614
3:08:42,200 --> 3:08:51,200
So this will be like one example of challenges we have conquered in the past year.

1615
3:08:51,200 --> 3:08:58,200
Yeah, to answer the question about how we actually source examples of those tricky stopped cars, there's a few ways to go about this.

1616
3:08:58,200 --> 3:09:03,200
But two examples are one, we can trigger for disagreements within our signals.

1617
3:09:03,200 --> 3:09:08,200
So let's say that parked bit flickers between parked and driving. We'll trigger that back.

1618
3:09:08,200 --> 3:09:16,200
And the second is we can leverage more of the shadow mode logic. So if the customer ignores the car, but we think we should stop for it, we'll get that data back too.

1619
3:09:16,200 --> 3:09:25,200
So these are just different, like various trigger logic that allows us to get those data campaigns back.

1620
3:09:25,200 --> 3:09:30,200
Hi. Thank you for the amazing presentation. Thanks so much.

1621
3:09:30,200 --> 3:09:40,200
So there are a lot of companies that are focusing on the AGI problem. And one of the reasons why it's such a hard problem is because the problem itself is so hard to define.

1622
3:09:40,200 --> 3:09:44,200
Several companies have several different definitions. They focus on different things.

1623
3:09:44,200 --> 3:09:51,200
So what is Tesla, how is Tesla defining the AGI problem? And what are you focusing on specifically?

1624
3:09:51,200 --> 3:10:03,200
Well, we're not actually specifically focused on AGI. I'm simply saying that AGI seems likely to be an emergent property of what we're doing.

1625
3:10:03,200 --> 3:10:18,200
Because we're creating all these autonomous cars and autonomous humanoids that are actually with a truly gigantic data stream that's coming in and being processed.

1626
3:10:18,200 --> 3:10:30,200
It's by far the most amount of real world data and data you can't get by just searching the Internet because you have to be out there in the world and interacting with people and interacting with the roads.

1627
3:10:30,200 --> 3:10:35,200
And just, you know, Earth is a big place and reality is messy and complicated.

1628
3:10:35,200 --> 3:10:51,200
So I think it's sort of like likely to just it just seems likely to be an emergent property of if you've got, you know, tens or hundreds of millions of autonomous vehicles and maybe even a comparable number of humanoids, maybe more than that on the human right front.

1629
3:10:51,200 --> 3:11:16,200
Well, that's just the most amount of data. And if that that video is being processed, it just seems likely that, you know, the cars will will definitely get way better than human drivers and the humanoid robots will become increasingly indistinguishable from humans, perhaps.

1630
3:11:16,200 --> 3:11:27,200
And so then, like I said, you have this emergent property of AGI.

1631
3:11:27,200 --> 3:11:36,200
And arguably, you know, humans collectively are sort of a superintelligence as well, especially as we improve the data rate between humans.

1632
3:11:36,200 --> 3:11:53,200
I mean, the thing like that seems to be way back in the early days, the Internet was like the Internet was like humanity acquiring a nervous system where now all of a sudden any one element of humanity could know all of the knowledge of humans by connecting to the Internet.

1633
3:11:53,200 --> 3:11:56,200
Almost all knowledge on certainly huge part of it.

1634
3:11:56,200 --> 3:12:09,200
Previously, we would exchange information by osmosis, by, you know, by we'd have like in order to transfer data, so you would have to write a letter somewhere would have to carry the letter by person to another person.

1635
3:12:09,200 --> 3:12:19,200
And then a whole bunch of things in between. And then it was like, yeah, I mean, insanely slow when you think about it.

1636
3:12:19,200 --> 3:12:26,200
And even if you were in the Library of Congress, you still didn't have access to all the world's information. You certainly couldn't search it.

1637
3:12:26,200 --> 3:12:30,200
And obviously, very few people are in the Library of Congress.

1638
3:12:30,200 --> 3:12:48,200
So I mean, one of the great sort of equality elements like the Internet has been the most the biggest equalizer in history in terms of access to information and knowledge.

1639
3:12:48,200 --> 3:12:58,200
And I think in any student of history, I think would agree with this because you go back a thousand years, there were very few books like and books would be incredibly expensive.

1640
3:12:58,200 --> 3:13:03,200
But only a few people knew how to read and only an even smaller number of people even had a book.

1641
3:13:03,200 --> 3:13:11,200
Now, now look at it like you can access any book instantly. You can learn anything for basically for free.

1642
3:13:11,200 --> 3:13:13,200
It's pretty incredible.

1643
3:13:13,200 --> 3:13:25,200
So, you know, I was asked recently what period of history would I prefer to be at the most.

1644
3:13:25,200 --> 3:13:33,200
And my answer was right now. This is the most interesting time in history. And I read a lot of history.

1645
3:13:33,200 --> 3:13:39,200
So let's hope you know, let's do our best to keep that going. Yeah.

1646
3:13:39,200 --> 3:13:56,200
And to go back to one of the early questions I would ask, like you can the thing that's happened over time with respect to Tesla autopilot is that we've just the the neural nets have gotten have gradually absorbed more and more software.

1647
3:13:56,200 --> 3:14:09,200
And in the limit, of course, you could simply take the videos as seen by the car and compare those to the steering inputs from the steering wheel and pedals, which are very simple inputs.

1648
3:14:09,200 --> 3:14:30,200
And in principle, you could train with nothing in between because that's what humans are doing with a biological neural net. You could train based on video and and what trains the video is the moving of the steering wheel and the pedals with no other software in between.

1649
3:14:30,200 --> 3:14:36,200
We're not there yet, but it's gradually going in that direction.

1650
3:14:36,200 --> 3:14:42,200
All right. Last question.

1651
3:14:42,200 --> 3:14:44,200
I think we got a question at the front here.

1652
3:14:44,200 --> 3:14:46,200
Hello there right there.

1653
3:14:46,200 --> 3:14:50,200
I will do two questions. Fine.

1654
3:14:50,200 --> 3:14:52,200
Hi, thanks for such a great presentation.

1655
3:14:52,200 --> 3:14:54,200
We'll do your question last.

1656
3:14:54,200 --> 3:15:17,200
Okay, cool. With FSD being used by so many people. Do you think what's the comp. How do you evaluate the company's risk tolerance in terms of performance statistics, and do you think there needs to be more transparency or regulation from third parties, as to how what's good enough and defining like thresholds for performance across so many miles.

1657
3:15:17,200 --> 3:15:22,200
Sure, well, the, you know,

1658
3:15:22,200 --> 3:15:26,200
one design requirement at Tesla is safety.

1659
3:15:26,200 --> 3:15:32,200
So, like, and that goes across the board so in terms of the mechanical safety of the car.

1660
3:15:32,200 --> 3:15:45,200
We have the lowest probability of injury of any cars ever tested by the government for just passive mechanical safety essentially crash structure and airbags and whatnot.

1661
3:15:45,200 --> 3:15:52,200
We have the best highest rating for active safety as well.

1662
3:15:52,200 --> 3:16:02,200
And I'm just going to get to the point where you the active safety is so ridiculously good. It's, it's, it's like just absurdly better than than a human.

1663
3:16:02,200 --> 3:16:23,200
And with respect to autopilot, we do publish this, broadly speaking the statistics on miles driven with cars that have no autonomy or Tesla cars with no autonomy with kind of hardware one hardware to hardware three, and then the ones that are in FSD beta.

1664
3:16:23,200 --> 3:16:27,200
And we see steady improvements all along the way.

1665
3:16:27,200 --> 3:16:42,200
And, you know, sometimes there's this there's this dichotomy of, you know, should you wait until the car is like on Earth, three times safer than a person before deploying any technology but I think that's, that is actually morally wrong.

1666
3:16:42,200 --> 3:16:51,200
At the point at which you believe that that adding autonomy reduces injury and death.

1667
3:16:51,200 --> 3:17:03,200
I think you have a moral obligation to deploy it, even though you're going to get sued and blamed by a lot of people, because the people whose lives you saved don't know that their lives are saved.

1668
3:17:03,200 --> 3:17:14,200
And the people, the people who do occasionally die or get injured, they definitely know, or their state does that it was, you know, whatever there was a problem with with with autopilot.

1669
3:17:14,200 --> 3:17:23,200
That's why you have to look at the, at the numbers in total miles driven. How many accidents occurred how many accidents were serious how many fatalities.

1670
3:17:23,200 --> 3:17:29,200
And, you know, we've got well over 3 million cars on the road so this is, that's a lot of miles driven every day.

1671
3:17:29,200 --> 3:17:38,200
It's not going to be perfect. But what matters is that is that it is very clearly safer than not deploying it.

1672
3:17:38,200 --> 3:17:45,200
Yeah. So, I think last question.

1673
3:17:45,200 --> 3:17:53,200
I think, yeah. Thanks. Last question here.

1674
3:17:53,200 --> 3:17:55,200
Okay.

1675
3:17:55,200 --> 3:18:12,200
Okay. Hi. So, I do not work on hardware so maybe the hardware team, and you guys can enlighten me. Why is it required that there be symmetry in the design of optimus, because humans.

1676
3:18:12,200 --> 3:18:31,200
We have handedness right we are, we use some set of muscles more than others, or time there is wear and tear. Right. So maybe you'll start see some joint failures or some actuator failures more over time, I understand that this is extremely pre stage.

1677
3:18:31,200 --> 3:18:51,200
Also, we as humans have based so much fantasy and fiction or superhuman capabilities like all of us don't want to walk right over there we want to extend our arms and like we have all these, you know, a lot of fantasy fantastical designs so considering everything

1678
3:18:51,200 --> 3:19:11,200
else that is going on in terms of batteries and intensity of compute. Maybe you can leverage all those aspects into coming up with something. Well, I don't know, more interesting in terms of your, the robot that you're building, and I'm hoping you're able to explore those

1679
3:19:11,200 --> 3:19:30,200
directions. Yeah, I mean I think it would be cool to have like you know make inspector gadget real value pretty sweet. So, yeah, I mean, right now we just want to make basic humanoid what work well and our goal is fastest path to a useful humanoid robot.

1680
3:19:30,200 --> 3:19:49,200
I think this is this will ground us in reality, literally, and ensure that we are doing something useful, like one of the hardest things to do is to be useful to to actually, and then and then to have high utility under the curve of like how many people

1681
3:19:49,200 --> 3:20:08,200
did you help, you know, and how much help did you provide to each person on average and then how many people did you help the total utility, like trying to actually ship useful product that people like to a large number of people is so insanely hard,

1682
3:20:08,200 --> 3:20:28,200
it's boggles the mind. You know, so I can say like man there's a hell of a difference between a company that is ship product and one has not ship product. It's again, this is night and day. And then even once you ship product, can you make the cost the value of the output worth more than the cost of the input, which is again, insanely difficult, especially with hardware.

1683
3:20:28,200 --> 3:20:57,200
So, but I think over time I think it would be cool to do creative things and have like eight arms and whatever, and have different versions. And maybe, you know, there'll be some hardware, like companies that are able to add things to an optimist like maybe we, you know, add a power port or something like that or attach them, you can add, you know, add attachments to your optimist like you can add them to your phone.

1684
3:20:57,200 --> 3:21:18,200
There could be a lot of cool things that could be done over time. And there could be maybe an ecosystem of small companies that or big companies that make add-ons for optimists. So with that, I'd like to thank the team for their hard work. You guys are awesome.

1685
3:21:18,200 --> 3:21:33,200
And thank you all for coming and for everyone online. Thanks for tuning in. And I think this will be one of those great videos where you can like, if you can fast forward to the bits that you find most interesting.

1686
3:21:33,200 --> 3:21:48,200
But we try to give you a tremendous amount of detail, literally so that you can look at the video at your leisure and you can focus on the parts that you find interesting and skip the other parts. So thank you all. And we'll do this, try to do this every year.

1687
3:21:48,200 --> 3:22:04,200
And we might do a monthly podcast even. So, but I think it'd be, you know, great to sort of bring you along for the ride and like show you what cool things are happening. And yeah, thank you. All right. Thanks.

1688
3:22:18,200 --> 3:22:33,200
Thanks.

1689
3:22:33,200 --> 3:22:48,200
Thanks.

1690
3:22:48,200 --> 3:23:03,200
Thanks.