Tesla AI Day 2022

science-fiction minus the fiction

ai-day-2022/introducting-optimus.png

01 Oct 2022

AI day is a recruitment event aimed at engineers. Sharing Tesla's progress on the AI front to get people excited about joining them.

Since 2020, Tesla tops the list of the most attractive companies for US engineering students (ahead of SpaceX, Lockheed Martin, Google, Boeing, NASA, Apple, Microsoft and Amazon).

It also acts as PR for Tesla and provides a better understanding of the technical progress & roadmap for investors, clients, fans.. and competitors!

It is a 3 hours long, quite technical presentation.

Even if a lot flies over my head, I find it fascinating to get a glimpse "under the hood" at how these innovative technologies get engineered and built.

What amazes me always with Tesla is how open they are with their engineering - from having open-sourced their patents years ago, to sharing their engineering work (typically one of the most closed guarded trade secret at companies who build things) in granular details.

Introducing: Optimus

This could be a defining moment in history.

"Tesla could make a meaningful contribution to AGI"

  • production planned at high-volume (millions of unit)
  • aim of <$20k
  • on "Elon time", Optimus will get to market in 3-5 years (so probably 5-8 years).

This would mean same cost as a year's low-wage salary of one person, for a robot who - over time - will be much more productive.

Economy is defined as production value per capita - what does an economy look like when no limitation of capita? 🤔 🤯

"the potential for optimus is i think appreciated by very few people"

"join Tesla and help make it a reality and bring it to fruition at scale such that it can help millions of people.
The potential like i said really boggles the mind, because you have to say what is an economy?
An economy is sort of productive entities times their productivity - capita times productivity per capita. At the point at which there is not a limitation on capita it's not clear what an economy even means. At that point an economy becomes quasi-infinite - this means a future of abundance, a future where there is no poverty where people you can have whatever you want in terms of products and services it really is a fundamental transformation of civilization as we know it.
Obviously we want to make sure that transformation is a positive one and safe but that's also why i think Tesla as an entity doing this being a single class of stock publicly traded owned by the public is very important and should not be overlooked. I think this is essential because then if the public doesn't like what Tesla is doing the public can buy shares in Tesla and vote differently - this is a big deal it's very important that i can't just do what i want sometimes people think that but it's not true so it's very important that the corporate entity that makes this happen is something that the public can properly influence so i think the Tesla structure is ideal for that."
Elon Musk

Self-driving cars have a potential for 10x economic output.

Optimus has potential for 100x economic output!

Test use cases showed:

  • moving boxes and objects on the factory floor ai-day-2022/221003-020227-tesla-ai-day-2022-0014- (same software as Tesla FSD)
    ai-day-2022/221003-020240-tesla-ai-day-2022-0015- (actual workstation in one of the Tesla factories)
  • bringing packages to office workers ai-day-2022/221003-020150-tesla-ai-day-2022-0013-
  • watering flowers ai-day-2022/221003-020113-tesla-ai-day-2022-0010-

Using semi-off-the-shelf at the moment. Working on custom design.
Working on optimising cost & scalability of actuators.

Opposable thumbs: can operate tools.

"we've also designed it using the same discipline that we use in designing the car which is to design it for manufacturing such that it's possible to make the robot in high volume at low cost with high reliability"

ai-day-2022/221003-021104-tesla-ai-day-2022-0018-

ai-day-2022/221003-021121-tesla-ai-day-2022-0019-

  • what amazes me is the pace of innovation. From concept to working prototype in < 1 year (6-8 months they said).

ai-day-2022/221003-021137-tesla-ai-day-2022-0020-

  • similar weight as a human means no weight restrictions/limitations.

Latest generation

ai-day-2022/221003-091529-tesla-ai-day-2022-0021-

Orange are actuators, blue are electrical systems.

Cost and efficiency are focus.

Part count and power consumption will be optimised/minimised.

ai-day-2022/221003-091632-tesla-ai-day-2022-0022-

  • battery will be good for 1x day of work

ai-day-2022/221003-091723-tesla-ai-day-2022-0023-

  • bot brain in the torso, leveraging Tesla FSD software

ai-day-2022/221003-091933-tesla-ai-day-2022-0024-

models used for crash test simulations are extremly complex and accurate. Same models used for Optimus.

ai-day-2022/221003-092050-tesla-ai-day-2022-0025-

"we're just bags of soggy jelly and bones thrown in" 😂

ai-day-2022/221003-092304-tesla-ai-day-2022-0026-

ai-day-2022/221003-092458-tesla-ai-day-2022-0028-

actuators

ai-day-2022/221003-092732-tesla-ai-day-2022-0029-

ai-day-2022/221003-092754-tesla-ai-day-2022-0030-

ai-day-2022/221003-092844-tesla-ai-day-2022-0031-

  • red axis denotes optimal
  • "communality study" to minimise number of different actuators

ai-day-2022/221003-093010-tesla-ai-day-2022-0032-

  • 1x actuator able to lift a 500kg piano.

hands

ai-day-2022/221003-093133-tesla-ai-day-2022-0033-

  • biologically inspired design because the world around us is designed for human biology ergonomics. Adapt the robot to its environment, not vice versa. So the robot can interact with the world of humans, no matter what.

If you are interested in the technical details, I encourage you to watch the whole presentation - it's quite fascinatting.

software

it was possible to quickly get to a functioning version of the concept from last year because of the years spend by the FSD team. robot on legs vs robot of wheels.
same "occupancy network" as in Tesla cars.

ai-day-2022/221003-093443-tesla-ai-day-2022-0034-

ai-day-2022/221003-2136_optimus-walking.gif

Full Self-Driving

ai-day-2022/221003-094514-tesla-ai-day-2022-0035-

ai-day-2022/221003-095428-tesla-ai-day-2022-0037-

ai-day-2022/221003-103405-tesla-ai-day-2022-0038-

FSD Beta can technically be made available worldwide by the end of the year.
Hurdle will be local regulatory approvals.

Metric to optimise against: how many miles in full autonomy between necessary interventions.

DOJO: In-House Supercomputer

400,000 video instantiations per second!

Invented a "language of lanes" to handle the logic of complicated 3D representations of lanes and their interrelations.
Framework could be extrapolated to "language of walking paths" for Optimus.

ai-day-2022/221003-110012-tesla-ai-day-2022-0039-

ai-day-2022/221003-110831-tesla-ai-day-2022-0040-

ai-day-2022/221003-111129-tesla-ai-day-2022-0042-

Transcript

Full transcript generated with Whisper !ai/whisper:

Tesla AI Day 2022.m4a
transcribed: 2022-10-02 21:19 | english


1 0:14:00,000 --> 0:14:02,000 Oh

2 0:14:30,000 --> 0:14:37,200 All right, welcome everybody give everyone a moment to

3 0:14:39,840 --> 0:14:41,840 Get back in the audience and

4 0:14:43,760 --> 0:14:49,920 All right, great welcome to Tesla AI day 2022

5 0:14:49,920 --> 0:14:56,920 We've got some really exciting things to show you I think you'll be pretty impressed I do want to set some expectations with respect to our

6 0:15:08,680 --> 0:15:14,600 Optimist robot as as you know last year was just a person in a robot suit

7 0:15:14,600 --> 0:15:22,600 But we've knocked we've come a long way and it's I think we you know compared to that it's gonna be very impressive

8 0:15:23,720 --> 0:15:25,720 and

9 0:15:26,280 --> 0:15:28,040 We're gonna talk about

10 0:15:28,040 --> 0:15:32,720 The advancements in AI for full self-driving as well as how they apply to

11 0:15:33,240 --> 0:15:38,600 More generally to real-world AI problems like a humanoid robot and even going beyond that

12 0:15:38,600 --> 0:15:43,640 I think there's some potential that what we're doing here at Tesla could

13 0:15:44,480 --> 0:15:48,320 make a meaningful contribution to AGI and

14 0:15:49,400 --> 0:15:51,860 And I think actually Tesla's a good

15 0:15:52,680 --> 0:15:58,400 entity to do it from a governance standpoint because we're a publicly traded company with one class of

16 0:15:58,920 --> 0:16:04,580 Stock and that means that the public controls Tesla, and I think that's actually a good thing

17 0:16:04,580 --> 0:16:07,380 So if I go crazy you can fire me this is important

18 0:16:08,740 --> 0:16:10,740 Maybe I'm not crazy. All right

19 0:16:11,380 --> 0:16:13,380 so

20 0:16:14,340 --> 0:16:19,620 Yeah, so we're going to talk a lot about our progress in AI autopilot as well as progress in

21 0:16:20,340 --> 0:16:27,380 with Dojo and then we're gonna bring the team out and do a long Q&A so you can ask tough questions

22 0:16:29,140 --> 0:16:32,260 Whatever you'd like existential questions technical questions

23 0:16:32,260 --> 0:16:37,060 But we want to have as much time for Q&A as possible

24 0:16:37,540 --> 0:16:40,100 So, let's see with that

25 0:16:41,380 --> 0:16:42,340 Because

26 0:16:42,340 --> 0:16:47,700 Hey guys, I'm Milana work on autopilot and it is my book and I'm Lizzie

27 0:16:48,580 --> 0:16:51,780 Mechanical engineer on the project as well. Okay

28 0:16:53,060 --> 0:16:57,620 So should we should we bring out the bot before we do that we have one?

29 0:16:57,620 --> 0:17:02,660 One little bonus tip for the day. This is actually the first time we try this robot without any

30 0:17:03,300 --> 0:17:05,300 backup support cranes

31 0:17:05,540 --> 0:17:08,740 Mechanical mechanisms no cables nothing. Yeah

32 0:17:08,740 --> 0:17:28,020 I want to do it with you guys tonight. That is the first time. Let's see. You ready? Let's go

33 0:17:38,740 --> 0:17:40,980 So

34 0:18:08,820 --> 0:18:12,820 I think the bot got some moves here

35 0:18:24,740 --> 0:18:29,700 So this is essentially the same full self-driving computer that runs in your tesla cars by the way

36 0:18:29,700 --> 0:18:37,400 So this is literally the first time the robot has operated without a tether was on stage tonight

37 0:18:59,700 --> 0:19:01,700 So

38 0:19:14,580 --> 0:19:19,220 So the robot can actually do a lot more than we just showed you we just didn't want it to fall on its face

39 0:19:20,500 --> 0:19:25,860 So we'll we'll show you some videos now of the robot doing a bunch of other things

40 0:19:25,860 --> 0:19:32,420 Um, yeah, which are less risky. Yeah, we should close the screen guys

41 0:19:34,420 --> 0:19:36,420 Yeah

42 0:19:40,900 --> 0:19:46,660 Yeah, we wanted to show a little bit more what we've done over the past few months with the bot and just walking around and dancing on stage

43 0:19:49,700 --> 0:19:50,900 Just humble beginnings

44 0:19:50,900 --> 0:19:56,260 But you can see the autopilot neural networks running as it's just retrained for the bot

45 0:19:56,900 --> 0:19:58,900 Directly on that on that new platform

46 0:19:59,620 --> 0:20:03,540 That's my watering can yeah when you when you see a rendered view, that's that's the robot

47 0:20:03,780 --> 0:20:08,740 What's the that's the world the robot sees so it's it's very clearly identifying objects

48 0:20:09,300 --> 0:20:11,860 Like this is the object it should pick up picking it up

49 0:20:12,500 --> 0:20:13,700 um

50 0:20:13,700 --> 0:20:15,700 Yeah

51 0:20:15,700 --> 0:20:21,300 So we use the same process as we did for the pilot to connect data and train neural networks that we then deploy on the robot

52 0:20:22,020 --> 0:20:25,460 That's an example that illustrates the upper body a little bit more

53 0:20:28,660 --> 0:20:32,740 Something that will like try to nail down in a few months over the next few months, I would say

54 0:20:33,460 --> 0:20:35,060 to perfection

55 0:20:35,060 --> 0:20:39,060 This is really an actual station in the fremont factory as well that it's working at

56 0:20:39,060 --> 0:20:45,060 Yep, so

57 0:20:54,180 --> 0:20:57,700 And that's not the only thing we have to show today, right? Yeah, absolutely. So

58 0:20:58,180 --> 0:20:59,140 um

59 0:20:59,140 --> 0:21:02,500 that what you saw was what we call bumble see that's our

60 0:21:03,620 --> 0:21:06,420 uh sort of rough development robot using

61 0:21:06,420 --> 0:21:08,420 Semi off-the-shelf actuators

62 0:21:08,980 --> 0:21:14,420 Um, but we actually uh have gone a step further than that already the team's done an incredible job

63 0:21:14,980 --> 0:21:20,660 Um, and we actually have an optimist bot with uh fully tesla designed and built actuators

64 0:21:21,460 --> 0:21:25,220 um battery pack uh control system everything um

65 0:21:25,780 --> 0:21:30,420 It it wasn't quite ready to walk, but I think it will walk in a few weeks

66 0:21:30,420 --> 0:21:37,700 Um, but we wanted to show you the robot, uh, the the something that's actually fairly close to what will go into production

67 0:21:38,420 --> 0:21:42,180 And um and show you all the things it can do so let's bring it out

68 0:21:42,180 --> 0:21:58,180 Do it

69 0:22:12,180 --> 0:22:14,180 So

70 0:22:33,380 --> 0:22:37,620 So here you're seeing optimists with uh, these are the

71 0:22:37,620 --> 0:22:43,780 The with the degrees of freedom that we expect to have in optimist production unit one

72 0:22:44,340 --> 0:22:47,860 Which is the ability to move all the fingers independently move the

73 0:22:48,900 --> 0:22:51,060 To have the thumb have two degrees of freedom

74 0:22:51,700 --> 0:22:53,620 So it has opposable thumbs

75 0:22:53,620 --> 0:22:59,380 And uh both left and right hand so it's able to operate tools and do useful things our goal is to make

76 0:23:00,660 --> 0:23:04,580 a useful humanoid robot as quickly as possible and

77 0:23:04,580 --> 0:23:10,100 Uh, we've also designed it using the same discipline that we use in designing the car

78 0:23:10,180 --> 0:23:13,080 Which is to say to to design it for manufacturing

79 0:23:14,020 --> 0:23:17,620 Such that it's possible to make the robot at in high volume

80 0:23:18,340 --> 0:23:20,580 At low cost with high reliability

81 0:23:21,300 --> 0:23:27,000 So that that's incredibly important. I mean you've all seen very impressive humanoid robot demonstrations

82 0:23:28,020 --> 0:23:30,100 And that that's great. But what are they missing?

83 0:23:30,100 --> 0:23:37,300 Um, they're missing a brain that they don't have the intelligence to navigate the world by themselves

84 0:23:37,700 --> 0:23:39,700 And they're they're also very expensive

85 0:23:40,340 --> 0:23:42,340 and made in low volume

86 0:23:42,340 --> 0:23:43,460 whereas

87 0:23:43,460 --> 0:23:49,860 This is the optimist is designed to be an extremely capable robot but made in very high volume probably

88 0:23:50,420 --> 0:23:52,260 ultimately millions of units

89 0:23:52,260 --> 0:23:55,940 Um, and it is expected to cost much less than a car

90 0:23:55,940 --> 0:24:00,740 So uh, I would say probably less than twenty thousand dollars would be my guess

91 0:24:06,980 --> 0:24:12,740 The potential for optimists is I think appreciated by very few people

92 0:24:16,980 --> 0:24:19,380 As usual tesla demos are coming in hot

93 0:24:20,740 --> 0:24:22,740 So

94 0:24:22,740 --> 0:24:25,380 So, okay, that's good. That's good. Um

95 0:24:26,180 --> 0:24:27,380 Yeah

96 0:24:27,380 --> 0:24:32,100 Uh, the i'm the team's put in put in and the team has put in an incredible amount of work

97 0:24:32,580 --> 0:24:37,540 Uh, it's uh working days, you know, seven days a week running the 3am oil

98 0:24:38,100 --> 0:24:43,780 That to to get to the demonstration today. Um, super proud of what they've done is they've really done done a great job

99 0:24:43,780 --> 0:24:52,980 I just like to give a hand to the whole optimist team

100 0:24:56,900 --> 0:25:02,980 So, you know that now there's still a lot of work to be done to refine optimists and

101 0:25:03,620 --> 0:25:06,580 Improve it obviously this is just optimist version one

102 0:25:06,580 --> 0:25:14,660 Um, and that's really why we're holding this event which is to convince some of the most talented people in the world like you guys

103 0:25:15,140 --> 0:25:16,340 um

104 0:25:16,340 --> 0:25:17,380 to

105 0:25:17,380 --> 0:25:22,820 Join tesla and help make it a reality and bring it to fruition at scale

106 0:25:23,620 --> 0:25:25,300 Such that it can help

107 0:25:25,300 --> 0:25:26,980 millions of people

108 0:25:26,980 --> 0:25:30,340 um, and the the and the potential like I said is is really

109 0:25:30,340 --> 0:25:35,860 Buggles the mind because you have to say like what what is an economy an economy is?

110 0:25:36,580 --> 0:25:37,700 uh

111 0:25:37,700 --> 0:25:39,700 sort of productive

112 0:25:39,700 --> 0:25:42,820 entities times the productivity uh capital times

113 0:25:43,380 --> 0:25:44,420 output

114 0:25:44,420 --> 0:25:48,500 Productivity per capita at the point at which there is not a limitation on capital

115 0:25:49,220 --> 0:25:54,100 The it's not clear what an economy even means at that point. It an economy becomes quasi infinite

116 0:25:54,980 --> 0:25:56,100 um

117 0:25:56,100 --> 0:25:58,100 so

118 0:25:58,100 --> 0:26:02,740 What what you know take into fruition in the hopefully benign scenario?

119 0:26:04,420 --> 0:26:05,940 the

120 0:26:05,940 --> 0:26:10,260 this means a future of abundance a future where

121 0:26:12,260 --> 0:26:18,760 There is no poverty where people you can have whatever you want in terms of products and services

122 0:26:18,760 --> 0:26:27,320 Um it really is a a fundamental transformation of civilization as we know it

123 0:26:28,680 --> 0:26:30,040 um

124 0:26:30,040 --> 0:26:33,800 Obviously, we want to make sure that transformation is a positive one and um

125 0:26:35,000 --> 0:26:36,600 safe

126 0:26:36,600 --> 0:26:38,600 And but but that's also why I think

127 0:26:39,320 --> 0:26:45,400 tesla as an entity doing this being a single class of stock publicly traded owned by the public

128 0:26:46,200 --> 0:26:48,200 Um is very important

129 0:26:48,200 --> 0:26:50,200 Um and should not be overlooked

130 0:26:50,360 --> 0:26:57,960 I think this is essential because then if the public doesn't like what tesla is doing the public can buy shares in tesla and vote

131 0:26:58,500 --> 0:27:00,200 differently

132 0:27:00,200 --> 0:27:02,200 This is a big deal. Um

133 0:27:03,000 --> 0:27:05,720 Like it's very important that that I can't just do what I want

134 0:27:06,360 --> 0:27:08,920 You know sometimes people think that but it's not true

135 0:27:09,480 --> 0:27:10,680 um

136 0:27:10,680 --> 0:27:12,680 so um

137 0:27:13,720 --> 0:27:15,720 You know that it's very important that the

138 0:27:15,720 --> 0:27:21,400 the corporate entity that has that makes this happen is something that the public can

139 0:27:22,120 --> 0:27:24,120 properly influence

140 0:27:24,120 --> 0:27:25,240 um

141 0:27:25,240 --> 0:27:28,200 And so I think the tesla structure is is is ideal for that

142 0:27:29,240 --> 0:27:31,240 um

143 0:27:32,760 --> 0:27:39,080 And like I said that you know self-driving cars will certainly have a tremendous impact on the world

144 0:27:39,720 --> 0:27:41,800 um, I think they will improve

145 0:27:41,800 --> 0:27:45,000 the productivity of transport by at least

146 0:27:46,120 --> 0:27:49,880 A half order of magnitude perhaps an order of magnitude perhaps more

147 0:27:51,000 --> 0:27:52,680 um

148 0:27:52,680 --> 0:27:54,680 Optimist I think

149 0:27:54,920 --> 0:27:56,920 has

150 0:27:57,400 --> 0:28:03,880 Maybe a two order of magnitude uh potential improvement in uh economic output

151 0:28:05,160 --> 0:28:09,240 Like like it's not clear. It's not clear what the limit actually even is

152 0:28:09,240 --> 0:28:11,240 um

153 0:28:11,800 --> 0:28:13,800 So

154 0:28:14,040 --> 0:28:17,320 But we need to do this in the right way we need to do it carefully and safely

155 0:28:17,960 --> 0:28:21,800 and ensure that the outcome is one that is beneficial to

156 0:28:22,580 --> 0:28:26,040 uh civilization and and one that humanity wants

157 0:28:27,240 --> 0:28:30,040 Uh can't this is extremely important obviously

158 0:28:30,920 --> 0:28:32,920 so um

159 0:28:34,440 --> 0:28:36,440 And I hope you will consider

160 0:28:36,680 --> 0:28:38,360 uh joining

161 0:28:38,360 --> 0:28:40,360 tesla to uh

162 0:28:40,920 --> 0:28:42,920 achieve those goals

163 0:28:43,160 --> 0:28:44,120 um

164 0:28:44,120 --> 0:28:49,880 It tells us we're we're we really care about doing the right thing here or aspire to do the right thing and and really not

165 0:28:51,000 --> 0:28:53,000 Pave the road to hell with with good intentions

166 0:28:53,240 --> 0:28:55,800 And I think the road is road to hell is mostly paved with bad intentions

167 0:28:55,800 --> 0:28:57,880 But every now and again, there's a good intention in there

168 0:28:58,440 --> 0:29:03,400 So we want to do the right thing. Um, so, you know consider joining us and helping make it happen

169 0:29:04,760 --> 0:29:07,480 With that let's uh, we want to the next phase

170 0:29:07,480 --> 0:29:09,480 Please right on. Thank you

171 0:29:15,960 --> 0:29:19,640 All right, so you've seen a couple robots today, let's do a quick timeline recap

172 0:29:20,200 --> 0:29:24,760 So last year we unveiled the tesla bot concept, but a concept doesn't get us very far

173 0:29:25,160 --> 0:29:30,680 We knew we needed a real development and integration platform to get real life learnings as quickly as possible

174 0:29:31,240 --> 0:29:36,280 So that robot that came out and did the little routine for you guys. We had that within six months built

175 0:29:36,280 --> 0:29:40,760 working on software integration hardware upgrades over the months since then

176 0:29:41,240 --> 0:29:45,160 But in parallel, we've also been designing the next generation this one over here

177 0:29:46,520 --> 0:29:51,720 So this guy is rooted in the the foundation of sort of the vehicle design process

178 0:29:51,720 --> 0:29:54,840 You know, we're leveraging all of those learnings that we already have

179 0:29:55,960 --> 0:29:58,200 Obviously, there's a lot that's changed since last year

180 0:29:58,200 --> 0:30:00,440 But there's a few things that are still the same you'll notice

181 0:30:00,440 --> 0:30:04,040 We still have this really detailed focus on the true human form

182 0:30:04,040 --> 0:30:07,800 We think that matters for a few reasons, but it's fun

183 0:30:07,800 --> 0:30:11,000 We spend a lot of time thinking about how amazing the human body is

184 0:30:11,720 --> 0:30:13,720 We have this incredible range of motion

185 0:30:14,280 --> 0:30:16,280 Typically really amazing strength

186 0:30:17,080 --> 0:30:22,680 A fun exercise is if you put your fingertip on the chair in front of you, you'll notice that there's a huge

187 0:30:23,480 --> 0:30:28,200 Range of motion that you have in your shoulder and your elbow, for example without moving your fingertip

188 0:30:28,200 --> 0:30:30,200 You can move those joints all over the place

189 0:30:30,200 --> 0:30:34,200 But the robot, you know, its main function is to do real useful work

190 0:30:34,200 --> 0:30:38,200 And it maybe doesn't necessarily need all of those degrees of freedom right away

191 0:30:38,200 --> 0:30:42,200 So we've stripped it down to a minimum sort of 28 fundamental degrees of freedom

192 0:30:42,200 --> 0:30:44,200 And then of course our hands in addition to that

193 0:30:46,200 --> 0:30:50,200 Humans are also pretty efficient at some things and not so efficient in other times

194 0:30:50,200 --> 0:30:56,200 So for example, we can eat a small amount of food to sustain ourselves for several hours. That's great

195 0:30:56,200 --> 0:31:02,200 But when we're just kind of sitting around, no offense, but we're kind of inefficient. We're just sort of burning energy

196 0:31:02,200 --> 0:31:06,200 So on the robot platform what we're going to do is we're going to minimize that idle power consumption

197 0:31:06,200 --> 0:31:08,200 Drop it as low as possible

198 0:31:08,200 --> 0:31:14,200 And that way we can just flip a switch and immediately the robot turns into something that does useful work

199 0:31:16,200 --> 0:31:20,200 So let's talk about this latest generation in some detail, shall we?

200 0:31:20,200 --> 0:31:24,200 So on the screen here, you'll see in orange our actuators, which we'll get to in a little bit

201 0:31:24,200 --> 0:31:26,200 And in blue our electrical system

202 0:31:28,200 --> 0:31:33,200 So now that we have our sort of human-based research and we have our first development platform

203 0:31:33,200 --> 0:31:37,200 We have both research and execution to draw from for this design

204 0:31:37,200 --> 0:31:40,200 Again, we're using that vehicle design foundation

205 0:31:40,200 --> 0:31:46,200 So we're taking it from concept through design and analysis and then build and validation

206 0:31:46,200 --> 0:31:50,200 Along the way, we're going to optimize for things like cost and efficiency

207 0:31:50,200 --> 0:31:54,200 Because those are critical metrics to take this product to scale eventually

208 0:31:54,200 --> 0:31:56,200 How are we going to do that?

209 0:31:56,200 --> 0:32:01,200 Well, we're going to reduce our part count and our power consumption of every element possible

210 0:32:01,200 --> 0:32:05,200 We're going to do things like reduce the sensing and the wiring at our extremities

211 0:32:05,200 --> 0:32:11,200 You can imagine a lot of mass in your hands and feet is going to be quite difficult and power consumptive to move around

212 0:32:11,200 --> 0:32:18,200 And we're going to centralize both our power distribution and our compute to the physical center of the platform

213 0:32:18,200 --> 0:32:23,200 So in the middle of our torso, actually it is the torso, we have our battery pack

214 0:32:23,200 --> 0:32:28,200 This is sized at 2.3 kilowatt hours, which is perfect for about a full day's worth of work

215 0:32:28,200 --> 0:32:36,200 What's really unique about this battery pack is it has all of the battery electronics integrated into a single PCB within the pack

216 0:32:36,200 --> 0:32:45,200 So that means everything from sensing to fusing, charge management and power distribution is all in one place

217 0:32:45,200 --> 0:32:54,200 We're also leveraging both our vehicle products and our energy products to roll all of those key features into this battery

218 0:32:54,200 --> 0:33:02,200 So that's streamlined manufacturing, really efficient and simple cooling methods, battery management and also safety

219 0:33:02,200 --> 0:33:08,200 And of course we can leverage Tesla's existing infrastructure and supply chain to make it

220 0:33:08,200 --> 0:33:15,200 So going on to sort of our brain, it's not in the head, but it's pretty close

221 0:33:15,200 --> 0:33:19,200 Also in our torso we have our central computer

222 0:33:19,200 --> 0:33:24,200 So as you know, Tesla already ships full self-driving computers in every vehicle we produce

223 0:33:24,200 --> 0:33:30,200 We want to leverage both the autopilot hardware and the software for the humanoid platform

224 0:33:30,200 --> 0:33:35,200 But because it's different in requirements and in form factor, we're going to change a few things first

225 0:33:35,200 --> 0:33:45,200 So we still are going to do everything that a human brain does, processing vision data, making split-second decisions based on multiple sensory inputs

226 0:33:45,200 --> 0:33:53,200 And also communications, so to support communications it's equipped with wireless connectivity as well as audio support

227 0:33:53,200 --> 0:34:00,200 And then it also has hardware level security features, which are important to protect both the robot and the people around the robot

228 0:34:00,200 --> 0:34:07,200 So now that we have our sort of core, we're going to need some limbs on this guy

229 0:34:07,200 --> 0:34:12,200 And we'd love to show you a little bit about our actuators and our fully functional hands as well

230 0:34:12,200 --> 0:34:18,200 But before we do that, I'd like to introduce Malcolm, who's going to speak a little bit about our structural foundation for the robot

231 0:34:18,200 --> 0:34:26,200 Thank you, Jiji

232 0:34:26,200 --> 0:34:33,200 Tesla have the capabilities to analyze highly complex systems

233 0:34:33,200 --> 0:34:36,200 They don't get much more complex than a crash

234 0:34:36,200 --> 0:34:41,200 You can see here a simulated crash from model 3 superimposed on top of the actual physical crash

235 0:34:41,200 --> 0:34:44,200 It's actually incredible how accurate it is

236 0:34:44,200 --> 0:34:47,200 Just to give you an idea of the complexity of this model

237 0:34:47,200 --> 0:34:53,200 It includes every not-bolt-and-washer, every spot weld, and it has 35 million degrees of freedom

238 0:34:53,200 --> 0:34:55,200 Quite amazing

239 0:34:55,200 --> 0:35:01,200 And it's true to say that if we didn't have models like this, we wouldn't be able to make the safest cars in the world

240 0:35:01,200 --> 0:35:09,200 So can we utilize our capabilities and our methods from the automotive side to influence a robot?

241 0:35:09,200 --> 0:35:16,200 Well, we can make a model, and since we have crash software, we're using the same software here, we can make it fall down

242 0:35:16,200 --> 0:35:23,200 The purpose of this is to make sure that if it falls down, ideally it doesn't, but it's superficial damage

243 0:35:23,200 --> 0:35:26,200 We don't want it to, for example, break its gearbox and its arms

244 0:35:26,200 --> 0:35:31,200 That's equivalent of a dislocated shoulder of a robot, difficult and expensive to fix

245 0:35:31,200 --> 0:35:38,200 So we want it to dust itself off, get on with the job it's being given

246 0:35:38,200 --> 0:35:47,200 We can also take the same model, and we can drive the actuators using the inputs from a previously solved model, bringing it to life

247 0:35:47,200 --> 0:35:51,200 So this is producing the motions for the tasks we want the robot to do

248 0:35:51,200 --> 0:35:55,200 These tasks are picking up boxes, turning, squatting, walking upstairs

249 0:35:55,200 --> 0:35:58,200 Whatever the set of tasks are, we can place the model

250 0:35:58,200 --> 0:36:00,200 This is showing just simple walking

251 0:36:00,200 --> 0:36:08,200 We can create the stresses in all the components that helps us to optimize the components

252 0:36:08,200 --> 0:36:10,200 These are not dancing robots

253 0:36:10,200 --> 0:36:14,200 These are actually the modal behavior, the first five modes of the robot

254 0:36:14,200 --> 0:36:22,200 Typically, when people make robots, they make sure the first mode is up around the top single figure, up towards 10 hertz

255 0:36:22,200 --> 0:36:26,200 The reason we do this is to make the controls of walking easier

256 0:36:26,200 --> 0:36:30,200 It's very difficult to walk if you can't guarantee where your foot is wobbling around

257 0:36:30,200 --> 0:36:34,200 That's okay to make one robot, we want to make thousands, maybe millions

258 0:36:34,200 --> 0:36:37,200 We haven't got the luxury of making them from carbon fiber, titanium

259 0:36:37,200 --> 0:36:41,200 We want to make them from plastic, things are not quite as stiff

260 0:36:41,200 --> 0:36:46,200 So we can't have these high targets, I call them dumb targets

261 0:36:46,200 --> 0:36:49,200 We've got to make them work at lower targets

262 0:36:49,200 --> 0:36:51,200 So is that going to work?

263 0:36:51,200 --> 0:36:57,200 Well, if you think about it, sorry about this, but we're just bags of soggy, jelly and bones thrown in

264 0:36:57,200 --> 0:37:02,200 We're not high frequency, if I stand on my leg, I don't vibrate at 10 hertz

265 0:37:02,200 --> 0:37:08,200 People operate at low frequency, so we know the robot actually can, it just makes controls harder

266 0:37:08,200 --> 0:37:14,200 So we take the information from this, the modal data and the stiffness and feed that into the control system

267 0:37:14,200 --> 0:37:16,200 That allows it to walk

268 0:37:18,200 --> 0:37:21,200 Just changing tack slightly, looking at the knee

269 0:37:21,200 --> 0:37:27,200 We can take some inspiration from biology and we can look to see what the mechanical advantage of the knee is

270 0:37:27,200 --> 0:37:33,200 It turns out it actually represents quite similar to four-bar link, and that's quite non-linear

271 0:37:33,200 --> 0:37:41,200 That's not surprising really, because if you think when you bend your leg down, the torque on your knee is much more when it's bent than it is when it's straight

272 0:37:41,200 --> 0:37:48,200 So you'd expect a non-linear function, and in fact the biology is non-linear, this matches it quite accurately

273 0:37:50,200 --> 0:37:56,200 So that's the representation, the four-bar link is obviously not physically four-bar link, as I said the characteristics are similar

274 0:37:56,200 --> 0:38:00,200 But me bending down, that's not very scientific, let's be a bit more scientific

275 0:38:00,200 --> 0:38:09,200 We've played all the tasks through this graph, and this is showing picking things up, walking, squatting, the tasks I said we did on the stress

276 0:38:09,200 --> 0:38:16,200 And that's the torque seen at the knee against the knee bend on the horizontal axis

277 0:38:16,200 --> 0:38:20,200 This is showing the requirement for the knee to do all these tasks

278 0:38:20,200 --> 0:38:31,200 And then put a curve through it, surfing over the top of the peaks, and that's saying this is what's required to make the robot do these tasks

279 0:38:31,200 --> 0:38:42,200 So if we look at the four-bar link, that's actually the green curve, and it's saying that the non-linearity of the four-bar link has actually linearized the characteristic of the force

280 0:38:42,200 --> 0:38:50,200 What that really says is that's lowered the force, that's what makes the actuator have the lowest possible force, which is the most efficient, we want to burn energy up slowly

281 0:38:50,200 --> 0:39:00,200 What's the blue curve? Well the blue curve is actually if we didn't have a four-bar link, we just had an arm sticking out of my leg here with an actuator on it, a simple two-bar link

282 0:39:00,200 --> 0:39:08,200 That's the best we could do with a simple two-bar link, and it shows that that would create much more force in the actuator, which would not be efficient

283 0:39:08,200 --> 0:39:21,200 So what does that look like in practice? Well, as you'll see, it's very tightly packaged in the knee, you'll see it go transparent in a second, you'll see the four-bar link there, it's operating on the actuator

284 0:39:21,200 --> 0:39:25,200 This is determined, the force and the displacements on the actuator

285 0:39:25,200 --> 0:39:32,200 I'll now pass you over to Konstantinos to tell you a lot more detail about how these actuators are made and designed and optimized. Thank you

286 0:39:32,200 --> 0:39:39,200 Thank you Malcolm

287 0:39:39,200 --> 0:39:50,200 So I would like to talk to you about the design process and the actuator portfolio in our robot

288 0:39:50,200 --> 0:39:55,200 So there are many similarities between a car and a robot when it comes to powertrain design

289 0:39:55,200 --> 0:40:06,200 The most important thing that matters here is energy, mass, and cost. We are carrying over most of our designing experience from the car to the robot

290 0:40:08,200 --> 0:40:22,200 So in the particular case, you see a car with two drive units, and the drive units are used in order to accelerate the car 0 to 60 miles per hour time or drive the city's drive site

291 0:40:22,200 --> 0:40:32,200 While the robot that has 28 actuators, it's not obvious what are the tasks at actuator level

292 0:40:32,200 --> 0:40:44,200 So we have tasks that are higher level like walking or climbing stairs or carrying a heavy object which needs to be translated into joint specs

293 0:40:44,200 --> 0:40:59,200 Therefore we use our model that generates the torque speed trajectories for our joints which subsequently is going to be fed in our optimization model to run through the optimization process

294 0:41:01,200 --> 0:41:07,200 This is one of the scenarios that the robot is capable of doing which is turning and walking

295 0:41:07,200 --> 0:41:25,200 So when we have this torque speed trajectory, we lay it over an efficiency map of an actuator and we are able along the trajectory to generate the power consumption and the cumulative energy for the task versus time

296 0:41:25,200 --> 0:41:38,200 So this allows us to define the system cost for the particular actuator and put a simple point into the cloud. Then we do this for hundreds of thousands of actuators by solving in our cluster

297 0:41:38,200 --> 0:41:44,200 And the red line denotes the Pareto front which is the preferred area where we will look for our optimal

298 0:41:44,200 --> 0:41:50,200 So the X denotes the preferred actuator design we have picked for this particular joint

299 0:41:50,200 --> 0:41:57,200 So now we need to do this for every joint. We have 28 joints to optimize and we parse our cloud

300 0:41:57,200 --> 0:42:07,200 We parse our cloud again for every joint spec and the red axis this time denote the bespoke actuator designs for every joint

301 0:42:07,200 --> 0:42:15,200 The problem here is that we have too many unique actuator designs and even if we take advantage of the symmetry, still there are too many

302 0:42:15,200 --> 0:42:23,200 In order to make something mass manufacturable, we need to be able to reduce the amount of unique actuator designs

303 0:42:23,200 --> 0:42:36,200 Therefore, we run something called commonality study which we parse our cloud again looking this time for actuators that simultaneously meet the joint performance requirements for more than one joint at the same time

304 0:42:36,200 --> 0:42:48,200 So the resulting portfolio is six actuators and they show in a color map in the middle figure and the actuators can be also viewed in this slide

305 0:42:48,200 --> 0:42:57,200 We have three rotary and three linear actuators, all of which have a great output force or torque per mass

306 0:42:57,200 --> 0:43:15,200 The rotary actuator in particular has a mechanical class integrated on the high speed side angular contact ball bearing and on the high speed side and on the low speed side a cross roller bearing and the gear train is a strain wave gear

307 0:43:15,200 --> 0:43:23,200 There are three integrated sensors here and bespoke permanent magnet machine

308 0:43:23,200 --> 0:43:31,200 The linear actuator

309 0:43:31,200 --> 0:43:33,200 I'm sorry

310 0:43:33,200 --> 0:43:44,200 The linear actuator has planetary rollers and an inverted planetary screw as a gear train which allows efficiency and compaction and durability

311 0:43:44,200 --> 0:43:58,200 So in order to demonstrate the force capability of our linear actuators, we have set up an experiment in order to test it under its limits

312 0:43:58,200 --> 0:44:07,200 And I will let you enjoy the video

313 0:44:07,200 --> 0:44:19,200 So our actuator is able to lift

314 0:44:19,200 --> 0:44:25,200 A half ton, nine foot concert grand piano

315 0:44:25,200 --> 0:44:31,200 And

316 0:44:31,200 --> 0:44:56,200 This is a requirement. It's not something nice to have because our muscles can do the same when they are direct driven when they are directly driven or quadricep muscles can do the same thing. It's just that the knee is an up year in Lincoln system that converts the force into velocity at the end effect or of our heels for purposes of giving to the human body agility

317 0:44:56,200 --> 0:45:10,200 So this is one of the main things that are amazing about the human body and I'm concluding my part at this point and I would like to welcome my colleague Mike who's going to talk to you about hand design. Thank you very much.

318 0:45:10,200 --> 0:45:13,200 Thanks, Constantine

319 0:45:13,200 --> 0:45:18,200 So we just saw how powerful a human and a humanoid actuator can be.

320 0:45:18,200 --> 0:45:23,200 However, humans are also incredibly dexterous.

321 0:45:23,200 --> 0:45:27,200 The human hand has the ability to move at 300 degrees per second.

322 0:45:27,200 --> 0:45:30,200 There's tens of thousands of tactile sensors.

323 0:45:30,200 --> 0:45:36,200 It has the ability to grasp and manipulate almost every object in our daily lives.

324 0:45:36,200 --> 0:45:40,200 For our robotic hand design, we are inspired by biology.

325 0:45:40,200 --> 0:45:43,200 We have five fingers and opposable thumb.

326 0:45:43,200 --> 0:45:48,200 Our fingers are driven by metallic tendons that are both flexible and strong.

327 0:45:48,200 --> 0:45:57,200 We have the ability to complete wide aperture power grasps, while also being optimized for precision gripping of small, thin and delicate objects.

328 0:45:57,200 --> 0:46:00,200 So why a human like robotic hand?

329 0:46:00,200 --> 0:46:05,200 Well, the main reason that our factories in the world around us is designed to be ergonomic.

330 0:46:05,200 --> 0:46:09,200 So what that means is that it ensures that objects in our factory are graspable.

331 0:46:09,200 --> 0:46:17,200 But it also ensures that new objects that we may have never seen before can be grasped by the human hand and by our robotic hand as well.

332 0:46:17,200 --> 0:46:27,200 The converse there is pretty interesting because it's saying that these objects are designed to our hand instead of having to make changes to our hand to accompany a new object.

333 0:46:27,200 --> 0:46:31,200 Some basic stats about our hand is that it has six actuators and 11 degrees of freedom.

334 0:46:31,200 --> 0:46:37,200 It has an in-hand controller, which drives the fingers and receives sensor feedback.

335 0:46:37,200 --> 0:46:43,200 Sensor feedback is really important to learn a little bit more about the objects that we're grasping and also for proprioception.

336 0:46:43,200 --> 0:46:48,200 And that's the ability for us to recognize where our hand is in space.

337 0:46:48,200 --> 0:46:51,200 One of the important aspects of our hand is that it's adaptive.

338 0:46:51,200 --> 0:46:58,200 This adaptability is involved essentially as complex mechanisms that allow the hand to adapt the objects that's being grasped.

339 0:46:58,200 --> 0:47:01,200 Another important part is that we have a non back drivable finger drive.

340 0:47:01,200 --> 0:47:07,200 This clutching mechanism allows us to hold and transport objects without having to turn on the hand motors.

341 0:47:07,200 --> 0:47:12,200 You just heard how we went about designing the TeslaBot hardware.

342 0:47:12,200 --> 0:47:16,200 Now I'll hand it off to Milan and our autonomy team to bring this robot to life.

343 0:47:16,200 --> 0:47:24,200 Thanks, Michael.

344 0:47:24,200 --> 0:47:26,200 All right.

345 0:47:26,200 --> 0:47:36,200 So all those cool things we've shown earlier in the video were possible just in a matter of a few months thanks to the amazing work that we've done on autopilot over the past few years.

346 0:47:36,200 --> 0:47:40,200 Most of those components ported quite easily over to the bots environment.

347 0:47:40,200 --> 0:47:45,200 If you think about it, we're just moving from a robot on wheels to a robot on legs.

348 0:47:45,200 --> 0:47:51,200 So some of the components are pretty similar and some of them require more heavy lifting.

349 0:47:51,200 --> 0:47:59,200 So for example, our computer vision neural networks were ported directly from autopilot to the bots situation.

350 0:47:59,200 --> 0:48:07,200 It's exactly the same occupancy network that we'll talk into a little bit more details later with the autopilot team that is now running on the bot here in this video.

351 0:48:07,200 --> 0:48:14,200 The only thing that changed really is the training data that we had to recollect.

352 0:48:14,200 --> 0:48:25,200 We're also trying to find ways to improve those occupancy networks using work made on your radiance fields to get really great volumetric rendering of the bots environments.

353 0:48:25,200 --> 0:48:32,200 For example, here some machinery that the bot might have to interact with.

354 0:48:32,200 --> 0:48:42,200 Another interesting problem to think about is in indoor environments, mostly with that sense of GPS signal, how do you get the bot to navigate to its destination?

355 0:48:42,200 --> 0:48:45,200 Say for instance, to find its nearest charging station.

356 0:48:45,200 --> 0:48:59,200 So we've been training more neural networks to identify high frequency features, key points within the bot's camera streams and track them across frames over time as the bot navigates with its environment.

357 0:48:59,200 --> 0:49:09,200 And we're using those points to get a better estimate of the bot's pose and trajectory within its environment as it's walking.

358 0:49:09,200 --> 0:49:18,200 We also did quite some work on the simulation side, and this is literally the autopilot simulator to which we've integrated the robot locomotion code.

359 0:49:18,200 --> 0:49:27,200 And this is a video of the motion control code running in your pilot simulator simulator, showing the evolution of the robots work over time.

360 0:49:27,200 --> 0:49:37,200 So as you can see, we started quite slowly in April and started accelerating as we unlock more joints and deploy more advanced techniques like arms balancing over the past few months.

361 0:49:37,200 --> 0:49:44,200 And so locomotion is specifically one component that's very different as we're moving from the car to the bots environment.

362 0:49:44,200 --> 0:49:57,200 So I think it warrants a little bit more depth and I'd like my colleagues to start talking about this now.

363 0:49:57,200 --> 0:50:04,200 Thank you Milan. Hi everyone, I'm Felix, I'm a robotics engineer on the project, and I'm going to talk about walking.

364 0:50:04,200 --> 0:50:10,200 Walking seems easy, right? People do it every day. You don't even have to think about it.

365 0:50:10,200 --> 0:50:15,200 But there are some aspects of walking which are challenging from an engineering perspective.

366 0:50:15,200 --> 0:50:22,200 For example, physical self-awareness. That means having a good representation of yourself.

367 0:50:22,200 --> 0:50:28,200 What is the length of your limbs? What is the mass of your limbs? What is the size of your feet? All that matters.

368 0:50:28,200 --> 0:50:37,200 Also, having an energy efficient gait. You can imagine there's different styles of walking and all of them are equally efficient.

369 0:50:37,200 --> 0:50:45,200 Most important, keep balance, don't fall. And of course, also coordinate the motion of all of your limbs together.

370 0:50:45,200 --> 0:50:52,200 So now humans do all of this naturally, but as engineers or roboticists, we have to think about these problems.

371 0:50:52,200 --> 0:50:57,200 And the following I'm going to show you how we address them in our locomotion planning and control stack.

372 0:50:57,200 --> 0:51:01,200 So we start with locomotion planning and our representation of the bot.

373 0:51:01,200 --> 0:51:07,200 That means a model of the robot's kinematics, dynamics, and the contact properties.

374 0:51:07,200 --> 0:51:16,200 And using that model and the desired path for the bot, our locomotion planner generates reference trajectories for the entire system.

375 0:51:16,200 --> 0:51:22,200 This means feasible trajectories with respect to the assumptions of our model.

376 0:51:22,200 --> 0:51:29,200 The planner currently works in three stages. It starts planning footsteps and ends with the entire motion for the system.

377 0:51:29,200 --> 0:51:32,200 And let's dive a little bit deeper in how this works.

378 0:51:32,200 --> 0:51:39,200 So in this video, we see footsteps being planned over a planning horizon following the desired path.

379 0:51:39,200 --> 0:51:49,200 And we start from this and add them for trajectories that connect these footsteps using toe-off and heel strike just as humans do.

380 0:51:49,200 --> 0:51:55,200 And this gives us a larger stride and less knee bend for high efficiency of the system.

381 0:51:55,200 --> 0:52:04,200 The last stage is then finding a sense of mass trajectory, which gives us a dynamically feasible motion of the entire system to keep balance.

382 0:52:04,200 --> 0:52:09,200 As we all know, plans are good, but we also have to realize them in reality.

383 0:52:09,200 --> 0:52:20,200 Let's see how we can do this.

384 0:52:20,200 --> 0:52:23,200 Thank you, Felix. Hello, everyone. My name is Anand.

385 0:52:23,200 --> 0:52:26,200 And I'm going to talk to you about controls.

386 0:52:26,200 --> 0:52:33,200 So let's take the motion plan that Felix just talked about and put it in the real world on a real robot.

387 0:52:33,200 --> 0:52:37,200 Let's see what happens.

388 0:52:37,200 --> 0:52:40,200 It takes a couple of steps and falls down.

389 0:52:40,200 --> 0:52:48,200 Well, that's a little disappointing, but we are missing a few key pieces here which will make it walk.

390 0:52:48,200 --> 0:52:57,200 Now, as Felix mentioned, the motion planner is using an idealized version of itself and a version of reality around it.

391 0:52:57,200 --> 0:52:59,200 This is not exactly correct.

392 0:52:59,200 --> 0:53:12,200 It also expresses its intention through trajectories and wrenches, wrenches of forces and torques that it wants to exert on the world to locomotive.

393 0:53:12,200 --> 0:53:16,200 Reality is way more complex than any similar model.

394 0:53:16,200 --> 0:53:18,200 Also, the robot is not simplified.

395 0:53:18,200 --> 0:53:25,200 It's got vibrations and modes, compliance, sensor noise, and on and on and on.

396 0:53:25,200 --> 0:53:30,200 So what does that do to the real world when you put the bot in the real world?

397 0:53:30,200 --> 0:53:36,200 Well, the unexpected forces cause unmodeled dynamics, which essentially the planet doesn't know about.

398 0:53:36,200 --> 0:53:44,200 And that causes destabilization, especially for a system that is dynamically stable like biped locomotion.

399 0:53:44,200 --> 0:53:46,200 So what can we do about it?

400 0:53:46,200 --> 0:53:48,200 Well, we measure reality.

401 0:53:48,200 --> 0:53:53,200 We use sensors and our understanding of the world to do state estimation.

402 0:53:53,200 --> 0:54:00,200 And here you can see the attitude and pelvis pose, which is essentially the vestibular system in a human,

403 0:54:00,200 --> 0:54:07,200 along with the center of mass trajectory being tracked when the robot is walking in the office environment.

404 0:54:07,200 --> 0:54:11,200 Now we have all the pieces we need in order to close the loop.

405 0:54:11,200 --> 0:54:14,200 So we use our better bot model.

406 0:54:14,200 --> 0:54:18,200 We use the understanding of reality that we've gained through state estimation.

407 0:54:18,200 --> 0:54:24,200 And we compare what we want versus what we expect the reality is doing to us

408 0:54:24,200 --> 0:54:30,200 in order to add corrections to the behavior of the robot.

409 0:54:30,200 --> 0:54:38,200 Here, the robot certainly doesn't appreciate being poked, but it does an admirable job of staying upright.

410 0:54:38,200 --> 0:54:43,200 The final point here is a robot that walks is not enough.

411 0:54:43,200 --> 0:54:48,200 We need it to use its hands and arms to be useful.

412 0:54:48,200 --> 0:54:50,200 Let's talk about manipulation.

413 0:55:00,200 --> 0:55:04,200 Hi, everyone. My name is Eric, robotics engineer on Teslabot.

414 0:55:04,200 --> 0:55:09,200 And I want to talk about how we've made the robot manipulate things in the real world.

415 0:55:09,200 --> 0:55:16,200 We wanted to manipulate objects while looking as natural as possible and also get there quickly.

416 0:55:16,200 --> 0:55:20,200 So what we've done is we've broken this process down into two steps.

417 0:55:20,200 --> 0:55:26,200 First is generating a library of natural motion references, or we could call them demonstrations.

418 0:55:26,200 --> 0:55:32,200 And then we've adapted these motion references online to the current real world situation.

419 0:55:32,200 --> 0:55:36,200 So let's say we have a human demonstration of picking up an object.

420 0:55:36,200 --> 0:55:42,200 We can get a motion capture of that demonstration, which is visualized right here as a bunch of key frames

421 0:55:42,200 --> 0:55:46,200 representing the location of the hands, the elbows, the torso.

422 0:55:46,200 --> 0:55:49,200 We can map that to the robot using inverse kinematics.

423 0:55:49,200 --> 0:55:55,200 And if we collect a lot of these, now we have a library that we can work with.

424 0:55:55,200 --> 0:56:01,200 But a single demonstration is not generalizable to the variation in the real world.

425 0:56:01,200 --> 0:56:06,200 For instance, this would only work for a box in a very particular location.

426 0:56:06,200 --> 0:56:12,200 So what we've also done is run these reference trajectories through a trajectory optimization program,

427 0:56:12,200 --> 0:56:17,200 which solves for where the hand should be, how the robot should balance,

428 0:56:17,200 --> 0:56:21,200 when it needs to adapt the motion to the real world.

429 0:56:21,200 --> 0:56:31,200 So for instance, if the box is in this location, then our optimizer will create this trajectory instead.

430 0:56:31,200 --> 0:56:38,200 Next, Milan's going to talk about what's next for the optimist, TeslaVine. Thanks.

431 0:56:38,200 --> 0:56:45,200 Thanks, Eric.

432 0:56:45,200 --> 0:56:50,200 Right. So hopefully by now you guys got a good idea of what we've been up to over the past few months.

433 0:56:50,200 --> 0:56:54,200 We started doing something that's usable, but it's far from being useful.

434 0:56:54,200 --> 0:56:58,200 There's still a long and exciting road ahead of us.

435 0:56:58,200 --> 0:57:03,200 I think the first thing within the next few weeks is to get optimists at least at par with Bumble-C,

436 0:57:03,200 --> 0:57:07,200 the other bot prototype you saw earlier, and probably beyond.

437 0:57:07,200 --> 0:57:12,200 We are also going to start focusing on the real use case at one of our factories

438 0:57:12,200 --> 0:57:18,200 and really going to try to nail this down and iron out all the elements needed

439 0:57:18,200 --> 0:57:20,200 to deploy this product in the real world.

440 0:57:20,200 --> 0:57:27,200 I was mentioning earlier, indoor navigation, graceful form management, or even servicing,

441 0:57:27,200 --> 0:57:31,200 all components needed to scale this product up.

442 0:57:31,200 --> 0:57:35,200 But I don't know about you, but after seeing what we've shown tonight,

443 0:57:35,200 --> 0:57:38,200 I'm pretty sure we can get this done within the next few months or years

444 0:57:38,200 --> 0:57:43,200 and make this product a reality and change the entire economy.

445 0:57:43,200 --> 0:57:47,200 So I would like to thank the entire optimist team for all their hard work over the past few months.

446 0:57:47,200 --> 0:57:51,200 I think it's pretty amazing. All of this was done in barely six or eight months.

447 0:57:51,200 --> 0:57:53,200 Thank you very much.

448 0:57:53,200 --> 0:58:01,200 Applause

449 0:58:07,200 --> 0:58:14,200 Hey, everyone. Hi, I'm Ashok. I lead the Autopilot team alongside Milan.

450 0:58:14,200 --> 0:58:18,200 God, it's going to be so hard to top that optimist section.

451 0:58:18,200 --> 0:58:21,200 We'll try nonetheless.

452 0:58:21,200 --> 0:58:26,200 Anyway, every Tesla that has been built over the last several years

453 0:58:26,200 --> 0:58:30,200 we think of the hardware to make the car drive itself.

454 0:58:30,200 --> 0:58:36,200 We have been working on the software to add higher and higher levels of autonomy.

455 0:58:36,200 --> 0:58:42,200 This time around last year, we had roughly 2,000 cars driving our FSD beta software.

456 0:58:42,200 --> 0:58:47,200 Since then, we have significantly improved the software's robustness and capability

457 0:58:47,200 --> 0:58:53,200 that we have now shipped it to 160,000 customers as of today.

458 0:58:53,200 --> 0:58:59,200 Applause

459 0:58:59,200 --> 0:59:06,200 This has not come for free. It came from the sweat and blood of the engineering team over the last one year.

460 0:59:06,200 --> 0:59:11,200 For example, we trained 75,000 neural network models just last one year.

461 0:59:11,200 --> 0:59:16,200 That's roughly a model every eight minutes that's coming out of the team.

462 0:59:16,200 --> 0:59:19,200 And then we evaluate them on our large clusters.

463 0:59:19,200 --> 0:59:24,200 And then we ship 281 of those models that actually improve the performance of the car.

464 0:59:24,200 --> 0:59:28,200 And this space of innovation is happening throughout the stack.

465 0:59:28,200 --> 0:59:37,200 The planning software, the infrastructure, the tools, even hiring, everything is progressing to the next level.

466 0:59:37,200 --> 0:59:41,200 The FSD beta software is quite capable of driving the car.

467 0:59:41,200 --> 0:59:46,200 It should be able to navigate from parking lot to parking lot, handling city street driving,

468 0:59:46,200 --> 0:59:56,200 stopping for traffic lights and stop signs, negotiating with objects at intersections, making turns and so on.

469 0:59:56,200 --> 1:00:02,200 All of this comes from the camera streams that go through our neural networks that run on the car itself.

470 1:00:02,200 --> 1:00:04,200 It's not coming back to the server or anything.

471 1:00:04,200 --> 1:00:09,200 It's running on the car and produces all the outputs to form the world model around the car.

472 1:00:09,200 --> 1:00:13,200 And the planning software drives the car based on that.

473 1:00:13,200 --> 1:00:17,200 Today we'll go into a lot of the components that make up the system.

474 1:00:17,200 --> 1:00:23,200 The occupancy network acts as the base geometry layer of the system.

475 1:00:23,200 --> 1:00:28,200 This is a multi-camera video neural network that from the images

476 1:00:28,200 --> 1:00:34,200 predicts the full physical occupancy of the world around the robot.

477 1:00:34,200 --> 1:00:39,200 So anything that's physically present, trees, walls, buildings, cars, balls, whatever you,

478 1:00:39,200 --> 1:00:46,200 if it's physically present, it predicts them along with their future motion.

479 1:00:46,200 --> 1:00:51,200 On top of this base level of geometry, we have more semantic layers.

480 1:00:51,200 --> 1:00:56,200 In order to navigate the roadways, we need the lanes, of course.

481 1:00:56,200 --> 1:00:59,200 The roadways have lots of different lanes and they connect in all kinds of ways.

482 1:00:59,200 --> 1:01:03,200 So it's actually a really difficult problem for typical computer vision techniques

483 1:01:03,200 --> 1:01:06,200 to predict the set of lanes and their connectivities.

484 1:01:06,200 --> 1:01:11,200 So we reached all the way into language technologies and then pulled the state of the art from other domains

485 1:01:11,200 --> 1:01:16,200 and not just computer vision to make this task possible.

486 1:01:16,200 --> 1:01:21,200 For vehicles, we need their full kinematic state to control for them.

487 1:01:21,200 --> 1:01:24,200 All of this directly comes from neural networks.

488 1:01:24,200 --> 1:01:28,200 Video streams, raw video streams, come into the networks,

489 1:01:28,200 --> 1:01:31,200 goes through a lot of processing, and then outputs the full kinematic state.

490 1:01:31,200 --> 1:01:37,200 The positions, velocities, acceleration, jerk, all of that directly comes out of networks

491 1:01:37,200 --> 1:01:39,200 with minimal post-processing.

492 1:01:39,200 --> 1:01:42,200 That's really fascinating to me because how is this even possible?

493 1:01:42,200 --> 1:01:45,200 What world do we live in that this magic is possible,

494 1:01:45,200 --> 1:01:48,200 that these networks predict fourth derivatives of these positions

495 1:01:48,200 --> 1:01:53,200 when people thought they couldn't even detect these objects?

496 1:01:53,200 --> 1:01:55,200 My opinion is that it did not come for free.

497 1:01:55,200 --> 1:02:00,200 It required tons of data, so we had to build sophisticated auto-labeling systems

498 1:02:00,200 --> 1:02:05,200 that churn through raw sensor data, run a ton of offline compute on the servers.

499 1:02:05,200 --> 1:02:09,200 It can take a few hours, run expensive neural networks,

500 1:02:09,200 --> 1:02:15,200 distill the information into labels that train our in-car neural networks.

501 1:02:15,200 --> 1:02:20,200 On top of this, we also use our simulation system to synthetically create images,

502 1:02:20,200 --> 1:02:25,200 and since it's a simulation, we trivially have all the labels.

503 1:02:25,200 --> 1:02:29,200 All of this goes through a well-oiled data engine pipeline

504 1:02:29,200 --> 1:02:33,200 where we first train a baseline model with some data,

505 1:02:33,200 --> 1:02:36,200 ship it to the car, see what the failures are,

506 1:02:36,200 --> 1:02:41,200 and once we know the failures, we mine the fleet for the cases where it fails,

507 1:02:41,200 --> 1:02:45,200 provide the correct labels, and add the data to the training set.

508 1:02:45,200 --> 1:02:48,200 This process systematically fixes the issues,

509 1:02:48,200 --> 1:02:51,200 and we do this for every task that runs in the car.

510 1:02:51,200 --> 1:02:54,200 Yeah, and to train these new massive neural networks,

511 1:02:54,200 --> 1:02:59,200 this year we expanded our training infrastructure by roughly 40 to 50 percent,

512 1:02:59,200 --> 1:03:06,200 so that sits us at about 14,000 GPUs today across multiple training clusters in the United States.

513 1:03:06,200 --> 1:03:09,200 We also worked on our AI compiler,

514 1:03:09,200 --> 1:03:13,200 which now supports new operations needed by those neural networks

515 1:03:13,200 --> 1:03:17,200 and maps them to the best of our underlying hardware resources.

516 1:03:17,200 --> 1:03:23,200 And our inference engine today is capable of distributing the execution of a single neural network

517 1:03:23,200 --> 1:03:26,200 across two independent system on chips,

518 1:03:26,200 --> 1:03:32,200 essentially two independent computers interconnected within the same full self-driving computer.

519 1:03:32,200 --> 1:03:37,200 And to make this possible, we had to keep a tight control on the end-to-end latency of this new system,

520 1:03:37,200 --> 1:03:43,200 so we deployed more advanced scheduling code across the full FSD platform.

521 1:03:43,200 --> 1:03:47,200 All of these neural networks running in the car together produce the vector space,

522 1:03:47,200 --> 1:03:50,200 which is again the model of the world around the robot or the car.

523 1:03:50,200 --> 1:03:56,200 And then the planning system operates on top of this, coming up with trajectories that avoid collisions or smooth,

524 1:03:56,200 --> 1:04:00,200 make progress towards the destination using a combination of model-based optimization

525 1:04:00,200 --> 1:04:06,200 plus neural network that helps optimize it to be really fast.

526 1:04:06,200 --> 1:04:11,200 Today, we are really excited to present progress on all of these areas.

527 1:04:11,200 --> 1:04:15,200 We have the engineering leads standing by to come in and explain these various blocks,

528 1:04:15,200 --> 1:04:22,200 and these power not just the car, but the same components also run on the Optimus robot that Milan showed earlier.

529 1:04:22,200 --> 1:04:26,200 With that, I welcome Paril to start talking about the planning section.

530 1:04:26,200 --> 1:04:36,200 Hi, all. I'm Paril Jain.

531 1:04:36,200 --> 1:04:43,200 Let's use this intersection scenario to dive straight into how we do the planning and decision-making in Autopilot.

532 1:04:43,200 --> 1:04:49,200 So we are approaching this intersection from a side street, and we have to yield to all the crossing vehicles.

533 1:04:49,200 --> 1:04:57,200 Right as we are about to enter the intersection, the pedestrian on the other side of the intersection decides to cross the road without a crosswalk.

534 1:04:57,200 --> 1:05:02,200 Now, we need to yield to this pedestrian, yield to the vehicles from the right,

535 1:05:02,200 --> 1:05:08,200 and also understand the relation between the pedestrian and the vehicle on the other side of the intersection.

536 1:05:08,200 --> 1:05:15,200 So a lot of these intra-object dependencies that we need to resolve in a quick glance.

537 1:05:15,200 --> 1:05:17,200 And humans are really good at this.

538 1:05:17,200 --> 1:05:27,200 We look at a scene, understand all the possible interactions, evaluate the most promising ones, and generally end up choosing a reasonable one.

539 1:05:27,200 --> 1:05:31,200 So let's look at a few of these interactions that Autopilot system evaluated.

540 1:05:31,200 --> 1:05:36,200 We could have gone in front of this pedestrian with a very aggressive launch and lateral profile.

541 1:05:36,200 --> 1:05:41,200 Now, obviously, we are being a jerk to the pedestrian, and we would spook the pedestrian and his cute pet.

542 1:05:41,200 --> 1:05:48,200 We could have moved forward slowly, shot for a gap between the pedestrian and the vehicle from the right.

543 1:05:48,200 --> 1:05:51,200 Again, we are being a jerk to the vehicle coming from the right.

544 1:05:51,200 --> 1:05:58,200 But you should not outright reject this interaction in case this is only safe interaction available.

545 1:05:58,200 --> 1:06:01,200 Lastly, the interaction we ended up choosing.

546 1:06:01,200 --> 1:06:09,200 Stay slow initially, find the reasonable gap, and then finish the maneuver after all the agents pass.

547 1:06:09,200 --> 1:06:18,200 Now, evaluation of all of these interactions is not trivial, especially when you care about modeling the higher-order derivatives for other agents.

548 1:06:18,200 --> 1:06:25,200 For example, what is the longitudinal jerk required by the vehicle coming from the right when you assert in front of it?

549 1:06:25,200 --> 1:06:33,200 Relying purely on collision checks with modular predictions will only get you so far because you will miss out on a lot of valid interactions.

550 1:06:33,200 --> 1:06:42,200 This basically boils down to solving the multi-agent joint trajectory planning problem over the trajectories of ego and all the other agents.

551 1:06:42,200 --> 1:06:47,200 Now, how much ever you optimize, there's going to be a limit to how fast you can run this optimization problem.

552 1:06:47,200 --> 1:06:53,200 It will be close to order of 10 milliseconds, even after a lot of incremental approximations.

553 1:06:53,200 --> 1:07:07,200 Now, for a typical crowded unprotected lift, say you have more than 20 objects, each object having multiple different future modes, the number of relevant interaction combinations will blow up.

554 1:07:07,200 --> 1:07:11,200 The planner needs to make a decision every 50 milliseconds.

555 1:07:11,200 --> 1:07:14,200 So how do we solve this in real time?

556 1:07:14,200 --> 1:07:23,200 We rely on a framework what we call as interaction search, which is basically a parallelized research over a bunch of maneuver trajectories.

557 1:07:23,200 --> 1:07:36,200 The state space here corresponds to the kinematic state of ego, the kinematic state of other agents, their nominal future multimodal predictions, and all the static entities in the scene.

558 1:07:36,200 --> 1:07:40,200 The action space is where things get interesting.

559 1:07:40,200 --> 1:07:50,200 We use a set of maneuver trajectory candidates to branch over a bunch of interaction decisions and also incremental goals for a longer horizon maneuver.

560 1:07:50,200 --> 1:07:55,200 Let's walk through this research very quickly to get a sense of how it works.

561 1:07:55,200 --> 1:08:00,200 We start with a set of vision measurements, namely lanes, occupancy, moving objects.

562 1:08:00,200 --> 1:08:05,200 These get represented as sparse attractions as well as latent features.

563 1:08:05,200 --> 1:08:17,200 We use this to create a set of goal candidates, lanes again from the lanes network, or unstructured regions which correspond to a probability mask derived from human demonstration.

564 1:08:17,200 --> 1:08:28,200 Once we have a bunch of these goal candidates, we create seed trajectories using a combination of classical optimization approaches, as well as our network planner, again trained on data from the customer fleet.

565 1:08:28,200 --> 1:08:35,200 Now once we get a bunch of these seed trajectories, we use them to start branching on the interactions.

566 1:08:35,200 --> 1:08:37,200 We find the most critical interaction.

567 1:08:37,200 --> 1:08:43,200 In our case, this would be the interaction with respect to the pedestrian, whether we assert in front of it or yield to it.

568 1:08:43,200 --> 1:08:47,200 Obviously, the option on the left is a high penalty option.

569 1:08:47,200 --> 1:08:49,200 It likely won't get prioritized.

570 1:08:49,200 --> 1:08:57,200 So we branch further onto the option on the right, and that's where we bring in more and more complex interactions, building this optimization problem incrementally with more and more constraints.

571 1:08:57,200 --> 1:09:03,200 And the research keeps flowing, branching on more interactions, branching on more goals.

572 1:09:03,200 --> 1:09:09,200 Now a lot of tricks here lie in evaluation of each of this node of the research.

573 1:09:09,200 --> 1:09:19,200 Inside each node, initially we started with creating trajectories using classical optimization approaches, where the constraints, like I described, would be added incrementally.

574 1:09:19,200 --> 1:09:24,200 And this would take close to one to five milliseconds per action.

575 1:09:24,200 --> 1:09:31,200 Now even though this is fairly good number, when you want to evaluate more than 100% interactions, this does not scale.

576 1:09:31,200 --> 1:09:37,200 So we ended up building lightweight, queryable networks that you can run in the loop of the planner.

577 1:09:37,200 --> 1:09:44,200 These networks are trained on human demonstrations from the fleet, as well as offline solvers with relaxed time limits.

578 1:09:44,200 --> 1:09:51,200 With this, we were able to bring the run time down to close to 100 microseconds per action.

579 1:09:51,200 --> 1:10:06,200 Now doing this alone is not enough, because you still have this massive research that you need to go through, and you need to efficiently prune the search space.

580 1:10:06,200 --> 1:10:11,200 So you need to do scoring on each of these trajectories.

581 1:10:11,200 --> 1:10:18,200 A few of these are fairly standard. You do a bunch of collision checks, you do a bunch of comfort analysis, what is the jerk and access required for a given maneuver.

582 1:10:18,200 --> 1:10:23,200 The customer fleet data plays an important role here again.

583 1:10:23,200 --> 1:10:32,200 We run two sets of, again, lightweight, queryable networks, both really augmenting each other, one of them trained from interventions from the FSD beta fleet,

584 1:10:32,200 --> 1:10:38,200 which gives a score on how likely is a given maneuver to result in interventions over the next few seconds.

585 1:10:38,200 --> 1:10:47,200 And second, which is purely on human demonstrations, human driven data, giving a score on how close is your given selected action to a human driven trajectory.

586 1:10:47,200 --> 1:10:56,200 The scoring helps us prune the search space, keep branching further on the interactions, and focus the compute on the most promising outcomes.

587 1:10:56,200 --> 1:11:06,200 The cool part about this architecture is that it allows us to create a cool blend between data driven approaches,

588 1:11:06,200 --> 1:11:12,200 where you don't have to rely on a lot of hand engineered costs, but also ground it in reality with physics based checks.

589 1:11:12,200 --> 1:11:22,200 Now a lot of what I described was with respect to the agents we could observe in the scene, but the same framework extends to objects behind occlusions.

590 1:11:22,200 --> 1:11:29,200 We use the video feed from eight cameras to generate the 3D occupancy of the world.

591 1:11:29,200 --> 1:11:34,200 The blue mask here corresponds to the visibility region we call it.

592 1:11:34,200 --> 1:11:38,200 It basically gets blocked at the first occlusion you see in the scene.

593 1:11:38,200 --> 1:11:44,200 We consume this visibility mask to generate what we call as ghost objects, which you can see on the top left.

594 1:11:44,200 --> 1:11:50,200 Now if you model the spawn regions and the state transitions of these ghost objects correctly,

595 1:11:50,200 --> 1:11:59,200 if you tune your control response as a function of their existence likelihood, you can extract some really nice human like behaviors.

596 1:11:59,200 --> 1:12:04,200 Now I'll pass it on to Phil to describe more on how we generate these occupancy networks.

597 1:12:04,200 --> 1:12:11,200 Thank you.

598 1:12:11,200 --> 1:12:18,200 Hey guys, my name is Phil. I will share the details of the occupancy network we built over the past year.

599 1:12:18,200 --> 1:12:23,200 This network is our solution to model the physical world in 3D around our cars.

600 1:12:23,200 --> 1:12:27,200 And it is currently not shown in our customer facing visualization.

601 1:12:27,200 --> 1:12:35,200 What you see here is the raw network output from our internal lab tool.

602 1:12:35,200 --> 1:12:46,200 The occupancy network takes video streams of all our eight cameras as input, produces a single unified volumetric occupancy in vector space directly.

603 1:12:46,200 --> 1:12:54,200 For every 3D location around our car, it predicts the probability of that location being occupied or not.

604 1:12:54,200 --> 1:13:02,200 Since it has video context, it is capable of predicting obstacles that are occluded instantaneously.

605 1:13:02,200 --> 1:13:16,200 For each location, it also produces a set of semantics such as a curb, car, pedestrian, and road debris as color coded here.

606 1:13:16,200 --> 1:13:19,200 Occupancy flow is also predicted for motion.

607 1:13:19,200 --> 1:13:26,200 Since the model is a generalized network, it does not tell static and dynamic objects explicitly.

608 1:13:26,200 --> 1:13:33,200 It is able to produce and model the random motion such as a swerving trainer here.

609 1:13:33,200 --> 1:13:40,200 This network is currently running in all testers with FSD computers, and it is incredibly efficient.

610 1:13:40,200 --> 1:13:45,200 Runs about every 10 milliseconds with our neural net accelerator.

611 1:13:45,200 --> 1:13:48,200 So how does this work? Let's take a look at architecture.

612 1:13:48,200 --> 1:13:53,200 First, we rectify each camera images with a camera calibration.

613 1:13:53,200 --> 1:13:59,200 And the images we're showing here, we're giving to the network, it's actually not the typical 8-bit RGB image.

614 1:13:59,200 --> 1:14:06,200 As you can see from the first image on top, we're giving the 12-bit raw photo account image to the network.

615 1:14:06,200 --> 1:14:17,200 Since it has four bits more information, it has 16 times better dynamic range as well as reduced latency since we don't have to run ISP in the loop anymore.

616 1:14:17,200 --> 1:14:25,200 We use a set of reglets and a bag of FPMs as a backbone to extract image space features.

617 1:14:25,200 --> 1:14:34,200 Next, we construct a set of 3D position query along with the image space features as keys and values fit into an attention module.

618 1:14:34,200 --> 1:14:39,200 The output of the attention module is high dimensional spatial features.

619 1:14:39,200 --> 1:14:48,200 These spatial features are aligned temporally using vehicle odometry to derive motion.

620 1:14:48,200 --> 1:14:57,200 Next, these spatial temporal features go through a set of deconvolution to produce the final occupancy and occupancy flow output.

621 1:14:57,200 --> 1:15:04,200 They're formed as fixed-size voxel grid, which might not be precise enough for planning and control.

622 1:15:04,200 --> 1:15:19,200 In order to get a higher resolution, we also produce per-voxel feature maps, which we feed into MLP with 3D spatial point queries to get position and semantics at any arbitrary location.

623 1:15:19,200 --> 1:15:23,200 After knowing the model better, let's take a look at another example.

624 1:15:23,200 --> 1:15:29,200 Here we have an articulated bus parked on the right side of the road, highlighted as an L-shaped voxel here.

625 1:15:29,200 --> 1:15:42,200 As we approach, the bus starts to move. The front of the car turns blue first, indicating the model predicts the front of the bus has a long zero occupancy flow.

626 1:15:42,200 --> 1:15:52,200 And as the bus keeps moving, the entire bus turns blue, and you can also see that the network predicts the precise curvature of the bus.

627 1:15:52,200 --> 1:16:03,200 Well, this is a very complicated problem for traditional object detection network, as you have to see whether I'm going to use one cuboid or perhaps two to fit the curvature.

628 1:16:03,200 --> 1:16:13,200 But for occupancy network, since all we care about is the occupancy in the visible space, and we'll be able to model the curvature precisely.

629 1:16:13,200 --> 1:16:18,200 Besides the voxel grid, the occupancy network also produces a drivable surface.

630 1:16:18,200 --> 1:16:27,200 The drivable surface has both 3D geometry and semantics. They are very useful for control, especially on hilly and curvy roads.

631 1:16:27,200 --> 1:16:37,200 The surface and the voxel grid are not predicted independently. Instead, the voxel grid actually aligns with the surface implicitly.

632 1:16:37,200 --> 1:16:46,200 Here we are at a here quest where you can see the 3D geometry of the surface being predicted nicely.

633 1:16:46,200 --> 1:16:51,200 Planner can use this information to decide perhaps we need to slow down more for the here quest.

634 1:16:51,200 --> 1:16:58,200 And as you can also see, the voxel grid aligns with the surface consistently.

635 1:16:58,200 --> 1:17:07,200 Besides the voxels and the surface, we're also very excited about the recent breakthrough in neural radiance field, or LERF.

636 1:17:07,200 --> 1:17:19,200 We're looking into both incorporate some of the last LERF features into occupancy network training, as well as using our network output as the input state for LERF.

637 1:17:19,200 --> 1:17:28,200 As a matter of fact, Ashok is very excited about this. This has been his personal weekend project for a while.

638 1:17:28,200 --> 1:17:38,200 I think academia is building a lot of these foundation models for language using tons of large data sets for language.

639 1:17:38,200 --> 1:17:45,200 I think for vision, NERFs are going to provide the foundation models for computer vision because they are grounded in geometry.

640 1:17:45,200 --> 1:17:52,200 Geometry gives us a nice way to supervise these networks and frees us of the requirement to define an ontology.

641 1:17:52,200 --> 1:17:56,200 And the supervision is essentially free because you just have to differentially render these images.

642 1:17:56,200 --> 1:18:11,200 So I think in the future, this occupancy network idea where images come in and then the network produces a consistent volumetric representation of the scene that can then be differentially rendered into any image that was observed,

643 1:18:11,200 --> 1:18:14,200 I personally think is a future of computer vision.

644 1:18:14,200 --> 1:18:29,200 And we do some initial work on it right now, but I think in the future, both at Tesla and in academia, we will see that this combination of one-shot prediction of volumetric occupancy will be the future.

645 1:18:29,200 --> 1:18:32,200 That's my personal bet.

646 1:18:32,200 --> 1:18:34,200 Thanks, Ashok.

647 1:18:34,200 --> 1:18:39,200 So here's an example early result of a 3D reconstruction from our free data.

648 1:18:39,200 --> 1:18:49,200 Instead of focusing on getting perfect RGB rep projection in image space, our primary goal here is to accurately represent the world in 3D space for driving.

649 1:18:49,200 --> 1:18:54,200 And we want to do this for all our free data all over the world in all weather and lighting conditions.

650 1:18:54,200 --> 1:19:00,200 And obviously, this is a very challenging problem, and we're looking for you guys to help.

651 1:19:00,200 --> 1:19:07,200 Finally, the occupancy network is trained with large auto-labeled data set without any human in the loop.

652 1:19:07,200 --> 1:19:12,200 And with that, I'll pass to Tim to talk about what it takes to train this network.

653 1:19:12,200 --> 1:19:18,200 Thanks, Phil.

654 1:19:18,200 --> 1:19:20,200 All right. Hey, everyone.

655 1:19:20,200 --> 1:19:23,200 Let's talk about some training infrastructure.

656 1:19:23,200 --> 1:19:32,200 So we've seen a couple of videos, you know, four or five, I think, and care more and worry more about a lot more clips on that.

657 1:19:32,200 --> 1:19:38,200 So we've been looking at the occupancy networks just from Phil, just Phil's videos.

658 1:19:38,200 --> 1:19:43,200 It takes 1.4 billion frames to train that network, which you just saw.

659 1:19:43,200 --> 1:19:47,200 And if you have 100,000 GPUs, it would take one hour.

660 1:19:47,200 --> 1:19:52,200 But if you have one GPU, it would take 100,000 hours.

661 1:19:52,200 --> 1:19:56,200 So that is not a humane time period that you can wait for your training job to run, right?

662 1:19:56,200 --> 1:19:58,200 We want to ship faster than that.

663 1:19:58,200 --> 1:20:00,200 So that means you're going to need to go parallel.

664 1:20:00,200 --> 1:20:03,200 So you need more compute for that.

665 1:20:03,200 --> 1:20:06,200 That means you're going to need a supercomputer.

666 1:20:06,200 --> 1:20:18,200 So this is why we've built in-house three supercomputers comprising of 14,000 GPUs, where we use 10,000 GPUs for training and run 4,000 GPUs for auto-labeling.

667 1:20:18,200 --> 1:20:24,200 All these videos are stored in 30 petabytes of a distributed, managed video cache.

668 1:20:24,200 --> 1:20:31,200 You shouldn't think of our data sets as fixed, let's say, as you think of your ImageNet or something, you know, with like a million frames.

669 1:20:31,200 --> 1:20:34,200 You should think of it as a very fluid thing.

670 1:20:34,200 --> 1:20:42,200 So we've got half a million of these videos flowing in and out of these clusters every single day.

671 1:20:42,200 --> 1:20:49,200 And we track 400,000 of these kind of Python video instantiations every second.

672 1:20:49,200 --> 1:20:51,200 So that's a lot of calls.

673 1:20:51,200 --> 1:20:57,200 We're going to need to capture that in order to govern the retention policies of this distributed video cache.

674 1:20:57,200 --> 1:21:04,200 So underlying all of this is a huge amount of infra, all of which we build and manage in-house.

675 1:21:04,200 --> 1:21:13,200 So you cannot just buy, you know, 14,000 GPUs and then 30 petabytes of flash NVMe and just put it together and let's go train.

676 1:21:13,200 --> 1:21:17,200 It actually takes a lot of work, and I'm going to go into a little bit of that.

677 1:21:17,200 --> 1:21:25,200 What you actually typically want to do is you want to take your accelerator, so that could be the GPU or Dojo, which we'll talk about later.

678 1:21:25,200 --> 1:21:31,200 And because that's the most expensive component, that's where you want to put your bottleneck.

679 1:21:31,200 --> 1:21:37,200 And so that means that every single part of your system is going to need to outperform this accelerator.

680 1:21:37,200 --> 1:21:39,200 And so that is really complicated.

681 1:21:39,200 --> 1:21:46,200 That means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes.

682 1:21:46,200 --> 1:21:53,200 These nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning framework.

683 1:21:53,200 --> 1:21:58,200 This machine learning framework then needs to hand it off to your GPU, and then you can start training.

684 1:21:58,200 --> 1:22:06,200 But then you need to do so across hundreds or thousands of GPU in a reliable way, in lockstep, and in a way that's also fast.

685 1:22:06,200 --> 1:22:10,200 So you're also going to need an interconnect. Extremely complicated.

686 1:22:10,200 --> 1:22:13,200 We'll talk more about Dojo in a second.

687 1:22:13,200 --> 1:22:18,200 So first I want to take you through some optimizations that we've done on our cluster.

688 1:22:18,200 --> 1:22:27,200 So we're getting in a lot of videos, and video is very much unlike, let's say, training on images or text, which I think is very well established.

689 1:22:27,200 --> 1:22:31,200 Video is quite literally a dimension more complicated.

690 1:22:31,200 --> 1:22:39,200 And so that's why we needed to go end to end from the storage layer down to the accelerator and optimize every single piece of that.

691 1:22:39,200 --> 1:22:45,200 Because we train on the photon count videos that come directly from our fleet, we train on those directly.

692 1:22:45,200 --> 1:22:48,200 We do not post-process those at all.

693 1:22:48,200 --> 1:22:53,200 The way it's just done is we seek exactly to the frames we select for our batch.

694 1:22:53,200 --> 1:22:56,200 We load those in, including the frames that they depend on.

695 1:22:56,200 --> 1:22:58,200 So these are your I-frames or your key frames.

696 1:22:58,200 --> 1:23:03,200 We package those up, move them into shared memory, move them into a double buffer on the GPU,

697 1:23:03,200 --> 1:23:09,200 and then use the hardware decoder that's only accelerated to actually decode the video.

698 1:23:09,200 --> 1:23:11,200 So we do that on the GPU natively.

699 1:23:11,200 --> 1:23:15,200 And it's all in a very nice byte-torch extension.

700 1:23:15,200 --> 1:23:26,200 Doing so unlocks more than 30% training speed increase for the occupancy networks and frees up basically the whole CPU to do any other thing.

701 1:23:26,200 --> 1:23:29,200 You cannot just do training with just videos.

702 1:23:29,200 --> 1:23:31,200 Of course, you need some kind of a ground truth.

703 1:23:31,200 --> 1:23:34,200 And that is actually an interesting problem as well.

704 1:23:34,200 --> 1:23:43,200 The objective for storing your ground truth is that you want to make sure you get to your ground truth that you need in the minimal amount of file system operations

705 1:23:43,200 --> 1:23:49,200 and load in the minimal size of what you need in order to optimize for aggregate cross-cluster throughput.

706 1:23:49,200 --> 1:23:56,200 Because you should see a compute cluster as one big device which has internally fixed constraints and thresholds.

707 1:23:56,200 --> 1:24:02,200 So for this, we rolled out a format that is native to us that's called Small.

708 1:24:02,200 --> 1:24:06,200 We use this for our ground truth, our feature cache, and any inference outputs.

709 1:24:06,200 --> 1:24:08,200 So a lot of tensors that are in there.

710 1:24:08,200 --> 1:24:10,200 And so just a cartoon here.

711 1:24:10,200 --> 1:24:13,200 Let's say this is your table that you want to store.

712 1:24:13,200 --> 1:24:16,200 Then that's how that would look out if you rolled out on disk.

713 1:24:16,200 --> 1:24:22,200 So what you do is you take anything you'd want to index on, so for example, video timestamps, you put those all in the header

714 1:24:22,200 --> 1:24:26,200 so that in your initial header read, you know exactly where to go on disk.

715 1:24:26,200 --> 1:24:34,200 Then if you have any tensors, you're going to try to transpose the dimensions to put a different dimension last as the contiguous dimension.

716 1:24:34,200 --> 1:24:37,200 And then also try different types of compression.

717 1:24:37,200 --> 1:24:41,200 Then you check out which one was most optimal and then store that one.

718 1:24:41,200 --> 1:24:46,200 This is actually a huge step if you do feature caching, unintelligible output from the machine learning network,

719 1:24:46,200 --> 1:24:52,200 rotate around the dimensions a little bit, you can get up to 20% increase in efficiency of storage.

720 1:24:52,200 --> 1:25:01,200 Then when you store that, we also order the columns by size so that all your small columns and small values are together

721 1:25:01,200 --> 1:25:06,200 so that when you seek for a single value, you're likely to overlap with the read on more values,

722 1:25:06,200 --> 1:25:11,200 which you'll use later so that you don't need to do another file system operation.

723 1:25:11,200 --> 1:25:13,200 So I could go on and on.

724 1:25:13,200 --> 1:25:17,200 I just went on, touched on two projects that we have internally.

725 1:25:17,200 --> 1:25:23,200 But this is actually part of a huge continuous effort to optimize the compute that we have in-house.

726 1:25:23,200 --> 1:25:27,200 So accumulating and aggregating through all these optimizations,

727 1:25:27,200 --> 1:25:32,200 we now train our occupancy networks twice as fast just because it's twice as efficient.

728 1:25:32,200 --> 1:25:38,200 And now if we add in a bunch more compute and go parallel, we can now train this in hours instead of days.

729 1:25:38,200 --> 1:25:43,200 And with that, I'd like to hand it off to the biggest user of compute, John.

730 1:25:43,200 --> 1:25:52,200 Hi, everybody. My name is John Emmons.

731 1:25:52,200 --> 1:25:54,200 I lead the Autopilot vision team.

732 1:25:54,200 --> 1:25:57,200 I'm going to cover two topics with you today.

733 1:25:57,200 --> 1:26:04,200 The first is how we predict lanes, and the second is how we predict the future behavior of other agents on the road.

734 1:26:04,200 --> 1:26:11,200 In the early days of Autopilot, we modeled the lane detection problem as an image space instant segmentation task.

735 1:26:11,200 --> 1:26:13,200 Our network was super simple, though.

736 1:26:13,200 --> 1:26:18,200 In fact, it was only capable of predicting lanes of a few different kinds of geometries.

737 1:26:18,200 --> 1:26:26,200 Specifically, it would segment the Eagle lane, it could segment adjacent lanes, and then it had some special casing for forks and merges.

738 1:26:26,200 --> 1:26:31,200 This simplistic modeling of the problem worked for highly structured roads like highways.

739 1:26:31,200 --> 1:26:35,200 But today we're trying to build a system that's capable of much more complex maneuvers.

740 1:26:35,200 --> 1:26:41,200 Specifically, we want to make left and right turns at intersections where the road topology can be quite a bit more complex and diverse.

741 1:26:41,200 --> 1:26:47,200 When we try to apply this simplistic modeling of the problem here, it just totally breaks down.

742 1:26:47,200 --> 1:26:54,200 Taking a step back for a moment, what we're trying to do here is to predict the sparse set of lane instances and their connectivity.

743 1:26:54,200 --> 1:27:00,200 And what we want to do is to have a neural network that basically predicts this graph where the nodes are the lane segments

744 1:27:00,200 --> 1:27:04,200 and the edges encode the connectivities between these lanes.

745 1:27:04,200 --> 1:27:08,200 So what we have is our lane detection neural network.

746 1:27:08,200 --> 1:27:11,200 It's made up of three components.

747 1:27:11,200 --> 1:27:16,200 In the first component, we have a set of convolutional layers, attention layers, and other neural network layers

748 1:27:16,200 --> 1:27:23,200 that encode the video streams from our eight cameras on the vehicle and produce a rich visual representation.

749 1:27:23,200 --> 1:27:32,200 We then enhance this visual representation with a coarse road level map data, which we encode with a set of additional neural network layers

750 1:27:32,200 --> 1:27:35,200 that we call the lane guidance module.

751 1:27:35,200 --> 1:27:40,200 This map is not an HD map, but it provides a lot of useful hints about the topology of lanes inside of intersections,

752 1:27:40,200 --> 1:27:46,200 the lane counts on various roads, and a set of other attributes that help us.

753 1:27:46,200 --> 1:27:51,200 The first two components here produce a dense tensor that sort of encodes the world.

754 1:27:51,200 --> 1:27:57,200 But what we really want to do is to convert this dense tensor into a sparse set of lanes and their connectivities.

755 1:27:57,200 --> 1:28:02,200 We approach this problem like an image captioning task, where the input is this dense tensor,

756 1:28:02,200 --> 1:28:09,200 and the output text is predicted into a special language that we developed at Tesla for encoding lanes and their connectivities.

757 1:28:09,200 --> 1:28:14,200 In this language of lanes, the words and tokens are the lane positions in 3D space.

758 1:28:14,200 --> 1:28:21,200 In the ordering of the tokens, in predicted modifiers in the tokens, encode the connected relationships between these lanes.

759 1:28:21,200 --> 1:28:26,200 By modeling the task as a language problem, we can capitalize on recent autoregressive architectures

760 1:28:26,200 --> 1:28:30,200 and techniques from the language community for handling the multiplicity of the problem.

761 1:28:30,200 --> 1:28:33,200 We're not just solving the computer vision problem at Autopilot.

762 1:28:33,200 --> 1:28:38,200 We're also applying the state-of-the-art in language modeling and machine learning more generally.

763 1:28:38,200 --> 1:28:42,200 I'm now going to dive into a little bit more detail of this language component.

764 1:28:42,200 --> 1:28:48,200 What I have depicted on the screen here is a satellite image which sort of represents the local area around the vehicle.

765 1:28:48,200 --> 1:28:56,200 The set of nodes and edges is what we refer to as the lane graph, and it's ultimately what we want to come out of this neural network.

766 1:28:56,200 --> 1:28:59,200 We start with a blank slate.

767 1:28:59,200 --> 1:29:03,200 We're going to want to make our first prediction here at this green dot.

768 1:29:03,200 --> 1:29:08,200 This green dot's position is encoded as an index into a coarse grid which discretizes the 3D world.

769 1:29:08,200 --> 1:29:13,200 Now, we don't predict this index directly because it would be too computationally expensive to do so.

770 1:29:13,200 --> 1:29:20,200 There's just too many grid points, and predicting a categorical distribution over this has both implications at training time and test time.

771 1:29:20,200 --> 1:29:23,200 So instead what we do is we discretize the world coarsely first.

772 1:29:23,200 --> 1:29:28,200 We predict a heat map over the possible locations, and then we latch in the most probable location.

773 1:29:28,200 --> 1:29:34,200 Condition on this, we then refine the prediction and get the precise point.

774 1:29:34,200 --> 1:29:38,200 Now, we know where the position of this token is, but we don't know its type.

775 1:29:38,200 --> 1:29:41,200 In this case, though, it's the beginning of a new lane.

776 1:29:41,200 --> 1:29:44,200 So we predict it as a start token.

777 1:29:44,200 --> 1:29:48,200 And because it's a start token, there's no additional attributes in our language.

778 1:29:48,200 --> 1:29:54,200 We then take the predictions from this first forward pass, and we encode them using a learned positional embedding

779 1:29:54,200 --> 1:30:00,200 which produces a set of tensors that we combine together, which is actually the first word in our language of lanes.

780 1:30:00,200 --> 1:30:04,200 We add this to the first position in our sentence here.

781 1:30:04,200 --> 1:30:09,200 We then continue this process by predicting the next lane point in a similar fashion.

782 1:30:09,200 --> 1:30:12,200 Now, this lane point is not the beginning of a new lane.

783 1:30:12,200 --> 1:30:15,200 It's actually a continuation of the previous lane.

784 1:30:15,200 --> 1:30:18,200 So it's a continuation token type.

785 1:30:18,200 --> 1:30:23,200 Now, it's not enough just to know that this lane is connected to the previously predicted lane.

786 1:30:23,200 --> 1:30:29,200 We want to encode its precise geometry, which we do by regressing a set of spline coefficients.

787 1:30:29,200 --> 1:30:34,200 We then take this lane, we encode it again, and add it as the next word in the sentence.

788 1:30:34,200 --> 1:30:39,200 We continue predicting these continuation lanes until we get to the end of the prediction grid.

789 1:30:39,200 --> 1:30:42,200 We then move on to a different lane segment.

790 1:30:42,200 --> 1:30:44,200 So you can see that cyan dot there.

791 1:30:44,200 --> 1:30:47,200 Now, it's not topologically connected to that pink point.

792 1:30:47,200 --> 1:30:52,200 It's actually forking off of that blue, sorry, that green point there.

793 1:30:52,200 --> 1:30:54,200 So it's got a fork type.

794 1:30:54,200 --> 1:31:00,200 And fork tokens actually point back to previous tokens from which the fork originates.

795 1:31:00,200 --> 1:31:03,200 So you can see here the fork point predictor is actually the index zero.

796 1:31:03,200 --> 1:31:09,200 So it's actually referencing back to tokens that it's already predicted, like you would in language.

797 1:31:09,200 --> 1:31:14,200 We continue this process over and over again until we've enumerated all of the tokens in the lane graph.

798 1:31:14,200 --> 1:31:18,200 And then the network predicts the end of sentence token.

799 1:31:18,200 --> 1:31:24,200 Yeah, I just wanted to note that the reason we do this is not just because we want to build something complicated.

800 1:31:24,200 --> 1:31:29,200 It almost feels like a Turing complete machine here with neural networks, though, is that we tried simpler approaches.

801 1:31:29,200 --> 1:31:34,200 For example, trying to just segment the lanes along the road or something like that.

802 1:31:34,200 --> 1:31:40,200 But then the problem is when there's uncertainty, say you cannot see the road clearly and there could be two lanes or three lanes,

803 1:31:40,200 --> 1:31:45,200 and you can't tell, a simple segmentation-based approach would just draw both of them.

804 1:31:45,200 --> 1:31:51,200 It's kind of a 2.5 lane situation, and the post-crossing algorithm would hilariously fail when the predictions are such.

805 1:31:51,200 --> 1:31:53,200 Yeah, and the problems don't end there.

806 1:31:53,200 --> 1:32:00,200 I mean, you need to predict these connective lanes inside of intersections, which it's just not possible with the approach that Ashok's mentioning,

807 1:32:00,200 --> 1:32:02,200 which is why we had to upgrade to this sort of approach.

808 1:32:02,200 --> 1:32:05,200 Yeah, when it overlaps like this, segmentation would just go haywire.

809 1:32:05,200 --> 1:32:09,200 But even if you try very hard to put them on separate layers, it's just a really hard problem.

810 1:32:09,200 --> 1:32:19,200 But language just offers a really nice framework for getting a sample from a posterior as opposed to trying to do all of this in post-crossing.

811 1:32:19,200 --> 1:32:21,200 But this doesn't actually stop for just autopilot, right, John?

812 1:32:21,200 --> 1:32:24,200 This can be used for optimists.

813 1:32:24,200 --> 1:32:34,200 Yeah, I guess they wouldn't be called lanes, but you could imagine in this stage here that you might have paths that encode the possible places that people could walk.

814 1:32:34,200 --> 1:32:41,200 Yeah, basically if you're in a factory or in a home setting, you can just ask the robot, okay, let's me please route to the kitchen,

815 1:32:41,200 --> 1:32:48,200 or please route to some location in the factory, and then we predict a set of pathways that would go through the aisles, take the robot,

816 1:32:48,200 --> 1:32:50,200 and say, okay, this is how you get to the kitchen.

817 1:32:50,200 --> 1:33:00,200 It just really gives us a nice framework to model these different paths that simplify the navigation problem for the downstream planner.

818 1:33:00,200 --> 1:33:07,200 All right, so ultimately what we get from this lane detection network is a set of lanes and their connectivities, which comes directly from the network.

819 1:33:07,200 --> 1:33:13,200 There's no additional step here for sparsely tying these dense predictions into sparse ones.

820 1:33:13,200 --> 1:33:18,200 This is just the direct unfiltered output of the network.

821 1:33:18,200 --> 1:33:20,200 Okay, so I talked a little bit about lanes.

822 1:33:20,200 --> 1:33:26,200 I'm going to briefly touch on how we model and predict the future paths and other semantics on objects.

823 1:33:26,200 --> 1:33:29,200 So I'm just going to go really quickly through two examples.

824 1:33:29,200 --> 1:33:34,200 The video on the right here, we've got a car that's actually running a red light and turning in front of us.

825 1:33:34,200 --> 1:33:40,200 What we do to handle situations like this is we predict a set of short time horizon future trajectories on all objects.

826 1:33:40,200 --> 1:33:48,200 We can use these to anticipate the dangerous situation here and apply whatever braking and steering action is required to avoid a collision.

827 1:33:48,200 --> 1:33:51,200 In the video on the right, there's two vehicles in front of us.

828 1:33:51,200 --> 1:33:53,200 The one on the left lane is parked.

829 1:33:53,200 --> 1:33:55,200 Apparently it's being loaded, unloaded.

830 1:33:55,200 --> 1:33:57,200 I don't know why the driver decided to park there.

831 1:33:57,200 --> 1:34:02,200 But the important thing is that our neural network predicted that it was stopped, which is the red color there.

832 1:34:02,200 --> 1:34:06,200 The vehicle in the other lane, as you notice, also is stationary.

833 1:34:06,200 --> 1:34:08,200 But that one's obviously just waiting for that red light to turn green.

834 1:34:08,200 --> 1:34:19,200 So even though both objects are stationary and have zero velocity, it's the semantics that is really important here so that we don't get stuck behind that awkwardly parked car.

835 1:34:19,200 --> 1:34:24,200 Predicting all of these agent attributes presents some practical problems when trying to build a real time system.

836 1:34:24,200 --> 1:34:30,200 We need to maximize the frame rate of our object section stack so that autopilot can quickly react to the changing environment.

837 1:34:30,200 --> 1:34:32,200 Every millisecond really matters here.

838 1:34:32,200 --> 1:34:37,200 To minimize the inference latency, our neural network is split into two phases.

839 1:34:37,200 --> 1:34:42,200 In the first phase, we identify the locations in 3D space where agents exist.

840 1:34:42,200 --> 1:34:52,200 In the second stage, we then pull out tensors at those 3D locations, append it with additional data that's on the vehicle, and then we do the rest of the processing.

841 1:34:52,200 --> 1:35:01,200 This sparsification step allows the neural network to focus compute on the areas that matter most, which gives us superior performance for a fraction of the latency cost.

842 1:35:01,200 --> 1:35:06,200 So putting it all together, the Autopilot Vision Stack predicts more than just the geometry and kinematics of the world.

843 1:35:06,200 --> 1:35:11,200 It also predicts a rich set of semantics, which enables safe and human-like driving.

844 1:35:11,200 --> 1:35:15,200 I'm now going to hand things off to Sharif, who will tell us how we run all these cool neural networks on our FSD computer.

845 1:35:15,200 --> 1:35:16,200 Thank you.

846 1:35:16,200 --> 1:35:26,200 Hi, everyone. I'm Shree.

847 1:35:26,200 --> 1:35:34,200 Today I'm going to give a glimpse of what it takes to run these FSD networks in the car, and how do we optimize for the inference latency.

848 1:35:34,200 --> 1:35:41,200 Today I'm going to focus just on the FSD lanes network that John just talked about.

849 1:35:41,200 --> 1:35:53,200 So when we started this track, we wanted to know if we can run this FSD lanes network natively on the trip engine, which is our in-house neural network accelerator that we built in the FSD computer.

850 1:35:53,200 --> 1:36:02,200 When we built this hardware, we kept it simple, and we made sure it can do one thing ridiculously fast, dense dot products.

851 1:36:02,200 --> 1:36:14,200 But this architecture is autoregressive and iterative, where it crunches through multiple attention blocks in the inner loop, producing sparse points directly at every step.

852 1:36:14,200 --> 1:36:21,200 So the challenge here was, how can we do this sparse point prediction and sparse computation on a dense dot product engine?

853 1:36:21,200 --> 1:36:25,200 Let's see how we did this on the trip.

854 1:36:25,200 --> 1:36:32,200 So the network predicts the heat map of most probable spatial locations of the point.

855 1:36:32,200 --> 1:36:41,200 Now we do an argmax and a one-hot operation, which gives the one-hot encoding of the index of the spatial location.

856 1:36:41,200 --> 1:36:49,200 Now we need to select the embedding associated with this index from an embedding table that is learned during training.

857 1:36:49,200 --> 1:37:02,200 To do this on trip, we actually built a lookup table in SRAM, and we engineered the dimensions of this embedding such that we could achieve all of these things with just matrix multiplication.

858 1:37:02,200 --> 1:37:12,200 Not just that, we also wanted to store this embedding into a token cache so that we don't recompute this for every iteration, rather reuse it for future point prediction.

859 1:37:12,200 --> 1:37:19,200 Again, we put some tricks here where we did all these operations just on the dot product engine.

860 1:37:19,200 --> 1:37:31,200 It's actually cool that our team found creative ways to map all these operations on the trip engine in ways that were not even imagined when this hardware was designed.

861 1:37:31,200 --> 1:37:34,200 But that's not the only thing we had to do to make this work.

862 1:37:34,200 --> 1:37:45,200 We actually implemented a whole lot of operations and features to make this model compilable, to improve the intake accuracy, as well as to optimize performance.

863 1:37:45,200 --> 1:37:56,200 All of these things helped us run this 75 million parameter model just under 10 milliseconds of latency, consuming just 8 watts of power.

864 1:37:56,200 --> 1:38:04,200 But this is not the only architecture running in the car. There are so many other architectures, modules, and networks we need to run in the car.

865 1:38:04,200 --> 1:38:13,200 To give a sense of scale, there are about a billion parameters of all the networks combined, producing around 1,000 neural network signals.

866 1:38:13,200 --> 1:38:24,200 So we need to make sure we optimize them jointly, such that we maximize the compute utilization throughput and minimize the latency.

867 1:38:24,200 --> 1:38:32,200 So we built a compiler just for neural networks that shares the structure to traditional compilers.

868 1:38:32,200 --> 1:38:49,200 As you can see, it takes the massive graph of neural nets with 150k nodes and 375k connections, takes this thing, partitions them into independent subgraphs, and compels each of those subgraphs natively for the inference devices.

869 1:38:49,200 --> 1:38:57,200 Then we have a neural network linker, which shares the structure to traditional linker, where we perform this link time optimization.

870 1:38:57,200 --> 1:39:10,200 There, we solve an offline optimization problem with compute memory and memory bandwidth constraints, so that it comes with an optimized schedule that gets executed in the car.

871 1:39:10,200 --> 1:39:24,200 On the runtime, we designed a hybrid scheduling system, which basically does heterogeneous scheduling on one SoC and distributed scheduling across both the SoCs to run these networks in a model parallel fashion.

872 1:39:24,200 --> 1:39:49,200 To get 100 tops of compute utilization, we need to optimize across all the layers of software right from tuning the network architecture, the compiler, all the way to implementing a low latency, high bandwidth RDMA link across both the SoCs, and in fact, going even deeper to understanding and optimizing the cache coherent and non-coherent data paths of the accelerator in the SoC.

873 1:39:49,200 --> 1:39:59,200 This is a lot of optimization at every level in order to make sure we get the highest frame rate, and as every millisecond counts here.

874 1:39:59,200 --> 1:40:08,200 And this is just the visualization of the neural networks that are running in the car. This is our digital brain, essentially.

875 1:40:08,200 --> 1:40:17,200 As you can see, these operations are nothing but just the matrix multiplication convolution, to name a few, real operations running in the car.

876 1:40:17,200 --> 1:40:36,200 To train this network with a billion parameters, you need a lot of labeled data. So Egan is going to talk about how do we achieve this with the auto labeling pipeline.

877 1:40:36,200 --> 1:40:38,200 Thank you, Sri.

878 1:40:38,200 --> 1:40:43,200 Hi, everyone. I'm Egan Zhang, and I'm leading Geometric Vision at Autopilot.

879 1:40:43,200 --> 1:40:48,200 So, yeah, let's talk about auto labeling.

880 1:40:48,200 --> 1:40:54,200 So we have several kinds of auto labeling frameworks to support various types of networks.

881 1:40:54,200 --> 1:40:59,200 But today, I'd like to focus on the awesome LanesNet here.

882 1:40:59,200 --> 1:41:12,200 So to successfully train and generalize this network to everywhere, we think we went tens of millions of trips from probably one million intersection or even more.

883 1:41:12,200 --> 1:41:15,200 So then how to do that?

884 1:41:15,200 --> 1:41:28,200 So it is certainly achievable to source sufficient amount of trips because we already have, as Tim explained earlier, we already have like 500,000 trips per day cache rate.

885 1:41:28,200 --> 1:41:36,200 However, converting all of those data into a training form is a very challenging technical problem.

886 1:41:36,200 --> 1:41:50,200 To solve this challenge, we've tried various ways of manual and auto labeling. So from the first column to the second, from the second to the third, each advance provided us nearly 100x improvement in throughput.

887 1:41:50,200 --> 1:42:02,200 But still, we want an even better auto labeling machine that can provide us good quality, diversity, and scalability.

888 1:42:02,200 --> 1:42:14,200 To meet all these requirements, despite the huge amount of engineering effort required here, we've developed a new auto labeling machine powered by multi-trip reconstruction.

889 1:42:14,200 --> 1:42:24,200 So this can replace five million hours of manual labeling with just 12 hours of cluster for labeling 10,000 trips.

890 1:42:24,200 --> 1:42:27,200 So how we solved? There are three big steps.

891 1:42:27,200 --> 1:42:34,200 The first step is high precision trajectory and structure recovery by multi-camera visual inertia odometry.

892 1:42:34,200 --> 1:42:43,200 So here all the features, including ground surface, are inferred from videos by neural networks, then tracked and reconstructed in the vector space.

893 1:42:43,200 --> 1:42:56,200 So the typical drift rate of this trajectory in car is like 1.3 centimeter per meter and 0.45 milli radian per meter, which is pretty decent considering its compact compute requirement.

894 1:42:56,200 --> 1:43:04,200 Then the recovery surface and road details are also used as a strong guidance for the later manual verification stuff.

895 1:43:04,200 --> 1:43:13,200 This is also enabled in every FSD vehicle, so we get pre-processed trajectories and structures along with the trip data.

896 1:43:13,200 --> 1:43:21,200 The second step is multi-trip reconstruction, which is the big and core piece of this machine.

897 1:43:21,200 --> 1:43:31,200 So the video shows how the previously shown trip is reconstructed and aligned with other trips, basically other trips from different vehicles, not the same vehicle.

898 1:43:31,200 --> 1:43:40,200 So this is done by multiple internal steps like coarse alignment, pairwise matching, joint optimization, then further surface refinement.

899 1:43:40,200 --> 1:43:45,200 In the end, the human analyst comes in and finalizes the label.

900 1:43:45,200 --> 1:43:55,200 So each habit steps are already fully parallelized on the cluster, so the entire process usually takes just a couple of hours.

901 1:43:55,200 --> 1:44:01,200 The last step is actually auto labeling the new trips.

902 1:44:01,200 --> 1:44:10,200 So here we use the same multi-trip alignment engine, but only between pre-built reconstruction and each new trip.

903 1:44:10,200 --> 1:44:15,200 So it's much, much simpler than fully reconstructing all the clips altogether.

904 1:44:15,200 --> 1:44:24,200 That's why it only takes 30 minutes per trip to auto label instead of several hours of manual labeling.

905 1:44:24,200 --> 1:44:31,200 And this is also the key of scalability of this machine.

906 1:44:31,200 --> 1:44:38,200 This machine easily scales as long as we have available compute and trip data.

907 1:44:38,200 --> 1:44:43,200 So about 50 trips were newly auto labeled from this scene, and some of them are shown here.

908 1:44:43,200 --> 1:44:47,200 So 53 from different vehicles.

909 1:44:47,200 --> 1:44:54,200 So this is how we capture and transform the space-time slices of the world into the network supervision.

910 1:44:54,200 --> 1:45:00,200 Yeah, one thing I'd like to note is that Jegen just talked about how we auto label our lanes.

911 1:45:00,200 --> 1:45:06,200 We have auto labors for almost every task that we do, including our planner, and many of these are fully automatic.

912 1:45:06,200 --> 1:45:13,200 There are no humans involved. For example, for objects, all the kinematics, the shapes, the futures, everything just comes from auto labeling.

913 1:45:13,200 --> 1:45:17,200 And the same is true for occupancy, too. And we have really just built a machine around this.

914 1:45:17,200 --> 1:45:22,200 Yeah, so if you can go back one slide. One more.

915 1:45:22,200 --> 1:45:29,200 It says parallelized on cluster. So that sounds pretty straightforward, but it really wasn't.

916 1:45:29,200 --> 1:45:33,200 Maybe it's fun to share how something like this comes about.

917 1:45:33,200 --> 1:45:39,200 So a while ago, we didn't have any auto labeling at all. And then someone makes a script.

918 1:45:39,200 --> 1:45:45,200 It starts to work. It starts working better until we reach a volume that's pretty high, and we clearly need a solution.

919 1:45:45,200 --> 1:45:51,200 And so there were two other engineers in our team who were like, you know, that's an interesting thing.

920 1:45:51,200 --> 1:45:57,200 What we needed to do was build a whole graph of essentially Python functions that we need to run one after the other.

921 1:45:57,200 --> 1:46:01,200 First, you pull the clip, then you do some cleaning, then you do some network inference,

922 1:46:01,200 --> 1:46:06,200 then another network inference until you finally get this. But so you need to do this at a large scale.

923 1:46:06,200 --> 1:46:14,200 So I tell them, we probably need to shoot for, you know, 100,000 clips per day or like 100,000 items. That seems good.

924 1:46:14,200 --> 1:46:21,200 And so the engineers said, well, we can do, you know, a bit of Postgres and a bit of Elvo grease. We can do it.

925 1:46:21,200 --> 1:46:28,200 Meanwhile, we are a bit later and we're doing 20 million of these functions every single day.

926 1:46:28,200 --> 1:46:34,200 Again, we pull in around half a million clips and on those we run a ton of functions, each of these in a streaming fashion.

927 1:46:34,200 --> 1:46:40,200 And so that's kind of the back end info that's also needed to not just run training, but also auto labeling.

928 1:46:40,200 --> 1:46:46,200 It really is like a factory that produces labels and production lines, yield, quality, inventory,

929 1:46:46,200 --> 1:46:52,200 like all of the same concepts applied to this label factory that applies for the factory for our cars.

930 1:46:52,200 --> 1:46:55,200 That's right.

931 1:46:55,200 --> 1:46:58,200 OK, thanks, Tim and Ashok.

932 1:46:58,200 --> 1:47:06,200 So, yeah, so concluding this section, I'd like to share a few more challenging and interesting examples for network for sure.

933 1:47:06,200 --> 1:47:15,200 And even for humans, probably. So from the top, there's like examples for like lack of lights case or foggy night or roundabout

934 1:47:15,200 --> 1:47:22,200 and occlusions by heavy occlusions by parked cars and even rainy night with the raindrops on camera lenses.

935 1:47:22,200 --> 1:47:27,200 These are challenging, but once their original scenes are fully reconstructed by other clips,

936 1:47:27,200 --> 1:47:34,200 all of them can be auto labeled so that our cars can drive even better through these challenging scenarios.

937 1:47:34,200 --> 1:47:47,200 So now let me pass the mic to David to learn more about how Sim is creating the new world on top of these labels. Thank you.

938 1:47:47,200 --> 1:47:51,200 Thank you, Yegan. My name is David and I'm going to talk about simulation.

939 1:47:51,200 --> 1:47:58,200 So simulation plays a critical role in providing data that is difficult to source and or hard to label.

940 1:47:58,200 --> 1:48:02,200 However, 3D scenes are notoriously slow to produce.

941 1:48:02,200 --> 1:48:10,200 Take, for example, the simulated scene playing behind me, a complex intersection from Market Street in San Francisco.

942 1:48:10,200 --> 1:48:13,200 It would take two weeks for artists to complete.

943 1:48:13,200 --> 1:48:16,200 And for us, that is painfully slow.

944 1:48:16,200 --> 1:48:22,200 However, I'm going to talk about using Yegan's automated ground truth labels along with some brand new tooling

945 1:48:22,200 --> 1:48:27,200 that allows us to procedurally generate this scene and many like it in just five minutes.

946 1:48:27,200 --> 1:48:31,200 That's an amazing a thousand times faster than before.

947 1:48:31,200 --> 1:48:36,200 So let's dive in to how a scene like this is created.

948 1:48:36,200 --> 1:48:43,200 We start by piping the automated ground truth labels into our simulated world creator tooling inside the software Houdini.

949 1:48:43,200 --> 1:48:50,200 Starting with road boundary labels, we can generate a solid road mesh and re-topologize it with the lane graph labels.

950 1:48:50,200 --> 1:48:57,200 This helps inform important road details like crossroad slope and detailed material blending.

951 1:48:57,200 --> 1:49:07,200 Next, we can use the line data and sweep geometry across its surface and project it to the road, creating lane paint decals.

952 1:49:07,200 --> 1:49:13,200 Next, using median edges, we can spawn to island geometry and populate it with randomized foliage.

953 1:49:13,200 --> 1:49:16,200 This drastically changes the visibility of the scene.

954 1:49:16,200 --> 1:49:21,200 Now, the outside world can be generated through a series of randomized heuristics.

955 1:49:21,200 --> 1:49:28,200 Modular building generators create visual obstructions while randomly placed objects like hydrants can change the color of the curbs,

956 1:49:28,200 --> 1:49:33,200 while trees can drop leaves below it obscuring lines or edges.

957 1:49:33,200 --> 1:49:39,200 Next, we can bring in map data to inform positions of things like traffic lights or stop signs.

958 1:49:39,200 --> 1:49:48,200 We can trace along its normal to collect important information like number of lanes and even get accurate street names on the signs themselves.

959 1:49:48,200 --> 1:49:57,200 Next, using lane graph, we can determine lane connectivity and spawn directional road markings on the road and their accompanying road signs.

960 1:49:57,200 --> 1:50:06,200 And finally, with lane graph itself, we can determine lane adjacency and other useful metrics to spawn randomized traffic permutations inside our simulator.

961 1:50:06,200 --> 1:50:11,200 And again, this is all automatic, no artists in the loop, and happens within minutes.

962 1:50:11,200 --> 1:50:15,200 And now this sets us up to do some pretty cool things.

963 1:50:15,200 --> 1:50:23,200 Since everything is based on data and heuristics, we can start to fuzz parameters to create visual variations of the single ground truth.

964 1:50:23,200 --> 1:50:34,200 It can be as subtle as object placement and random material swapping to more drastic changes like entirely new biomes or locations of environment like urban, suburban, or rural.

965 1:50:34,200 --> 1:50:43,200 This allows us to create infinite targeted permutations for specific ground truths that we need more ground truth for.

966 1:50:43,200 --> 1:50:47,200 And all this happens within a click of a button.

967 1:50:47,200 --> 1:50:52,200 And we can even take this one step further by altering our ground truth itself.

968 1:50:52,200 --> 1:50:59,200 Say John wants his network to pay more attention to directional road markings to better detect an upcoming captive left turn lane.

969 1:50:59,200 --> 1:51:12,200 We can start to procedurally alter our lane graph inside the simulator to help focus to create entirely new flows through this intersection to help focus the network's attention to the road markings to create more accurate predictions.

970 1:51:12,200 --> 1:51:20,200 And this is a great example of how this tooling allows us to create new data that can never be collected from the real world.

971 1:51:20,200 --> 1:51:28,200 And the true power of this tool is in its architecture and how we can run all tasks in parallel to infinitely scale.

972 1:51:28,200 --> 1:51:35,200 So you saw the tile creator tool in action converting the ground truth labels into their counterparts.

973 1:51:35,200 --> 1:51:43,200 Next, we can use our tile extractor tool to divide this data into geo hash tiles about 150 meters square in size.

974 1:51:43,200 --> 1:51:47,200 We then save out that data into separate geometry and instance files.

975 1:51:47,200 --> 1:51:56,200 This gives us a clean source of data that's easy to load and allows us to be rendering engine agnostic for the future.

976 1:51:56,200 --> 1:52:02,200 Then using a tile loader tool, we can summon any number of those cash tiles using a geo hash ID.

977 1:52:02,200 --> 1:52:11,200 Currently, we're doing about these five by five tiles or three by three, usually centered around fleet hotspots or interesting lane graph locations.

978 1:52:11,200 --> 1:52:23,200 And the tile loader also converts these tile sets into U assets for consumption by the Unreal Engine and gives you a finished product from what you saw on the first slide.

979 1:52:23,200 --> 1:52:26,200 And this really sets us up for size and scale.

980 1:52:26,200 --> 1:52:32,200 And as you can see on the map behind us, we can easily generate most of San Francisco city streets.

981 1:52:32,200 --> 1:52:38,200 And this didn't take years or even months of work, but rather two weeks by one person.

982 1:52:38,200 --> 1:52:44,200 We can continue to manage and grow all this data using our PDG network inside of the tooling.

983 1:52:44,200 --> 1:52:50,200 This allows us to throw compute at it and regenerate all these tile sets overnight.

984 1:52:50,200 --> 1:53:04,200 This ensures all environments are consistent quality and features, which is super important for training since new ontologies and signals are constantly released.

985 1:53:04,200 --> 1:53:13,200 And we can combine that to come full circle because we generated all these tile sets from ground truth data that contain all the weird intricacies from the real world.

986 1:53:13,200 --> 1:53:21,200 And we can combine that with the procedural visual and traffic variety to create limitless targeted data for the network to learn from.

987 1:53:21,200 --> 1:53:22,200 And that concludes the same section.

988 1:53:22,200 --> 1:53:27,200 I'll pass it to Kate to talk about how we can use all this data to improve autopilot.

989 1:53:27,200 --> 1:53:37,200 Thanks, David.

990 1:53:37,200 --> 1:53:38,200 Hi, everyone.

991 1:53:38,200 --> 1:53:46,200 My name is Kate Park, and I'm here to talk about the data engine, which is the process by which we improve our neural networks via data.

992 1:53:46,200 --> 1:53:54,200 We're going to show you how we deterministically solve interventions via data and walk you through the life of this particular clip.

993 1:53:54,200 --> 1:54:04,200 In this scenario, autopilot is approaching a turn and incorrectly predicts that crossing vehicle as stopped for traffic and thus a vehicle that we would slow down for.

994 1:54:04,200 --> 1:54:07,200 In reality, there's nobody in the car.

995 1:54:07,200 --> 1:54:09,200 It's just awkwardly parked.

996 1:54:09,200 --> 1:54:17,200 We built this tooling to identify the mispredictions, correct the label and categorize this clip into an evaluation set.

997 1:54:17,200 --> 1:54:24,200 This particular clip happens to be one of 126 that we've diagnosed as challenging parked cars at turns.

998 1:54:24,200 --> 1:54:34,200 Because of this infra, we can curate this evaluation set without any engineering resources custom to this particular challenge case.

999 1:54:34,200 --> 1:54:39,200 To actually solve that challenge case requires mining thousands of examples like it.

1000 1:54:39,200 --> 1:54:42,200 And it's something Tesla can trivially do.

1001 1:54:42,200 --> 1:54:53,200 We simply use our data sourcing infra request data and use the tooling shown previously to correct the labels by surgically targeting the mispredictions of the current model.

1002 1:54:53,200 --> 1:54:58,200 We're only adding the most valuable examples to our training set.

1003 1:54:58,200 --> 1:55:02,200 We surgically fix 13,900 clips.

1004 1:55:02,200 --> 1:55:09,200 And because those were examples where the current model struggles, we don't even need to change the model architecture.

1005 1:55:09,200 --> 1:55:14,200 A simple weight update with this new valuable data is enough to solve the challenge case.

1006 1:55:14,200 --> 1:55:21,200 So you see we no longer predict that crossing vehicle as stopped as shown in orange, but parked as shown in red.

1007 1:55:21,200 --> 1:55:25,200 In academia, we often see that people keep data constant.

1008 1:55:25,200 --> 1:55:28,200 But at Tesla, it's very much the opposite.

1009 1:55:28,200 --> 1:55:37,200 We see time and time and again that data is one of the best, if not the most deterministic lever to solving these interventions.

1010 1:55:37,200 --> 1:55:42,200 We just showed you the data engine loop for one challenge case, namely these parked cars at turns.

1011 1:55:42,200 --> 1:55:47,200 But there are many challenge cases even for one signal of vehicle movement.

1012 1:55:47,200 --> 1:55:55,200 We apply this data engine loop to every single challenge case we've diagnosed, whether it's buses, curvy roads, stopped vehicles, parking lots.

1013 1:55:55,200 --> 1:55:57,200 And we don't just add data once.

1014 1:55:57,200 --> 1:56:01,200 We do this again and again to perfect the semantic.

1015 1:56:01,200 --> 1:56:13,200 In fact, this year we updated our vehicle movement signal five times and with every weight update trained on the new data, we push our vehicle movement accuracy up and up.

1016 1:56:13,200 --> 1:56:23,200 This data engine framework applies to all our signals, whether they're 3D, multicam video, whether the data is human labeled, auto labeled or simulated,

1017 1:56:23,200 --> 1:56:27,200 whether it's an offline model or an online model model.

1018 1:56:27,200 --> 1:56:36,200 Tesla is able to do this at scale because of the fleet advantage, the info that our ENG team has built and the labeling resources that feed our networks.

1019 1:56:36,200 --> 1:56:40,200 To train on all this data, we need a massive amount of compute.

1020 1:56:40,200 --> 1:56:45,200 So I'll hand it off to Pete and Ganesh to talk about the Dojo supercomputing platform.

1021 1:56:45,200 --> 1:56:55,200 Thank you.

1022 1:56:55,200 --> 1:56:56,200 Thanks, everybody.

1023 1:56:56,200 --> 1:56:57,200 Thanks for hanging in there.

1024 1:56:57,200 --> 1:56:59,200 We're almost there.

1025 1:56:59,200 --> 1:57:00,200 My name is Pete Bannon.

1026 1:57:00,200 --> 1:57:05,200 I run the custom silicon and low voltage teams at Tesla.

1027 1:57:05,200 --> 1:57:07,200 And my name is Ganesh Venkate.

1028 1:57:07,200 --> 1:57:14,200 I run the Dojo program.

1029 1:57:14,200 --> 1:57:16,200 Thank you.

1030 1:57:16,200 --> 1:57:21,200 I'm frequently asked, why is a car company building a supercomputer for training?

1031 1:57:21,200 --> 1:57:27,200 This question fundamentally misunderstands the nature of Tesla.

1032 1:57:27,200 --> 1:57:31,200 At its heart, Tesla is a hardcore technology company.

1033 1:57:31,200 --> 1:57:43,200 All across the company, people are working hard in science and engineering to advance the fundamental understanding and methods that we have available to build cars,

1034 1:57:43,200 --> 1:57:50,200 energy solutions, robots and anything else that we can do to improve the human condition around the world.

1035 1:57:50,200 --> 1:57:53,200 It's a super exciting thing to be a part of.

1036 1:57:53,200 --> 1:57:57,200 And it's a privilege to run a very small piece of it in the semiconductor group.

1037 1:57:57,200 --> 1:58:04,200 Tonight we're going to talk a little bit about Dojo and give you an update on what we've been able to do over the last year.

1038 1:58:04,200 --> 1:58:10,200 But before we do that, I wanted to give a little bit of background on the initial design that we started a few years ago.

1039 1:58:10,200 --> 1:58:17,200 When we got started, the goal was to provide a substantial improvement to the training latency for our autopilot team.

1040 1:58:17,200 --> 1:58:27,200 Some of the largest neural networks they train today run for over a month, which inhibits their ability to rapidly explore alternatives and evaluate them.

1041 1:58:27,200 --> 1:58:35,200 So, you know, a 30X speed up would be really nice if we could provide it at a cost competitive and energy competitive way.

1042 1:58:35,200 --> 1:58:43,200 To do that, we wanted to build a chip with a lot of arithmetic units that we could utilize at a very high efficiency.

1043 1:58:43,200 --> 1:58:51,200 And we spent a lot of time studying whether we could do that using DRAM, various packaging ideas, all of which failed.

1044 1:58:51,200 --> 1:59:02,200 And in the end, even though it felt like an unnatural act, we decided to reject DRAM as the primary storage medium for this system and instead focus on SRAM embedded in the chip.

1045 1:59:02,200 --> 1:59:09,200 SRAM provides, unfortunately, a modest amount of capacity, but extremely high bandwidth and very low latency.

1046 1:59:09,200 --> 1:59:13,200 And that enables us to achieve high utilization with the arithmetic units.

1047 1:59:13,200 --> 1:59:20,200 Those choices, that particular choice led to a whole bunch of other choices.

1048 1:59:20,200 --> 1:59:24,200 For example, if you want to have virtual memory, you need page tables. They take up a lot of space.

1049 1:59:24,200 --> 1:59:28,200 We didn't have space, so no virtual memory.

1050 1:59:28,200 --> 1:59:30,200 We also don't have interrupts.

1051 1:59:30,200 --> 1:59:41,200 The accelerator is a bare-bonds raw piece of hardware that's presented to a compiler, and the compiler is responsible for scheduling everything that happens in a deterministic way.

1052 1:59:41,200 --> 1:59:45,200 So there's no need or even desire for interrupts in the system.

1053 1:59:45,200 --> 1:59:53,200 We also chose to pursue model parallelism as a training methodology, which is not the typical situation.

1054 1:59:53,200 --> 2:00:01,200 Most machines today use data parallelism, which consumes additional memory capacity, which we obviously don't have.

1055 2:00:01,200 --> 2:00:10,200 So all of those choices led us to build a machine that is pretty radically different from what's available today.

1056 2:00:10,200 --> 2:00:14,200 We also had a whole bunch of other goals. One of the most important ones was no limits.

1057 2:00:14,200 --> 2:00:20,200 So we wanted to build a compute fabric that would scale in an unbounded way, for the most part.

1058 2:00:20,200 --> 2:00:23,200 I mean, obviously, there's physical limits now and then.

1059 2:00:23,200 --> 2:00:29,200 But pretty much, if your model was too big for the computer, you just had to go buy a bigger computer.

1060 2:00:29,200 --> 2:00:31,200 That's what we were looking for.

1061 2:00:31,200 --> 2:00:41,200 Today, the way machines are packaged, there's a pretty fixed ratio of, for example, GPUs, CPUs, and DRAM capacity and network capacity.

1062 2:00:41,200 --> 2:00:54,200 We really wanted to desegregate all that so that as models evolved, we could vary the ratios of those various elements and make the system more flexible to meet the needs of the autopilot team.

1063 2:00:54,200 --> 2:01:01,200 Yeah, and it's so true, Pete, like no limits philosophy was our guiding star all the way.

1064 2:01:01,200 --> 2:01:15,200 All of our choices were centered around that, and to the point that we didn't want traditional data center infrastructure to limit our capacity to execute these programs at speed.

1065 2:01:15,200 --> 2:01:31,200 That's why we integrated vertically our data center, the entire data center by doing a vertical integration of the data center.

1066 2:01:31,200 --> 2:01:34,200 We could extract new levels of efficiency.

1067 2:01:34,200 --> 2:01:49,200 We could optimize power delivery, cooling, and as well as system management across the whole data center stack rather than doing box by box and integrating those boxes into data centers.

1068 2:01:49,200 --> 2:02:06,200 And to do this, we also wanted to integrate early to figure out limits of scale for our software workloads, so we integrated Dojo environment into our autopilot software very early, and we learned a lot of lessons.

1069 2:02:06,200 --> 2:02:25,200 And today, Bill Chang will go over our hardware update as well as some of the challenges that we faced along the way, and Rajiv Kurian will give you a glimpse of our compiler technology as well as go over some of our cool results.

1070 2:02:25,200 --> 2:02:31,200 Great.

1071 2:02:31,200 --> 2:02:34,200 Thanks, Pete. Thanks, Ganesh.

1072 2:02:34,200 --> 2:02:48,200 I'll start tonight with a high level vision of our system that will help set the stage for the challenges and the problems we're solving, and then also how software will then leverage this for performance.

1073 2:02:48,200 --> 2:03:08,200 Now, our vision for Dojo is to build a single unified accelerator, a very large one. Software would see a seamless compute plane with globally addressable, very fast memory, and all connected together with uniform high bandwidth and low latency.

1074 2:03:08,200 --> 2:03:23,200 Now, to realize this, we need to use density to achieve performance. Now, we leverage technology to get this density in order to break levels of hierarchy all the way from the chip to the scale out systems.

1075 2:03:23,200 --> 2:03:36,200 Now, silicon technology has used this, has done this for decades. Chips have followed Moore's law for density integration to get performance scaling.

1076 2:03:36,200 --> 2:03:53,200 Now, a key step in realizing that vision was our training tile. Not only can we integrate 25 dies at extremely high bandwidth, but we can scale that to any number of additional tiles by just connecting them together.

1077 2:03:53,200 --> 2:04:02,200 Now, last year, we showcased our first functional training tile, and at that time we already had workloads running on it.

1078 2:04:02,200 --> 2:04:10,200 And since then, the team here has been working hard and diligently to deploy this at scale.

1079 2:04:10,200 --> 2:04:19,200 Now, we've made amazing progress and had a lot of milestones along the way, and of course, we've had a lot of unexpected challenges.

1080 2:04:19,200 --> 2:04:27,200 But this is where our fail fast philosophy has allowed us to push our boundaries.

1081 2:04:27,200 --> 2:04:35,200 Now, pushing density for performance presents all new challenges. One area is power delivery.

1082 2:04:35,200 --> 2:04:43,200 Here, we need to deliver the power to our compute die, and this directly impacts our top line compute performance.

1083 2:04:43,200 --> 2:04:54,200 But we need to do this at unprecedented density. We need to be able to match our die pitch with a power density of almost one amp per millimeter squared.

1084 2:04:54,200 --> 2:05:01,200 And because of the extreme integration, this needs to be a multi-tiered vertical power solution.

1085 2:05:01,200 --> 2:05:12,200 And because there's a complex heterogeneous material stack up, we have to carefully manage the material transition, especially CTE.

1086 2:05:12,200 --> 2:05:16,200 Now, why does the coefficient of thermal expansion matter in this case?

1087 2:05:16,200 --> 2:05:27,200 CTE is a fundamental material property, and if it's not carefully managed, that stack up would literally rip itself apart.

1088 2:05:27,200 --> 2:05:38,200 So we started this effort by working with vendors to develop this power solution, but we realized that we actually had to develop this in-house.

1089 2:05:38,200 --> 2:05:47,200 Now, to balance schedule and risk, we built quick iterations to support both our system bring up and software development,

1090 2:05:47,200 --> 2:05:53,200 and also to find the optimal design and stack up that would meet our final production goals.

1091 2:05:53,200 --> 2:06:03,200 And in the end, we were able to reduce CTE over 50 percent and meet our performance by 3x over our initial version.

1092 2:06:03,200 --> 2:06:14,200 Now, needless to say, finding this optimal material stack up while maximizing performance at density is extremely difficult.

1093 2:06:14,200 --> 2:06:18,200 Now, we did have unexpected challenges along the way.

1094 2:06:18,200 --> 2:06:25,200 Here's an example where we pushed the boundaries of integration that led to component failures.

1095 2:06:25,200 --> 2:06:35,200 This started when we scaled up to larger and longer workloads, and then intermittently a single site on a tile would fail.

1096 2:06:35,200 --> 2:06:45,200 Now, they started out as recoverable failures, but as we pushed to much higher and higher power, these would become permanent failures.

1097 2:06:45,200 --> 2:06:52,200 Now, to understand this failure, you have to understand why and how we build our power modules.

1098 2:06:52,200 --> 2:06:59,200 Solving density at every level is the cornerstone of actually achieving our system performance.

1099 2:06:59,200 --> 2:07:08,200 Now, because our XY plane is used for high bandwidth communication, everything else must be stacked vertically.

1100 2:07:08,200 --> 2:07:14,200 This means all other components other than our die must be integrated into our power modules.

1101 2:07:14,200 --> 2:07:21,200 Now, that includes our clock and our power supplies and also our system controllers.

1102 2:07:21,200 --> 2:07:27,200 Now, in this case, the failures were due to losing clock output from our oscillators.

1103 2:07:27,200 --> 2:07:39,200 And after an extensive debug, we found that the root cause was due to vibrations on the module from piezoelectric effects on nearby capacitors.

1104 2:07:39,200 --> 2:07:45,200 Now, singing caps are not a new phenomenon and, in fact, very common in power design.

1105 2:07:45,200 --> 2:07:52,200 But normally clock chips are placed in a very quiet area of the board and often not affected by power circuits.

1106 2:07:52,200 --> 2:08:00,200 But because we needed to achieve this level of integration, these oscillators need to be placed in very close proximity.

1107 2:08:00,200 --> 2:08:13,200 Now, due to our switching frequency and then the vibration resonance created, it caused out of plane vibration on our MEMS oscillator that caused it to crack.

1108 2:08:13,200 --> 2:08:16,200 Now, the solution to this problem is a multiprong approach.

1109 2:08:16,200 --> 2:08:22,200 We can reduce the vibration by using soft terminal caps.

1110 2:08:22,200 --> 2:08:30,200 We can update our MEMS part with a lower Q factor for the out of plane direction.

1111 2:08:30,200 --> 2:08:40,200 And we can also update our switching frequency to push the resonance further away from these sensitive bands.

1112 2:08:40,200 --> 2:08:49,200 Now, addition to the density at the system level, we've been making a lot of progress at the infrastructure level.

1113 2:08:49,200 --> 2:08:59,200 We knew that we had to re-examine every aspect of the data center infrastructure in order to support our unprecedented power and cooling density.

1114 2:08:59,200 --> 2:09:06,200 We brought in a fully custom designed CDU to support DOJO's dense cooling requirements.

1115 2:09:06,200 --> 2:09:13,200 And the amazing part is we're able to do this at a fraction of the cost versus buying off the shelf and modifying it.

1116 2:09:13,200 --> 2:09:21,200 And since our DOJO cabinet integrates enough power and cooling to match an entire row of standard IT racks,

1117 2:09:21,200 --> 2:09:26,200 we need to carefully design our cabinet and infrastructure together.

1118 2:09:26,200 --> 2:09:31,200 And we've already gone through several iterations of this cabinet to optimize this.

1119 2:09:31,200 --> 2:09:36,200 And earlier this year, we started load testing our power and cooling infrastructure.

1120 2:09:36,200 --> 2:09:46,200 And we were able to push it over two megawatts before we tripped our substation and got a call from the city.

1121 2:09:46,200 --> 2:09:53,200 Now, last year, we introduced only a couple of components of our system, the custom D1 die and the training tile.

1122 2:09:53,200 --> 2:09:57,200 But we teased the exit pod as our end goal.

1123 2:09:57,200 --> 2:10:04,200 We'll walk through the remaining parts of our system that are required to build out this exit pod.

1124 2:10:04,200 --> 2:10:09,200 Now, the system tray is a key part of realizing our vision of a single accelerator.

1125 2:10:09,200 --> 2:10:17,200 It enables us to seamlessly connect tiles together, not only within the cabinet, but between cabinets.

1126 2:10:17,200 --> 2:10:23,200 We can connect these tiles at very tight spacing across the entire accelerator.

1127 2:10:23,200 --> 2:10:27,200 And this is how we achieve our uniform communication.

1128 2:10:27,200 --> 2:10:36,200 This is a laminate bus bar that allows us to integrate very high power, mechanical and thermal support, and an extremely dense integration.

1129 2:10:36,200 --> 2:10:43,200 It's 75 millimeters in height and supports six tiles at 135 kilograms.

1130 2:10:43,200 --> 2:10:52,200 This is the equivalent of three to four fully loaded high performance racks.

1131 2:10:52,200 --> 2:10:55,200 Next, we need to feed data to the training tiles.

1132 2:10:55,200 --> 2:10:59,200 This is where we've developed the Dojo interface processor.

1133 2:10:59,200 --> 2:11:05,200 It provides our system with high bandwidth DRAM to stage our training data.

1134 2:11:05,200 --> 2:11:15,200 And it provides full memory bandwidth to our training tiles using TTP, our custom protocol that we use to communicate across our entire accelerator.

1135 2:11:15,200 --> 2:11:21,200 It also has high speed Ethernet that helps us extend this custom protocol over standard Ethernet.

1136 2:11:21,200 --> 2:11:27,200 And we provide native hardware support for this with little to no software overhead.

1137 2:11:27,200 --> 2:11:36,200 And lastly, we can connect to it through a standard Gen4 PCIe interface.

1138 2:11:36,200 --> 2:11:43,200 Now, we pair 20 of these cards per tray, and that gives us 640 gigabytes of high bandwidth DRAM.

1139 2:11:43,200 --> 2:11:48,200 And this provides our disaggregated memory layer for our training tiles.

1140 2:11:48,200 --> 2:11:54,200 These cards are a high bandwidth ingest path, both through PCIe and Ethernet.

1141 2:11:54,200 --> 2:12:04,200 They also provide a high-rate XZ connectivity path that allows shortcuts across our large Dojo accelerator.

1142 2:12:04,200 --> 2:12:09,200 Now, we actually integrate the host directly underneath our system tray.

1143 2:12:09,200 --> 2:12:16,200 These hosts provide our ingest processing and connect to our interface processors through PCIe.

1144 2:12:16,200 --> 2:12:23,200 These hosts can provide hardware video decoder support for video-based training.

1145 2:12:23,200 --> 2:12:35,200 And our user applications land on these hosts, so we can provide them with the standard x86 Linux environment.

1146 2:12:35,200 --> 2:12:52,200 Now, we can put two of these assemblies into one cabinet and pair it with redundant power supplies that do direct conversion of 3-phase 480 volt AC power to 52 volt DC power.

1147 2:12:52,200 --> 2:13:00,200 Now, by focusing on density at every level, we can realize the vision of a single accelerator.

1148 2:13:00,200 --> 2:13:09,200 Now, starting with the uniform nodes on our custom D1 die, we can connect them together in our fully integrated training tile,

1149 2:13:09,200 --> 2:13:17,200 and then finally seamlessly connecting them across cabinet boundaries to form our Dojo accelerator.

1150 2:13:17,200 --> 2:13:26,200 And altogether, we can house two full accelerators in our Exopod for a combined one Exoflop of ML compute.

1151 2:13:26,200 --> 2:13:35,200 Now, altogether, this amount of technology and integration has only ever been done a couple of times in the history of compute.

1152 2:13:35,200 --> 2:13:48,200 Next, we'll see how software can leverage this to accelerate their performance.

1153 2:13:48,200 --> 2:13:53,200 Thanks, Bill. My name is Rajiv, and I'm going to talk some numbers.

1154 2:13:53,200 --> 2:14:01,200 The software stack begins with the PyTorch extension that speaks to our commitment to run standard PyTorch models out of the box.

1155 2:14:01,200 --> 2:14:08,200 We're going to talk more about our JIT compiler and the InJS pipeline that feeds the hardware with data.

1156 2:14:08,200 --> 2:14:14,200 Abstractly, performance is tops times utilization times accelerator occupancy.

1157 2:14:14,200 --> 2:14:22,200 We've seen how the hardware provides peak performance as the job of the compiler to extract utilization from the hardware while code is running on it.

1158 2:14:22,200 --> 2:14:30,200 It's the job of the InJS pipeline to make sure that data can be fed at a throughput high enough for the hardware to not ever starve.

1159 2:14:30,200 --> 2:14:34,200 So let's talk about why communication-bound models are difficult to scale.

1160 2:14:34,200 --> 2:14:39,200 But before that, let's look at why ResNet-50-like models are easier to scale.

1161 2:14:39,200 --> 2:14:44,200 You start off with a single accelerator, run the forward and backward passes, followed by the optimizer.

1162 2:14:44,200 --> 2:14:49,200 Then to scale this up, you run multiple copies of this on multiple accelerators.

1163 2:14:49,200 --> 2:14:54,200 The gradients produced by the backward pass do need to be reduced, and this introduces some communication.

1164 2:14:54,200 --> 2:15:00,200 This can be done pipeline with the backward pass.

1165 2:15:00,200 --> 2:15:05,200 This setup scales fairly well, almost linearly.

1166 2:15:05,200 --> 2:15:11,200 For models with much larger activations, we run into a problem as soon as we want to run the forward pass.

1167 2:15:11,200 --> 2:15:16,200 The batch size that fits in a single accelerator is often smaller than the batch norm surface.

1168 2:15:16,200 --> 2:15:22,200 To get around this, researchers typically run this setup on multiple accelerators in sync batch norm mode.

1169 2:15:22,200 --> 2:15:30,200 This introduces latency-bound communication to the critical path of the forward pass, and we already have a communication bottleneck.

1170 2:15:30,200 --> 2:15:36,200 And while there are ways to get around this, they usually involve tedious manual work best suited for a compiler.

1171 2:15:36,200 --> 2:15:45,200 And ultimately, there's no skirting around the fact that if your state does not fit in a single accelerator, you can be communication-bound.

1172 2:15:45,200 --> 2:15:52,200 Even with significant efforts from our ML engineers, we see such models don't scale linearly.

1173 2:15:52,200 --> 2:15:57,200 The Dojo system was built to make such models work at high utilization.

1174 2:15:57,200 --> 2:16:02,200 The high density integration was built to not only accelerate the compute-bound portions of a model,

1175 2:16:02,200 --> 2:16:13,200 but also the latency-bound portions like a batch norm or the bandwidth-bound portions like a gradient all reduced or a parameter all gathered.

1176 2:16:13,200 --> 2:16:18,200 A slice of the Dojo mesh can be carved out to run any model.

1177 2:16:18,200 --> 2:16:25,200 The only thing users need to do is to make the slice large enough to fit a batch norm surface for their particular model.

1178 2:16:25,200 --> 2:16:30,200 After that, the partition presents itself as one large accelerator,

1179 2:16:30,200 --> 2:16:39,200 freeing the users from having to worry about the internal details of execution and as the job of the compiler to maintain this abstraction.

1180 2:16:39,200 --> 2:16:47,200 Long-grain synchronization primitives and uniform low latency makes it easy to accelerate all forms of parallelism across integration boundaries.

1181 2:16:47,200 --> 2:16:53,200 Tensors are usually stored sharded in SRAM and replicated just in time for layers execution.

1182 2:16:53,200 --> 2:16:57,200 We depend on the high Dojo bandwidth to hide this replication time.

1183 2:16:57,200 --> 2:17:07,200 Tensor replication and other data transfers are overlapped with compute, and the compiler can also recompute layers when it's profitable to do so.

1184 2:17:07,200 --> 2:17:10,200 We expect most models to work out of the box.

1185 2:17:10,200 --> 2:17:16,200 As an example, we took the recently released stable diffusion model and got it running on Dojo in minutes.

1186 2:17:16,200 --> 2:17:22,200 Out of the box, the compiler was able to map it in a model parallel manner on 25 Dojo dies.

1187 2:17:22,200 --> 2:17:29,200 Here are some pictures of a cyber-trap on Mars generated by stable diffusion running on Dojo.

1188 2:17:29,200 --> 2:17:42,200 Looks like it still has some ways to go before matching the Tesla Design Studio team.

1189 2:17:42,200 --> 2:17:46,200 So we've talked about how communication bottlenecks can hamper scalability.

1190 2:17:46,200 --> 2:17:52,200 Perhaps an acid test of a compiler and the underlying hardware is executing a cross-dye bashworm layer.

1191 2:17:52,200 --> 2:17:55,200 Like mentioned before, this can be a serial bottleneck.

1192 2:17:55,200 --> 2:18:01,200 The communication phase of a bashworm begins with nodes computing their local mean and standard deviations,

1193 2:18:01,200 --> 2:18:09,200 then coordinating to reduce these values, then broadcasting these values back, and then they resume their work in parallel.

1194 2:18:09,200 --> 2:18:13,200 So what would an ideal bashworm look like on 25 Dojo dies?

1195 2:18:13,200 --> 2:18:19,200 Let's say the previous layer's activations are already split across dies.

1196 2:18:19,200 --> 2:18:26,200 We would expect the 350 nodes on each die to coordinate and produce die local mean and standard deviation values.

1197 2:18:26,200 --> 2:18:33,200 Ideally, these would get further reduced with the final value ending somewhere towards the middle of the tile.

1198 2:18:33,200 --> 2:18:38,200 We would then hope to see a broadcast of this value radiating from the center.

1199 2:18:38,200 --> 2:18:43,200 Let's see how the compiler actually executes a real bashworm operation across 25 dies.

1200 2:18:43,200 --> 2:18:49,200 The communication trees were extracted from the compiler, and the timing is from a real hardware one.

1201 2:18:49,200 --> 2:18:59,200 We're about to see 8,750 nodes on 25 dies coordinating to reduce and then broadcast the bashworm mean and standard deviation values.

1202 2:18:59,200 --> 2:19:05,200 Die local reduction followed by global reduction towards the middle of the tile,

1203 2:19:05,200 --> 2:19:14,200 then the reduced value broadcast radiating from the middle accelerated by the hardware's broadcast facility.

1204 2:19:14,200 --> 2:19:19,200 This operation takes only 5 microseconds on 25 Dojo dies.

1205 2:19:19,200 --> 2:19:24,200 The same operation takes 150 microseconds on 24 GPUs.

1206 2:19:24,200 --> 2:19:28,200 This is in orders of magnitude improvement over GPUs.

1207 2:19:28,200 --> 2:19:32,200 And while we talked about an all-reduce operation in the context of a batch norm,

1208 2:19:32,200 --> 2:19:38,200 it's important to reiterate that the same advantages apply to all other communication primitives,

1209 2:19:38,200 --> 2:19:42,200 and these primitives are essential for large-scale training.

1210 2:19:42,200 --> 2:19:45,200 So how about full model performance?

1211 2:19:45,200 --> 2:19:50,200 So while we think that ResNet-50 is not a good representation of real-world Tesla workloads,

1212 2:19:50,200 --> 2:19:53,200 it is a standard benchmark, so let's start there.

1213 2:19:53,200 --> 2:19:57,200 We are already able to match the A100 die for die.

1214 2:19:57,200 --> 2:20:04,200 However, perhaps a hint of Dojo's capabilities is that we're able to hit this number with just a batch of 8 per die.

1215 2:20:04,200 --> 2:20:08,200 But Dojo was really built to tackle larger complex models.

1216 2:20:08,200 --> 2:20:14,200 So when we set out to tackle real-world workloads, we looked at the usage patterns of our current GPU cluster,

1217 2:20:14,200 --> 2:20:17,200 and two models stood out, the autolabeling networks,

1218 2:20:17,200 --> 2:20:20,200 a class of offline models that are used to generate ground truth,

1219 2:20:20,200 --> 2:20:23,200 and the occupancy networks that you heard about.

1220 2:20:23,200 --> 2:20:28,200 The autolabeling networks are large models that have high arithmetic intensity,

1221 2:20:28,200 --> 2:20:31,200 while the occupancy networks can be in just bound.

1222 2:20:31,200 --> 2:20:36,200 We chose these models because together they account for a large chunk of our current GPU cluster usage,

1223 2:20:36,200 --> 2:20:41,200 and they would challenge the system in different ways.

1224 2:20:41,200 --> 2:20:44,200 So how do we do on these two networks?

1225 2:20:44,200 --> 2:20:49,200 The results we're about to see were measured on multi-die systems for both the GPU and Dojo,

1226 2:20:49,200 --> 2:20:52,200 but normalized to per die numbers.

1227 2:20:52,200 --> 2:20:57,200 On our autolabeling network, we're already able to surpass the performance of an A100

1228 2:20:57,200 --> 2:21:01,200 with our current hardware running on our older generation VRMs.

1229 2:21:01,200 --> 2:21:07,200 On our production hardware with our newer VRMs, that translates to doubling the throughput of an A100.

1230 2:21:07,200 --> 2:21:10,200 And our models show that with some key compiler optimizations,

1231 2:21:10,200 --> 2:21:15,200 we could get to more than 3x the performance of an A100.

1232 2:21:15,200 --> 2:21:19,200 We see even bigger leaps on the occupancy network.

1233 2:21:19,200 --> 2:21:24,200 Almost 3x with our production hardware with room for more.

1234 2:21:24,200 --> 2:21:34,200 So what does that mean for Tesla?

1235 2:21:34,200 --> 2:21:37,200 With our current level of compiler performance,

1236 2:21:37,200 --> 2:21:47,200 we could replace the ML computer of 1, 2, 3, 4, 5, and 6 GPU boxes with just a single Dojo tile.

1237 2:21:47,200 --> 2:21:58,200 And this Dojo tile costs less than one of these GPU boxes.

1238 2:21:58,200 --> 2:22:09,200 What it really means is that networks that took more than a month to train now take less than a week.

1239 2:22:09,200 --> 2:22:16,200 Alas, when we measure things, it did not turn out so well.

1240 2:22:16,200 --> 2:22:20,200 At the PyTorch level, we did not see our expected performance out of the gate.

1241 2:22:20,200 --> 2:22:23,200 And this timeline chart shows our problem.

1242 2:22:23,200 --> 2:22:28,200 The teeny tiny little green bars, that's the compile code running on the accelerator.

1243 2:22:28,200 --> 2:22:35,200 The row is mostly white space where the hardware is just waiting for data.

1244 2:22:35,200 --> 2:22:41,200 With our dense ML compute, Dojo hosts effectively have 10x more ML compute than the GPU hosts.

1245 2:22:41,200 --> 2:22:48,200 The data loaders running on this one host simply can't keep up with all that ML hardware.

1246 2:22:48,200 --> 2:22:54,200 So to solve our data loader scalability issues, we knew we had to get over the limit of this single host.

1247 2:22:54,200 --> 2:23:00,200 The Tesla transport protocol moves data seamlessly across host, tiles, and ingest processors.

1248 2:23:00,200 --> 2:23:04,200 So we extended the Tesla transport protocol to work over Ethernet.

1249 2:23:04,200 --> 2:23:09,200 We then built the Dojo network interface card, the DNIC, to leverage TTP over Ethernet.

1250 2:23:09,200 --> 2:23:16,200 This allows any host with a DNIC card to be able to DM it to and from other TTP endpoints.

1251 2:23:16,200 --> 2:23:19,200 So we started with the Dojo mesh.

1252 2:23:19,200 --> 2:23:25,200 Then we added a tier of data loading hosts equipped with the DNIC card.

1253 2:23:25,200 --> 2:23:29,200 We connected these hosts to the mesh via an Ethernet switch.

1254 2:23:29,200 --> 2:23:38,200 Now every host in this data loading tier is capable of reaching all TTP endpoints in the Dojo mesh via hardware accelerated DM it.

1255 2:23:38,200 --> 2:23:45,200 After these optimizations went in, our occupancy went from 4% to 97%.

1256 2:23:45,200 --> 2:23:57,200 So the data loading sections have reduced drastically, and the ML hardware has kept busy.

1257 2:23:57,200 --> 2:24:01,200 We actually expect this number to go to 100% pretty soon.

1258 2:24:01,200 --> 2:24:09,200 After these changes went in, we saw the full expected speed up from the PyTorch layer, and we were back in business.

1259 2:24:09,200 --> 2:24:17,200 So we started with hardware design that breaks through traditional integration boundaries in service of our vision of a single giant accelerator.

1260 2:24:17,200 --> 2:24:21,200 We've seen how the compiler and ingest layers build on top of that hardware.

1261 2:24:21,200 --> 2:24:28,200 So after proving our performance on these complex real-world networks, we knew what our first large-scale deployment would target.

1262 2:24:28,200 --> 2:24:32,200 Our high arithmetic intensity auto labeling networks.

1263 2:24:32,200 --> 2:24:37,200 Today that occupies 4,000 GPUs over 72 GPU racks.

1264 2:24:37,200 --> 2:24:53,200 With our dense computer and our high performance, we expect to provide the same throughput with just four Dojo cabinets.

1265 2:24:53,200 --> 2:25:00,200 And these four Dojo cabinets will be part of our first exapod that we plan to build by quarter one of 2023.

1266 2:25:00,200 --> 2:25:10,200 This will more than double Tesla's auto labeling capacity.

1267 2:25:10,200 --> 2:25:20,200 The first exapod is part of a total of seven exapods that we plan to build in Palo Alto right here across the wall.

1268 2:25:20,200 --> 2:25:27,200 And we have a display cabinet from one of these exapods for everyone to look at.

1269 2:25:27,200 --> 2:25:39,200 Six tiles densely packed on a tray, 54 petaflops of compute, 640 gigabytes of high bandwidth memory with power and host defeated.

1270 2:25:39,200 --> 2:25:44,200 A lot of compute.

1271 2:25:44,200 --> 2:25:51,200 And we're building out new versions of all our cluster components and constantly improving our software to hit new limits of scale.

1272 2:25:51,200 --> 2:25:58,200 We believe that we can get another 10x improvement with our next generation hardware.

1273 2:25:58,200 --> 2:26:02,200 And to realize our ambitious goals, we need the best software and hardware engineers.

1274 2:26:02,200 --> 2:26:05,200 So please come talk to us or visit tesla.com.

1275 2:26:05,200 --> 2:26:27,200 Thank you.

1276 2:26:27,200 --> 2:26:35,200 All right. So hopefully that was enough detail.

1277 2:26:35,200 --> 2:26:38,200 And now we can move to questions.

1278 2:26:38,200 --> 2:26:46,200 And guys, I think the team came out of the stage.

1279 2:26:46,200 --> 2:26:58,200 We really wanted to show the depth and breadth of Tesla in artificial intelligence, compute hardware, robotics actuators,

1280 2:26:58,200 --> 2:27:09,200 and try to really shift the perception of the company away from, you know, a lot of people think we're like just a car company or we make cool cars, whatever.

1281 2:27:09,200 --> 2:27:19,200 They don't have, most people have no idea that Tesla is arguably the leader in real world AI hardware and software.

1282 2:27:19,200 --> 2:27:32,200 And that we're building what is arguably the first, the most radical computer architecture since the Crayon supercomputer.

1283 2:27:32,200 --> 2:27:43,200 And I think if you're interested in developing some of the most advanced technology in the world that's going to really affect the world in a positive way, Tesla is the place to be.

1284 2:27:43,200 --> 2:27:48,200 So, yeah, let's fire away with some questions.

1285 2:27:48,200 --> 2:27:55,200 I think there's a mic at the front and a mic at the back.

1286 2:27:55,200 --> 2:28:03,200 Just throw mics at people. Jump off on the mic.

1287 2:28:03,200 --> 2:28:08,200 Hi, thank you very much. I was impressed here.

1288 2:28:08,200 --> 2:28:15,200 Yeah, I was impressed very much by Optimus, but I wonder why did they don't even the hunt?

1289 2:28:15,200 --> 2:28:18,200 Why did you choose a tendon driven approach for the hunt?

1290 2:28:18,200 --> 2:28:26,200 Because tendons are not very durable. And why spring loaded?

1291 2:28:26,200 --> 2:28:29,200 Hello, is this working? Cool. Awesome. Yes, that's a great question.

1292 2:28:29,200 --> 2:28:38,200 You know, when it comes to any type of actuation scheme, there's tradeoffs between, you know, whether or not it's a tendon driven system or some type of linkage based system.

1293 2:28:38,200 --> 2:28:39,200 Keep the mic close to your mouth.

1294 2:28:39,200 --> 2:28:40,200 A little closer.

1295 2:28:40,200 --> 2:28:41,200 Yeah.

1296 2:28:41,200 --> 2:28:43,200 Hear me? Cool.

1297 2:28:43,200 --> 2:28:55,200 So, yeah, the main reason why we went for a tendon based system is that, you know, first we actually investigated some synthetic tendons, but we found that metallic boating cables are, you know, a lot stronger.

1298 2:28:55,200 --> 2:29:01,200 One of the advantages of these cables is that it's very good for part reduction.

1299 2:29:01,200 --> 2:29:09,200 We do want to make a lot of these hands. So having a bunch of parts, a bunch of small linkages ends up being, you know, a problem when you're making a lot of something.

1300 2:29:09,200 --> 2:29:17,200 One of the big reasons that, you know, tendons are better than linkages in a sense is that you can be anti backlash.

1301 2:29:17,200 --> 2:29:25,200 So anti backlash essentially, you know, allows you to not have any gaps or, you know, stuttering motion in your fingers.

1302 2:29:25,200 --> 2:29:32,200 Spring loaded. Mainly what spring loaded allows us to do is allows us to have active opening.

1303 2:29:32,200 --> 2:29:43,200 So instead of having to have two actuators to drive the fingers closed and then open, we have the ability to, you know, have the tendon drive them closed and then the springs passively extend.

1304 2:29:43,200 --> 2:29:50,200 And this is something that's seen in our hands as well, right? We have the ability to actively flex and then we also have the ability to extend.

1305 2:29:50,200 --> 2:29:52,200 Yeah.

1306 2:29:52,200 --> 2:30:04,200 I mean, our goal with Optimus is to have a robot that is maximally useful as quickly as possible. So there's a lot of ways to solve the various problems of a humanoid robot.

1307 2:30:04,200 --> 2:30:15,200 And we're probably not barking up the right tree on all the technical solutions. And I should say that we're open to evolving the technical solutions that you see here over time.

1308 2:30:15,200 --> 2:30:29,200 They're not lucked in stone. But we have to pick something and we want to pick something that's going to allow us to produce the robot as quickly as possible and have it, like I said, be useful as quickly as possible.

1309 2:30:29,200 --> 2:30:36,200 We're trying to follow the goal of fastest path to a useful robot that can be made at volume.

1310 2:30:36,200 --> 2:30:52,200 And we're going to test the robot internally at Tesla in our factory and just see, like, how useful is it? Because you have to have a, you've got to close the loop on reality to confirm that the robot is in fact useful.

1311 2:30:52,200 --> 2:31:12,200 And yeah, so we're just going to use it to build things. And we're confident we can do that with the hand that we have currently designed. But for sure there'll be hand version two, version three, and we may change the architecture quite significantly over time.

1312 2:31:12,200 --> 2:31:33,200 Hi. You're the optimist robot is really impressive that you did a great job by penal robots are really difficult. But what I noticed might be missing from your plan is to acknowledge the utility of the human spirit and I'm wondering if optimists will ever get a personality

1313 2:31:33,200 --> 2:31:46,200 and be able to laugh at our jokes while they've, while it folds our clothes. Yeah, absolutely. I think we want to have really fun versions of optimists.

1314 2:31:46,200 --> 2:32:05,200 And so that optimist can both do the utilitarian and do tasks but can also be kind of like a friend and a buddy and hang out with you. And I'm sure people will think of all sorts of creative uses for this robot.

1315 2:32:05,200 --> 2:32:25,200 And, you know, the thing, once you have the core intelligence and actuators figured out, then you can actually, you know, put all sorts of costumes I guess on on the robot I mean you can make the robot look.

1316 2:32:25,200 --> 2:32:40,200 You can skin the robot in many different ways. And I'm sure people will find very interesting ways to, to. Yeah, versions of optimists.

1317 2:32:40,200 --> 2:32:55,200 Thanks for the great presentation. I wanted to know if there was an equivalent to interventions in optimist it seems like labeling through moments where humans disagree with what's going on is important. And in a humanoid robot.

1318 2:32:55,200 --> 2:33:02,200 That might be also a desirable source of information.

1319 2:33:02,200 --> 2:33:15,200 Yeah, I think we will have ways to remote operate the robot and intervene when it does something bad, especially when we are training the robot and bringing it up.

1320 2:33:15,200 --> 2:33:28,200 And hopefully we, you know, design it in a way that we can stop the robot from if it's going to hit something we can just like hold it and then we'll stop it won't like you know crush your hand or something and those are all intervention data.

1321 2:33:28,200 --> 2:33:35,200 And we can learn a lot from our simulation systems to where we can check for collisions and supervise that those are bad actions.

1322 2:33:35,200 --> 2:33:53,200 Yeah, I mean, so optimists, we want over time to for it to be, you know, an Android kind of Android that you've seen in in sci fi movies like Star Trek, the next generation like data, but obviously we could program the robot to be less robot like and more friendly and.

1323 2:33:53,200 --> 2:34:11,200 You know, it can obviously learn to emulate humans and and feel very natural. So as as AI in general improves, we can add that to the robot and you know it should be obviously able to do simple instructions or even into it.

1324 2:34:11,200 --> 2:34:13,200 What it is that you want.

1325 2:34:13,200 --> 2:34:25,200 So you could give it a high level instruction and then it can break that down into a series of actions and and take those actions.

1326 2:34:25,200 --> 2:34:26,200 Hi.

1327 2:34:26,200 --> 2:34:36,200 Yeah, it's exciting to think that with the optimist you will think that you can achieve orders of magnitude of improvement in economic output.

1328 2:34:36,200 --> 2:34:45,200 That's really exciting. And when Tesla started the mission was to accelerate the advent of renewable energy or sustainable transport.

1329 2:34:45,200 --> 2:35:03,200 So with the optimist, do you still see that mission being the mission statement of Tesla or is it going to be updated with, you know, mission to accelerate the advent of infinite abundance or it limited this limitless economy.

1330 2:35:03,200 --> 2:35:11,200 Yeah, it is not strictly speaking optimist is not strictly speaking

1331 2:35:11,200 --> 2:35:15,200 directly in line with accelerating sustainable energy.

1332 2:35:15,200 --> 2:35:21,200 It, you know, to the degree that it is more efficient at getting things done than a person.

1333 2:35:21,200 --> 2:35:35,200 It does, I guess, help with, you know, sustainable energy, but I think the mission effectively does somewhat broaden with the advent of optimist to, you know, I don't know, making the future awesome.

1334 2:35:35,200 --> 2:35:43,200 So, you know, I think you look at optimist and I know about you, but I'm excited to see what optimist will become.

1335 2:35:43,200 --> 2:35:51,200 And, you know, this is like, you know, if you could, I mean, we can tell like any given technology.

1336 2:35:51,200 --> 2:35:58,200 Are you do you want to see what it's like in a year two years three years four years five years 10.

1337 2:35:58,200 --> 2:36:03,200 I'd say for sure. You definitely want to see what's what's happened with optimist.

1338 2:36:03,200 --> 2:36:09,200 Whereas you know a bunch of other technologies are you know sort of plateaued.

1339 2:36:09,200 --> 2:36:16,200 I don't know what name names here but

1340 2:36:16,200 --> 2:36:19,200 you know, so

1341 2:36:19,200 --> 2:36:24,200 I think optimist is going to be incredible in like five years 10 years like mind blowing.

1342 2:36:24,200 --> 2:36:29,200 And I'm really interested to see that happen and I hope you are too.

1343 2:36:29,200 --> 2:36:41,200 Thank you. I have a quick question here. Justin and I was wondering, like, are you planning to extend like conversational capabilities for the robot?

1344 2:36:41,200 --> 2:36:49,200 And my second full out question to that is what's like the end goal? What's the end goal of optimist?

1345 2:36:49,200 --> 2:36:56,200 Yeah, optimist would definitely have conversational capabilities. So

1346 2:36:56,200 --> 2:37:00,200 you'd be able to talk to it and have a conversation and it would feel quite natural.

1347 2:37:00,200 --> 2:37:09,200 So from an end goal standpoint, I'm I'm I don't know. I think it's going to keep keep evolving and

1348 2:37:09,200 --> 2:37:16,200 I'm not sure where it ends up, but some someplace interesting for sure.

1349 2:37:16,200 --> 2:37:21,200 You know, we always have to be careful about the, you know, don't go down the terminator path.

1350 2:37:21,200 --> 2:37:29,200 That's a I thought maybe we should start off with a video of like the terminator starting off with this, you know, skull crushing.

1351 2:37:29,200 --> 2:37:32,200 But there might be people might take that too seriously.

1352 2:37:32,200 --> 2:37:36,200 So, you know, we do want optimist to be safe.

1353 2:37:36,200 --> 2:37:44,200 So we are designing in safeguards where you can locally stop the robot

1354 2:37:44,200 --> 2:37:52,200 and, you know, with like basically a localized control ROM that you can't update over the Internet,

1355 2:37:52,200 --> 2:37:56,200 which I think that's quite important.

1356 2:37:56,200 --> 2:38:12,200 Essential, frankly. So like a localized stop button or remote control, something like that, that that cannot be changed.

1357 2:38:12,200 --> 2:38:22,200 But it's definitely going to be interesting. It won't be boring.

1358 2:38:22,200 --> 2:38:27,200 OK, yeah, I see you today. You have very attractive product with dojo and its applications.

1359 2:38:27,200 --> 2:38:30,200 So I'm wondering what's the future for the dojo platform?

1360 2:38:30,200 --> 2:38:39,200 Will you like provide like infrastructure infrastructure as service like AWS or even like a still the chip like the Nvidia?

1361 2:38:39,200 --> 2:38:43,200 So basically what's the future? Because I see you use a seven millimeter.

1362 2:38:43,200 --> 2:38:46,200 So the developer cost is like easily over 10 million dollars.

1363 2:38:46,200 --> 2:38:51,200 How do you make the business like a business wise?

1364 2:38:51,200 --> 2:39:00,200 Yeah, I mean, dojo is a very big computer and actually will be used a lot of power and needs a lot of cooling.

1365 2:39:00,200 --> 2:39:08,200 So I think it's probably going to make more sense to have dojo operate in like Amazon Web Services manner than to try to sell it to someone else.

1366 2:39:08,200 --> 2:39:20,200 So the most that would be the most efficient way to operate dojo is just have it be a service that you can use that's available online.

1367 2:39:20,200 --> 2:39:25,200 And that where you can train your models way faster and for less money.

1368 2:39:25,200 --> 2:39:34,200 And as the world transitions to software 2.0.

1369 2:39:34,200 --> 2:39:36,200 And that's on the bingo card.

1370 2:39:36,200 --> 2:39:41,200 Someone I know it has to not drink five tequilas.

1371 2:39:41,200 --> 2:39:45,200 So let's see.

1372 2:39:45,200 --> 2:39:49,200 Software 2.0.

1373 2:39:49,200 --> 2:39:53,200 Yeah, we'll use a lot of neural net training.

1374 2:39:53,200 --> 2:40:07,200 So, you know, it kind of makes sense that over time as there's more more neural net stuff, people will want to use the fastest lowest cost neural net training system.

1375 2:40:07,200 --> 2:40:14,200 So I think there's a lot of opportunity in that direction.

1376 2:40:14,200 --> 2:40:21,200 Hi, my name is Ali Jahanian. Thank you for this event. It's very inspirational.

1377 2:40:21,200 --> 2:40:40,200 My question is, I'm wondering, what is your vision for humanity robots that understand our emotions and art and can contribute to our creativity?

1378 2:40:40,200 --> 2:40:53,200 Well, I think there's this you're already seeing robots that at least are able to generate very interesting art with like like Dali and Dali to.

1379 2:40:53,200 --> 2:41:03,200 And I think we'll we'll start seeing AI that can actually generate even movies that have a that have coherence, like interesting movies and tell jokes.

1380 2:41:03,200 --> 2:41:14,200 So it's quite remarkable how fast AI is advancing at many companies besides Tesla.

1381 2:41:14,200 --> 2:41:17,200 We're headed for a very interesting future.

1382 2:41:17,200 --> 2:41:21,200 And, yeah, so you guys want to comment on that?

1383 2:41:21,200 --> 2:41:27,200 Yeah, I guess the optimist robot can come up with physical art, not just digital art.

1384 2:41:27,200 --> 2:41:33,200 You can you know, you can ask for some dance moves in text or voice and then you can produce those in the future.

1385 2:41:33,200 --> 2:41:37,200 So it's a lot of physical art, not just digital art.

1386 2:41:37,200 --> 2:41:41,200 Oh, yeah, yeah. Computers can absolutely make physical art. Yeah. Yeah.

1387 2:41:41,200 --> 2:41:45,200 Like dance, play soccer or whatever.

1388 2:41:45,200 --> 2:41:50,200 It needs to get more agile, but over time, for sure.

1389 2:41:50,200 --> 2:41:52,200 Thanks so much for the presentation.

1390 2:41:52,200 --> 2:42:00,200 For the Tesla autopilot slides, I noticed that the models that you were using were heavily motivated by language models.

1391 2:42:00,200 --> 2:42:05,200 And I was wondering what the history of that was and how much of an improvement it gave.

1392 2:42:05,200 --> 2:42:10,200 I thought that that was a really interesting, curious choice to use language models for the lane transitioning.

1393 2:42:10,200 --> 2:42:14,200 So there are sort of two aspects for why we transition to language modeling.

1394 2:42:14,200 --> 2:42:17,200 So the first talk, talk loud and close. OK.

1395 2:42:17,200 --> 2:42:20,200 It's not coming through very clearly. OK, got it.

1396 2:42:20,200 --> 2:42:23,200 Yeah, so the language models help us in two ways.

1397 2:42:23,200 --> 2:42:26,200 The first way is that it lets us predict lanes that we couldn't have otherwise.

1398 2:42:26,200 --> 2:42:33,200 As Ashok mentioned earlier, basically when we predicted lanes in sort of a dense 3D fashion,

1399 2:42:33,200 --> 2:42:38,200 you can only model certain kinds of lanes, but we want to get those crisscrossing connections inside of intersections.

1400 2:42:38,200 --> 2:42:41,200 It's just not possible to do that without making it a graph prediction.

1401 2:42:41,200 --> 2:42:45,200 If you try to do this with dense segmentation, it just doesn't work.

1402 2:42:45,200 --> 2:42:48,200 Also, the lane prediction is a multimodal problem.

1403 2:42:48,200 --> 2:42:54,200 Sometimes you just don't have sufficient visual information to know precisely how things look on the other side of the intersection.

1404 2:42:54,200 --> 2:42:59,200 So you need a method that can generalize and produce coherent predictions.

1405 2:42:59,200 --> 2:43:02,200 You don't want to be predicting two lanes and three lanes at the same time.

1406 2:43:02,200 --> 2:43:06,200 You want to commit to one. And a general model like these language models provides that.

1407 2:43:10,200 --> 2:43:11,200 Hi.

1408 2:43:11,200 --> 2:43:14,200 Hi. My name is Giovanni.

1409 2:43:14,200 --> 2:43:18,200 Thanks for the presentation. It's really nice.

1410 2:43:18,200 --> 2:43:21,200 I have a question for FSD team.

1411 2:43:21,200 --> 2:43:30,200 For the neural networks, how do you do unit tests, software unit tests on that?

1412 2:43:30,200 --> 2:43:40,200 Do you have a bunch, I don't know, thousands of cases where the neural network,

1413 2:43:40,200 --> 2:43:45,200 after you train it, you have to pass it before you release it as a product?

1414 2:43:45,200 --> 2:43:50,200 What's your software unit testing strategies for this?

1415 2:43:50,200 --> 2:43:51,200 Glad you asked.

1416 2:43:51,200 --> 2:43:56,200 There's a series of tests that we have defined, starting from unit tests for software itself.

1417 2:43:56,200 --> 2:44:00,200 But then for the neural network models, we have VIP sets defined.

1418 2:44:00,200 --> 2:44:05,200 If you just have a large test set, that's not enough what we find.

1419 2:44:05,200 --> 2:44:09,200 We need sophisticated VIP sets for different failure modes.

1420 2:44:09,200 --> 2:44:12,200 And then we cure them and grow them over the time of the product.

1421 2:44:12,200 --> 2:44:19,200 So over the years, we have hundreds of thousands of examples where we have been failing in the past

1422 2:44:19,200 --> 2:44:20,200 that we have curated.

1423 2:44:20,200 --> 2:44:25,200 And so for any new model, we test against the entire history of these failures

1424 2:44:25,200 --> 2:44:27,200 and then keep adding to this test set.

1425 2:44:27,200 --> 2:44:32,200 On top of this, we have shadow modes where we ship these models in silent to the car

1426 2:44:32,200 --> 2:44:35,200 and we get data back on where they are failing or succeeding.

1427 2:44:35,200 --> 2:44:39,200 And there's an extensive QA program.

1428 2:44:39,200 --> 2:44:41,200 It's very hard to ship a regression.

1429 2:44:41,200 --> 2:44:44,200 There's like nine levels of filters before it hits customers.

1430 2:44:44,200 --> 2:44:48,200 But then we have really good infra to make this all efficient.

1431 2:44:48,200 --> 2:44:50,200 I'm one of the QA testers.

1432 2:44:50,200 --> 2:44:52,200 So I QA the car.

1433 2:44:52,200 --> 2:44:54,200 Yeah, QA tester.

1434 2:44:54,200 --> 2:44:55,200 Yeah.

1435 2:44:55,200 --> 2:45:04,200 So I'm constantly in the car just being QA-ing, like whatever the latest alpha bolt is that doesn't totally crash.

1436 2:45:04,200 --> 2:45:06,200 Finds a lot of bugs.

1437 2:45:08,200 --> 2:45:10,200 Hi. Great event.

1438 2:45:10,200 --> 2:45:14,200 I have a question about foundational models for autonomous driving.

1439 2:45:14,200 --> 2:45:21,200 We have all seen that big models that really can, when you scale up with data and model parameter,

1440 2:45:21,200 --> 2:45:25,200 from GPT-3 to POM, it can actually now do reasoning.

1441 2:45:25,200 --> 2:45:32,200 Do you see that it's essential scaling up foundational models with data and size?

1442 2:45:32,200 --> 2:45:38,200 And then at least you can get a teacher model that potentially can solve all the problems.

1443 2:45:38,200 --> 2:45:41,200 And then you distill to a student model.

1444 2:45:41,200 --> 2:45:46,200 Is that how you see foundational models relevant for autonomous driving?

1445 2:45:46,200 --> 2:45:48,200 That's quite similar to our auto labeling model.

1446 2:45:48,200 --> 2:45:51,200 So we don't just have models that run in the car.

1447 2:45:51,200 --> 2:45:57,200 We train models that are entirely offline, that are extremely large, that can't run in real time on the car.

1448 2:45:57,200 --> 2:46:05,200 So we just run those offline on our servers, producing really good labels that can then train the online networks.

1449 2:46:05,200 --> 2:46:10,200 So that's one form of distillation of these teacher-student models.

1450 2:46:10,200 --> 2:46:16,200 In terms of foundational models, we are building some really, really large datasets that are multiple petabytes.

1451 2:46:16,200 --> 2:46:20,200 And we are seeing that some of these tasks work really well when we have these large datasets.

1452 2:46:20,200 --> 2:46:25,200 Like the kinematics, like I mentioned, we go in, all the kinematics out of all the objects.

1453 2:46:25,200 --> 2:46:27,200 And up to the fourth derivative.

1454 2:46:27,200 --> 2:46:29,200 And people thought we couldn't do detection with cameras.

1455 2:46:29,200 --> 2:46:32,200 Detection, depth, velocity, acceleration.

1456 2:46:32,200 --> 2:46:37,200 And imagine how precise these have to be for these higher-order derivatives to be accurate.

1457 2:46:37,200 --> 2:46:41,200 And this all comes from these large datasets and large models.

1458 2:46:41,200 --> 2:46:49,200 So we are seeing the equivalent of foundation models in our own way for geometry and kinematics and things like those.

1459 2:46:49,200 --> 2:46:52,200 Do you want to add anything, John?

1460 2:46:52,200 --> 2:46:53,200 Yeah, I'll keep it brief.

1461 2:46:53,200 --> 2:47:03,200 Basically, whenever we train on a larger dataset, we see big improvements in our model performance.

1462 2:47:03,200 --> 2:47:10,200 And basically, whenever we initialize our networks with some pre-training step from some other auxiliary task, we basically see improvements.

1463 2:47:10,200 --> 2:47:17,200 The self-supervised or supervised with large datasets both help a lot.

1464 2:47:17,200 --> 2:47:25,200 Hi. So at the beginning, Elon said that Tesla was potentially interested in building artificial general intelligence systems.

1465 2:47:25,200 --> 2:47:34,200 Given the potentially transformative impact of technology like that, it seems prudent to invest in technical AGI safety expertise specifically.

1466 2:47:34,200 --> 2:47:38,200 I know Tesla does a lot of technical narrow AI safety research.

1467 2:47:38,200 --> 2:47:48,200 I was curious if Tesla was intending to try to build expertise in technical artificial general intelligence safety specifically.

1468 2:47:48,200 --> 2:47:59,200 Well, I mean, if we're looking like we're going to be making a significant contribution to artificial general intelligence, then we'll for sure invest in safety.

1469 2:47:59,200 --> 2:48:01,200 I'm a big believer in AI safety.

1470 2:48:01,200 --> 2:48:12,200 I think there should be an AI regulatory authority at the government level, just as there is a regulatory authority for anything that affects public safety.

1471 2:48:12,200 --> 2:48:21,200 So we have regulatory authority for aircraft and cars and food and drugs because they affect public safety.

1472 2:48:21,200 --> 2:48:23,200 And AI also affects public safety.

1473 2:48:23,200 --> 2:48:39,200 So I think this is not really something that government, I think, understands yet. But I think there should be a referee that is ensuring or trying to ensure public safety for AGI.

1474 2:48:39,200 --> 2:48:46,200 And you think of like, well, what are the elements that are necessary to create AGI?

1475 2:48:46,200 --> 2:49:09,200 The accessible data set is extremely important. And if you've got a large number of cars and humanoid robots processing petabytes of video data and audio data from the real world, just like humans, that might be the biggest data set.

1476 2:49:09,200 --> 2:49:12,200 It probably is the biggest data set.

1477 2:49:12,200 --> 2:49:17,200 Because in addition to that, you can obviously incrementally scan the Internet.

1478 2:49:17,200 --> 2:49:29,200 But what the Internet can't quite do is have millions or hundreds of millions of cameras in the real world, and like I said, with audio and other sensors as well.

1479 2:49:29,200 --> 2:49:39,200 So I think we probably will have the most amount of data and probably the most amount of training power.

1480 2:49:39,200 --> 2:49:48,200 Therefore, probably we will make a contribution to AGI.

1481 2:49:48,200 --> 2:49:53,200 Hey, I noticed the semi was back there, but we haven't talked about it too much.

1482 2:49:53,200 --> 2:49:59,200 I was just wondering for the semi truck, what are the changes you're thinking about from a sensing perspective?

1483 2:49:59,200 --> 2:50:03,200 I imagine there's very different requirements, obviously, than just a car.

1484 2:50:03,200 --> 2:50:06,200 And if you don't think that's true, why is that true?

1485 2:50:06,200 --> 2:50:10,200 No, I think basically you can drive a car.

1486 2:50:10,200 --> 2:50:12,200 I mean, think about it, what drives any vehicle?

1487 2:50:12,200 --> 2:50:18,200 It's a biological neural net with eyes, with cameras, essentially.

1488 2:50:18,200 --> 2:50:30,200 So if – and really, what is your – your primary sensors are two cameras on a slow gimbal, a very slow gimbal.

1489 2:50:30,200 --> 2:50:32,200 That's your head.

1490 2:50:32,200 --> 2:50:39,200 So if a biological neural net with two cameras on a slow gimbal can drive a semi truck,

1491 2:50:39,200 --> 2:50:48,200 then if you've got like eight cameras with continuous 360-degree vision operating at a higher frame rate and much higher reaction rate,

1492 2:50:48,200 --> 2:50:56,200 then I think it is obvious that you should be able to drive a semi or any vehicle much better than a human.

1493 2:50:56,200 --> 2:51:00,200 Hi, my name is Akshay. Thank you for the event.

1494 2:51:00,200 --> 2:51:08,200 Assuming Optimus would be used for different use cases and would evolve at different pace for these use cases,

1495 2:51:08,200 --> 2:51:15,200 would it be possible to sort of develop and deploy different software and hardware components independently

1496 2:51:15,200 --> 2:51:27,200 and deploy them in Optimus so that the overall feature development is faster for Optimus?

1497 2:51:27,200 --> 2:51:30,200 I'm trying to see the question.

1498 2:51:30,200 --> 2:51:33,200 Okay, all right. We did not comprehend.

1499 2:51:33,200 --> 2:51:38,200 Unfortunately, our neural net did not comprehend the question.

1500 2:51:38,200 --> 2:51:44,200 So next question.

1501 2:51:44,200 --> 2:51:46,200 Hi, I want to switch the gear to the autopilot.

1502 2:51:46,200 --> 2:51:53,200 So when you guys plan to roll out the FSD beta to countries other than U.S. and Canada,

1503 2:51:53,200 --> 2:52:00,200 and also my next question is what's the biggest bottleneck or the technological barrier you think in the current autopilot stack

1504 2:52:00,200 --> 2:52:06,200 and how you envision to solve that to make the autopilot is considerably better than human

1505 2:52:06,200 --> 2:52:11,200 in terms of performance matrix, like safety assurance and the human confidence?

1506 2:52:11,200 --> 2:52:18,200 I think you also mentioned for FSD V11, you are going to combine the highway and the city as a single stack

1507 2:52:18,200 --> 2:52:24,200 and some architectural big improvements. Can you maybe expand a bit on that? Thank you.

1508 2:52:24,200 --> 2:52:29,200 Well, that's a whole bunch of questions.

1509 2:52:29,200 --> 2:52:33,200 We're hopeful to be able to, I think from a technical standpoint,

1510 2:52:33,200 --> 2:52:43,200 FSD beta should be possible to roll out FSD beta worldwide by the end of this year.

1511 2:52:43,200 --> 2:52:54,200 But for a lot of countries, we need regulatory approval, and so we are somewhat gated by the regulatory approval in other countries.

1512 2:52:54,200 --> 2:53:03,200 But I think from a technical standpoint, it will be ready to go to a worldwide beta by the end of this year.

1513 2:53:03,200 --> 2:53:07,200 And there's quite a big improvement that we're expecting to release next month.

1514 2:53:07,200 --> 2:53:16,200 That will be especially good at assessing the velocity of fast-moving cross traffic and a bunch of other things.

1515 2:53:16,200 --> 2:53:22,200 So, anyone want to elaborate?

1516 2:53:22,200 --> 2:53:27,200 Yeah, I guess so. There used to be a lot of differences between production autopilot and the full self-driving beta,

1517 2:53:27,200 --> 2:53:31,200 but those differences have been getting smaller and smaller over time.

1518 2:53:31,200 --> 2:53:40,200 I think just a few months ago, we now use the same vision-only object detection stack in both FSD and in the production autopilot on all vehicles.

1519 2:53:40,200 --> 2:53:45,200 There's still a few differences, the primary one being the way that we predict lanes right now.

1520 2:53:45,200 --> 2:53:50,200 So we upgraded the modeling of lanes so that it could handle these more complex geometries like I mentioned in the talk.

1521 2:53:50,200 --> 2:53:54,200 In production autopilot, we still use a simpler lane model,

1522 2:53:54,200 --> 2:54:01,200 but we're extending our current FSD beta models to work in all sort of highway scenarios as well.

1523 2:54:01,200 --> 2:54:06,200 Yeah, and the version of FSD beta that I drive actually does have the integrated stack.

1524 2:54:06,200 --> 2:54:14,200 So it uses the FSD stack both in city streets and highway, and it works quite well for me.

1525 2:54:14,200 --> 2:54:20,200 But we need to validate it in all kinds of weather like heavy rain, snow, dust,

1526 2:54:20,200 --> 2:54:29,200 and just make sure it's working better than the production stack across a wide range of environments.

1527 2:54:29,200 --> 2:54:32,200 But we're pretty close to that.

1528 2:54:32,200 --> 2:54:40,200 I mean, I think it's, I don't know, maybe, it'll definitely be before the end of the year and maybe November.

1529 2:54:40,200 --> 2:54:46,200 Yeah, in our personal drives, the FSD stack on highway drives already way better than the production stack we have.

1530 2:54:46,200 --> 2:54:53,200 And we do expect to also include the parking lot stack as a part of the FSD stack before the end of this year.

1531 2:54:53,200 --> 2:55:02,200 So that will basically bring us to you sit in the car in the parking lot and drive till the end of the parking lot at a parking spot before the end of this year.

1532 2:55:02,200 --> 2:55:12,200 And in terms of the fundamental metric to optimize against is how many miles between a necessary intervention.

1533 2:55:12,200 --> 2:55:25,200 So just massively improving how many miles the car can drive in full autonomy before an intervention is required that is safety critical.

1534 2:55:25,200 --> 2:55:36,200 So, yeah, that's the fundamental metric that we're measuring every week, and we're making radical improvements on that.

1535 2:55:36,200 --> 2:55:46,200 Hi, thank you. Thank you so much for the presentation. Very inspiring. My name is Daisy. I actually have a non-technical question for you.

1536 2:55:46,200 --> 2:56:07,200 I'm curious if you were back to your 20s, what are some of the things you wish you knew back then? What are some advice you would give to your younger self?

1537 2:56:07,200 --> 2:56:14,200 Well, I'm trying to figure out something useful to say.

1538 2:56:14,200 --> 2:56:20,200 Yeah, yeah, join Tesla would be one thing.

1539 2:56:20,200 --> 2:56:28,200 Yeah, I think just trying to try to expose yourself to as many smart people as possible.

1540 2:56:28,200 --> 2:56:34,200 I don't read a lot of books.

1541 2:56:34,200 --> 2:56:37,200 You know, I do. I did do that, though.

1542 2:56:37,200 --> 2:56:54,200 So I think there's some merit to just also not being necessarily too intense and enjoying the moment a bit more, I would say, to 20-something me.

1543 2:56:54,200 --> 2:57:02,200 Just to stop and smell the roses occasionally would probably be a good idea.

1544 2:57:02,200 --> 2:57:14,200 You know, it's like when we were developing the Falcon 1 rocket on the Kwajalein Atoll, and we had this beautiful little island that we're developing the rocket on,

1545 2:57:14,200 --> 2:57:26,200 and not once during that entire time did I even have a drink on the beach. I'm like, I should have had a drink on the beach. That would have been fine.

1546 2:57:26,200 --> 2:57:32,200 Thank you very much. I think you have excited all of the robotics people with Optimus.

1547 2:57:32,200 --> 2:57:40,200 This feels very much like 10 years ago in driving, but as driving has proved to be harder than it actually looked 10 years ago,

1548 2:57:40,200 --> 2:57:49,200 what do we know now that we didn't 10 years ago that would make, for example, AGI on a humanoid come faster?

1549 2:57:49,200 --> 2:58:00,200 Well, I mean, it seems to me that AGI is advancing very quickly. Hardly a week goes by without some significant announcement.

1550 2:58:00,200 --> 2:58:11,200 And yeah, I mean, at this point, like, AGI seems to be able to win at almost any rule-based game.

1551 2:58:11,200 --> 2:58:31,200 It's able to create extremely impressive art, engage in conversations that are very sophisticated, write essays, and these just keep improving.

1552 2:58:31,200 --> 2:58:45,200 And there's so many more talented people working on AI, and the hardware is getting better. I think AI is on a super, like a strong exponential curve of improvements,

1553 2:58:45,200 --> 2:58:57,200 independent of what we do at Tesla, and obviously will benefit somewhat from that exponential curve of improvement with AI.

1554 2:58:57,200 --> 2:59:07,200 Tesla just also happens to be very good at actuators, at motors, gearboxes, controllers, power electronics, batteries, sensors.

1555 2:59:07,200 --> 2:59:19,200 And really, I'd say the biggest difference between the robot on four wheels and the robot with arms and legs is getting the actuators right.

1556 2:59:19,200 --> 2:59:33,200 It's an actuators and sensors problem. And obviously, how you control those actuators and sensors, but it's, yeah, actuators and sensors and how you control the actuators.

1557 2:59:33,200 --> 2:59:42,200 I don't know, we happen to have the ingredients necessary to create a compelling robot, and we're doing it.

1558 2:59:42,200 --> 2:59:51,200 Hi, Ilan. You are actually bringing the humanity to the next level. Literally, Tesla and you are bringing the humanity to the next level.

1559 2:59:51,200 --> 3:00:03,200 So you said Optimus Prime, Optimus will be used in next Tesla factory. My question is, will a new Tesla factory will be fully run by Optimus program?

1560 3:00:03,200 --> 3:00:10,200 And when can general public order a humanoid?

1561 3:00:10,200 --> 3:00:16,200 Yeah, I think it'll, you know, we're going to start Optimus with very simple tasks in the factory.

1562 3:00:16,200 --> 3:00:34,200 You know, like maybe just like loading a part, like you saw in the video, loading a part, you know, carrying apart from one place to another or loading a part into one of our more conventional robot cells to, you know, that welds body together.

1563 3:00:34,200 --> 3:00:44,200 So we'll start, you know, just trying to, how do we make it useful at all? And then gradually expand the number of situations where it's useful.

1564 3:00:44,200 --> 3:00:55,200 And I think that the number of situations where Optimus is useful will grow exponentially, like really, really fast.

1565 3:00:55,200 --> 3:01:07,200 In terms of when people can order one, I don't know, I think it's not that far away. Well, I think you mean when can people receive one?

1566 3:01:07,200 --> 3:01:23,200 So, I don't know, I'm like, I'd say probably within three years, not more than five years, within three to five years, you could probably receive an Optimus.

1567 3:01:23,200 --> 3:01:29,200 I feel the best way to make the progress for AGI is to involve as many smart people across the world as possible.

1568 3:01:29,200 --> 3:01:37,200 And given the size and resource of Tesla compared to robot companies, and given the state of human research at the moment,

1569 3:01:37,200 --> 3:01:44,200 would it make sense for the kind of Tesla to sort of open source some of the simulation hardware parts?

1570 3:01:44,200 --> 3:01:53,200 I think Tesla can still be the dominant platformer where it can be something like Android OS or like iOS stuff for the entire human research.

1571 3:01:53,200 --> 3:02:00,200 Would that be something that rather than keeping the Optimus to just Tesla researchers or the factory itself,

1572 3:02:00,200 --> 3:02:10,200 can you open it and let the whole world explore human research?

1573 3:02:10,200 --> 3:02:19,200 I think we have to be careful about Optimus being potentially used in ways that are bad, because that is one of the possible things to do.

1574 3:02:19,200 --> 3:02:40,200 So, I think we'd provide Optimus where you can provide instructions to Optimus, but where those instructions are governed by some laws of robotics that you cannot overcome.

1575 3:02:40,200 --> 3:02:52,200 So, not doing harm to others, and I think probably quite a few safety related things with Optimus.

1576 3:02:52,200 --> 3:02:59,200 We'll just take maybe a few more questions, and then thank you all for coming.

1577 3:02:59,200 --> 3:03:09,200 Questions, one deep and one broad. On the deep, for Optimus, what's the current and what's the ideal controller bandwidth?

1578 3:03:09,200 --> 3:03:15,200 And then in the broader question, there's this big advertisement for the depth and breadth of the company.

1579 3:03:15,200 --> 3:03:21,200 What is it uniquely about Tesla that enables that?

1580 3:03:21,200 --> 3:03:25,200 Anyone want to tackle the bandwidth question?

1581 3:03:25,200 --> 3:03:28,200 So, the technical bandwidth of the...

1582 3:03:28,200 --> 3:03:29,200 Close to your mouth and loud.

1583 3:03:29,200 --> 3:03:41,200 Okay. For the bandwidth question, you have to understand or figure out what is the task that you want it to do, and if you took a frequency transform of that task, what is it that you want your limbs to do?

1584 3:03:41,200 --> 3:03:50,200 And that's where you get your bandwidth from. It's not a number that you can specifically just say. You need to understand your use case, and that's where the bandwidth comes from.

1585 3:03:50,200 --> 3:03:54,200 What are the broad questions?

1586 3:03:54,200 --> 3:04:05,200 The breadth and depth thing. I can answer the breadth and depth.

1587 3:04:05,200 --> 3:04:31,200 On the bandwidth question, I think we'll probably just end up increasing the bandwidth, which translates to the effective dexterity and reaction time of the robot. It's safe to say it's not one hertz, and maybe you don't need to go all the way to 100 hertz, but maybe 10, 25, I don't know.

1588 3:04:31,200 --> 3:04:39,200 Over time, I think the bandwidth will increase quite a bit, or translate it to dexterity and latency.

1589 3:04:39,200 --> 3:04:44,200 You'd want to minimize that over time.

1590 3:04:44,200 --> 3:04:48,200 Minimize latency, maximize dexterity.

1591 3:04:48,200 --> 3:05:07,200 In terms of breadth and depth, we're a pretty big company at this point, so we've got a lot of different areas of expertise that we necessarily had to develop in order to make autonomous electric cars, and then in order to make autonomous electric cars.

1592 3:05:07,200 --> 3:05:11,200 Tesla is like a whole series of startups, basically.

1593 3:05:11,200 --> 3:05:19,200 So far, they've almost all been quite successful. So we must be doing something right.

1594 3:05:19,200 --> 3:05:30,200 I consider one of my core responsibilities in running the company is to have an environment where great engineers can flourish.

1595 3:05:30,200 --> 3:05:42,200 And I think in a lot of companies, maybe most companies, if somebody is a really talented, driven engineer, they're unable to actually...

1596 3:05:42,200 --> 3:05:48,200 Their talents are suppressed at a lot of companies.

1597 3:05:48,200 --> 3:06:05,200 And some of the companies that engineering talent is suppressed in a way that is maybe not obviously bad, but where it's just so comfortable, and you paid so much money, and the output you actually have to produce is so low, that it's like a honey trap.

1598 3:06:05,200 --> 3:06:19,200 So there's a few honey traps in Silicon Valley, where they don't necessarily seem like bad places for engineers, but you have to say a good engineer went in, and what did they get out?

1599 3:06:19,200 --> 3:06:28,200 And the output of that engineering talent seems very low, even though they seem to be enjoying themselves.

1600 3:06:28,200 --> 3:06:32,200 That's why I call it a few honey trap companies in Silicon Valley.

1601 3:06:32,200 --> 3:06:44,200 Tesla is not a honey trap. We're demanding, and it's like, you're going to get a lot of shit done, and it's going to be really cool, and it's not going to be easy.

1602 3:06:44,200 --> 3:07:04,200 But if you are a super talented engineer, your talents will be used, I think, to a greater degree than anywhere else. You know, SpaceX also that way.

1603 3:07:04,200 --> 3:07:16,200 Hi, I have two questions. So both to the autopilot team. So the thing is, I have been following your progress for the past few years. So today you have made changes on the lane detection.

1604 3:07:16,200 --> 3:07:23,200 You said that previously you were doing instant semantic segmentation. Now you guys have built transform models for building the lanes.

1605 3:07:23,200 --> 3:07:34,200 So what are some other common challenges which you guys are facing right now, which you are solving in future as a curious engineer so that we as a researcher can work on those, start working on those?

1606 3:07:34,200 --> 3:07:42,200 And the second question is, I'm really curious about the data engine. You guys have told a case where the car is stopped.

1607 3:07:42,200 --> 3:07:50,200 So how are you finding cases which is very much similar to that from the data which you have? So a little bit more on the data engine would be great.

1608 3:07:50,200 --> 3:08:01,200 I'll start with the first question using occupancy network as an example. So what you saw in the presentation did not exist a year ago.

1609 3:08:01,200 --> 3:08:06,200 So we only spent one year of our time. We actually shaped more than 12 occupancy networks.

1610 3:08:06,200 --> 3:08:17,200 And to have one foundation model actually to represent the entire physical world around everywhere and in all weather conditions is actually really, really challenging.

1611 3:08:17,200 --> 3:08:30,200 So only over a year ago, we were kind of like driving a 2D world. If there's a wall and there's a curve, we kind of represent with the same static edge, which is obviously not ideal.

1612 3:08:30,200 --> 3:08:34,200 There's a big difference between a curve and a wall. When you drive, you make different choices.

1613 3:08:34,200 --> 3:08:42,200 So after we realized that, we have to go to 3D. We have to basically rethink the entire problem and think about how we address that.

1614 3:08:42,200 --> 3:08:51,200 So this will be like one example of challenges we have conquered in the past year.

1615 3:08:51,200 --> 3:08:58,200 Yeah, to answer the question about how we actually source examples of those tricky stopped cars, there's a few ways to go about this.

1616 3:08:58,200 --> 3:09:03,200 But two examples are one, we can trigger for disagreements within our signals.

1617 3:09:03,200 --> 3:09:08,200 So let's say that parked bit flickers between parked and driving. We'll trigger that back.

1618 3:09:08,200 --> 3:09:16,200 And the second is we can leverage more of the shadow mode logic. So if the customer ignores the car, but we think we should stop for it, we'll get that data back too.

1619 3:09:16,200 --> 3:09:25,200 So these are just different, like various trigger logic that allows us to get those data campaigns back.

1620 3:09:25,200 --> 3:09:30,200 Hi. Thank you for the amazing presentation. Thanks so much.

1621 3:09:30,200 --> 3:09:40,200 So there are a lot of companies that are focusing on the AGI problem. And one of the reasons why it's such a hard problem is because the problem itself is so hard to define.

1622 3:09:40,200 --> 3:09:44,200 Several companies have several different definitions. They focus on different things.

1623 3:09:44,200 --> 3:09:51,200 So what is Tesla, how is Tesla defining the AGI problem? And what are you focusing on specifically?

1624 3:09:51,200 --> 3:10:03,200 Well, we're not actually specifically focused on AGI. I'm simply saying that AGI seems likely to be an emergent property of what we're doing.

1625 3:10:03,200 --> 3:10:18,200 Because we're creating all these autonomous cars and autonomous humanoids that are actually with a truly gigantic data stream that's coming in and being processed.

1626 3:10:18,200 --> 3:10:30,200 It's by far the most amount of real world data and data you can't get by just searching the Internet because you have to be out there in the world and interacting with people and interacting with the roads.

1627 3:10:30,200 --> 3:10:35,200 And just, you know, Earth is a big place and reality is messy and complicated.

1628 3:10:35,200 --> 3:10:51,200 So I think it's sort of like likely to just it just seems likely to be an emergent property of if you've got, you know, tens or hundreds of millions of autonomous vehicles and maybe even a comparable number of humanoids, maybe more than that on the human right front.

1629 3:10:51,200 --> 3:11:16,200 Well, that's just the most amount of data. And if that that video is being processed, it just seems likely that, you know, the cars will will definitely get way better than human drivers and the humanoid robots will become increasingly indistinguishable from humans, perhaps.

1630 3:11:16,200 --> 3:11:27,200 And so then, like I said, you have this emergent property of AGI.

1631 3:11:27,200 --> 3:11:36,200 And arguably, you know, humans collectively are sort of a superintelligence as well, especially as we improve the data rate between humans.

1632 3:11:36,200 --> 3:11:53,200 I mean, the thing like that seems to be way back in the early days, the Internet was like the Internet was like humanity acquiring a nervous system where now all of a sudden any one element of humanity could know all of the knowledge of humans by connecting to the Internet.

1633 3:11:53,200 --> 3:11:56,200 Almost all knowledge on certainly huge part of it.

1634 3:11:56,200 --> 3:12:09,200 Previously, we would exchange information by osmosis, by, you know, by we'd have like in order to transfer data, so you would have to write a letter somewhere would have to carry the letter by person to another person.

1635 3:12:09,200 --> 3:12:19,200 And then a whole bunch of things in between. And then it was like, yeah, I mean, insanely slow when you think about it.

1636 3:12:19,200 --> 3:12:26,200 And even if you were in the Library of Congress, you still didn't have access to all the world's information. You certainly couldn't search it.

1637 3:12:26,200 --> 3:12:30,200 And obviously, very few people are in the Library of Congress.

1638 3:12:30,200 --> 3:12:48,200 So I mean, one of the great sort of equality elements like the Internet has been the most the biggest equalizer in history in terms of access to information and knowledge.

1639 3:12:48,200 --> 3:12:58,200 And I think in any student of history, I think would agree with this because you go back a thousand years, there were very few books like and books would be incredibly expensive.

1640 3:12:58,200 --> 3:13:03,200 But only a few people knew how to read and only an even smaller number of people even had a book.

1641 3:13:03,200 --> 3:13:11,200 Now, now look at it like you can access any book instantly. You can learn anything for basically for free.

1642 3:13:11,200 --> 3:13:13,200 It's pretty incredible.

1643 3:13:13,200 --> 3:13:25,200 So, you know, I was asked recently what period of history would I prefer to be at the most.

1644 3:13:25,200 --> 3:13:33,200 And my answer was right now. This is the most interesting time in history. And I read a lot of history.

1645 3:13:33,200 --> 3:13:39,200 So let's hope you know, let's do our best to keep that going. Yeah.

1646 3:13:39,200 --> 3:13:56,200 And to go back to one of the early questions I would ask, like you can the thing that's happened over time with respect to Tesla autopilot is that we've just the the neural nets have gotten have gradually absorbed more and more software.

1647 3:13:56,200 --> 3:14:09,200 And in the limit, of course, you could simply take the videos as seen by the car and compare those to the steering inputs from the steering wheel and pedals, which are very simple inputs.

1648 3:14:09,200 --> 3:14:30,200 And in principle, you could train with nothing in between because that's what humans are doing with a biological neural net. You could train based on video and and what trains the video is the moving of the steering wheel and the pedals with no other software in between.

1649 3:14:30,200 --> 3:14:36,200 We're not there yet, but it's gradually going in that direction.

1650 3:14:36,200 --> 3:14:42,200 All right. Last question.

1651 3:14:42,200 --> 3:14:44,200 I think we got a question at the front here.

1652 3:14:44,200 --> 3:14:46,200 Hello there right there.

1653 3:14:46,200 --> 3:14:50,200 I will do two questions. Fine.

1654 3:14:50,200 --> 3:14:52,200 Hi, thanks for such a great presentation.

1655 3:14:52,200 --> 3:14:54,200 We'll do your question last.

1656 3:14:54,200 --> 3:15:17,200 Okay, cool. With FSD being used by so many people. Do you think what's the comp. How do you evaluate the company's risk tolerance in terms of performance statistics, and do you think there needs to be more transparency or regulation from third parties, as to how what's good enough and defining like thresholds for performance across so many miles.

1657 3:15:17,200 --> 3:15:22,200 Sure, well, the, you know,

1658 3:15:22,200 --> 3:15:26,200 one design requirement at Tesla is safety.

1659 3:15:26,200 --> 3:15:32,200 So, like, and that goes across the board so in terms of the mechanical safety of the car.

1660 3:15:32,200 --> 3:15:45,200 We have the lowest probability of injury of any cars ever tested by the government for just passive mechanical safety essentially crash structure and airbags and whatnot.

1661 3:15:45,200 --> 3:15:52,200 We have the best highest rating for active safety as well.

1662 3:15:52,200 --> 3:16:02,200 And I'm just going to get to the point where you the active safety is so ridiculously good. It's, it's, it's like just absurdly better than than a human.

1663 3:16:02,200 --> 3:16:23,200 And with respect to autopilot, we do publish this, broadly speaking the statistics on miles driven with cars that have no autonomy or Tesla cars with no autonomy with kind of hardware one hardware to hardware three, and then the ones that are in FSD beta.

1664 3:16:23,200 --> 3:16:27,200 And we see steady improvements all along the way.

1665 3:16:27,200 --> 3:16:42,200 And, you know, sometimes there's this there's this dichotomy of, you know, should you wait until the car is like on Earth, three times safer than a person before deploying any technology but I think that's, that is actually morally wrong.

1666 3:16:42,200 --> 3:16:51,200 At the point at which you believe that that adding autonomy reduces injury and death.

1667 3:16:51,200 --> 3:17:03,200 I think you have a moral obligation to deploy it, even though you're going to get sued and blamed by a lot of people, because the people whose lives you saved don't know that their lives are saved.

1668 3:17:03,200 --> 3:17:14,200 And the people, the people who do occasionally die or get injured, they definitely know, or their state does that it was, you know, whatever there was a problem with with with autopilot.

1669 3:17:14,200 --> 3:17:23,200 That's why you have to look at the, at the numbers in total miles driven. How many accidents occurred how many accidents were serious how many fatalities.

1670 3:17:23,200 --> 3:17:29,200 And, you know, we've got well over 3 million cars on the road so this is, that's a lot of miles driven every day.

1671 3:17:29,200 --> 3:17:38,200 It's not going to be perfect. But what matters is that is that it is very clearly safer than not deploying it.

1672 3:17:38,200 --> 3:17:45,200 Yeah. So, I think last question.

1673 3:17:45,200 --> 3:17:53,200 I think, yeah. Thanks. Last question here.

1674 3:17:53,200 --> 3:17:55,200 Okay.

1675 3:17:55,200 --> 3:18:12,200 Okay. Hi. So, I do not work on hardware so maybe the hardware team, and you guys can enlighten me. Why is it required that there be symmetry in the design of optimus, because humans.

1676 3:18:12,200 --> 3:18:31,200 We have handedness right we are, we use some set of muscles more than others, or time there is wear and tear. Right. So maybe you'll start see some joint failures or some actuator failures more over time, I understand that this is extremely pre stage.

1677 3:18:31,200 --> 3:18:51,200 Also, we as humans have based so much fantasy and fiction or superhuman capabilities like all of us don't want to walk right over there we want to extend our arms and like we have all these, you know, a lot of fantasy fantastical designs so considering everything

1678 3:18:51,200 --> 3:19:11,200 else that is going on in terms of batteries and intensity of compute. Maybe you can leverage all those aspects into coming up with something. Well, I don't know, more interesting in terms of your, the robot that you're building, and I'm hoping you're able to explore those

1679 3:19:11,200 --> 3:19:30,200 directions. Yeah, I mean I think it would be cool to have like you know make inspector gadget real value pretty sweet. So, yeah, I mean, right now we just want to make basic humanoid what work well and our goal is fastest path to a useful humanoid robot.

1680 3:19:30,200 --> 3:19:49,200 I think this is this will ground us in reality, literally, and ensure that we are doing something useful, like one of the hardest things to do is to be useful to to actually, and then and then to have high utility under the curve of like how many people

1681 3:19:49,200 --> 3:20:08,200 did you help, you know, and how much help did you provide to each person on average and then how many people did you help the total utility, like trying to actually ship useful product that people like to a large number of people is so insanely hard,

1682 3:20:08,200 --> 3:20:28,200 it's boggles the mind. You know, so I can say like man there's a hell of a difference between a company that is ship product and one has not ship product. It's again, this is night and day. And then even once you ship product, can you make the cost the value of the output worth more than the cost of the input, which is again, insanely difficult, especially with hardware.

1683 3:20:28,200 --> 3:20:57,200 So, but I think over time I think it would be cool to do creative things and have like eight arms and whatever, and have different versions. And maybe, you know, there'll be some hardware, like companies that are able to add things to an optimist like maybe we, you know, add a power port or something like that or attach them, you can add, you know, add attachments to your optimist like you can add them to your phone.

1684 3:20:57,200 --> 3:21:18,200 There could be a lot of cool things that could be done over time. And there could be maybe an ecosystem of small companies that or big companies that make add-ons for optimists. So with that, I'd like to thank the team for their hard work. You guys are awesome.

1685 3:21:18,200 --> 3:21:33,200 And thank you all for coming and for everyone online. Thanks for tuning in. And I think this will be one of those great videos where you can like, if you can fast forward to the bits that you find most interesting.

1686 3:21:33,200 --> 3:21:48,200 But we try to give you a tremendous amount of detail, literally so that you can look at the video at your leisure and you can focus on the parts that you find interesting and skip the other parts. So thank you all. And we'll do this, try to do this every year.

1687 3:21:48,200 --> 3:22:04,200 And we might do a monthly podcast even. So, but I think it'd be, you know, great to sort of bring you along for the ride and like show you what cool things are happening. And yeah, thank you. All right. Thanks.

1688 3:22:18,200 --> 3:22:33,200 Thanks.

1689 3:22:33,200 --> 3:22:48,200 Thanks.

1690 3:22:48,200 --> 3:23:03,200 Thanks.

links

social