How we teach computers to understand pictures | Fei Fei Li

How we teach computers to understand pictures | Fei Fei Li

Let me show you something. (Video) Girl: Okay, that’s a cat
sitting in a bed. The boy is petting the elephant. Those are people
that are going on an airplane. That’s a big airplane. Fei-Fei Li: This is
a three-year-old child describing what she sees
in a series of photos. She might still have a lot
to learn about this world, but she’s already an expert
at one very important task: to make sense of what she sees. Our society is more
technologically advanced than ever. We send people to the moon,
we make phones that talk to us or customize radio stations
that can play only music we like. Yet, our most advanced
machines and computers still struggle at this task. So I’m here today
to give you a progress report on the latest advances
in our research in computer vision, one of the most frontier
and potentially revolutionary technologies in computer science. Yes, we have prototyped cars
that can drive by themselves, but without smart vision,
they cannot really tell the difference between a crumpled paper bag
on the road, which can be run over, and a rock that size,
which should be avoided. We have made fabulous megapixel cameras, but we have not delivered
sight to the blind. Drones can fly over massive land, but don’t have enough vision technology to help us to track
the changes of the rainforests. Security cameras are everywhere, but they do not alert us when a child
is drowning in a swimming pool. Photos and videos are becoming
an integral part of global life. They’re being generated at a pace
that’s far beyond what any human, or teams of humans, could hope to view, and you and I are contributing
to that at this TED. Yet our most advanced software
is still struggling at understanding and managing this enormous content. So in other words,
collectively as a society, we’re very much blind, because our smartest
machines are still blind. “Why is this so hard?” you may ask. Cameras can take pictures like this one by converting lights into
a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just like to hear is not
the same as to listen, to take pictures is not
the same as to see, and by seeing,
we really mean understanding. In fact, it took Mother Nature
540 million years of hard work to do this task, and much of that effort went into developing the visual
processing apparatus of our brains, not the eyes themselves. So vision begins with the eyes, but it truly takes place in the brain. So for 15 years now, starting
from my Ph.D. at Caltech and then leading Stanford’s Vision Lab, I’ve been working with my mentors,
collaborators and students to teach computers to see. Our research field is called
computer vision and machine learning. It’s part of the general field
of artificial intelligence. So ultimately, we want to teach
the machines to see just like we do: naming objects, identifying people,
inferring 3D geometry of things, understanding relations, emotions,
actions and intentions. You and I weave together entire stories
of people, places and things the moment we lay our gaze on them. The first step towards this goal
is to teach a computer to see objects, the building block of the visual world. In its simplest terms,
imagine this teaching process as showing the computers
some training images of a particular object, let’s say cats, and designing a model that learns
from these training images. How hard can this be? After all, a cat is just
a collection of shapes and colors, and this is what we did
in the early days of object modeling. We’d tell the computer algorithm
in a mathematical language that a cat has a round face,
a chubby body, two pointy ears, and a long tail, and that looked all fine. But what about this cat? (Laughter) It’s all curled up. Now you have to add another shape
and viewpoint to the object model. But what if cats are hidden? What about these silly cats? Now you get my point. Even something as simple
as a household pet can present an infinite number
of variations to the object model, and that’s just one object. So about eight years ago, a very simple and profound observation
changed my thinking. No one tells a child how to see, especially in the early years. They learn this through
real-world experiences and examples. If you consider a child’s eyes as a pair of biological cameras, they take one picture
about every 200 milliseconds, the average time an eye movement is made. So by age three, a child would have seen
hundreds of millions of pictures of the real world. That’s a lot of training examples. So instead of focusing solely
on better and better algorithms, my insight was to give the algorithms
the kind of training data that a child was given through experiences in both quantity and quality. Once we know this, we knew we needed to collect a data set that has far more images
than we have ever had before, perhaps thousands of times more, and together with Professor
Kai Li at Princeton University, we launched the ImageNet project in 2007. Luckily, we didn’t have to mount
a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures
that humans have ever created. We downloaded nearly a billion images and used crowdsourcing technology
like the Amazon Mechanical Turk platform to help us to label these images. At its peak, ImageNet was one of
the biggest employers of the Amazon Mechanical Turk workers: together, almost 50,000 workers from 167 countries around the world helped us to clean, sort and label nearly a billion candidate images. That was how much effort it took to capture even a fraction
of the imagery a child’s mind takes in
in the early developmental years. In hindsight, this idea of using big data to train computer algorithms
may seem obvious now, but back in 2007, it was not so obvious. We were fairly alone on this journey
for quite a while. Some very friendly colleagues advised me
to do something more useful for my tenure, and we were constantly struggling
for research funding. Once, I even joked to my graduate students that I would just reopen
my dry cleaner’s shop to fund ImageNet. After all, that’s how I funded
my college years. So we carried on. In 2009, the ImageNet project delivered a database of 15 million images across 22,000 classes
of objects and things organized by everyday English words. In both quantity and quality, this was an unprecedented scale. As an example, in the case of cats, we have more than 62,000 cats of all kinds of looks and poses and across all species
of domestic and wild cats. We were thrilled
to have put together ImageNet, and we wanted the whole research world
to benefit from it, so in the TED fashion,
we opened up the entire data set to the worldwide
research community for free. (Applause) Now that we have the data
to nourish our computer brain, we’re ready to come back
to the algorithms themselves. As it turned out, the wealth
of information provided by ImageNet was a perfect match to a particular class
of machine learning algorithms called convolutional neural network, pioneered by Kunihiko Fukushima,
Geoff Hinton, and Yann LeCun back in the 1970s and ’80s. Just like the brain consists
of billions of highly connected neurons, a basic operating unit in a neural network is a neuron-like node. It takes input from other nodes and sends output to others. Moreover, these hundreds of thousands
or even millions of nodes are organized in hierarchical layers, also similar to the brain. In a typical neural network we use
to train our object recognition model, it has 24 million nodes, 140 million parameters, and 15 billion connections. That’s an enormous model. Powered by the massive data from ImageNet and the modern CPUs and GPUs
to train such a humongous model, the convolutional neural network blossomed in a way that no one expected. It became the winning architecture to generate exciting new results
in object recognition. This is a computer telling us this picture contains a cat and where the cat is. Of course there are more things than cats, so here’s a computer algorithm telling us the picture contains
a boy and a teddy bear; a dog, a person, and a small kite
in the background; or a picture of very busy things like a man, a skateboard,
railings, a lampost, and so on. Sometimes, when the computer
is not so confident about what it sees, we have taught it to be smart enough to give us a safe answer
instead of committing too much, just like we would do, but other times our computer algorithm
is remarkable at telling us what exactly the objects are, like the make, model, year of the cars. We applied this algorithm to millions
of Google Street View images across hundreds of American cities, and we have learned something
really interesting: first, it confirmed our common wisdom that car prices correlate very well with household incomes. But surprisingly, car prices
also correlate well with crime rates in cities, or voting patterns by zip codes. So wait a minute. Is that it? Has the computer already matched
or even surpassed human capabilities? Not so fast. So far, we have just taught
the computer to see objects. This is like a small child
learning to utter a few nouns. It’s an incredible accomplishment, but it’s only the first step. Soon, another developmental
milestone will be hit, and children begin
to communicate in sentences. So instead of saying
this is a cat in the picture, you already heard the little girl
telling us this is a cat lying on a bed. So to teach a computer
to see a picture and generate sentences, the marriage between big data
and machine learning algorithm has to take another step. Now, the computer has to learn
from both pictures as well as natural language sentences generated by humans. Just like the brain integrates
vision and language, we developed a model
that connects parts of visual things like visual snippets with words and phrases in sentences. About four months ago, we finally tied all this together and produced one of the first
computer vision models that is capable of generating
a human-like sentence when it sees a picture for the first time. Now, I’m ready to show you
what the computer says when it sees the picture that the little girl saw
at the beginning of this talk. (Video) Computer: A man is standing
next to an elephant. A large airplane sitting on top
of an airport runway. FFL: Of course, we’re still working hard
to improve our algorithms, and it still has a lot to learn. (Applause) And the computer still makes mistakes. (Video) Computer: A cat lying
on a bed in a blanket. FFL: So of course, when it sees
too many cats, it thinks everything
might look like a cat. (Video) Computer: A young boy
is holding a baseball bat. (Laughter) FFL: Or, if it hasn’t seen a toothbrush,
it confuses it with a baseball bat. (Video) Computer: A man riding a horse
down a street next to a building. (Laughter) FFL: We haven’t taught Art 101
to the computers. (Video) Computer: A zebra standing
in a field of grass. FFL: And it hasn’t learned to appreciate
the stunning beauty of nature like you and I do. So it has been a long journey. To get from age zero to three was hard. The real challenge is to go
from three to 13 and far beyond. Let me remind you with this picture
of the boy and the cake again. So far, we have taught
the computer to see objects or even tell us a simple story
when seeing a picture. (Video) Computer: A person sitting
at a table with a cake. FFL: But there’s so much more
to this picture than just a person and a cake. What the computer doesn’t see
is that this is a special Italian cake that’s only served during Easter time. The boy is wearing his favorite t-shirt given to him as a gift by his father
after a trip to Sydney, and you and I can all tell how happy he is and what’s exactly on his mind
at that moment. This is my son Leo. On my quest for visual intelligence, I think of Leo constantly and the future world he will live in. When machines can see, doctors and nurses will have
extra pairs of tireless eyes to help them to diagnose
and take care of patients. Cars will run smarter
and safer on the road. Robots, not just humans, will help us to brave the disaster zones
to save the trapped and wounded. We will discover new species,
better materials, and explore unseen frontiers
with the help of the machines. Little by little, we’re giving sight
to the machines. First, we teach them to see. Then, they help us to see better. For the first time, human eyes
won’t be the only ones pondering and exploring our world. We will not only use the machines
for their intelligence, we will also collaborate with them
in ways that we cannot even imagine. This is my quest: to give computers visual intelligence and to create a better future
for Leo and for the world. Thank you. (Applause)


  • Vijeta Khare

  • Everything Is Just Beginning

    I like her Chinese style English, haha

  • Yibrah Fisseha

    I love this TED talk

  • Cezzar Ahmet Paşa

    İy konuşmaydı.

  • L Neve

    step 1 : recognize objects, step 2 : recognize emotions, step 3 : get rid of every human being. SKYNET

  • metal garurumon

    That's unique dress

  • 김정현

    This is pretty cool

  • Zijian Xi

    we didn't learn from pictures. We learned from real world. And computer need learn from video.

  • Liqiang Zhang


  • Grand Chen


  • Yuzhen Wang

    她嘴是歪的 不过她为什么能如此成功 全力投资教育么?

  • Mahdi Safari

    very nice

  • gurdyal singh

    So nice

  • Aditya Gupta

    Our computers are still lik a kid which learn from machine learning..

  • Sons of Seven

    Way of expressing ….nxt level

  • 三块手表

    李飞飞 ?就是那个中国在谷歌的科技间谍吗

  • Mo Kiani

    Great talk. Forgot to mention how scary this can be too though. I wonder what Orwell would have to say.

  • Mikku Berkin


  • Hari Shankar

    Awesome awesome awesome awesome awesome!

  • Liu Bo

    why can I know Miss Li English? who can tell me >:

  • Timur Lord's Beloved

    They should take youtube videos. Record them with printscreen camera. Videos usally have frame rate of 50 frames in 1 seconds. The frames are correlated to each other within time and meaning. People actually have not only spatial but temporal memory – that is why they can recognize objects by how they moves or behaves. But videos in Youtube has to be depicted also with 3D camera. Because human has 2 eyes for this reason, though he could see with one.


    Nowadays I rarely watch the full video but ones like this put perspectives in my mind

  • Hello People

    It can be useful for learning foreign language
    You just show tons of pictures to the computer
    Than it tells you what is it
    And step by step your brain will start to understand language.
    It will help to spread English language

  • Lakes J

    Isn't Fei Fei means No No…

  • jxixi Kiki

    she is smart.

  • Han Shen

    I am very interested about technology, but i really like to know how to create something like this.. and what should i learn ? i have so many question in my mind

  • Ritik Sinha

    Did someone notice the weird patterns on her dress XD

  • Sunny D

    She is no doubt a brilliant scientist. What she and her team have done is absolutely wonderful. But in her presentation, she barely showed her excitement about her work or achievements. She said that she was thrilled, but she surely didn’t give me the impression of being thrilled. Maybe she is not as brilliant a presenter as a scientist. But that’s totally understandable. Her scientific work still inspires people.

  • Sumit Dev

    Science is made for cat only

  • Bo Wang

    hehe “computer science”

  • 李金泽


  • mazahir ali

    nice work keep it up because it's development is on trending in future

  • Miguellina Bonhomme

    Belle machinerie … mais … serions-nous réduits à notre capacité visuelle et à notre intellect ? !!!

  • Expect

    Computers are what humans call the anti christ. Just saying

  • David X

    spy of China

  • Ashraf Osman

    Respect .. love sharing the valuable info in an honest manner proving that although the road is long, humankind is making the best out of the accumulated knowledge

  • Justin Feng

    I'm afraid we human only see the positive side right now. dangerous is hiding.

  • Hüseyin Dama

    I think Nothing can reach our perception level, we human not only see , but also feel many things as if we are fed by unknown source

  • Hopi Ng

    Teach it to “want” to learn. Then it will. But there you open a box that should be closed.

    Sad, what we have done.

  • Harry B

    You don't need images, you need movies, in multimedia, not only with pixels but with voxels and lots of other external or internal sensors.
    Then you need to abstract them, by creating the intersection of their property sets, but not only in pixels or voxels but in spacetime, to include the notion of movements and ongoing changes.
    If you miss any guiding thread, you can find it along the values for the observer.

  • Zixi Wang


  • Yi Zhang

    For those scientists or engineers whose mother tough is not English, while they are trying their best to improve in their profession, they have to spend time to polish their English. So far Feifei Li had done both pretty well. She's really brilliant!

  • eazon

    I can NOT stop staring at her teeth.

  • Yi Zhang

    The first round of claps is just after the phrase of "for free". This might have a lot of implication.

  • Neuz Morshed Nadim


  • Nirmal V

    Cat is the hardest part try with some other animal.

  • Uslu Smart

    Geleceğin en önemli sektörlerinden birisi görüntü işleme ve yapay zeka olacak..

  • Umar Ibn Ali

    Give her a standing ovation you peasants! 😂

    Those of us working with AI, be it Machine Learning, Data Science, Computer Vision or NLP know that her work is unprecedented.

  • rajkumar m

    i am extremely happy for having presented myself with all these world class scholars of course not personally. i strongly believe that knowledge is to share not to store. joining this group certainly improve ones intelligence in Cyberspace . I WISH THAT 2019 .FCT WILL BE ANOTHER LAND MARK IN TECHNOLOGY dear sirs…

  • CS Mania

    thanks a lot, but also think about hereafter that what will be happened in the hereafter.

  • Narendra Parmar

    Good doing 😆

  • Sunny shah

    Well let's hope these advance machines never fall into the wrong hands,
    a machine doesn't know whether it is being used for good or evil, it just does what it is told to do.
    If the algorithms are open source, we already know a lot of bad people are going to get crazy ideas about what to do with this kind of technology.

  • fabts4

    So, here's to summing-up 18 minutes into 2 seconds:
    Computers use a neural network

  • Ay Bee

    Too many people hide behind technology in order to escape the real world, while believing they are the only ones living in it. Some even believe they were chosen to be at the forefront of some kind of universal, evolutionary change or shift in human consciousness, and that the technological progress they are making is going to help humanity in the longer term. Unfortunately for the non-technologists, who are just going 'along for the ride', those working on technologies like AI rarely ever divulge their ultimate goal or true intent for what they are working toward. AI technologists and those wanting to 'humanise' AI are walking a very thin line between their own selfish desire to experiment, play-with and shape the future of AI and the potential for the final AI embodiment to enslave us all. Do any of them truly reflect on the longer term ramifications of their actions?

  • Chien Tran

    What a great mom!

  • Javier Ramirez

    I thought this was Fei Fei the DJ!🤔🤔🤔🤔

  • Spy Programmer

  • Grease quala

    man she gives headache, why she's shouting , she has a mic for god sake

  • Crazy Code


  • Chao Yu

    I wish I could have some computer knowledge:)

  • Hao Hu

    Nah, such a failure experiment that you ain’t trying to teach them to learn, instead of just feeding and taking guesses. As part of AI, theoretically, the machines suppose to learn based on whatever the informations are. I mean this this is not what we called learning just copying and pasting. Not even close to the entry lvl of human AI.

  • MegaBezymec

    Спасибо, Омериканцы йобаные, за то что вы создаёте СкайНет 🙂 Этому миру и так пизда глубокая, так что вы на правильном пути)))

  • Diego Castañeda

    computers are not the same as humans,you are right if this can help us in the short term,but it will definitely take over humans and end up with humanity on the long term,because they read so much information in a split of a second,never gets tired nor sleep,and are not like humans,they will definitely be better than us,they just do not want you to believe or worry about it :/

  • Adolfo Usier

    love the channel 📸

  • Felix C

    TED: Become Human

  • Mihar Bess

    I get her point but her speach was full of stupidely used words, We don't have to give Machine lives nor feelings , we have to program them to surve our needs !!! next thing she'll be talking about Machines Rights haha

  • GeneralTerzX

    The prequel to the Matrix trilogy 🙈

  • happios

    Like Andrew Yang said AI is going to take over jobs. Better secure your freedom dividend

  • Darrell Gunter

    Excellent overview of how AI can help us see the depth and context of a picture.

  • boshihou100

    There is something wrong in here statement. My 2 year daughter can recognize a cat even she just saw one. Why? Because there is logic built up in human's brain.


    Why is she crying…🤪

  • JFS

    Good video, thanks. However 4 years later, same recognition problems, same mistakes. I guess the big data hype should be over soon. Yet another winter is coming :).

  • Jin Sun

    You have billions of pictures, you feed them to the brain of a wood, you get nothing out of it. The current AI is only slightly better than a wood in terms of human intelligence. A 3 year-old only need to see a handful of pictures of a cat, to be able to identify nearly all of them in all possible poses. You see the difference here. The current naively simple AI models can never match the abstraction abilities of human intelligence. It is a dead end. It is going to be harder and harder until it goes out of fashion.

  • 宁远

    She is one of the most influencial researcher in the area of AI. I would do anything for being her PhD student

  • Edwin Kabagi


  • Samuel chen


  • Gustavo Schroeder Sapelli

    Maravilhoso!!! Ótima palestra!!!

  • Sunday Seance

    poor speaker. she has blocked nose, probably needs surgery

  • Isuru Nanayakkara

    Man, this is amazing. Outstanding work! Props to her and everyone involved for their incredible efforts.

  • Chad X

    AI : making efforts in the space that is trapped in the dimensions that is totally fictional against the real world and its mechanisms

  • Pavan Kumar

    NOTE: Please do not teach to computer's…..
    if we make computer like thinking,walking, fighting and etc… then after we can't control it… because it's a machine.
    Example: Suppose we are humans right we can learn, teach, eat, walk, run, fight,love and etc… In the Universe we know what to do , how to think , how to do but whenever a machine do like the human , think like a human then they kill us…….because machine is a 1000 times better than human and powerful also….
    "we are humans we can learn anything we can do and we can invent so many things, whenever machine's think we can't alive in this world."


    Correct many dont realize

  • Stewie Griffin

    good speech.
    But I think we will never achieve the projected goal.

  • Jade Z

    Are you in China now? 😄😄😂!

  • Be Br

    That was 2015. A slow start… and 3 years later we hade YOLO (look here the TED-Show about the next Generation of the technology: )

  • Dennis Allard

    Very boring presentation, more of an advertisement for TED. The title of this video is not accurate since the presentation does not explain how the recognition occurs.

  • parth patel

    I wish to joined you and you teams in this process to the machine learning

  • Sanam Pudasaini

    51,890 citations.. OMG…

  • jamal nawaz

    Really amazing….

  • MrNouraiz

    so she basically used maximum time for a ted talk (18 mins), incredible, pioneer in image classification and mentor of karpathy;

  • RPG STRIKE رب ج سترايك

    God says we have created you from mud.
    Us human can never ever create something like child brain for example other senses like hearing, smelling, feeling, tasting and interacting, exploring…..long list of miracles around us. We can only wonder how greatful god to us and I assure now this women sees a miracles of every surrounding object or logic. Peace Salam alhamdulelah 🙂

  • 北冰洋


  • peng peng

    这个就厉害了, AI领域泰斗级人物

  • S Y

    Or this is the trailer that the Matrix prepares us for before the reboot

  • Ni Sun

    This is a great contribution! We can see how much effort Li Fei Fei and her lab did!

  • Hu Go

    I know it's off topic.. This is a very complex process. But we can understand pictures easily.
    We can not yet code an ai to understand pictures but "evolution" did it billions of years ago.
    As we try to mimic human behaviors on computers we'll understand more and more that creation can not be replaced by evolution. Thanks for reading.

  • Z Picasso


  • Alexandr Sheludko

    Thanks to all people, who makes our world better. Regarding to persons like Fey-Fey Lee we have all technical advantages and knowlages we have now. Without thouse people we may be still livin in caves and hunting mamonths. Thanks a lot again.

  • ANU_G channel

    Your speech is sufficiently clear to listen and understands which enables better learning. Thanks, congrats and all good wishes to you too.

  • Md. Hasibur Rahman

    It was a great idea to compare a computer with a child who is learning the world.


Leave a Reply

Your email address will not be published. Required fields are marked *