Tench: When data is messy

栏目: IT技术 · 发布时间: 4年前

内容简介:There’s a story I tell in mybook because it’s a great illustration of how AI gets the wrong idea about what problem we’re asking it to solve:Researchers at the University of TuebingenHuman fingers against a green background!

There’s a story I tell in mybook because it’s a great illustration of how AI gets the wrong idea about what problem we’re asking it to solve:

Researchers at the University of Tuebingen trained a neural net to recognize images , and then had it point out which parts of the images were the most important for its decision. When they asked it to highlight the most important pixels for the category “tench” (a kind of fish), this is what it highlighted:

Tench: When data is messy

Human fingers against a green background!

Why was it looking for human fingers when it was supposed to be looking for a fish? It turns out that most of the tench pictures the neural net had seen were of people holding the fish as a trophy. It doesn’t have any context for what a tench actually is, so it assumes the fingers are part of the fish.

The image-generating neural net inArtBreeder (calledBigGAN) was also trained on the same dataset, called ImageNet, and when you ask it to generate tenches, this is what it does:

Tench: When data is messy

The humans are much more distinct than the fish, and I’m fascinated by the highly exaggerated human fingers.

There are other categories in ImageNet that have similar problems. Here’s “microphone”.

Tench: When data is messy

It’s figured out about dramatic stage lighting and human forms, but many of its images don’t contain anything that remotely resembles a microphone. In so many of its training pictures the microphone is a tiny part of the image, easy to overlook. There are similar problems with small instruments like “flute” and “oboe”.

In other cases, there might be evidence of pictures being mislabeled. In these generated images of “football helmet”, some of them are clearly of people NOT wearing helmets, and a few even look suspiciously like baseball helmets.

Tench: When data is messy

ImageNet is a really messy dataset. It has a category for agama, but none for giraffe. Rather than horse as a category, it has sorrel (a specific color of horse). “Bicycle built for two” is a category, but not skateboard.

Tench: When data is messy

A huge reason for ImageNet’s messiness is that it was automatically scraped from images on the internet. The images were supposed to have been filtered by the crowdsourced workers who labeled them, but plenty of weirdness slipped through.And horribleness - many images and labels that definitely shouldn’t have appeared in a general-purpose research dataset, and images that looked like they had gotten there without the consent of the people pictured. After several years of widespread use by the AI community, the ImageNet team has reportedly been removing some of that content . Other problematic datasets - like those scraped from online images without permission, or from surveillance footage - have been removed recently . (Others, likeClearview AI’s, are still in use.)

This week Vinay Prabhu and Abeba Birhane pointed out major problems with another dataset, 80 Million Tiny Images , which scraped images and automatically assigned tags to them with the help of another neural net trained on internet text. The internet text, you may be shocked to hear, had some pretty offensive stuff in it. MIT CSAIL removed that dataset permanently rather than manually filter all 80 million images.

This is not just a problem with bad data , but with a system where major research groups can release datasets with such huge issues with offensive language and lack of consent. As tech ethicist Shannon Vallor put it , ”For any institution that does machine learning today, ‘we didn’t know’ isn’t an excuse, it’s a confession”. Like the algorithm that upscaled Obama into a white man , ImageNet is the product of a machine learning community where there’s a huge lack of diversity . (Did you notice that most of the generated humans in this blog post are white? If you didn’t notice, that might be because so much of Western culture treats white as default).

It takes a lot of work to create a better dataset - and to be more aware of which datasets should never be created. But it’s work worth doing.

Bonus material this week: a few of my favorite BigGAN image categories. Enter your emailhere for a gallery!

My book on AI is out, and, you can now get it any of these several ways!Amazon -Barnes & Noble -Indiebound -Tattered Cover -Powell’s -Boulder Bookstore


以上所述就是小编给大家介绍的《Tench: When data is messy》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

SEM修炼手册:百度竞价、信息流广告、数据分析与专题页策划实战详解

SEM修炼手册:百度竞价、信息流广告、数据分析与专题页策划实战详解

陈丰洲 / 电子工业出版社 / 2018-10 / 59.80元

SEM人员在职场打拼的过程中,会遇到一个又一个坑,《SEM修炼手册:百度竞价、信息流广告、数据分析与专题页策划实战详解》尝试站在一定的高度,将从业者从专员走向管理岗位过程中可能碰到的问题进行整理,不仅谈竞价推广,也谈基于SEM的营销体系。 《SEM修炼手册:百度竞价、信息流广告、数据分析与专题页策划实战详解》包括11章内容,由浅入深地分享SEM的进阶过程。第1章是SEM概述,让读者对SEM有......一起来看看 《SEM修炼手册:百度竞价、信息流广告、数据分析与专题页策划实战详解》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

URL 编码/解码
URL 编码/解码

URL 编码/解码

MD5 加密
MD5 加密

MD5 加密工具