I was expecting to be pessimistic, but Google actually releases the datasets und...

commoner · on May 15, 2022

It's nice that Google is releasing something, but the 3 datasets (https://crowdsource.google.com/about/open-source/) only cover only a fraction of the Crowdsource tasks:

  Food compare: Compare the characteristics of two food images.
  Response rating: Evaluate the natural-ness of a bot response.
  Audio donation: Record your voice to improve speech technology.
  Food facts: Tell us if a food dish has particular characteristics.
  Food labeller: Tell us what food an image contains.
  Semantic similarity: Judge whether two phrases have the same meaning.
  Chart understanding: Judge whether charts are understandable and trustworthy.
  Glide type: Glide your fingers on the keyboard to type the text that you see.
  Audio validation: Listen to a short audio clip and determine if the pronunciation sounds natural in your language.
  Image label verification: Tell us if images are tagged correctly.
  Image capture: Collect and share photos of your part of the world.
  Translation: Translate phrases and words into different languages.
  Translation validation: Select which phrases are translated correctly.
  Handwriting recognition: Look at handwriting and type the text that you see.
  Sentiment evaluation: Decide if a sentence in your language is positive, negative or neutral.
  Smart camera (Android Lollipop 5.0+ required): Point at an object and see if the camera can guess what it is.

https://play.google.com/store/apps/details?id=com.google.and...

Even in those 3 datasets, Google does not disclose the proportion/percentage of the crowdsourced contributions that are released publicly. I would not contribute to Crowdsource with the expectation that my contributions would help build a freely licensed dataset.

fxtentacle · on May 15, 2022

"Google shares some Crowdsource data via open-sourcing"

Looks like the releases will only be partial. This kind of data should be collected bu a nonprofit for the benefit of everyone.

prox · on May 15, 2022

Also, what the crowd gets is “cool badges”

Who even comes up with these things?

fxtentacle · on May 15, 2022

Marketing people who understand how the average citizen works? I'm pretty sure they'll find thousands to low millions of volunteers.

nurettin · on May 15, 2022

I was expecting to be pessimistic, and I am, because I thought that datasets that actually matter probably won't be released and there is no guarantee that this trend will continue. Please don't trust the giant.

natly · on May 15, 2022

Yeah, most of the 'Google Research Datasets' github account has super boring datasets. There's no way they'll help out with the actually interesting datasets (this is just PR).

j_barbossa · on May 15, 2022

That blog post doesn't state that _all_ collected data through Crowdsource will get published.

bastawhiz · on May 15, 2022

Do you really want raw, unmoderated user generated content?

cute_boi · on May 15, 2022

Yes! They can provide both filtered & unfiltered data.

Rebelgecko · on May 15, 2022

I imagine there would be some serious legal concerns with releasing the raw data (I work for Google, but no special insights into this project)

natly · on May 16, 2022

If you're not gonna release the data, then don't do the project in the first place (especially not under the implication that it'll be shared with everyone). Saying in hindsight "oh we can't release all the data due to sensitivity issues" is just weaseling your way to keeping the most valuable data to yourselves as if the stated issues couldn't have been predicted.

IshKebab · on May 15, 2022

Why? Users volunteer the data. Just ask them if they're ok with it being public.

Rebelgecko · on May 15, 2022

I think that approach is akin to the honor system and based on my experiences on the internet I fear that it won't scale well. For some types of images, just because the uploader is ok with the file being shared doesn't mean it's a good idea to redistribute it. For a bland example, think of a photo where the uploader doesn't have copyright. I'm sure you can imagine what would happen if someone on the seedier parts of the internet says "hey, if you upload your images to this website, Google will host it for free forever!"

thinkingemote · on May 15, 2022

One negative, I guess, would be uncovering the moderation algorithm so a malicious user could circumvent it.

Another negative would be release of Bad Words or illegal content submitted by malicious users. Depends on the task.

But the actual raw data would be of more use to researchers than one cleaned from an output of algorithms. Perhaps there could be a program for educational researchers?

toomuchtodo · on May 15, 2022

Would this be compatible with importing into OpenStreetMap and similar open data projects?

habi · on May 15, 2022

OpenStreetMap does not host images, other projects (close to OSM) do.

donalhunt · on May 15, 2022

In addition CCBY 4.0 isn't compatible with OSM without a waiver for some of the terms. :(

habi · on May 15, 2022

Sure, that too.

natly · on May 15, 2022

Btw what do those 'data cards' actually do? Can you get sued for going against it? Does it conflict with the permissive license or does that take precedence?

learndeeply · on May 15, 2022

It's a strange ML term to describe the data's metadata.