Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was expecting to be pessimistic, but Google actually releases the datasets under a permissive license (CC-By 4.0). Awesome!

https://research.google/tools/datasets/open-images-extended-...

https://github.com/google-research-datasets/hiertext



It's nice that Google is releasing something, but the 3 datasets (https://crowdsource.google.com/about/open-source/) only cover only a fraction of the Crowdsource tasks:

  Food compare: Compare the characteristics of two food images.
  Response rating: Evaluate the natural-ness of a bot response.
  Audio donation: Record your voice to improve speech technology.
  Food facts: Tell us if a food dish has particular characteristics.
  Food labeller: Tell us what food an image contains.
  Semantic similarity: Judge whether two phrases have the same meaning.
  Chart understanding: Judge whether charts are understandable and trustworthy.
  Glide type: Glide your fingers on the keyboard to type the text that you see.
  Audio validation: Listen to a short audio clip and determine if the pronunciation sounds natural in your language.
  Image label verification: Tell us if images are tagged correctly.
  Image capture: Collect and share photos of your part of the world.
  Translation: Translate phrases and words into different languages.
  Translation validation: Select which phrases are translated correctly.
  Handwriting recognition: Look at handwriting and type the text that you see.
  Sentiment evaluation: Decide if a sentence in your language is positive, negative or neutral.
  Smart camera (Android Lollipop 5.0+ required): Point at an object and see if the camera can guess what it is.
https://play.google.com/store/apps/details?id=com.google.and...

Even in those 3 datasets, Google does not disclose the proportion/percentage of the crowdsourced contributions that are released publicly. I would not contribute to Crowdsource with the expectation that my contributions would help build a freely licensed dataset.


"Google shares some Crowdsource data via open-sourcing"

Looks like the releases will only be partial. This kind of data should be collected bu a nonprofit for the benefit of everyone.


Also, what the crowd gets is “cool badges”

Who even comes up with these things?


Marketing people who understand how the average citizen works? I'm pretty sure they'll find thousands to low millions of volunteers.


I was expecting to be pessimistic, and I am, because I thought that datasets that actually matter probably won't be released and there is no guarantee that this trend will continue. Please don't trust the giant.


Yeah, most of the 'Google Research Datasets' github account has super boring datasets. There's no way they'll help out with the actually interesting datasets (this is just PR).


That blog post doesn't state that _all_ collected data through Crowdsource will get published.


Do you really want raw, unmoderated user generated content?


Yes! They can provide both filtered & unfiltered data.


I imagine there would be some serious legal concerns with releasing the raw data (I work for Google, but no special insights into this project)


If you're not gonna release the data, then don't do the project in the first place (especially not under the implication that it'll be shared with everyone). Saying in hindsight "oh we can't release all the data due to sensitivity issues" is just weaseling your way to keeping the most valuable data to yourselves as if the stated issues couldn't have been predicted.


Why? Users volunteer the data. Just ask them if they're ok with it being public.


I think that approach is akin to the honor system and based on my experiences on the internet I fear that it won't scale well. For some types of images, just because the uploader is ok with the file being shared doesn't mean it's a good idea to redistribute it. For a bland example, think of a photo where the uploader doesn't have copyright. I'm sure you can imagine what would happen if someone on the seedier parts of the internet says "hey, if you upload your images to this website, Google will host it for free forever!"


One negative, I guess, would be uncovering the moderation algorithm so a malicious user could circumvent it.

Another negative would be release of Bad Words or illegal content submitted by malicious users. Depends on the task.

But the actual raw data would be of more use to researchers than one cleaned from an output of algorithms. Perhaps there could be a program for educational researchers?


Would this be compatible with importing into OpenStreetMap and similar open data projects?


OpenStreetMap does not host images, other projects (close to OSM) do.


In addition CCBY 4.0 isn't compatible with OSM without a waiver for some of the terms. :(


Sure, that too.


Btw what do those 'data cards' actually do? Can you get sued for going against it? Does it conflict with the permissive license or does that take precedence?


It's a strange ML term to describe the data's metadata.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: