Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just thinking out loud here...

It seems to me that if you wanted to root out sentiment bias in this type of algorithm, then you would need to adjust your baseline word embeddings dataset until you have sentiment scores for the words "Italian", "British", "Chinese", "Mexican", "African", etc that are roughly equal, without changing the sentiment scores for all other words. That being said, I have no idea how you'd approach such a task...

I don't think you could ever get equal sentiment scores for "black" and "white" without biasing the dataset in such a manner that it would be rendered invalid for other scenarios (e.g., giving a "dark black alley" a higher sentiment than it would otherwise have). "Black" and "white" is a more difficult situation because the words have different meanings outside of race/ethnicity.



I think I would agree. You otherwise run the risk of having fixed the metric ("Italian" vs. "Mexican", "Chad" vs. "Shaniqua", etc.) without actually fixing the underlying issue.

Also, regarding black/white etc., there might legitimately be words which have so many different meanings (whether race-related or not) that you should just exclude them from sentiment analysis. "Right" can mean like "human rights", "right thing to do", or "not left". Probably plenty of other words like that. You might do better to have a list of 100-200 words that are just excluded because of issues like that.


> there might legitimately be words which have so many different meanings

I haven't studied word embeddings past the pop-sci level but wouldn't such words form multiple clusters in the embedding space? I would have thought it would be relatively easy to get different 'words' for 'right (entitlement)', 'right (direction)', etc?

Edit: Nibling post answers this question.


Would it be worth trying to think of words with different meanings as entirely new words? So, "white" in one sentence may be a different word than "white" in another?


There's a long list of papers on that - 'multi-sense word embeddings'. But more recently we have found that passing the raw character embeddings through a two layer BiLSTM will resolve the ambiguity of meaning from context - 'ElMO'.

https://arxiv.org/abs/1802.05365 (state of the art)


Does “a dark black alley” have a sentiment at all?

I would argue that it’s pragmatically associated with bad things (e.g., being mugged, overcrowded areas) but it’s not intrinsically bad (or good) itself.


> associated with bad things

Is that not what's meant by sentiment?


My intuition is that word-level sentiment is rather pointless. “The Disaster Artist was not bad” has a positive sentiment overall, but each of the individual words, except possibly ‘artist’, have are usually thought to be negative. Moreover, you can totally flip the overall sentiment by adding another neutralish word “The Disaster Artist was not even bad.”

Similarly, my guess is that alley is rarely found in a positive context, but the actual sentiment comes from elsewhere in the utterance.


Word-level sentiment is like spherical cows in a vacuum in physics. Everyone knows its an extremely flawed model, but it produces good results in a lot of scenarios, so it will inevitably be used because it also has the enormous benefit of simplicity.


This article is about a simple model. Within that model, it absolutely makes sense for “dark black alley” to get a negative score.


It certainly gets a sentiment score, but whether that score is in any way meaningful or corresponds to actual human sentiment is important. Otherwise, you’re just playing stupid games, and winning stupid prizes...though I suppose just stupid is a step up from stupid and racist.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: