Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Gene name errors are widespread in the scientific literature (biomedcentral.com)
64 points by campuscodi on Aug 24, 2016 | hide | past | favorite | 59 comments


I did some small tasks for people working in bioinformatics and what I've seen is both amazing (the science part) and terrifying (the tools). Apparently I saved hours of retyping for one person who was doing a manual JOIN on gene names between two CSV files. As in ctrl+f name of gene from file A, copy, paste into another window. For thousands of rows. I was trying to explain that you can import CSV files without autoformatting as well, but they didn't believe me... sigh... (at least they were aware that this is a bad idea with default settings, because the formatting changes)

I don't know if this can be fixed and how. Lots of people seem to have their own process and they neither understand that the tools can be used more efficiently / correctly, nor question the long, manual process they follow currently. They basically require an "intermediate excel" course which ends with "if you're doing copy/paste 3 times in a row, you should look for better solution". I'm not even questioning the use of excel at this point...


This kind of problem is rife in academia.

Part of the problem, as a sibling mentioned, is that many people get into these kinds of fields from non-programming or even computer-illiterate backgrounds. At an undergraduate level, even in heavily numerical disciplines like physics, there is relatively little coding. Even then, it's tedious crap like F90 and bulletproof C. So people get into a Masters or a PhD in a field they love and are suddenly confronted by data analysis, and they have no idea what to do.

I've spoken a lot to my girlfriend (an astrophysicist) about this, as she's in this position herself. It's not that she isn't smart, she just has little to no experience with data wrangling and she's (in my opinion) been inadequately trained. I've solved things with Python one-liners that she would have spent literally days doing manually. We've had conversations along the lines of:

"But why does this file have 700 lines of input data filenames hardcoded? You know you could write some code to grab them for you?"

"Yes, but in the time it would take me to write the code, I may as well do it by hand." [In the end I wrote a 5 line regex to do it]

So I can assure you, people question the manual process and they think you're magical when you show them a quicker way of doing things. However, often there isn't the motivation or confidence to try and be magical themselves.


My wife works at the virology lab of a local hospital. They do a lot of validation and quality controls. However, when she started there, all their statistics were calculated by hand using approximate equations which were supposed to make it doable using a simple calculator. Whoever had set this up lacked a strong understanding of statistics. Still, this procedure had remained unchanged since the eighties and no one knew how or why it was supposed to work. Going back in old records, my wife found that there were a lot of errors in the recorded results.

She spent a while setting up a simple Excel sheet into which they enter all test results and now all the statistics are calculated automatically using proper methods.


> This kind of problem is rife in academia.

This kind of problem is rife everywhere. People don't know and are not interested in how to use computers to help them because "it's too complicated".

Excel doesn't help.


Is this necessarily a bad thing entire? House wizard can be a fairly comfortable position.


This is where I take the time to explain to everyone that "Yeah, while it may save you time to do it manually, if you do it THIS way instead, you never have to write it from scratch again, and your output goes waaay up".

Every job I've ever had, I always look at how my predecessors do the job, then automate the crap that I don't want to deal with. :/


I'm in the middle of writing a retrospective on my PhD - the stuff I wish I'd done at the beginning. Perhaps unsurprisingly, it mostly boils down to automate and pipeline everything. It almost always saves time because inevitably you have to repeat work with parameters changed very slightly.


The prevailing attitude is "I want to do biology, not computer science". I find it pretty frustrating. This is probably one of the last fields of science to be seriously computerized.


I used to work as a bioinformatics programmer, and in some cases I did some processing in an afternoon, including writing a python script, that a postdoc had used a month doing manually in Excel.

While the tools often are crude, a problem is that biochemical and molecular biology researchers lack the bioinformatics skills and other IT skills that often is badly needed in modern genetic research. The core competence of these researchers is in the lab, and bioinformatics skills is outside of that. That you see effects like conversion of gene names in excel does not surprise me at all.

You also see the same problem in other areas where researchers need skills outside of their usual comfort zone. The classical example is statistics.


If only someone had warned them! Like, say, 12 years ago:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186...

Bonus: the number of errors is positively correlated with Impact Factor (a tool used by statistically illiterate administrative types to judge the quality of research).


I almost lost it at "We hope the date and floating-point conversions will be made non-default options – in deference to the large bioinformatics and biotechnology communities if not for other users."

Sure, the vast biotech community outnumbers those few people who work in some other jobs (or use Excel for private stuff) and enter dates or floating point numbers.

Excel is trying to limit the effect of sloppiness on the part of laypeople. They usually don't set cell formatting to date or number. Excel really does the right thing there.

And if you're a scientist, presumably with a respectable degree and professional responsibilities, learn to use your tool correctly.

Nice how the correct ways to use Excel are listed as cumbersome workarounds.


Turning SEPT2 into a date in the current year doesn't seem like really doing the right thing to me.


In vast majority of real use cases, it certainly is. Microsoft has one of the greatest usability labs in the world; but they also have the broadest user base.


Why not?

1. Out of a thousand people entering "SEPT2", how many mean "September 2nd"?

2. How many of them use that shortcut intentionally?

3. How many people mean some gene?

My answers are:

1. almost all of them

2. A respectable part of those in 1.

3. A dozen?


I would expect almost everyone meaning the date to not type the T, but maybe I'm just weird.

And far more than a dozen people mean the gene, come on.

And even if you want to type september second, trying to guess a year to attach is not the best idea.


> a tool used by statistically illiterate administrative types to judge the quality of research

Pfft, I bet you only say that because you have a low h-index.


Not exactly -- 13,000 citations and counting, mostly due to a few papers in Nature, Cell, and NEJM. But that doesn't change the fact that JIF is a terrible metric. Frankly, papers like the above are why impact factors are so misleading. Most research in the above journals will not be cited as heavily as ours has been.

Even counting citations is better, although if admins would actually read the papers, that would be better-er. Even NIH intramural people talk about impact factor like it means something. Pretty fucked considering that the publishers of glamour rags have vested interests that are very nearly the opposite of "careful scholarship".


When researchers stop using Excel as their main "database", this problem might be solved.


The one thing SQL can't do that Excel can: a calculation cell, as in, one that starts with =.

Yes, I can probably code something entirely in SQL to do it, but it will not be portable across SQL implementations; and yes, I can code that as a variable in my program... but neither of them seem to be as fluent as the way Excel does it.

I wouldn't use Excel for prod, but it comes in handy for a lot of small dumb shit purely because of =.


Many databases support virtual columns (https://en.m.wikipedia.org/wiki/Virtual_column)

If portability is really important, I think your best bet would be a view that adds the calculated columns.

Unfortunately, that's still not as easy as using Excel, by a long stretch.


What's the alternative?

Remember, these are people who don't understand data types, or if they do they are to lazy to declare them.

I don't see them formulating a correct SQL query. Or use any kind of programming language that has strict typing.

It would have to be a custom-tailored system that knows about nomenclature in the field. Sounds not very efficient.


> I don't see them formulating a correct SQL query.

Well, neither can programmers, since SQL injection is still the most common security vulnerability in software, so I agree, SQL is probably a bad tool for the job :)

> It would have to be a custom-tailored system that knows about nomenclature in the field. Sounds not very efficient.

Is Excel a custom-tailored system that knows about nomenclature in the field? The article seems to explicitly argue against that.

I don't think it's a failure to understand data types. It's a mismatch between what you expect the software to do, and what it does by default. Unfortunately, Microsoft has steadfastly refused to allow any way to change the auto-formatting options(check out some people really pissed off for being treated like children here - http://answers.microsoft.com/en-us/office/forum/office_2007-...).

There's two useful prongs of attack here - one is to somehow force Excel to conform to the expectations of researchers - perhaps an extension that works to prevent the most egregious cases of auto-formatting gone wrong? The alternative would be what you suggest - creating and marketing a custom solution, the problem there is that you'd need either buy-in from a significant number of researchers to spread it, or you'd need to replicate a lot of Excel features to make the transition smooth for others.


> Well, neither can programmers, since SQL injection is still the most common security vulnerability in software, so I agree, SQL is probably a bad tool for the job :)

The problem isn't with SQL, it's with the idiotic idea of gluing SQL queries from strings. See also: template languages for web pages, i.e. gluing strings to build what is really a tree.

--

I think the hate against Excel is unfounded. Sure, it's not the perfectly suited software for the domain - in the same way like a dedicated fruit slicer is better than a knife at slicing fruits, a dedicated fruit peeler is better at peeling them, etc. but one knife can do all those jobs pretty well on its own, while also being able to do countless other things. Excel is a very versatile and powerful tool, and this power comes from its flexibility. The solutions often proposed, like a "proper" database-backed system, often involves a fixed workflow and having to call in IT support every time something doesn't fit that workflow (which is always).

So IMO - yes for people dealing with data using Excel. Yes for them learning a programming language, SQL and a generic database system. But strong no for forcing them to use domain-specific dedicated "tools" that impose a particular workflow on them.


> prevent the most egregious cases of auto-formatting gone wrong

Sure, but converting "DEC1" to "December, 1st" is not egregiously wrong, it's a valuable feature and in most of the cases the expected thing.


Not to a person who specifically does not want that feature - and since Excel does not provide a way to turn it off or customize it, my idea was that an extension might be able to. Of course, I've never written an Excel extension(macro?) so I have no idea.


Of course you can turn it off. By formatting the cells as the correct type.


That's a ridiculous argument and you should know it. Forcing users to do manual work every time instead of having an option to disable or configure a feature is just a UX fail. Doubly so because I would bet those formatted excel files don't survive transition, and the data is actually transmitted in CSV or whatever, so you'd have to reformat the data over and over every time you open to edit it, and hope that someone along the way doesn't make a mistake. This a problem software is meant to solve, not create..

You could argue that all you can do is mitigation since CSV files don't offer much ability to influence how Excel will load them, and all you would need is one improperly-configured Excel along the pipeline to break the data. However, this is a significant mitigation - Excel apps would be configured once, and you would deal with a situation 1% of the time, and the solution would be trivial(just configure it!). Instead now you're dealing with the problem every time, and the solution(just mark the cells!) takes a lot more effort.

I don't think Microsoft necessarily has incentive to add this configuration(the science community as a whole is probably a tiny blip on its radar), but this is why we create modular and extensible software - so others can tweak it to their liking.


No, it's the correct argument.

Putting an option in to disable something that only a miniscule number of users would ever want is not sensible.

Note bene: I'm only talking about this DEC1-like conversions. I agree that there are lots of conversions that are annoying.


If you have a column of text labels and only a few format as dates, they are very unlikely to be dates. Excel should be smart enough to know this.


SQL injection isn't an SQL vulnerability, but one of code that naïvely constructs queries. Not a parallel case here.


> It would have to be a custom-tailored system that knows about nomenclature in the field.

And is available for free - excel came with university / corporate license to all desktops for no extra cost (in most places relevant for this discussion). And popular + easily available - you need to convince IT to allow it on the network / preinstall it on the provided systems.


FWIW, that's reference #1 in this paper.


Indeed. If it made it through review without that happening... that would suggest that peer review is fallible. ;-)


For a long time, I knew I was dropping this gene [1] from my analyses because pandas automatically converted its name to a floating point. With the right combination of flags I was able to get it to work right, but even real programming languages are not immune.

[1]http://flybase.org/reports/FBgn0036414.html - short symbol is "nan"


"And I hope you've learned to sanitize your database inputs."


Science is probably the least scary thing that an Excel bug can affect. JPMorgan Chase's "London Whale" venture lost $2 billion in part because spreadsheet modelers divided by a sum instead of an average to get a value at risk.

https://baselinescenario.com/2013/02/09/the-importance-of-ex...

Then there was the infamous Reinhart-Rogoff paper in economics, used to justify harmful austerity policies worldwide post-2008, that came to false conclusions using a row formula that wasn't updated.

https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-th...


That's not excel bug.

I've seen co-workers getting wrong results due to a wrong copy-and-paste, where no attention was paid. This is no bug: it's a misuse off the program.

Know your tools is always a good advice.


It is a bug. There is no way to stop Excel from automatically turning certain strings in files that it reads in, into dates.


At the university I was working on a project to do parsing of gene relationships from literature. And yeah remember the inconsistencies. Also genes have funny names there is a SHH (Sonic Hedgehog), a DICER1 (which cuts something RNA or DNA I forgot), and a bunch of other silly ones.

Ultimately though coming from the world of algorithms and nicely organized data, it was frustrating how disorganized the nomenclature seemed.


Don't forget that the genomics folks also renamed a whole bunch of genes a few years ago so now there are two different names for the same thing floating around!


For every gene, there are typically at least 3-4 names that reference the gene. In some cases, two genes have the same name- for example, OCT1 and Oct-1. The first is "organic cation transporter 1" and the second is "Octamer binding protein 1". The second was "renamed" to (IIRC) POU2F1 but there are still plenty of references to the old name even in new literature.

This is just one example. The entire gene naming area is a pile of bollocks.


And we haven't even touched on the glory of Drosophila gene names.


Oh this happens constantly. Apparently, in an ideal world, every gene will have multiple counterintuitive symbols that appear to give insight as to what the protein does, but are in fact misleading.

HGNC seems to delight in nonsensical renaming. MLL (Mixed-lineage leukemia, a gene rearranged in many leukemias, particularly infants) is now KMT2A ("Lysine Methyl-Transferase 2A"), yet gobs of its fusion partners are canonically named, say, MLLT3 ("Myeloid/Lymphoid Or Mixed-Lineage Leukemia; Translocated To, 3").

But wait, what's its translocation partner? Oh, right, the Gene Formerly Known As MLL. Who thinks this is a good idea?

Journals typically insist on the latest HGNC (HUGO Gene Nomenclature Committee, a subset of the HUman Genome Organization) symbols, whatever they might be (see above). Not atypically, in review, someone will ask to use the old name because that's what they're used to. Best of all is when reviewers each suggest using a different name.

Science!


I'm interested in this. I have biologist friends at the university and we've discussed the possibility of implementing such functionality; I'd like to give it a try and probably re-discover what you found already. Some kind of report of your experience would be very useful.

Do you have some references to that project? Reports, comments, maybe software?


> [...] coming from the world of algorithms and nicely organized data, it was frustrating how disorganized [...]

Yeah. Rigor is not really a concept to which even molecular biologists are at home, which is understandable and probably explains why their parties are much better than those in a lot of other disciplines, but maybe a little hard to get used to in some other ways.


Pathology resident working with big data and an undergrad in physics checking in. I learned SQL. Met Stonebraker before the Turing Award. I taught myself Python in part by working on Rosalind problems. It's not that I don't understand. I don't have chunks of time large enough to quiet my mind, frame-shift, and work on my research at the code level. I've got IRBs and all sorts of oversight to deal with, budgets, etc. and the biology. A central success of my project has been recruiting professional CS people early.

That's my single biggest suggestion for biologists: know that professional computer scientists and programmers are desperately hungry for interesting data and would love nothing more than to help you design the project up front so they don't get sucked into the vortex of technical debt that will swallow your project if you don't set it up right early.


Excel is a fantastic tool, and widely used in bioinformatics — but to use any tool properly, you have to learn it. I'm really thankful that it was tought first in my bioinformatics class, before any specific tools or programming languages, so we wouldn't commit stupid errors as this one.


> Linear-regression estimates show gene name errors in supplementary files have increased at an annual rate of 15 % over the past five years, outpacing the increase in published papers (3.8 % per year).

This still doesn't actually tell us if the problem is getting worse. Or if it does it is badly worded. Even assuming this 3.8% is derived from their own data, you need the number of papers published that contain genelists (I would imagine this has probably risen faster than the number of papers overall).

In other words, the authors should have plotted the error rate over time rather than the number of errors over time.


Even worse there is not even a proper definition of what a gene is.

The meaning changes. Depending on who you are talking to, What book you are reading, What part in a book you are reading.


Biology (the discipline) is messy because biology (the reality) is really really messy.


It would be interesting to a certain type of person to look back on this in ten years and see if any of those papers were corrected, etc. Some links are in this thread that would allow this to be done now as well. Also check out the fMRI and microarray scandals that contaminated decades worth of publications.

Did using correct data and analysis pipelines actually matter to the conclusions these authors came to?



Fucking Excel! They haven't fixed this utterly egregious bug in over a decade.

I'm going to make a sign and picket Microsoft when I'm next in the area.


What bug?


I have long fantasized about writing a virus that forces Excel to quit if it detects the user is attempting to use Excel to do scientific research.


What's your proposed alternative that doesn't require hours to start with? Serious question - I believe no programming languages can be applied here. (at least not yet) The barrier to entry must be minimal.


You seem to be asking why people shouldn't do data analysis without knowing anything about data analysis.


I'm saying - we've got lots of people who know the science and do not know or want to know anything about programming. If we want good results, we need to enable them to work better somehow. Proper programming is a good idea for adding to university programs now, so that in ~10 years we can have this talk again and point out that people know better tools and shouldn't be using excel, or at least shouldn't be using it badly.


At the very least using the t-test in Excel should require a unique key that you only get after doing a few hours of classwork.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: