Gene name errors are widespread in the scientific literature

viraptor · on Aug 24, 2016

I did some small tasks for people working in bioinformatics and what I've seen is both amazing (the science part) and terrifying (the tools). Apparently I saved hours of retyping for one person who was doing a manual JOIN on gene names between two CSV files. As in ctrl+f name of gene from file A, copy, paste into another window. For thousands of rows. I was trying to explain that you can import CSV files without autoformatting as well, but they didn't believe me... sigh... (at least they were aware that this is a bad idea with default settings, because the formatting changes)

I don't know if this can be fixed and how. Lots of people seem to have their own process and they neither understand that the tools can be used more efficiently / correctly, nor question the long, manual process they follow currently. They basically require an "intermediate excel" course which ends with "if you're doing copy/paste 3 times in a row, you should look for better solution". I'm not even questioning the use of excel at this point...

joshvm · on Aug 24, 2016

This kind of problem is rife in academia.

Part of the problem, as a sibling mentioned, is that many people get into these kinds of fields from non-programming or even computer-illiterate backgrounds. At an undergraduate level, even in heavily numerical disciplines like physics, there is relatively little coding. Even then, it's tedious crap like F90 and bulletproof C. So people get into a Masters or a PhD in a field they love and are suddenly confronted by data analysis, and they have no idea what to do.

I've spoken a lot to my girlfriend (an astrophysicist) about this, as she's in this position herself. It's not that she isn't smart, she just has little to no experience with data wrangling and she's (in my opinion) been inadequately trained. I've solved things with Python one-liners that she would have spent literally days doing manually. We've had conversations along the lines of:

"But why does this file have 700 lines of input data filenames hardcoded? You know you could write some code to grab them for you?"

"Yes, but in the time it would take me to write the code, I may as well do it by hand." [In the end I wrote a 5 line regex to do it]

So I can assure you, people question the manual process and they think you're magical when you show them a quicker way of doing things. However, often there isn't the motivation or confidence to try and be magical themselves.

Agentlien · on Aug 24, 2016

My wife works at the virology lab of a local hospital. They do a lot of validation and quality controls. However, when she started there, all their statistics were calculated by hand using approximate equations which were supposed to make it doable using a simple calculator. Whoever had set this up lacked a strong understanding of statistics. Still, this procedure had remained unchanged since the eighties and no one knew how or why it was supposed to work. Going back in old records, my wife found that there were a lot of errors in the recorded results.

She spent a while setting up a simple Excel sheet into which they enter all test results and now all the statistics are calculated automatically using proper methods.

SixSigma · on Aug 24, 2016

> This kind of problem is rife in academia.

This kind of problem is rife everywhere. People don't know and are not interested in how to use computers to help them because "it's too complicated".

Excel doesn't help.

throwanem · on Aug 24, 2016

Is this necessarily a bad thing entire? House wizard can be a fairly comfortable position.

th0waway · on Aug 24, 2016

This is where I take the time to explain to everyone that "Yeah, while it may save you time to do it manually, if you do it THIS way instead, you never have to write it from scratch again, and your output goes waaay up".

Every job I've ever had, I always look at how my predecessors do the job, then automate the crap that I don't want to deal with. :/

joshvm · on Aug 25, 2016

I'm in the middle of writing a retrospective on my PhD - the stuff I wish I'd done at the beginning. Perhaps unsurprisingly, it mostly boils down to automate and pipeline everything. It almost always saves time because inevitably you have to repeat work with parameters changed very slightly.

drauh · on Aug 24, 2016

The prevailing attitude is "I want to do biology, not computer science". I find it pretty frustrating. This is probably one of the last fields of science to be seriously computerized.

alephnil · on Aug 24, 2016

I used to work as a bioinformatics programmer, and in some cases I did some processing in an afternoon, including writing a python script, that a postdoc had used a month doing manually in Excel.

While the tools often are crude, a problem is that biochemical and molecular biology researchers lack the bioinformatics skills and other IT skills that often is badly needed in modern genetic research. The core competence of these researchers is in the lab, and bioinformatics skills is outside of that. That you see effects like conversion of gene names in excel does not surprise me at all.

You also see the same problem in other areas where researchers need skills outside of their usual comfort zone. The classical example is statistics.

apathy · on Aug 24, 2016

If only someone had warned them! Like, say, 12 years ago:

https://bmcbioinformatics.biomedcentral.com/articles/10.1186...

Bonus: the number of errors is positively correlated with Impact Factor (a tool used by statistically illiterate administrative types to judge the quality of research).

Tomte · on Aug 24, 2016

I almost lost it at "We hope the date and floating-point conversions will be made non-default options – in deference to the large bioinformatics and biotechnology communities if not for other users."

Sure, the vast biotech community outnumbers those few people who work in some other jobs (or use Excel for private stuff) and enter dates or floating point numbers.

Excel is trying to limit the effect of sloppiness on the part of laypeople. They usually don't set cell formatting to date or number. Excel really does the right thing there.

And if you're a scientist, presumably with a respectable degree and professional responsibilities, learn to use your tool correctly.

Nice how the correct ways to use Excel are listed as cumbersome workarounds.

Dylan16807 · on Aug 24, 2016

Turning SEPT2 into a date in the current year doesn't seem like really doing the right thing to me.

golergka · on Aug 24, 2016

In vast majority of real use cases, it certainly is. Microsoft has one of the greatest usability labs in the world; but they also have the broadest user base.

Tomte · on Aug 24, 2016

Why not?

1. Out of a thousand people entering "SEPT2", how many mean "September 2nd"?

2. How many of them use that shortcut intentionally?

3. How many people mean some gene?

My answers are:

1. almost all of them

2. A respectable part of those in 1.

3. A dozen?

Dylan16807 · on Aug 24, 2016

I would expect almost everyone meaning the date to not type the T, but maybe I'm just weird.

And far more than a dozen people mean the gene, come on.

And even if you want to type september second, trying to guess a year to attach is not the best idea.

gaur · on Aug 24, 2016

> a tool used by statistically illiterate administrative types to judge the quality of research

Pfft, I bet you only say that because you have a low h-index.

apathy · on Aug 24, 2016

Not exactly -- 13,000 citations and counting, mostly due to a few papers in Nature, Cell, and NEJM. But that doesn't change the fact that JIF is a terrible metric. Frankly, papers like the above are why impact factors are so misleading. Most research in the above journals will not be cited as heavily as ours has been.

Even counting citations is better, although if admins would actually read the papers, that would be better-er. Even NIH intramural people talk about impact factor like it means something. Pretty fucked considering that the publishers of glamour rags have vested interests that are very nearly the opposite of "careful scholarship".

bioinformatics · on Aug 24, 2016

When researchers stop using Excel as their main "database", this problem might be solved.

DiabloD3 · on Aug 24, 2016

The one thing SQL can't do that Excel can: a calculation cell, as in, one that starts with =.

Yes, I can probably code something entirely in SQL to do it, but it will not be portable across SQL implementations; and yes, I can code that as a variable in my program... but neither of them seem to be as fluent as the way Excel does it.

I wouldn't use Excel for prod, but it comes in handy for a lot of small dumb shit purely because of =.

Someone · on Aug 24, 2016

Many databases support virtual columns (https://en.m.wikipedia.org/wiki/Virtual_column)

If portability is really important, I think your best bet would be a view that adds the calculated columns.

Unfortunately, that's still not as easy as using Excel, by a long stretch.

Tomte · on Aug 24, 2016

What's the alternative?

Remember, these are people who don't understand data types, or if they do they are to lazy to declare them.

I don't see them formulating a correct SQL query. Or use any kind of programming language that has strict typing.

It would have to be a custom-tailored system that knows about nomenclature in the field. Sounds not very efficient.

Sacho · on Aug 24, 2016

> I don't see them formulating a correct SQL query.

Well, neither can programmers, since SQL injection is still the most common security vulnerability in software, so I agree, SQL is probably a bad tool for the job :)

> It would have to be a custom-tailored system that knows about nomenclature in the field. Sounds not very efficient.

Is Excel a custom-tailored system that knows about nomenclature in the field? The article seems to explicitly argue against that.

I don't think it's a failure to understand data types. It's a mismatch between what you expect the software to do, and what it does by default. Unfortunately, Microsoft has steadfastly refused to allow any way to change the auto-formatting options(check out some people really pissed off for being treated like children here - http://answers.microsoft.com/en-us/office/forum/office_2007-...).

There's two useful prongs of attack here - one is to somehow force Excel to conform to the expectations of researchers - perhaps an extension that works to prevent the most egregious cases of auto-formatting gone wrong? The alternative would be what you suggest - creating and marketing a custom solution, the problem there is that you'd need either buy-in from a significant number of researchers to spread it, or you'd need to replicate a lot of Excel features to make the transition smooth for others.

TeMPOraL · on Aug 24, 2016

> Well, neither can programmers, since SQL injection is still the most common security vulnerability in software, so I agree, SQL is probably a bad tool for the job :)

The problem isn't with SQL, it's with the idiotic idea of gluing SQL queries from strings. See also: template languages for web pages, i.e. gluing strings to build what is really a tree.

--

I think the hate against Excel is unfounded. Sure, it's not the perfectly suited software for the domain - in the same way like a dedicated fruit slicer is better than a knife at slicing fruits, a dedicated fruit peeler is better at peeling them, etc. but one knife can do all those jobs pretty well on its own, while also being able to do countless other things. Excel is a very versatile and powerful tool, and this power comes from its flexibility. The solutions often proposed, like a "proper" database-backed system, often involves a fixed workflow and having to call in IT support every time something doesn't fit that workflow (which is always).

So IMO - yes for people dealing with data using Excel. Yes for them learning a programming language, SQL and a generic database system. But strong no for forcing them to use domain-specific dedicated "tools" that impose a particular workflow on them.

Tomte · on Aug 24, 2016

> prevent the most egregious cases of auto-formatting gone wrong

Sure, but converting "DEC1" to "December, 1st" is not egregiously wrong, it's a valuable feature and in most of the cases the expected thing.

Sacho · on Aug 24, 2016

Not to a person who specifically does not want that feature - and since Excel does not provide a way to turn it off or customize it, my idea was that an extension might be able to. Of course, I've never written an Excel extension(macro?) so I have no idea.

Tomte · on Aug 24, 2016

Of course you can turn it off. By formatting the cells as the correct type.

Sacho · on Aug 24, 2016

That's a ridiculous argument and you should know it. Forcing users to do manual work every time instead of having an option to disable or configure a feature is just a UX fail. Doubly so because I would bet those formatted excel files don't survive transition, and the data is actually transmitted in CSV or whatever, so you'd have to reformat the data over and over every time you open to edit it, and hope that someone along the way doesn't make a mistake. This a problem software is meant to solve, not create..

You could argue that all you can do is mitigation since CSV files don't offer much ability to influence how Excel will load them, and all you would need is one improperly-configured Excel along the pipeline to break the data. However, this is a significant mitigation - Excel apps would be configured once, and you would deal with a situation 1% of the time, and the solution would be trivial(just configure it!). Instead now you're dealing with the problem every time, and the solution(just mark the cells!) takes a lot more effort.

I don't think Microsoft necessarily has incentive to add this configuration(the science community as a whole is probably a tiny blip on its radar), but this is why we create modular and extensible software - so others can tweak it to their liking.

Tomte · on Aug 24, 2016

No, it's the correct argument.

Putting an option in to disable something that only a miniscule number of users would ever want is not sensible.

Note bene: I'm only talking about this DEC1-like conversions. I agree that there are lots of conversions that are annoying.

eutectic · on Aug 24, 2016

If you have a column of text labels and only a few format as dates, they are very unlikely to be dates. Excel should be smart enough to know this.

throwanem · on Aug 24, 2016

SQL injection isn't an SQL vulnerability, but one of code that naïvely constructs queries. Not a parallel case here.

viraptor · on Aug 24, 2016

> It would have to be a custom-tailored system that knows about nomenclature in the field.

And is available for free - excel came with university / corporate license to all desktops for no extra cost (in most places relevant for this discussion). And popular + easily available - you need to convince IT to allow it on the network / preinstall it on the provided systems.

dalke · on Aug 24, 2016

FWIW, that's reference #1 in this paper.

apathy · on Aug 25, 2016

Indeed. If it made it through review without that happening... that would suggest that peer review is fallible. ;-)

rflrob · on Aug 24, 2016

For a long time, I knew I was dropping this gene [1] from my analyses because pandas automatically converted its name to a floating point. With the right combination of flags I was able to get it to work right, but even real programming languages are not immune.

[1]http://flybase.org/reports/FBgn0036414.html - short symbol is "nan"

sevenless · on Aug 24, 2016

"And I hope you've learned to sanitize your database inputs."

sevenless · on Aug 24, 2016

Science is probably the least scary thing that an Excel bug can affect. JPMorgan Chase's "London Whale" venture lost $2 billion in part because spreadsheet modelers divided by a sum instead of an average to get a value at risk.

https://baselinescenario.com/2013/02/09/the-importance-of-ex...

Then there was the infamous Reinhart-Rogoff paper in economics, used to justify harmful austerity policies worldwide post-2008, that came to false conclusions using a row formula that wasn't updated.

https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-th...

woliveirajr · on Aug 26, 2016

That's not excel bug.

I've seen co-workers getting wrong results due to a wrong copy-and-paste, where no attention was paid. This is no bug: it's a misuse off the program.

Know your tools is always a good advice.

sevenless · on Aug 27, 2016

It is a bug. There is no way to stop Excel from automatically turning certain strings in files that it reads in, into dates.

rdtsc · on Aug 24, 2016

At the university I was working on a project to do parsing of gene relationships from literature. And yeah remember the inconsistencies. Also genes have funny names there is a SHH (Sonic Hedgehog), a DICER1 (which cuts something RNA or DNA I forgot), and a bunch of other silly ones.

Ultimately though coming from the world of algorithms and nicely organized data, it was frustrating how disorganized the nomenclature seemed.

hyperion2010 · on Aug 24, 2016

Don't forget that the genomics folks also renamed a whole bunch of genes a few years ago so now there are two different names for the same thing floating around!

dekhn · on Aug 24, 2016

For every gene, there are typically at least 3-4 names that reference the gene. In some cases, two genes have the same name- for example, OCT1 and Oct-1. The first is "organic cation transporter 1" and the second is "Octamer binding protein 1". The second was "renamed" to (IIRC) POU2F1 but there are still plenty of references to the old name even in new literature.

This is just one example. The entire gene naming area is a pile of bollocks.

apathy · on Aug 24, 2016

And we haven't even touched on the glory of Drosophila gene names.

apathy · on Aug 24, 2016

Oh this happens constantly. Apparently, in an ideal world, every gene will have multiple counterintuitive symbols that appear to give insight as to what the protein does, but are in fact misleading.

HGNC seems to delight in nonsensical renaming. MLL (Mixed-lineage leukemia, a gene rearranged in many leukemias, particularly infants) is now KMT2A ("Lysine Methyl-Transferase 2A"), yet gobs of its fusion partners are canonically named, say, MLLT3 ("Myeloid/Lymphoid Or Mixed-Lineage Leukemia; Translocated To, 3").

But wait, what's its translocation partner? Oh, right, the Gene Formerly Known As MLL. Who thinks this is a good idea?

Journals typically insist on the latest HGNC (HUGO Gene Nomenclature Committee, a subset of the HUman Genome Organization) symbols, whatever they might be (see above). Not atypically, in review, someone will ask to use the old name because that's what they're used to. Best of all is when reviewers each suggest using a different name.

Science!

AlbertoGP · on Aug 24, 2016

I'm interested in this. I have biologist friends at the university and we've discussed the possibility of implementing such functionality; I'd like to give it a try and probably re-discover what you found already. Some kind of report of your experience would be very useful.

Do you have some references to that project? Reports, comments, maybe software?

throwanem · on Aug 24, 2016

> [...] coming from the world of algorithms and nicely organized data, it was frustrating how disorganized [...]

Yeah. Rigor is not really a concept to which even molecular biologists are at home, which is understandable and probably explains why their parties are much better than those in a lot of other disciplines, but maybe a little hard to get used to in some other ways.

niels_olson · on Aug 24, 2016

Pathology resident working with big data and an undergrad in physics checking in. I learned SQL. Met Stonebraker before the Turing Award. I taught myself Python in part by working on Rosalind problems. It's not that I don't understand. I don't have chunks of time large enough to quiet my mind, frame-shift, and work on my research at the code level. I've got IRBs and all sorts of oversight to deal with, budgets, etc. and the biology. A central success of my project has been recruiting professional CS people early.

That's my single biggest suggestion for biologists: know that professional computer scientists and programmers are desperately hungry for interesting data and would love nothing more than to help you design the project up front so they don't get sucked into the vortex of technical debt that will swallow your project if you don't set it up right early.

golergka · on Aug 24, 2016

Excel is a fantastic tool, and widely used in bioinformatics — but to use any tool properly, you have to learn it. I'm really thankful that it was tought first in my bioinformatics class, before any specific tools or programming languages, so we wouldn't commit stupid errors as this one.

Snoooze · on Aug 24, 2016

> Linear-regression estimates show gene name errors in supplementary files have increased at an annual rate of 15 % over the past five years, outpacing the increase in published papers (3.8 % per year).

This still doesn't actually tell us if the problem is getting worse. Or if it does it is badly worded. Even assuming this 3.8% is derived from their own data, you need the number of papers published that contain genelists (I would imagine this has probably risen faster than the number of papers overall).

In other words, the authors should have plotted the error rate over time rather than the number of errors over time.

callesgg · on Aug 24, 2016

Even worse there is not even a proper definition of what a gene is.

The meaning changes. Depending on who you are talking to, What book you are reading, What part in a book you are reading.

roywiggins · on Aug 24, 2016

Biology (the discipline) is messy because biology (the reality) is really really messy.

nonbel · on Aug 24, 2016

It would be interesting to a certain type of person to look back on this in ten years and see if any of those papers were corrected, etc. Some links are in this thread that would allow this to be done now as well. Also check out the fMRI and microarray scandals that contaminated decades worth of publications.

Did using correct data and analysis pipelines actually matter to the conclusions these authors came to?

mtgx · on Aug 25, 2016

Relevant:

http://learn.filtered.com/blog/5-excel-bloopers-with-huge-co...

sevenless · on Aug 24, 2016

Fucking Excel! They haven't fixed this utterly egregious bug in over a decade.

I'm going to make a sign and picket Microsoft when I'm next in the area.

garaetjjte · on Aug 24, 2016

What bug?

gaur · on Aug 24, 2016

I have long fantasized about writing a virus that forces Excel to quit if it detects the user is attempting to use Excel to do scientific research.

viraptor · on Aug 24, 2016

What's your proposed alternative that doesn't require hours to start with? Serious question - I believe no programming languages can be applied here. (at least not yet) The barrier to entry must be minimal.

sevenless · on Aug 24, 2016

You seem to be asking why people shouldn't do data analysis without knowing anything about data analysis.

viraptor · on Aug 24, 2016

I'm saying - we've got lots of people who know the science and do not know or want to know anything about programming. If we want good results, we need to enable them to work better somehow. Proper programming is a good idea for adding to university programs now, so that in ~10 years we can have this talk again and point out that people know better tools and shouldn't be using excel, or at least shouldn't be using it badly.

sevenless · on Aug 24, 2016

At the very least using the t-test in Excel should require a unique key that you only get after doing a few hours of classwork.