I did some small tasks for people working in bioinformatics and what I've seen is both amazing (the science part) and terrifying (the tools). Apparently I saved hours of retyping for one person who was doing a manual JOIN on gene names between two CSV files. As in ctrl+f name of gene from file A, copy, paste into another window. For thousands of rows. I was trying to explain that you can import CSV files without autoformatting as well, but they didn't believe me... sigh... (at least they were aware that this is a bad idea with default settings, because the formatting changes)
I don't know if this can be fixed and how. Lots of people seem to have their own process and they neither understand that the tools can be used more efficiently / correctly, nor question the long, manual process they follow currently. They basically require an "intermediate excel" course which ends with "if you're doing copy/paste 3 times in a row, you should look for better solution". I'm not even questioning the use of excel at this point...
Part of the problem, as a sibling mentioned, is that many people get into these kinds of fields from non-programming or even computer-illiterate backgrounds. At an undergraduate level, even in heavily numerical disciplines like physics, there is relatively little coding. Even then, it's tedious crap like F90 and bulletproof C. So people get into a Masters or a PhD in a field they love and are suddenly confronted by data analysis, and they have no idea what to do.
I've spoken a lot to my girlfriend (an astrophysicist) about this, as she's in this position herself. It's not that she isn't smart, she just has little to no experience with data wrangling and she's (in my opinion) been inadequately trained. I've solved things with Python one-liners that she would have spent literally days doing manually. We've had conversations along the lines of:
"But why does this file have 700 lines of input data filenames hardcoded? You know you could write some code to grab them for you?"
"Yes, but in the time it would take me to write the code, I may as well do it by hand." [In the end I wrote a 5 line regex to do it]
So I can assure you, people question the manual process and they think you're magical when you show them a quicker way of doing things. However, often there isn't the motivation or confidence to try and be magical themselves.
My wife works at the virology lab of a local hospital. They do a lot of validation and quality controls. However, when she started there, all their statistics were calculated by hand using approximate equations which were supposed to make it doable using a simple calculator. Whoever had set this up lacked a strong understanding of statistics. Still, this procedure had remained unchanged since the eighties and no one knew how or why it was supposed to work. Going back in old records, my wife found that there were a lot of errors in the recorded results.
She spent a while setting up a simple Excel sheet into which they enter all test results and now all the statistics are calculated automatically using proper methods.
This is where I take the time to explain to everyone that "Yeah, while it may save you time to do it manually, if you do it THIS way instead, you never have to write it from scratch again, and your output goes waaay up".
Every job I've ever had, I always look at how my predecessors do the job, then automate the crap that I don't want to deal with. :/
I'm in the middle of writing a retrospective on my PhD - the stuff I wish I'd done at the beginning. Perhaps unsurprisingly, it mostly boils down to automate and pipeline everything. It almost always saves time because inevitably you have to repeat work with parameters changed very slightly.
The prevailing attitude is "I want to do biology, not computer science". I find it pretty frustrating. This is probably one of the last fields of science to be seriously computerized.
I used to work as a bioinformatics programmer, and in some cases I did some processing in an afternoon, including writing a python script, that a postdoc had used a month doing manually in Excel.
While the tools often are crude, a problem is that biochemical and molecular biology researchers lack the bioinformatics skills and other IT skills that often is badly needed in modern genetic research. The core competence of these researchers is in the lab, and bioinformatics skills is outside of that. That you see effects like conversion of gene names in excel does not surprise me at all.
You also see the same problem in other areas where researchers need skills outside of their usual comfort zone. The classical example is statistics.
Bonus: the number of errors is positively correlated with Impact Factor (a tool used by statistically illiterate administrative types to judge the quality of research).
I almost lost it at "We hope the date and floating-point conversions will be made non-default options – in deference to the large bioinformatics and biotechnology communities if not for other users."
Sure, the vast biotech community outnumbers those few people who work in some other jobs (or use Excel for private stuff) and enter dates or floating point numbers.
Excel is trying to limit the effect of sloppiness on the part of laypeople. They usually don't set cell formatting to date or number. Excel really does the right thing there.
And if you're a scientist, presumably with a respectable degree and professional responsibilities, learn to use your tool correctly.
Nice how the correct ways to use Excel are listed as cumbersome workarounds.
In vast majority of real use cases, it certainly is. Microsoft has one of the greatest usability labs in the world; but they also have the broadest user base.
Not exactly -- 13,000 citations and counting, mostly due to a few papers in Nature, Cell, and NEJM. But that doesn't change the fact that JIF is a terrible metric. Frankly, papers like the above are why impact factors are so misleading. Most research in the above journals will not be cited as heavily as ours has been.
Even counting citations is better, although if admins would actually read the papers, that would be better-er. Even NIH intramural people talk about impact factor like it means something. Pretty fucked considering that the publishers of glamour rags have vested interests that are very nearly the opposite of "careful scholarship".
The one thing SQL can't do that Excel can: a calculation cell, as in, one that starts with =.
Yes, I can probably code something entirely in SQL to do it, but it will not be portable across SQL implementations; and yes, I can code that as a variable in my program... but neither of them seem to be as fluent as the way Excel does it.
I wouldn't use Excel for prod, but it comes in handy for a lot of small dumb shit purely because of =.
> I don't see them formulating a correct SQL query.
Well, neither can programmers, since SQL injection is still the most common security vulnerability in software, so I agree, SQL is probably a bad tool for the job :)
> It would have to be a custom-tailored system that knows about nomenclature in the field. Sounds not very efficient.
Is Excel a custom-tailored system that knows about nomenclature in the field? The article seems to explicitly argue against that.
I don't think it's a failure to understand data types. It's a mismatch between what you expect the software to do, and what it does by default. Unfortunately, Microsoft has steadfastly refused to allow any way to change the auto-formatting options(check out some people really pissed off for being treated like children here - http://answers.microsoft.com/en-us/office/forum/office_2007-...).
There's two useful prongs of attack here - one is to somehow force Excel to conform to the expectations of researchers - perhaps an extension that works to prevent the most egregious cases of auto-formatting gone wrong? The alternative would be what you suggest - creating and marketing a custom solution, the problem there is that you'd need either buy-in from a significant number of researchers to spread it, or you'd need to replicate a lot of Excel features to make the transition smooth for others.
> Well, neither can programmers, since SQL injection is still the most common security vulnerability in software, so I agree, SQL is probably a bad tool for the job :)
The problem isn't with SQL, it's with the idiotic idea of gluing SQL queries from strings. See also: template languages for web pages, i.e. gluing strings to build what is really a tree.
--
I think the hate against Excel is unfounded. Sure, it's not the perfectly suited software for the domain - in the same way like a dedicated fruit slicer is better than a knife at slicing fruits, a dedicated fruit peeler is better at peeling them, etc. but one knife can do all those jobs pretty well on its own, while also being able to do countless other things. Excel is a very versatile and powerful tool, and this power comes from its flexibility. The solutions often proposed, like a "proper" database-backed system, often involves a fixed workflow and having to call in IT support every time something doesn't fit that workflow (which is always).
So IMO - yes for people dealing with data using Excel. Yes for them learning a programming language, SQL and a generic database system. But strong no for forcing them to use domain-specific dedicated "tools" that impose a particular workflow on them.
Not to a person who specifically does not want that feature - and since Excel does not provide a way to turn it off or customize it, my idea was that an extension might be able to. Of course, I've never written an Excel extension(macro?) so I have no idea.
That's a ridiculous argument and you should know it. Forcing users to do manual work every time instead of having an option to disable or configure a feature is just a UX fail. Doubly so because I would bet those formatted excel files don't survive transition, and the data is actually transmitted in CSV or whatever, so you'd have to reformat the data over and over every time you open to edit it, and hope that someone along the way doesn't make a mistake. This a problem software is meant to solve, not create..
You could argue that all you can do is mitigation since CSV files don't offer much ability to influence how Excel will load them, and all you would need is one improperly-configured Excel along the pipeline to break the data. However, this is a significant mitigation - Excel apps would be configured once, and you would deal with a situation 1% of the time, and the solution would be trivial(just configure it!). Instead now you're dealing with the problem every time, and the solution(just mark the cells!) takes a lot more effort.
I don't think Microsoft necessarily has incentive to add this configuration(the science community as a whole is probably a tiny blip on its radar), but this is why we create modular and extensible software - so others can tweak it to their liking.
> It would have to be a custom-tailored system that knows about nomenclature in the field.
And is available for free - excel came with university / corporate license to all desktops for no extra cost (in most places relevant for this discussion). And popular + easily available - you need to convince IT to allow it on the network / preinstall it on the provided systems.
For a long time, I knew I was dropping this gene [1] from my analyses because pandas automatically converted its name to a floating point. With the right combination of flags I was able to get it to work right, but even real programming languages are not immune.
Science is probably the least scary thing that an Excel bug can affect. JPMorgan Chase's "London Whale" venture lost $2 billion in part because spreadsheet modelers divided by a sum instead of an average to get a value at risk.
Then there was the infamous Reinhart-Rogoff paper in economics, used to justify harmful austerity policies worldwide post-2008, that came to false conclusions using a row formula that wasn't updated.
At the university I was working on a project to do parsing of gene relationships from literature. And yeah remember the inconsistencies. Also genes have funny names there is a SHH (Sonic Hedgehog), a DICER1 (which cuts something RNA or DNA I forgot), and a bunch of other silly ones.
Ultimately though coming from the world of algorithms and nicely organized data, it was frustrating how disorganized the nomenclature seemed.
Don't forget that the genomics folks also renamed a whole bunch of genes a few years ago so now there are two different names for the same thing floating around!
For every gene, there are typically at least 3-4 names that reference the gene. In some cases, two genes have the same name- for example, OCT1 and Oct-1. The first is "organic cation transporter 1" and the second is "Octamer binding protein 1". The second was "renamed" to (IIRC) POU2F1 but there are still plenty of references to the old name even in new literature.
This is just one example. The entire gene naming area is a pile of bollocks.
Oh this happens constantly. Apparently, in an ideal world, every gene will have multiple counterintuitive symbols that appear to give insight as to what the protein does, but are in fact misleading.
HGNC seems to delight in nonsensical renaming. MLL (Mixed-lineage leukemia, a gene rearranged in many leukemias, particularly infants) is now KMT2A ("Lysine Methyl-Transferase 2A"), yet gobs of its fusion partners are canonically named, say, MLLT3 ("Myeloid/Lymphoid Or Mixed-Lineage Leukemia; Translocated To, 3").
But wait, what's its translocation partner? Oh, right, the Gene Formerly Known As MLL. Who thinks this is a good idea?
Journals typically insist on the latest HGNC (HUGO Gene Nomenclature Committee, a subset of the HUman Genome Organization) symbols, whatever they might be (see above). Not atypically, in review, someone will ask to use the old name because that's what they're used to. Best of all is when reviewers each suggest using a different name.
I'm interested in this. I have biologist friends at the university and we've discussed the possibility of implementing such functionality; I'd like to give it a try and probably re-discover what you found already. Some kind of report of your experience would be very useful.
Do you have some references to that project? Reports, comments, maybe software?
> [...] coming from the world of algorithms and nicely organized data, it was frustrating how disorganized [...]
Yeah. Rigor is not really a concept to which even molecular biologists are at home, which is understandable and probably explains why their parties are much better than those in a lot of other disciplines, but maybe a little hard to get used to in some other ways.
Pathology resident working with big data and an undergrad in physics checking in. I learned SQL. Met Stonebraker before the Turing Award. I taught myself Python in part by working on Rosalind problems. It's not that I don't understand. I don't have chunks of time large enough to quiet my mind, frame-shift, and work on my research at the code level. I've got IRBs and all sorts of oversight to deal with, budgets, etc. and the biology. A central success of my project has been recruiting professional CS people early.
That's my single biggest suggestion for biologists: know that professional computer scientists and programmers are desperately hungry for interesting data and would love nothing more than to help you design the project up front so they don't get sucked into the vortex of technical debt that will swallow your project if you don't set it up right early.
Excel is a fantastic tool, and widely used in bioinformatics — but to use any tool properly, you have to learn it. I'm really thankful that it was tought first in my bioinformatics class, before any specific tools or programming languages, so we wouldn't commit stupid errors as this one.
> Linear-regression estimates show gene name errors in supplementary files have increased at an annual rate of 15 % over the past five years, outpacing the increase in published papers (3.8 % per year).
This still doesn't actually tell us if the problem is getting worse. Or if it does it is badly worded. Even assuming this 3.8% is derived from their own data, you need the number of papers published that contain genelists (I would imagine this has probably risen faster than the number of papers overall).
In other words, the authors should have plotted the error rate over time rather than the number of errors over time.
It would be interesting to a certain type of person to look back on this in ten years and see if any of those papers were corrected, etc. Some links are in this thread that would allow this to be done now as well. Also check out the fMRI and microarray scandals that contaminated decades worth of publications.
Did using correct data and analysis pipelines actually matter to the conclusions these authors came to?
What's your proposed alternative that doesn't require hours to start with? Serious question - I believe no programming languages can be applied here. (at least not yet) The barrier to entry must be minimal.
I'm saying - we've got lots of people who know the science and do not know or want to know anything about programming. If we want good results, we need to enable them to work better somehow. Proper programming is a good idea for adding to university programs now, so that in ~10 years we can have this talk again and point out that people know better tools and shouldn't be using excel, or at least shouldn't be using it badly.
I don't know if this can be fixed and how. Lots of people seem to have their own process and they neither understand that the tools can be used more efficiently / correctly, nor question the long, manual process they follow currently. They basically require an "intermediate excel" course which ends with "if you're doing copy/paste 3 times in a row, you should look for better solution". I'm not even questioning the use of excel at this point...