Hacker Newsnew | past | comments | ask | show | jobs | submit | steffres's commentslogin

I do this manually by appending `__t_tag`, where `t` is a category and `tag` the value.

E.g. `__o_car`, where `o` means object, or `__p_supercode`, where `p` = project, `__t_ml`, where `t` = topic, ml = machine learning, etc.

No dependencies, hardcoded into the files forever, and search is reasonably fast too (don't need it that often anyway).


That still looks pretty hierarchical. I think the point of tags vs something like categories is that a file can have multiple independent tags. Your system can do it with symlinks too of course.


Forgot to add, that those can be concatanated, e.g. `__o_car__t_finances`.


I simply rinse my wooden boards with boiling water after cutting meat on them (and then wash them of course).


That sounds reasonable. But what about the case where the DB migration of version 2 would be incompatbile with code version 1, e.g. a column was dropped?


You NEVER do that in one go, you need to split it in several deployments. Dropping a column is relatively straightforward, in two steps. First deploy a version of the code that doesn’t use the column, then release the migration dropping the column.

The typical example is the renaming of a column, which needs to be done in several steps:

1. Create the new column, copying the data from the old column (DB migration) both columns exist, but only the old one is used

2. Deploy new code that works with both columns, reading from old and writing new and old

3. Deploy data migration (DB migration) that ensures old and new columns has the same values (to ensure data consistency). At this point, there are no “old column only” writes by the code deployed in previous step

4. Deploy new code using only new column. Old column is deprecated

5. Delete old column

At any given point, code versions N (current) and N-1 (previous) are compatible with the DB. Any change on the DB is done in advance in a backwards compatible way.


I see. Thanks for the clarifications.

And these DB migrations, did your team keep a history of them? If so, did you manage them yourselves, or did you use some tools like flyway?

I'm asking because I'm starting a project where we will manage the persistence SQL layer without any ORM (always did it so far with Django's migrations), but might consider some third party tools for DB migrations.


The way I've seen it work is hand written SQL for the migrations, numbered and tracked in Git.

There shouldn't be any reason that you can't do it with Flyway, but I would be concerned about fighting Flyway a bit. I use Django a fair bit and I honestly don't see a good way to make this approach work for Django, not suggesting that you can't, but you would be fighting Django a fair bit, it's not really how it's designed to work.

If you don't have en ORM, then this is actually much much easier to do right. I'd design the initial schema, either by hand or using some pgAdmin, TOAD or whatever you database has. For there on everything is just hand written migrations.


I didn't mean to use Django _and_ a separate migration tool. It's just that I did work with Django so far, but switching now to a new codebase without it. Hence my question for experiences in DB migration.


At work we use Django migrations. It helps. But the core of schema evolution is backward compatibility. To reach this goal, code review is necessary.

Here we never drop columns, only add new colunms, also never change column types.

When it comes to constraints (not null or something else) we double check about backwards compatibility.

Changing data is not a robust rollback.


Django has migrations; why would this be harder with Django?


The migrations is fairly tightly couple to the code. You can apply the migrations without deploying new code, if you extract the migrations, but now you have at least two branches in your version control, both of which are technically in production. You have the version that's actually running, and you have the version with the model changes and the migrations, from which you extracted the migrations and applied to the database.

I'd argue that because you're making the migrations from the model, it's also easier to do accidentally create migrations that are not independent of the code version.


Hm, but isn't that right? You make your change in code, which doesn't touch your models (except you're no longer using the column you'd like to deprecate), and deploy that; show it's working. Then you make another change to actually remove the column from the models and generate a migration. Then you deploy that version, which migrates the db and runs your new model code?

(You could in theory remove the column but not merge the migration if you wanted to show your code worked fully without that column in your ORM model before removing it from the DB as well?)


Django migrations have painful edge cases when you deal with data migrations.


We've used dbmate[1] outside of the Django/alembic ecosystem.

[1] https://github.com/amacneil/dbmate


You can also check out Bytebase, it's GitLab for database schema migrations (Disclaimer: I am one of the authors)


btw. it's also bad to drop a column if you have multiple people in a team when they switch between branches. it's always a headache, so the best thing is to delay dropping/deleting.

renaming stuff with that gets a little bit tricky, but you can workaround that with database triggers if you really need to rename things.


The problem I've seen a lot, particularly with Rails, is when migrations generate a schema dump after running them, which can get really messy if people blindly commit the diff or if you rely on running several years of migrations in your local environment from scratch (many of which may have been edited or removed if they directly referenced application code). Given the migrations are executed through a DSL and the dump is just a structural snapshot of the DB at the end of the run, they're not quite as reproducible as plain SQL migrations.

You just end up with weird, unpredictable DB states that don't reflect production at all. Especially when you're dealing with old versions of MySQL and the character encoding and collation are all over the place.


What about new rows added during step 0 - 2


You do it in two stages. Add as new column, deploy code that uses it and no longer use the old column. Then later drop the column once nothing is using it anymore.


I'd say both.

Some is declarative, e.g. `FROM`, `ENV`, `EXPOSE`. While on the other hand `RUN`, `CMD`, etc. is fully imperative.


You don't really get credit for being "both". There are maintainability and comprehensibility benefits to keeping anything imperative out of a language (you don't have to reason causally from one statement to the next), which is out the window when you introduce imperative elements. Also: in a Dockerfile, those imperative elements are the heart of the system.


`ENV` is a bad example because it's effect differs greatly from where it's placed in the Dockerfile. Eg: before a RUN statement consuming it's value or after.

`FROM` also has more use cases when using multi stage builds.


oh, you're right. I forgot that a key characteristic of anything being "declarative" is that order of statements should not matter.

Acutally, come to think of it, since `RUN` may depend on any other Dockerfile statement (even `EXPOSE` might make a difference in code), does this mean that even a single imperative statement that is introduced in some language, makes the language imperative?


Anyone know, what's the advantage of this over a big composite repo with several git submdolues?

I think that submodules are better suited for separation of concerns and performance, even while achieving the same composite structure as an equivalent monorepo?


The advantage is simple: Git submodules suck and are a chore to manage for any dependency that sees remotely high traffic or requires frequent synchronization. As the number of developers, submodules, and synchronization requirements increase, this pain increases dramatically. Basic git features, like cherry picking and bisecting to find errors become dramatically worse. You cannot even run `git checkout` without potentially introducing an error, because you might need to update the submodule! All your most basic commands become worse. I have worked on and helped maintain projects with 10+ submodules, and they were one of the most annoying, constantly problematic pain points of the entire project, that every single developer screwed up repeatedly, whether they were established contributors or new ones. We had to finally give in and start using pre-push hooks to ban people from touching submodules without specific commit message patterns. And every single time we eliminated a submodule -- mostly by merging them and their history into the base project, where they belonged anyway -- people were happier, development speed increased, and people made less errors.

The reasons for those things being separate projects had a history (dating to a time before Git was popular, even) and can be explained, but ultimately it doesn't matter; by the time I was around, all of those reasons ceased to exist or were simply not important.

I will personally never, ever, ever, ever allow Git submodules in any project I manage unless they are both A) extremely low traffic, so updating them constantly doesn't suck, and B) a completely external dependency that is mostly outside of my control, that cannot be managed any other way.

Save yourself hair and time and at least use worktrees instead.


Monorepo allows a single commit to update across components, eg an API change


for each submodule affected by some change you would need an additional commits, yes. But those commits are bundled together in the commit of the parent repo where they act as one.

So, atomicity of changes can be guaranteed, but you need to write a few more commits. However this effort of small increases of commits is far outweighed by the modularity imo.


> this effort of small increases of commits is far outweighed by the modularity

Not remotely, as the scale of the codebase increases, the benefit of modularity goes to zero and the benefit of atomic changes increases.

Also: it's not always feasible to break up a change into smaller commits. Sometimes atomic change is the only way to do it.


With --recurse-submodules the atomicity doesn't seem to suffer. It used to be the case that you couldn't ensure all changes in the source tree couldn't be pushed atomically, now you can, but I'm not sure it's the default behavior.


Is it? I'm slightly struggling to understand what benefit you gain from having the "parent" repo but also having individual submodules. Sure, working in each individual project's module makes cloning faster, until you need to work on a module that references another module (at which point you need to check out the parent repo or risk using the wrong version), and now every change you make needs two commits (one to the sub-repo, and one to the base to bump the submodule reference),


In our case, we have a codebase that involves two submodules: one for persistence and one for python based management of internal git repos. Both of these are standalone applications and can run on their own. They are then used in a parent repo which represents the overarching architecture, which calls into the submodules.

The advantage of this is, that work can be done by devs on the individual modules without much knowledge of the overarching architecture, nor strong code ties into it.

Right now our persistence is done with SQL, but we could swap it with anything else, e.g. mongo, and the parent codebase wouldn't notice a thing since the submodule only returns well defined python objects.

Of course, this comes at the cost of higher number of commits as you mentioned. But in my opinion these are still cheap because they only affect trivial quantity and not brain-demanding quality.


But what do you do as soon as one of the submodules has a dependency on another? I imagine you might not hit it in your simple case, but I feel like scenarios like that are where the advantages of monorepos lie.

To take a concrete example, I'm working on a codebase that houses both a Node.js server-side application and an Electron app that communicates with it (using tRPC [0]). The Electron app can directly import the API router types from the Node app, thus gaining full type safety, and whenever the backend API is changed the Electron app can be updated at the same time (or type checks in CI will fail).

If this weren't in a monorepo, you would need to first update the Node app, then pick up those changes in the Electron app. This becomes risky in the presence of automated deployment, because, if the Node app's changes accidentally introduced a breaking API change, the Electron app is now broken until the changes are picked up. In a monorepo you'd spot this scenario right away. (Mind you, there is still the issue of updating the built Electron app on the users' machines, but the point remains - you can easily imagine a JS SPA or some other downstream dependency in its place.)

[0]: https://trpc.io/


yes, if one submodule would depend on another, this would cause problems indeed.

So far, we could avoid it though, by strict encapsulation.

But I definitely see the point in your example and wouldn't follow through with submodules there probably too.

It's just that in OP's link, I'm quite sceptical as the monorepo approach requires quite some heavy tweaking.


I missed the git push --recurse-submodules flag, even though it seems like it's been there for a long time. Yeah, it seems like it would work, except you need to configure it to be always "check" and be always on when you push.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: