Ah yes, I too have accidentally committed node_modules.
Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.
Genuine, non snarky question:
Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?
> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?
It’s hard to look at a million files on disk and figure out which ones have changed. Git, by default, examines the filesystem metadata. It takes a long time to examine the metadata for a million files.
The main alternative approaches are:
- Locking: Git makes all the files read-only, so you have to unlock them first before editing. This way, you only have to look at the unlocked files.
- Watching: Keep a process running in the background and listen to notifications that the files have changed.
- Virtual filesystem: Present a virtual filesystem to the user, so all file modifications go through some kind of Git daemon running in the background.
All three approaches have been used by various version control systems. They’re not easy approaches by any means, and they all have major impacts on the way you have to set up your Git repository.
People also want e.g. sparse checkouts, when you’re working with such large repos.
It's notable that git does support "watching", but it requires some setup on Linux to install and integrate with Watchman. On Windows and Mac, core.fsmonitor has been built in since version 2.37.
Are there any solutions that use libgit2's ability to define a custom ODB backend? There are even example backends already written [1] that use RDBMSs as the underlying data store.
There are repos with many files and there are repos with lots of history data. Those are problems with different solutions—adding millions of files to the repo will make 'git status' take ages, but it won’t necessarily put the same level of pressure on the object database.
There are various versions of Git that use alternative object storage, like Microsoft’s VFS, if I remember correctly.
Has anyone made a system like option 3 that successfully merges git with a filesystem? It could present both git and fs interfaces, but share events internally. I'd be interested to see how that would work.
What about asking the OS for the list of changes like Everything on Windows does, instantly, for millions, at a RAM cost of a ~1-2 browser tabs (though that might be limited to NTFS, but still)?
> What about asking the OS for the list of changes like Everything on Windows does
That's not, the last time I checked, how everything on Windows works.
Windows provides the ability to hook into FS system calls, so that things like virus scanners work.
Everything uses the hook to get notified of all changes, and uses those mods simply to update its index (which is faster than scanning a file for viruses, so it's imperceptible to users).
It's a great idea, and I don't think there is anything similar in Linux or BSD (inotify isn't the same thing, AFAIK, it uses up file descriptors).
Other users have made good comments about performance limitations on the underlying filesystems themselves. Adding to this, I recently encountered the findlargedir tool, which aims to detect potentially problematic directories such as this: https://github.com/dkorunic/findlargedir/
>Findlargedir is a tool specifically written to help quickly identify "black hole" directories on an any filesystem having more than 100k entries in a single flat structure. When a directory has many entries (directories or files), getting directory listing gets slower and slower, impacting performance of all processes attempting to get a directory listing (for instance to delete some files and/or to find some specific files). Processes reading large directory inodes get frozen while doing so and end up in the uninterruptible sleep ("D" state) for longer and longer periods of time. Depending on the filesystem, this might start to become visible with 100k entries and starts being a very noticeable performance impact with 1M+ entries.
>Such directories mostly cannot shrink back even if content gets cleaned up due to the fact that most Linux and Un*x filesystems do not support directory inode shrinking (for instance very common ext3/ext4). This often happens with forgotten Web sessions directory (PHP sessions folder where GC interval was configured to several days), various cache folders (CMS compiled templates and caches), POSIX filesystem emulating object storage, etc.
IME, on basically all filesystems, just walking a directory tree of lots of files is expensive. Half a million files on modern systems should not be a terribly huge issue but once you get into the millions, just figuring out how to back them all up correctly and in a reasonable time frame starts to become a major admin headache.
Since git is essentially a filesystem with extensive version control features, it doesn't surprise me that it would have problems handing large amounts of files.
Not related to git (I hope), but a lot of scientific data/imaging folks seem to think file abstractions are free. I've seen more than one stack explode a _single_ microscope image into 100k files, so you'd hit 1M after trying to store just 10 microscope slides. Then, a realistic archive with thousands of images can hit a billion files before you know it.
It's hard to get people past the demo phase "works for me" when they have played with one image, to realize they really need a reasonable container format to play nice with the systems world outside their one task.
I was referring to general-purpose filesystems in common use today. Yes, there are a lot of special-purpose and experimental filesystems which are optimized for certain use cases, and a competent systems programmer could write one optimized specifically for small files, but these all have to make significant trade-offs.
It used to be much more rare in the past. With 20 TB drives available today, it is much more common to be able to handle many more files. When I designed my file system replacement (www.Didgets.com), I didn't just put 'a million files' in the requirement; I put 100x more in it.
Now I have a system that will find subsets in just a second or two (even when the whole set contains hundreds of millions and any given subset might contain hundreds of thousands of matches). Here is a short video of a demo: https://www.youtube.com/watch?v=dWIo6sia_hw
In my experience, the standard linux file system can get very slow even on super powerful machines when you have too many files in a directory. I recently generated ~550,000 files in a directory on a 64-core machine with 256gb of RAM and an SSD, and it took around 10 seconds to do `ls` on it. So that could be a part of it too.
I always marvel at it and think: "wow so git goes through its history, pulls out many small files and chunks and patches, updates the whole file tree and all of this after hitting enter and being done like immediately."
> Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made?
I can't speak to improving git, but I think some light on this area can be shed by Linus' tech talk at Google in 2007.
1. Linus says there's a specific focus on full history and content, not files ... so it's a deliberate, different axis of focus than file count:
2. As Linus tells it, Git appears to be designed specifically for project maintenance while not getting in the way of individual commits and collaboration. But the global history and more expensive operations on things like "who touched this line" are deliberate so lines of a function are tracked across all moves of the content itself.
Practically, I've used lazy filesystems both for Windows-on-Git via GVFS [1][2] and Google's monorepo jacked into a mercurial client (I think that's what it is?). Both companies have made this work, but as Linus says, a lot of the stuff just doesn't work well with either system.
Windows-on-Git still takes a lot of time overall, and stacking > 10 patches of an exploratory refactor with the monorepo on hg starts slowing WAY WAY down to the point where any source control operations just get in the way.
Something I learned about writing robust code is that scalability needs to be tested up-front. Test with 0, 1, and many where the latter is tens of millions, not just ten.
I've seen production databases that had 40,000 tables for valid reasons.
I've personally deployed an app that needed 80,000 security groups in a single LDAP domain, just for it. I can't remember what the total group number of groups across everything was, but it was a decent chunk of a million.
Making something like Git, or a file system, or a package manager? Test what happens with millions of objects! Try billions and see where your app breaks. Fix the issues even if you never think anyone will trigger them.
It's not about scaling to some arbitrary number, it's about scaling, period.
While this is true in some cases, more frequently I saw apps designed and able to handle millions of users and billions of transactions that ended being used by tens of users and hundreds of transactions.
All the effort spent on testing and optimizations for scaling purpose was a waste of time and resources, that could be better spent elsewhere.
I'm not telling one should not care, or code sloppy, but there is a balance where code is just good enough for the purpose. There's a lot of truth in this "don't do premature optimization".
It's a difficult one; if you don't know yet if you will have billions of transactions, you should focus on clarity and flexibility - that is, you should be able to rewrite and re-architect your application and its runtime IF it turns out to be successful.
A parent comment mentioned SQL databases for example; those are great because they can scale both horizontally and vertically these days, sometimes with the click of a button in AWS.
Other good practices are things like stateless back-end services so they can scale horizontally, thoroughly documenting (and maintaining documentation) on business processes handled by the software, monitoring, etc.
Disclaimer: I'm an armchair expert, I've never had to deal with back-end scaling.
Something has gone horribly wrong if you don’t know if the requirements are 10 request per second or a billion per second.
We build some services from the ground up for very high traffic and the hoops you have to jump through and the tradeoffs just don’t make sense for a basic CRUD thing which can run on a boring ole machine and a little SQL instance
The question comes.. when your app is on the growth curve, do you start to test this?
I work in enterprise software and one of the big problems I see is companies suck at software growth when it's obvious the software is in the upward curve.
Large companies will throw huge amounts of data at your app once you sell it to them.
You still need to do enough to buy time if you do need more scalability, since scalability tends to be architectural. Waiting until you hit a wall in production is usually months too late to start working on it. As a moving target, I often try to test at 10x the current workload, which is usually enough to deal with load spikes and surfaces scalability issues early enough that customers don't see them.
I think the GP was saying that it should scale without breaking. It can get slow, fine, that's a different challenge. But it shouldn't segfault (as an example).
The gist of it is this: many load tests don't even consider the actual potential volume of traffic. But that's fine, if you're using a load tester, you don't have to estimate the traffic well - even to within an order of magnitude. You can just add a couple extra zeroes, and see where it breaks. Failing to do this simple thing will usually lead to objectively worse software, and there's a chance that some day you'll need to handle that much traffic.
But that chance isn't the sole reason why you're doing that load test. The reason is to improve the software. You're identifying defects by stressing the limits.
When you're doing a load test (or any test really) the possible outcomes are basically three: (1) it works! (2) it broke. (3) huh, that's interesting. If your tests are always coming up (1) then you're not obtaining any benefit from them. Don't you want to know where the limiting factors are in your app? If you're able to remove those limits, but not for production (at least not right now), wouldn't it be great to know what will break next month (or next year) when you do?
Think of the person who writes unit tests for every piece of code, but not as TDD. There's a school of thought that you should write the test first, then write the simplest code that passes, and that's fine but not what I'm talking about. Imagine a person who writes perfect code and perfect tests. Every code works, every test passes confirming that it worked. What is even the value of writing the test?
That's what load testing under only the expected conditions is like. We already know the software works under those conditions (likely) because it's already in production, handling that amount of load. So while there is value in a load test that runs prior to deployment, in order to check that nothing of the change is likely to induce a break under the expected/existing load, it's a different kind of testing and produces different value than a stress test that is designed to hopefully induce a failure and show you where there is a defect in code. Where it segfaults, for example.
And just because you've identified a limiting factor outside the bounds of what expected activity is likely to go through the system soon, doesn't mean you need to fix it now. Having one less "known unknown" on the table is a thing of value. Now that stress won't be able to surprise you later, when that parameter has drifted into the danger zone because of organic development, and now it's becoming a thing in the way.
"GP was saying that it should scale without breaking" and the responder was saying that making that a priority means that your wasting time that probably won't be needed.
The time you spend making it work for millions of users that won't be needed is time not spent making value to customers that do need it.
Someone experienced will know how much work a certain approach will take and its capacity.
Sometimes there are quick wins to give like 100x capacity to a system just by doing things slightly differently, but only with experience will you know that.
I think you can write good code that intentionally omits performance optimizations you know could be made, but don’t want to make right now because it trades off complexity for performance. I usually leave myself a note of how to improve it if it does in fact become the bottleneck or starts to hurt latency
Yeah and these apps probably would never work with that many of users in practice because they missed a few things here and there and the only way to fix them would have been to have that amount of traffic in the first place.
also all projects, git or anything else, have limited resourcing. I'd rather it's spent on the prioritized features & needs than exhaustive testing for edge cases.
Scaling code isn't always as simple as rewriting your search function to be faster.
What if scaling to millions of objects forces real tradeoffs for the hundreds of objects case?
It feels like you're asking people to only create Postgres, but SQLite has a perfectly valid use case as well.
In this case, git checking the access time of 500k files is fundamentally slow. The only way around this is to change how git tracks files, which all come with other usability tradeoffs. Git itself supports a fsmonitor that makes handling more files faster, but very few people use it because the tradeoffs aren't worth it.
> It feels like you're asking people to only create Postgres, but SQLite has a perfectly valid use case as well.
I don't mean this to be a "well actually" comment, but because I found it interesting when I learnt this a few weeks ago - some limits for SQLite [1] are actually higher than the limits for Postgres [2] (specifically the number of columns in a table and the maximum size of a single field).
I've been working with some SQLite databases that are >100GB lately, and wondering if this is a bad idea. The theoretical max size is 140TB, but there's a big gap between can and should.
Among mainline Linux filesystems, xfs started doing this first. The test suite is still named xfstests, although many more filesystems rely on it now. They regularly test xfs on enormous filesystems which 99.9% of us will never see, both with hundreds of billions of tiny files, and relatively small numbers of very large ones, plus various mixes of the two. Pushing it into edge cases like billions of files in one directory without any nesting. I really like that strong engineering culture and that's why I prefer xfs for most stuff.
Sometimes edge cases can quickly detect bugs that only happend rarely under normal circumstances and therefore is difficult to reproduce/debug.
E.g. when programming in C for little endian computers it can be a good idea to test code on big endian CPUs as the difference in endianess can reveal "out of bounds" writes for pointers.
These sorts of exercises help with performance tuning in the small.
One thing you should learn, and many don't, about perf analysis is that you start getting serious artifacting in the data for tiny functions that get called an awful lot. I've found a lot of tangible improvements from removing 50% of the calls to a function that the profiler claims takes barely any time. Profilers lie. You have to know what they lie about.
When I'm trying to optimize leaf- or near-leaf-node functions I've been known to wrap the call with a for loop that runs the same operation 10, 100, 1000 times in a row, just so I can see if some change has a barely-double-digit effect on performance. These predictions usually hold up in production.
Just be very, very sure not to commit that for loop.
Or use representative data that is ridiculously large compared to the average case.
This is really not unique to XFS... anyone who worked on a filesystem in at least the last decade would tell you that a test like the one similar to what OP inadvertently created are commonplace.
Unlike with many user-space applications, filesystems have very well-defined range of conditions they have to work in. Eg. every filesystem worth its salt will come with a limit on number of everything in it, i.e. number of files, groups, links and so on. And these limits are tested, they aren't conjectures. Ask any filesystem developer how many metadata operations per second can their program do, and they will likely be able to answer you in their sleep. This might be surprising on the consumer end of the deal, but to the developers there's nothing new here.
I would agree with scale orders of magnitude higher than you can possibly imagine. But once you know what your scaling limits are (and there are always are limits) and what the (pre)failure behaviour looks like… we’ll you don’t have to fix them…
I mean that's how you get k8s for projects that in reality will never need it.
Now you have a developer that is only doing k8s.
Managing overhead and minimizing it is really something to keep in mind.
So your App can't handle 100000 concurrent Users?
As long as there is a plan how you could enable that in case of emergency there is really no incentive to have all that premature optimization for 90% of companies imo.
People who make filesystems test this stuff and will be able to tell you the ballpark figure for performance of this kind of operation even w/o testing. Testing here isn't the problem...
The problem here is that we need a reasonably small interface for filesystem to enable competing implementations, so, for example, we don't have a filesystem interface for bulk metadata operations, because this is an unusual request (most user-space applications which consume filesystem services don't need it). So, we can only query individual files for metadata changes through "legal" means (i.e. through the documented interface). And now you end up in a situation where instead of fetching all the necessary information in a single query, the performance impact of your query scales linearly with the number of items queried.
Even if Git developers anticipated this performance bottleneck, there's not much they can do w/o doing some other undesirable stuff. Any solution created outside of the filesystem would risk de-synchronization with the filesystem (i.e. something that watches the state of the filesystem dies and needs to be restarted, either loosing old changes or changes done between the restarts). Another solution could try going behind the documented filesystem interface, and try to salvage this information directly from the known filesystems... which would be a lot of work compounded with the potential to screw up your filesystem.
Maybe if we'd have Git integrated with the kernel and be able to thus integrate better with at least the in-kernel filesystems. But this would still put people on anything but Linux at a disadvantage, and even on Linux, if you wanted some filesystem that's not in the kernel, you'd also have the same problem...
When Git was first released, the Linux kernel sources had less than 20,000 files. It currently has around 70,000 files. It’s not nothing, but it also isn’t millions.
The kernel is big, but it isn't _that_ big in the grand scheme of things. The project from the original article here is bigger, and many companies have projects bigger than that.
I worked at an educational institution where we ran an academic-focused Enterprise Resource Planning (ERP) system that was fairly large. Not quite 40k tables, but it had over 4k. To give you an idea of how this was organized:
* Most simple things like a "Person" were multiple tables because you had to include audits and historical changes for each field
* A "Person" wasn't even all that useful because it included guests or other fairly transient entities like vendor contacts so you had an explosion of more tables as you classified roles into "Student", "Faculty", "Employee", etc... (many with histories as above).
* Addresses and other non-core demographic information were usually sharded into all sorts of categories like "primary", "parent's", "last known good", "good for mailing", etc... (more histories, etc...)
* All coded information like label types such as "STUDENT", or "MAILING" were always handled as separate validation tables with strict FK constraints and usually included extra meta information like descriptions and usage notes within parts of the system.
* Each functional sub-system (HR, Payroll, AR, AP, etc.) had its own dedicated schema.
* All external jobs, processes, and external integrations were configured separately.
* All enterprise integrations usually had a whole a dedicated schema for configuration.
* Most parts of the interactive web UI were database driven (Oracle's Apache mod PL/SQL) with many templates and other components stored in large collections of tables.
I'll stop there, but basically just imagine a very large application that tries to be 100% database-driven. That's how you get a lot of tables.
And honestly, I kinda get it. Until you run into a case where your volume is such that you physically can't run on the db then run it on the db. I run all my job processing off the DB and couldn't be happier. I have to hit "can't run along side the real data" and "can't run in its own db" before I'll need to consider something else.
It probably feels weird for devs to drive the UI off the db but it's just Wordpress by another name.
I've worked with / on an application like that, it had all form fields awkwardly configured in a database, plus a complicated database migration script to add, remove and update those fields.
When I rewrote the application I just hardcoded the form fields, nobody should need to do a database migration to change an otherwise mostly static form.
Would it be "better" if they had one table with json/xml/whatever and handled schema in code?
They made a trade-off they found right. When they hit the limit with their approach, they even implemented their own DB (S4/Hana) to support their system.
Why not? I don't think there are many graph databases that are set up to handle multiple petabytes of data, so RDBMs make a good storage layer at that scale
I've got a db that hosts postgresql versions of CSVs/XLSs that are uploaded/harvested to an open data portal (as part of the portal). There are ~10k of them in there (+-), and could certainly see more (O(5k)) if some of the CSVs were parsed better.
Security. You have access to stef25_ tables and I don't.
The alternative would be we both have access to the same tables with a permission layer to grant access to row.
Both choices have trade offs but if company makes a mistake and I now have access to your rows? Seems easier to control access at the table layer rather than the column layer.
Or user group... or active directory group... or admin group. super user. etc.
Or... you can just split things by tables. Or even shard by databases where I don't have access to your database and vice versa.
doing stuff in the application and leaving everything in one database/schema is an option... but don't think you aren't making trade offs and leaving open possible issues by not taking the more comprehensive option like sharding.
And that's just one question to ask. Another is what about upgrading the database and segregating customers. can't do that if everyone is on the same database/schema. What if a customer doesn't want to be updated or upgraded? Much like companies paying for Windows XP support because stuff they have relies on the older version of software?
"where user.id = 123" is a simple solution that quickly becomes more complicate to put it mildly.
Wouldn't having separate databases (with separate users (per organization)) make more sense from a security point of view? I have no knowledge of these things, I've never actually worked with more than one database in a mysql instance.
edit: I tell a lie, I separated the forums and wordpress databases on a website I run.
One DB server can have multiple DB's. In this case we are talking about single DB (not server) containing multi thousands of tables. And im curious what is the use case for such designs.
Thankfully, our product has customer specific use patterns that we've been able to manage/plan/predict peak load for and what not. Of those 300, a random subset of 20-30 would be 'busy/critical' at any given time, and the others can easily tolerate a delay as migrations + schema changes lock and manipulate things.
How does a "monorepo" differ from, say, using a master project containing many git submodules[1], perhaps recursively? You would probably need a bit of tooling. But the gain is that git commands in the submodules are speedy, and there is only O(logN) commit multiplication to commit the updated commit SHAs up the chain. Think Merkle tree, not single head commit SHA.
Eventually, you may get a monstrosity like Android Repo [2] though. And an Android checkout and build is pushing 1TB these days.
But there, perhaps, the submodule idea wins again. Replace most of the submodules with prebuilt variants, and have full source + building only for the module of interest.
> How does a "monorepo" differ from, say, using a master project containing many git submodules[1], perhaps recursively
One fundamental way it differs is atomic commits. You can't change something in repo A and subsubrepo XYZ in a single pull.
A monorepo allows you to do things like atomic commits to arbitrary pairs of files in the repo, which among other things opens up the possibility of enforcing single-version of libraries, which in turn removes a whole class of diamond dependency issues.
There's other benefits, but imo it's probably not worth it for most companies because of the staggering number of things it breaks in the developer toolspace once it gets large enough. Eventually you need teams of people that do nothing but make tooling to support monorepo scaling, because everything off the shelf explodes (what do you do when even perforce can't handle your repo?)
For example, at Google we have a team of people who do nothing but, effectively, recreate the cross referencing and jump-to-def everyone else gets for "free" from IntelliJ / VS intellisence, etc. (We do other stuff too, but that's a fair paraphrase). And on top of that the team really only exists because Steve Yegge is a Force to be Reckoned With, otherwise we might still be flailing around without jump to def, idk.
One Major Point for monorepos is the ability to eschew packages.
In most languages creating, publishing and consuming a package is a lot of work, while in a monorepo you just add a reference to the code and are ready to go (except in react native...gaah that was pure horror). That's especially valuable if you need to refactor something and need to adjust it's dependencies. Doing that via packages is slow and painful. Via project references it's mich easier and has a tight feature loop, change+build+fix instead of of change+build+publish+consume+fix
Yup, agree completely. That's a natural extension of the same thing that enables atomic commits - suddenly just having direct library dependencies instead of packages isn't that big of a problem if you push everything into the monorepo.
> That's especially valuable if you need to refactor something and need to adjust it's dependencies.
And yes exactly, being able to change a library and all of its callers at the same time is pretty handy.
The ability of Google's internal code search to jump between declaration, definition, override, and call site is miles ahead of what Intellisense can do.
Things like moving a file from one git submodule to another is more cumbersome than just `mv foo dir/bar`. That means your directory structure is in practice tightly coupled to the tree of git projects.
Also, since any of the git sub-repos can be branched, the chaos of merging development branches seems like it gets even more complicated in a submodule architecture.
It may be possible to put a user interface that abstracts away the submodule architecture and forces everything to live on HEAD. But at that point it might be easier to just provide a git-like UI to a centralized VCS.
Most large monorepos simply are not on git. Google has Piper, Yandex has arc, Facebook has eden (which is actually semi-open-source, btw!), some companies use Perforce and so on.
"Monorepo" is a culture around having a single branch with a single lineage, and not developing anything in any isolation greater than a single developer's workstation.
I agree that Git is not very adequate for large monorepos, but I'd say that most open source projects are on Git, and most of them are trivial monorepos.
Not off-the-shelf git though, they have their own file system virtualisation stuff on top. Some of that used to be open-source (Windows only, I think?).
Microsoft's blog posts have indicated a move to use something as close to off-the-shelf git as possible, though. They say they've stopped using VFS much and are instead more often relying on sparse checkouts. They've upstreamed a lot of patches into git itself, and maintain their own git fork but the fork distance is generally shrinking as those patches upstream.
The monorepo is where you end up when you have failed to enforce encapsulation and your "modules" do not have stable APIs (or are actually modular). Then, with sub-modules each change will often involve multiple commits to different modules, plus commits to update references, so O(N) commit multiplication.
They upstreamed almost everything. The last version of "scalar" was mostly just a configuration tool for sparse checkout "cones" which needed a bit of hand-holding, and that is easier to configure in git itself now, or so I hear.
Bold move to enable the "ours" merge strategy by default! I presume this is a typo for the "-Xours" merge option to `ort` or `recursive`, but that still seems pretty brave.
It is still only "ours" per hunk. But yes, it could obliterate changes. On the other hand, the default merge strategy is a huge waste of developer time. There is rarely a genuine conflict. It's usually just that we want to keep both sides.
Author here, the phrase "generated" was the wrong choice of word here. These translations are manually translated by humans in another systems and consolidated into these .xlf files so they're dependent on what the original strings are in code at a specific commit. They cannot be generated on the fly
Since 70% of the files were xlf files used for translation/localization, couldn't they instead just store all of those in a single SQLite file and solve their problem much more easily? Any of the nuances of the directory structure could be captured in SQLite tables and relationships, and it would be easy to access them for edits by non-coders using a tool like DB Browser.
I feel like often people make problems much harder than they need to be by imposing arbitrary constraints on themselves that could be avoided if they approached the problem differently.
The translations are dependent on the original strings that is in code. For example if we change "design anything" to "design everything", the translations also needs to be updated to reflect that and by keeping it within vcs, we have an atomic change including both code, copy and translation. Moving it to a database would make updates easier but would now be a separate process to "sync" between copy change in code and translation changes in the database
That just sounds like adding another problem though. A filesystem (and git) already is a database, and plain files can be read and managed more easily than a possibly corruptible binary file. Plus, you'd lose history, unless you add more complexity to add history.
I mean I don't know if they ever needed history but, just saying. You get certain things for free by using a filesystem / git.
You commit the sqlite dump file(?) to git and have the history...
I dunno but there are folks who would put anything in git. I work with someone who manages to exceed disk space of company's Gitlab instance by git adding everything. The disk is full again once a month.
Our monorepo is at ~500 megs right now. This is 7 years worth of changes. No signs of distress anywhere, other than a periodic git gc operation that now takes long enough to barely notice.
I can't imagine using anything else for my current project. In fact, the only domain within which I would even consider something different would be game development. Even then, only if the total asset set is ever expected to exceed a gigabyte or so. Git is awful with large blobs. LFS is an option, but I've always felt like it was a bandaid and not a fundamental solve.
However, since then we've migrated our engineering blog from medium to a self-hosted stack, so HN doesn't link it to the previous discussion automatically.
Anyone know, what's the advantage of this over a big composite repo with several git submdolues?
I think that submodules are better suited for separation of concerns and performance, even while achieving the same composite structure as an equivalent monorepo?
The advantage is simple: Git submodules suck and are a chore to manage for any dependency that sees remotely high traffic or requires frequent synchronization. As the number of developers, submodules, and synchronization requirements increase, this pain increases dramatically. Basic git features, like cherry picking and bisecting to find errors become dramatically worse. You cannot even run `git checkout` without potentially introducing an error, because you might need to update the submodule! All your most basic commands become worse. I have worked on and helped maintain projects with 10+ submodules, and they were one of the most annoying, constantly problematic pain points of the entire project, that every single developer screwed up repeatedly, whether they were established contributors or new ones. We had to finally give in and start using pre-push hooks to ban people from touching submodules without specific commit message patterns. And every single time we eliminated a submodule -- mostly by merging them and their history into the base project, where they belonged anyway -- people were happier, development speed increased, and people made less errors.
The reasons for those things being separate projects had a history (dating to a time before Git was popular, even) and can be explained, but ultimately it doesn't matter; by the time I was around, all of those reasons ceased to exist or were simply not important.
I will personally never, ever, ever, ever allow Git submodules in any project I manage unless they are both A) extremely low traffic, so updating them constantly doesn't suck, and B) a completely external dependency that is mostly outside of my control, that cannot be managed any other way.
Save yourself hair and time and at least use worktrees instead.
for each submodule affected by some change you would need an additional commits, yes. But those commits are bundled together in the commit of the parent repo where they act as one.
So, atomicity of changes can be guaranteed, but you need to write a few more commits. However this effort of small increases of commits is far outweighed by the modularity imo.
With --recurse-submodules the atomicity doesn't seem to suffer. It used to be the case that you couldn't ensure all changes in the source tree couldn't be pushed atomically, now you can, but I'm not sure it's the default behavior.
Is it? I'm slightly struggling to understand what benefit you gain from having the "parent" repo but also having individual submodules. Sure, working in each individual project's module makes cloning faster, until you need to work on a module that references another module (at which point you need to check out the parent repo or risk using the wrong version), and now every change you make needs two commits (one to the sub-repo, and one to the base to bump the submodule reference),
In our case, we have a codebase that involves two submodules: one for persistence and one for python based management of internal git repos. Both of these are standalone applications and can run on their own. They are then used in a parent repo which represents the overarching architecture, which calls into the submodules.
The advantage of this is, that work can be done by devs on the individual modules without much knowledge of the overarching architecture, nor strong code ties into it.
Right now our persistence is done with SQL, but we could swap it with anything else, e.g. mongo, and the parent codebase wouldn't notice a thing since the submodule only returns well defined python objects.
Of course, this comes at the cost of higher number of commits as you mentioned. But in my opinion these are still cheap because they only affect trivial quantity and not brain-demanding quality.
But what do you do as soon as one of the submodules has a dependency on another? I imagine you might not hit it in your simple case, but I feel like scenarios like that are where the advantages of monorepos lie.
To take a concrete example, I'm working on a codebase that houses both a Node.js server-side application and an Electron app that communicates with it (using tRPC [0]). The Electron app can directly import the API router types from the Node app, thus gaining full type safety, and whenever the backend API is changed the Electron app can be updated at the same time (or type checks in CI will fail).
If this weren't in a monorepo, you would need to first update the Node app, then pick up those changes in the Electron app. This becomes risky in the presence of automated deployment, because, if the Node app's changes accidentally introduced a breaking API change, the Electron app is now broken until the changes are picked up. In a monorepo you'd spot this scenario right away. (Mind you, there is still the issue of updating the built Electron app on the users' machines, but the point remains - you can easily imagine a JS SPA or some other downstream dependency in its place.)
I missed the git push --recurse-submodules flag, even though it seems like it's been there for a long time. Yeah, it seems like it would work, except you need to configure it to be always "check" and be always on when you push.
this is one of those multipurpose PR articles (not all bad) to generate awareness of the company, their product, use case, and developers.
>At Canva, we made the conscious decision to adopt the monorepo pattern with its benefits and drawbacks. Since the first commit in 2012, the repository has rapidly grown alongside the product in both size and traffic
while reading it i was having trouble keeping track of where I was in the recursion, it's sort of "Xzibit A" for "yo dawg, we know you use source repositories, so check out our source repository (we keep it in our source repository) while you check out your source repository!"
We learned they were 70% autogenerated so probably shouldn't have been in git at all, but our build process relied on that, and didnt want to fix it, so we bodged it.
Something being autogenerated, or binary, doesn't mean it shouldn't be in version control. If step one of your instructions to build something from version control involve downloading a specific version of something else, then your VCS isn't doing it's job, and you're likely skirting around it to avoid limitations in the tool itself. People still use tools like P4 because they want versioned binary content that belongs in version control, or because they want to handle half a million files, and git chokes.
In my last org, we vendored our entire toolchain, including SDKs. The project setup instructions were:
- Install p4
- Sync, get coffee
- Run build, get more coffee.
A disruptive thing like a compiler upgrade just works out of the box in this scenario.
It's a shame that the mantra of "do one thing well" devolves into "only support a few hundred text files on linux" with git.
This is precisely why every ASIC (HW) company I'm familiar with uses P4. ASIC design flows rely critically on 3rd party tooling, that must be version/release specific. You can't rely on those objects being available whenever. They get squirreled away and kept, forever.
> Something being autogenerated, or binary, doesn't mean it shouldn't be in version control.
I think the SHA should be in version control. The file should be reproducibly built [1], then cached on a central server.
This means that a build target like a system image could be satisfied by downloading the complete image and no intermediate files. And a change to one file in one binary will result in only a small number of intermediate files being downloaded or reproducibly built to chain up to the new system image.
This is something that's really lacking in, for example, Git.
> I think the SHA should be in version control. The file should be reproducibly built [1], then cached on a central server.
Requiring reproducible builds to handle translations or images is a bit much. Also, if it's cached on a central server, that now means you need to be connected to that central server. If you require a connection to said central server, why not just have your source code on said server in the first place, a la p4?
I do agree that NixOS is a great idea, but personally 99% of my problems would be solved if git scaled properly.
You can always build from source in this scenario. The cache server lets you skip two things. First, you can prune the leaves of the tree of intermediate files you might need. Second, where you do need to compile/build/link/package, etc., you can do only those steps that are altered by your changes. So you save CPU time and storage space.
> why not just have your source code on said server in the first place, a la p4?
That would be great. A version of git where cloning is almost a no-op, and building is downloading the package assuming you haven't changed anything.
I'm not aware of how p4 allowing this. My recollection of perforce is that I still had most source files locally.
Wouldn't Git LFS be the tool for this job? Have the automated tool build a .zip file for example of the translations (possibly with compression level set to 0), then have your build toolchain unzip the archive before it runs. Then check that big .zip file into GitLFS, et voila you now have this large file versioned in Git.
Git LFS isn't the same as git, though. It's better than putting everything in a separate store, but for one it disables offline work, and breaks the concept of D in the DVCS of git.
> then have your build toolchain unzip the archive before it runs
My build toolchain shouldn't have to work around the shortcomings of my environment, IMO.
> et voila you now have this large file versioned in Git.
No, it's on a separate http server that is fetched via git lfs. Subtle, but important difference.
It's good enough for the small usecases, but way behind tools that have first class support for binary files (binary deltas, common compression, ...). Even SVN shines here.
Separately we also found that git lfs is not very optimised for large repositories, notably its locking feature which list every file tracked by git for every checkout and commit command.
It just does a pretty good job of dealing with binary files in general. The check in/check out model is perfect for unmergeable files; you can purge old revisions; all the metadata is server side, so you only pay for the files you get; partial gets are well supported. And so, if you're going to maintain a set of tools that everybody is going to use to build your project, the Perforce depot is the obvious place to put them. Your project's source code is already there!
(There are various good reasons why you might not! But "because binary files shouldn't go in version control" is not one of them)
> You vendored all your compilers/language runtimes in the source control repo of each project? Including, like, gcc or clang? WTF?
Yep. Along with paltform SDKs, third party dependencies, precompiled binaries, non-redistributable runtimes, you name it.
Giant PSD or FBX files? 4K Textures? all of it.
Client mappings are the bread and butter of P4 (or Stream views more recently which are not as nice to work with) - you say "I don't want the path containing MacOS" if you don't want it.
> Because the Linux kernel source tree and its history can accurately be described as "a few hundred text files".
I was off by a little bit, it's ~60k. But it's still "only" 60k text files, no matter how important those text files are.
I've worked on a couple of game projects that did this. Build on Windows PC, build for Windows/Switch/Xbox One/Xbox Serieses/PS4/PS5/Linux. I was never responsible for setting this up, and that side of things did sounda bit annoying, but it seemed to work well enough once up and running. No need to worry about which precise version of Visual Studio 2019 you have, or whether you've got the exact same minor revision of the SDK as everybody else. You always build with exactly the right toolchain and SDK for each target platform.
It's not that unusual, we vendor entire VM images which contain the development environment. (Codebase existed since before docker). And it works well, need to fix something in a project that was last update 20 years ago? Just boot up the VM and you are ready.
Well we don't put them in git, we put them in perforce because git keels over if you try and stuff 10GB of binaries into it once every few months.
I think the real question is the other way around though, why _not_ use git for versioning when that's what it's supposed to be for? Why do I have to verison some things with git, and others with npm/go build/pip/vcpkg/cargo/whatever?
To clarify here, "generated" and "autogenerated" were bad choices of words. They're translations created by humans and is dependent on the strings in code.
It's not an unbreakable rule that generated or binary files should not be in Git. It's a rough guideline. Partly because Git is bad at dealing with binary files.
There are plenty of cases when including generated files is appropriate. It has many advantages over not doing that - probably the biggest are
* Code review is much easier because you can see the effect on the output.
* It's easier to find the generated files because they're next to the rest of your code. IDEs like it much more too.
In fact the upsides are so great and the downsides so minimal I would say it should be the default option as long as:
* The generated files are not huge.
* The generated files are always the same.
Even when they are huge it might still be a good idea, but you can put the files in a submodule or LFS. I do that for a project that has a really difficult to install generator so users don't need to install it.
I'm on the fence with this one. My previous project was Go & Typescript with a range of generated files; I committed the generated files, so that they would flag up in code reviews if they were changed, avoiding hidden or magic changes. I also didn't automatically regenerate, avoiding churn.
That said, if the autogenerated output is stable, it's fine. After all, in a sense, compiling your code is also a kind of autogenerating and few people will advocate for keeping compiled code in git.
- Makes it easy to develop applications and libraries together in a single branch
- Similarly, makes it easy to make a breaking change to a library, then change all clients of said library, in a single branch
- And because of the above, makes it easy to keep all dependencies on internal libs at the latest version, which can greatly reduce all sorts of “dependency hell” issues
- Generally makes integration testing a bit easier
The downside is you have to invest a lot more time in tooling, keeping both local and CI builds fast. And even with that tooling, builds won’t be as fast as they trivially are with multi-repo. But if you do invest that time in tooling, you can generally get them fast enough, and then reap the other benefits for a very productive dev experience.
Have done both monorepo and multi-repo at different, decent sized companies. Both have their pros/cons.
A couple of years now, but whether it's a good idea depends on your use case and organization. Seems to work for some. It works for my current assignment too - two and possibly more React Native that reuse a lot of components, translations, have the same APIs, etc.
I am not sure what I'm looking at here. Surely those half million files are for dozens if not hundreds of different apps, libraries and tools and surely those do not all depend on each other, no?
Because if so, why not just use one repo per app/library/tool? Sure, if you have a cluster of things that all depend on each other, or a cluster of things that typically is needed in bulk, by all means, put those in a single repo.
But putting literally all your code in a single repo is not a very sane technical choice, is it?
Depends on the test tooling. If you want a single commit to pass integration tests then they need to be in a single commit. Otherwise you're tracking every version of every tool.
But I like to look at the problem from another perspective. Why _not_ use a single repo. The only real reason would be to work around technical challenges with your source control of choice, not because having everything tracked together is inherently bad.
Google runs a single monorepo for 95% of its projects across the company. Google isn't perfect, but it's hard to argue that it isn't technically very good.
One of the biggest advantages is that there is no version chasing and dependency questions. At commit X, everything works consistently. No debating about whether this or that dependency is out of sync.
Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.
Genuine, non snarky question: Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?