In the previous article, we looked at open source from a strategic point of view. Here the focus shifts to execution.
Author: Hirad Emamialagha
Most internal repositories are not ready to be public.
This is something we ran into quickly once we started thinking about open sourcing our work.
They work, but only in the environment they were built in. Paths point to internal systems, dependencies are loosely defined, and setup depends on things that were never written down. Within the team, this is manageable. Outside, the same repository usually breaks in the first few minutes.
Open sourcing a repository is not about making it visible. It starts when those assumptions are removed and the system can run on its own.
From Repository to Something Usable
A simple way to look at it:
An internal repository starts to behave like an open source project when it can be used without you.
That sounds obvious, but in practice it is a high bar.
Someone who has never seen your code, has no access to your infrastructure, and cannot ask you questions should still be able to install it, understand what it does, and run a basic example.
Many repositories do not meet that bar.
They depend on things that only exist in that environment. Environment variables that are never documented. Private package registries. Directory structures that only make sense if you have seen them before.
At some point you stop fixing individual issues and realize the real problem is dependency on context. The goal becomes removing that dependency altogether.
Removing Internal Assumptions
The first step is not documentation. It is not licensing either.
It is cleanup.
Anything that only works in your current setup needs to be questioned. If something cannot be explained in a few lines, it is worth questioning whether that complexity is necessary. If it is, it should be clearly documented.
That usually includes credentials, internal services, hard-coded paths, and a fair amount of experimental code that never got removed.
But cleanup is not just about removing sensitive things.
It is about predictability.
A repository should behave the same way on a clean machine as it does internally.
If it does not, users will try to fix it. Some will succeed, most will not. Either way, the system becomes harder to trust.
Structure Helps More Than You Think
Once the obvious issues are gone, structure starts to matter.
A messy repository slows people down immediately. You open it, look around, and you are not sure where to start. After a minute or two, many people leave.
There is no single correct layout, but most Python projects end up looking somewhat similar:
project/
├── src/
├── tests/
├── examples/
├── docs/
├── pyproject.toml
├── README.md
├── LICENSE
└── CHANGELOG.mdThis is less about rules and more about familiarity.
If users recognize the structure, they do not need to think about it. They can focus on the code instead.
The Environment Problem
A lot of projects fail here.
Environments tend to evolve over time. Dependencies get installed one by one. Something breaks, someone fixes it, and the fix is never written down. After a while, the system works, but nobody is exactly sure why.
Then someone clones the repository on a fresh machine and nothing works.
If your project depends on specific versions, system libraries, or hardware assumptions, those need to be made explicit. Not in a separate document, but in the repository itself.
That usually means a proper dependency file, clear installation steps, and sometimes a container or environment specification.
If setting up the environment feels unpredictable or frustrating, people often stop there.
Documentation Is the Entry Point
People do not start with your code. They start with your README.
In many cases, that is the only thing they read.
So it has to answer a few simple questions quickly. What the project does. Why it exists. How to install it. How to run something that works.
That last part matters more than it seems.
A small example that runs in a few minutes does more than a long explanation. It shows that the system actually works, and it gives users something to build on.
If the first interaction fails, the rest of the documentation does not matter much.
Collaboration Starts With a Few Files
Once a repository is public, people interact with it in ways that are hard to predict.
Some will open issues. Some will suggest changes. Some will try to extend it in directions you did not expect.
A few standard files make this easier to manage.
A license defines how the project can be used. A contributing guide explains how to run things locally and how to propose changes. A code of conduct sets expectations.
LICENSE: defines how the project can be used and distributed.
CONTRIBUTING.md: explains how to contribute and run the project locally.
CODE_OF_CONDUCT.md: sets expectations for behavior.
SECURITY.md: explains how to report critical security or privacy issues.
Issue and PR templates: structure communication.
These are small things, but without them, communication becomes messy very quickly. They also signal that the project is intentional, not incidental.
Automation Changes the Dynamic
In internal projects, a lot of validation is informal. Someone runs tests locally, maybe checks formatting, and that is enough.
In open source, this does not scale well.
Automated checks become the baseline. Tests run on every change. Formatting is enforced. Basic guarantees are always verified.
This is not only about catching errors.
It changes how people contribute. They do not need to understand everything upfront. They make a change, run the pipeline, and get feedback from the system.
It removes a layer of friction.
Automation makes trust less dependent on people and more on consistent checks.
Versioning Matters More Than Expected
Inside a team, the main branch is often enough.
Outside, it is not.
Without versioning, users do not know what they are depending on. Something works today, breaks tomorrow, and there is no clear explanation why.
Versioning creates stability. It gives users something fixed to rely on, even if the project keeps evolving.
This becomes even more important in machine learning, where small changes can affect results in ways that are not immediately obvious.
Versioning gives users something fixed to rely on.
A Simple Check
There is a practical way to test all of this.
Clone the repository on a clean machine. Follow the instructions exactly. Do not fix things manually. Do not assume anything.
If you can get to a working example without guessing, the repository is probably in a good place.
If not, the gaps are usually easy to spot.
Final Thoughts
Turning a repository into something others can use takes more effort than expected.
You remove assumptions, fix the environment, restructure parts of the code, rewrite documentation, and automate things that used to be manual.
At some point, the repository starts behaving differently. It becomes predictable. Easier to trust.
What is easy to miss is that this does not only help once the project is public.
The same changes make the repository easier to work with for your own team. Onboarding becomes simpler. Debugging becomes clearer. Fewer things depend on hidden context.
In practice, the biggest benefit often shows up before anything is even shared.
Like before, the biggest benefit of open sourcing often shows up within your own team, long before anything is shared publicly.
What Comes Next
So far, this has been about code.
In the next article, the focus shifts to models and datasets. That is where things start to behave differently. Releasing a model is not the same as releasing code, and the expectations change quite a bit.