JDWK | Centralized vs Localized Flows

The argument for centralized flows is that it is difficult to debug failures and unexpected results if you don’t know what version of the code was ran. If the flow is centralized and read-only then you know everyone is running the same code or at least a version of the code that has been tested and released, since the run would be pointing to a read-only release area.

There are several flaws to this plan though. It is impossible to fully test an rtl2gates flow; the combinations of variables and inputs are effectively infinite. Factoring in long run times and limited licenses and resources, even moderate software or rtl level code coverage is not feasible, and bugs are unavoidable.

When they are caught, how do you fix them, validate and then release a new version quickly enough that it does not impact the schedule?

The answer is that you cannot. Often with centralized flows, staging or prelease areas are used to give users quick fixes before an official release is made, but this can be dangerous as the staging area may have other unwanted and untested code in it, and worse, it will change from run to run as new fixes or features are added to the staging area.

Therefore, the flow must allow local overrides for people to move past their particular bug, issue, or missing feature. This code is written quickly, without review, and is rarely reusable. Over time, the overrides grow, making debugging worse than local modifications to the flow, and moving code back into to the flow and tested impossible. Addtionally, these hacks/overrides often get copied and shared between users without revision control, proliferating bad coding practices and new bugs. When a new version of the flow is released, the local overrides cannot be tested, and the flow fails in mysterious ways or runs to signoff with suboptimal or even non-manufacturable results.

Centralized flows tend to work for one project and then struggle to maintain pace when a second one begins and are quickly localized.

Localized flows are cloned to each work area or run area. This allows the engineers to make the fix where the problem exists in the code and avoid local overrides. The changes are then submitted for review, tested and integrated into the flow to benefit all designs. The downside is that the user can change anything and break the flow in very obscure ways. Code reviews often miss these bugs, which are then distributed quickly if users are working from the head of the repository.

The argument that localized flows are harder to debug is unfounded because with any revision control, you can easily see what commit was used and any local changes the user has made. A centralized release can also be imitated by using labels, tags or bookmarks and making formal releases with release notes. Work areas then default to the latest released version for the project. If a release version is found to have a bug that testing missed, then the test suite is updated, the bug is fixed, and a new release is made. This process is identical to centralized flows, but with a localized flow, a user doesn’t have to wait for the release, they can make the changes themselves, submit them for review, and move on.

The downside with localized flows is that a separate work area has to be cloned for each project to test the flow. Any change has to be committed and distributed to all clones or clients. This forces untested code to be added to the stack or a branch that must be merged with the main branch to test. To spread the burden of testing, typically, a central CAD team releases a generic flow and each project is responsible for testing the release with their project level code/configs before releasing to the team. This is an effective methodology, but divorces the core of the flow from the people actually using it. Over time, the flow will become bottom heavy, meaning the development of the project specific code will outpace the generic code. Code that should be written for all projects gets copied and modified, and the work needed to ramp up a new project increases.

A good test suite and release process is needed with either centralized or localized flows. If releases do not happen quickly, users will resort to running on the repository head or coding local overrides. Arguably, the companies that have the most prolific and successful physical design teams use Perforce because any collection of changes can be labeled and quickly distributed without waiting on a full regression suite and release process. These companies still run regressions and make releases, but the user doesn’t need to take any changes that are not required to fix the issue and comparing one run to the next is much easier.

What neither method nor revision control prevents is switching flow release versions or making local changes mid-flow. Both approaches require the flow to log the state of the flow at the time of the run. Preferably, this would happen at the exact time of execution as long run times allow for manipulation even after a run is launched. This is a very difficult problem to solve and execute, as runs often get repeated in the same area multiple times and pairing the results with the code is a major challenge without creating a mess of time stamps or unique hashes that are difficult for a human to read and consume large amounts of disk space. Another option is to clone the flow for each run, with the downside being the user has to distribute changes to all runs and this still doesn’t prevent in-flight edits that make debugging impossible.

Giant company-wide mono repos pose a significant problem for localized flows as the repository can become too large to clone for every work area and impossible to clone for every run. There are solutions that make handling large repositories feasible, but at the loss of visibility of the local clone and local changes to other users. This makes debugging another user’s run arduous but not impossible. By storing the commit hash and the local changes for each run, the state of the run area can be quickly duplicated. A utility can be written to help with this process, but it is certainly more complicated than just changing directories and looking directly at the run area. The choice of revision control is a widely discussed and heavily debated topic.

In summary, centralizing the flow will cause frustration for the end user when they are unable to fix the problem at the source. This frustration and schedule pressure will cause users to create lower level overrides (project, design, local, etc) that actually hurt debuggability. This code will get copied and modified instead of being generalized and integrated into the flow, increasing the ramp up effort of new nodes, projects and designs, and defeating the purpose of having a centralized flow in the first place.

Centralized vs Localized Flows

2024/08/09

Justin Kernen