This paper presents a static transformation algorithm, for C++-based hardware models such as SystemC models. This algorithmic transformation changes the threading structure of the models and generates efficient C++ code with pre-emptive threading for efficient simulation on multi-processor systems. Efficient modeling for simulation and modeling for synthesis seems to be competing goals in the context of C++ based modeling paradigms, because for synthesis, the design should be partitioned according to the targeted hardware units and their interconnections, and concurrency is aligned along unit boundaries. However, for simulation performance, such a model may contain many concurrent modules, implemented by many user-level threads, which is an artifact of the C++-based concurrency modeling mechanism. It has been shown in previous work, however, that multi-threading is not necessarily a simulation performance hindrance. In fact, a proper choice of multithreading, especially in huge simulation models with I/O bound computations, is necessary for simulation efficiency. However, since the existing C++-based modeling frameworks employ user level thread packages, even this necessary threading cannot take advantage of the multi-processors availability, because user-level threads are transparent to operating system kernel. We solve this problem by accepting synthesis targeted SystemC models, and compiling them into multi-threaded simulation models, with kernel-level threads, resulting in faster simulation on multi-processors, as well as on single processors. Concurrency alignment in our resulting code is usually along the dataflow through the model. In our past work we have shown that simulation is faster after such concurrency re-assignment and here we give a foundation for implementing such an algorithm. This work has similarity with Quasi-Static Scheduling work in the literature, however, the aim and context are different, and so is the basis for the algorithm design.