This is a preliminary draft Idea ---- Use the information of one emulation run to let gcc optimize the most common code bits. Remark ------ This document is meant to explain what I get from reading http://lists.gnu.org/archive/html/qemu-devel/2003-05/msg00061.html Details ------- It is a good idea to read my explanation of Dynamic Translation first. The process consists of three steps: profiling, analyzing and then creating new code. Profiling --------- The idea of profiling is to write every Translated Block ("TB") as sequence of oplets, and note each subsequent execution of that TB. After the run this information can be used to summarize the most common sequences of oplets. 1) In op.c, a new oplet is introduced: op_profile(), which takes one PARAM, which is the id of that TB. It writes out this id to /tmp/qemu.profile. 2) When a new TB is created in gen_intermediate_code(), the static TB_id counter is incremented, op_profile is inserted, and TB_id as parameter added. 3) At the end of gen_intermediate_code(), the oplet sequence of the current TB along with its id is written to /tmp/qemu.profile 4) All this happens only if "-profile" is specified on the command line and "--enable-profile" was specified for configure. The "-profile" also opens /tmp/qemu.profile and writes the current target to it. This is in vl.c. Analyzing --------- The idea: build a tree of oplet sequences, each node having the frequency of its execution, and an index of offsets in /tmp/qemu.profile. This is of course an own program. A node could look like this: struct profile_node { // contains pointers to oplet sequences, which add exactly one // oplet longer to this sequence profile_node* next_level[MAX_OP]; uint64_t frequency; off_t* offsets // this is a growable list }; The tree does not need to be computed completely, but can be capped (after all, we don't want to optimize all sequences, but only the most frequent). Creating new oplets ------------------- The idea: take the most frequent oplet sequences, and create new oplets for them. The same program which analyzes /tmp/qemu.profile can do this. In order to create the new oplet, we just cut the respective code of the oplets, and paste them together. A lot of those oplets are templated Fabrice Style, i.e. with _template.h files which are #include'd after #define'ing some constants. Therefore we need to push op.c through the preprocessor first. In order to do that, we have to add an op.E target in Makefile.target, which behaves almost like the op.o target, just substitutes the gcc flag "-E" for "-c" (call just the preprocessor). The analyzer then reads op.E, and creates an index to the code parts for the different oplets. We cannot optimize any oplet sequence containing an EXIT_TB(). For detailed explanations see below. Therefore, in exec-all.h we need an undef of EXIT_TB if compiling op.E, so that the analyzer can stop extending sequences after seeing such an oplet. After having all that information, the analyzer can write out the new oplets into a file op_seq.h, which resides in the respective target directory. This file must be included by op.c. And we have to add an empty file op_seq.h before profiling, so that QEmu compiles. The analyzer has to write out another file, translate-seq.h, which contains a list of the most frequent sequences and their respective oplet. this file resides in the target directory, and like for op_seq.h, a dummy must exist initially. A function in translate-all.c must be added, gen_collapse_seqs(), which uses the data from translate-seq.h, does a binary search for the longest matching sequence(s) and collapses them to the respective oplet. This function has to be called from cpu_gen_code() (which is in translate-all.c), just after the call to gen_intermediate_code(). Why we cannot optimize EXIT_TB() -------------------------------- In practice, EXIT_TB() is just the return instruction which gets stripped from all oplets, so that the current TB actually returns. However, there is a problem there: if the current oplet uses stack space (like calling a function, or needing more variables than available registers), the stack space is corrupted, if EXIT_TB() is called, because the return instruction is just that, a return, without the cleaning up of stack space and volatile registers. If that's not understandable, please read the Quick&Dirty Introduction to Assembly in the Dynamic Translation document.