HotSpot模板解釋器目標代碼生成過程源碼分析

2019-11-14 15:17:22

字體：大中小

來源：轉載

供稿：網友

　　雖然說解釋執行模式是逐字逐句翻譯給目標平臺運行的，但這樣的過程未免太過緩慢，如果能把字節碼說的話做成紙條，運行時只要把對應的紙條交給目標平臺就可以了，這樣，執行速度就會明顯提升。JVM的Hotspot虛擬機的模板解釋器就是用這種方法來解釋執行的。在開始分析之前，先了解一下JVM的執行方式。

　　(1).邊解釋邊運行，即每次解釋一條字節碼并運行其解釋的本地代碼，這種執行引擎速度相對很慢
　　(2).JIT(即時編譯)具有更快的運行速度但是需要更多的內存，方法被第一次調用時，字節碼編譯生成的本地代碼將會被緩存，這樣在該方法下次被調用的時候，將取出緩沖的本地代碼直接運行
　　(3).自適應優化，對于經常被調用的方法，則會緩存其編譯產生成的本地代碼，對于其他較少被調用的代碼，仍對其采用解釋執行的方法。
　　(4).片上虛擬機，即虛擬機的執行引擎直接嵌入在片上

　　HotSpot虛擬機可以配置為以下運行模式：
-Xint：解釋模式
-Xcomp：編譯模式
-Xmixed：混合模式
(通過java -version就可以查看虛擬機的運行模式)

　　HotSpot在啟動時，會為所有字節碼創建在目標平臺上運行的解釋運行的機器碼，并存放在CodeCache中，在解釋執行字節碼的過程中，就會從CodeCache中取出這些本地機器碼并執行。

　　Hotspot虛擬機的細節技術實現值得借鑒，如果你覺得源碼甚至匯編代碼比較枯燥的話，也可以大致了解相關模塊的組件、工作流程，對相關實現有一定的認識。

　　下面就從模板解釋器的初始化開始，分析HotSpot的解釋代碼的生成。
　　在創建虛擬機時，在初始化全局模塊過程中，會調用interPReter_init()初始化模板解釋器，模板解釋器的初始化包括抽象解釋器AbstractInterpreter的初始化、模板表TemplateTable的初始化、CodeCache的Stub隊列StubQueue的初始化、解釋器生成器InterpreterGenerator的初始化。

 1 void TemplateInterpreter::initialize() { 2   if (_code != NULL) return; 3   // assertions 4   //... 5  6   AbstractInterpreter::initialize(); 7  8   TemplateTable::initialize(); 9 10   // generate interpreter11   { ResourceMark rm;12     TraceTime timer("Interpreter generation", TraceStartupTime);13     int code_size = InterpreterCodeSize;14     NOT_PRODUCT(code_size *= 4;)  // debug uses extra interpreter code space15     _code = new StubQueue(new InterpreterCodeletInterface, code_size, NULL,16                           "Interpreter");17     InterpreterGenerator g(_code);18     if (PrintInterpreter) print();19   }20 21   // initialize dispatch table22   _active_table = _normal_table;23 }

1.AbstractInterpreter是基于匯編模型的解釋器的共同基類，定義了解釋器和解釋器生成器的抽象接口。
2.模板表TemplateTable保存了各個字節碼的模板(目標代碼生成函數和參數)。
TemplateTable的初始化調用def()將所有字節碼的目標代碼生成函數和參數保存在_template_table或_template_table_wide(wide指令)模板數組中

  //                              interpr. templates  // Java spec bytecodes          ubcp|disp|clvm|iswd  in    out   generator      argument  def(Bytecodes::_nop           , ____|____|____|____, vtos, vtos, nop           ,  _      );  def(Bytecodes::_aconst_null   , ____|____|____|____, vtos, atos, aconst_null   ,  _      );  def(Bytecodes::_iconst_m1     , ____|____|____|____, vtos, itos, iconst        , -1      );  def(Bytecodes::_iconst_0      , ____|____|____|____, vtos, itos, iconst        ,  0      );  def(Bytecodes::_iconst_1      , ____|____|____|____, vtos, itos, iconst        ,  1      );  def(Bytecodes::_iconst_2      , ____|____|____|____, vtos, itos, iconst        ,  2      );//...其他字節碼的模板定義

其中，def()是查看數組對應項是否為空，若為空則初始化該數組項。

Template* t = is_wide ? template_for_wide(code) : template_for(code);  // setup entry  t->initialize(flags, in, out, gen, arg);

_template_table或_template_table_wide的數組項就是Template對象，即字節碼的模板，Template的結構如下：

class Template VALUE_OBJ_CLASS_SPEC { private:  enum Flags {    uses_bcp_bit,                                // set if template needs the bcp pointing to bytecode    does_dispatch_bit,                           // set if template dispatches on its own    calls_vm_bit,                                // set if template calls the vm    wide_bit                                     // set if template belongs to a wide instruction  };  typedef void (*generator)(int arg);  int       _flags;                  // describes interpreter template properties (bcp unknown)  TosState  _tos_in;                 // tos cache state before template execution  TosState  _tos_out;                // tos cache state after  template execution  generator _gen;                    // template code generator  int       _arg;

_flags為標志，該項的低四位分別標志：

uses_bcp_bit，標志需要使用字節碼指針(byte code pointer，數值為字節碼基址+字節碼偏移量)
does_dispatch_bit，標志是否在模板范圍內進行轉發，如跳轉類指令會設置該位
calls_vm_bit，標志是否需要調用JVM函數
wide_bit，標志是否是wide指令(使用附加字節擴展全局變量索引)

_tos_in表示模板執行前的TosState(操作數棧棧頂元素的數據類型，TopOfStack，用來檢查模板所聲明的輸出輸入類型是否和該函數一致，以確保棧頂元素被正確使用)
_tos_out表示模板執行后的TosState
_gen表示模板生成器(函數指針)
_arg表示模板生成器參數

3.StubQueue是用來保存生成的本地代碼的Stub隊列，隊列每一個元素對應一個InterpreterCodelet對象，InterpreterCodelet對象繼承自抽象基類Stub，包含了字節碼對應的本地代碼以及一些調試和輸出信息。
其內存結構如下：在對齊至CodeEntryAlignment后，緊接著InterpreterCodelet的就是生成的目標代碼。

4.InterpreterGenerator根據虛擬機使用的解釋器模型不同分為別CppInterpreterGenerator和TemplateInterpreterGenerator
根據不同平臺的實現，以x86_64平臺為例，TemplateInterpreterGenerator定義在/hotspot/src/cpu/x86/vm/templateInterpreter_x86_64.cpp

1 InterpreterGenerator::InterpreterGenerator(StubQueue* code)2   : TemplateInterpreterGenerator(code) {3    generate_all(); // down here so it can be "virtual"4 }

(1).在TemplateInterpreterGenerator的generate_all()中，將生成一系列JVM運行過程中所執行的一些公共代碼和所有字節碼的InterpreterCodelet：

error exits：出錯退出處理入口
字節碼追蹤入口(配置了-XX:+TraceBytecodes)
函數返回入口
JVMTI的EarlyReturn入口
逆優化調用返回入口
native調用返回值處理handlers入口
continuation入口
safepoint入口
異常處理入口
拋出異常入口
方法入口(native方法和非native方法)
字節碼入口

(2).其中，set_entry_points_for_all_bytes()會對所有被定義的字節碼生成目標代碼并設置對應的入口(這里只考慮is_defined的情況)

 1 void TemplateInterpreterGenerator::set_entry_points_for_all_bytes() { 2   for (int i = 0; i < DispatchTable::length; i++) { 3     Bytecodes::Code code = (Bytecodes::Code)i; 4     if (Bytecodes::is_defined(code)) { 5       set_entry_points(code); 6     } else { 7       //未被實現的字節碼(操作碼) 8       set_unimplemented(i); 9     }10   }11 }

(3).set_entry_points()將取出該字節碼對應的Template模板，并調用set_short_enrty_points()進行處理，并將入口地址保存在轉發表(DispatchTable)_normal_table或_wentry_table(使用wide指令)中

 1 void TemplateInterpreterGenerator::set_entry_points(Bytecodes::Code code) { 2   CodeletMark cm(_masm, Bytecodes::name(code), code); 3   // initialize entry points 4   // ... asserts 5   address bep = _illegal_bytecode_sequence; 6   address cep = _illegal_bytecode_sequence; 7   address sep = _illegal_bytecode_sequence; 8   address aep = _illegal_bytecode_sequence; 9   address iep = _illegal_bytecode_sequence;10   address lep = _illegal_bytecode_sequence;11   address fep = _illegal_bytecode_sequence;12   address dep = _illegal_bytecode_sequence;13   address vep = _unimplemented_bytecode;14   address wep = _unimplemented_bytecode;15   // code for short & wide version of bytecode16   if (Bytecodes::is_defined(code)) {17     Template* t = TemplateTable::template_for(code);18     assert(t->is_valid(), "just checking");19     set_short_entry_points(t, bep, cep, sep, aep, iep, lep, fep, dep, vep);20   }21   if (Bytecodes::wide_is_defined(code)) {22     Template* t = TemplateTable::template_for_wide(code);23     assert(t->is_valid(), "just checking");24     set_wide_entry_point(t, wep);25   }26   // set entry points27   EntryPoint entry(bep, cep, sep, aep, iep, lep, fep, dep, vep);28   Interpreter::_normal_table.set_entry(code, entry);29   Interpreter::_wentry_point[code] = wep;30 }

這里以非wide指令為例分析set_short_entry_points()。bep(byte entry point), cep, sep, aep, iep, lep, fep, dep, vep分別為指令執行前棧頂元素狀態為byte/boolean、char、short、array/reference(對象引用)、int、long、float、double、void類型時的入口地址。

(4).set_short_entry_points()根據操作數棧棧頂元素類型進行判斷，首先byte類型、char類型和short類型都應被當做int類型進行處理，對于非void類型將調用generate_and_dispatch()產生目標代碼，這里以iconst_0為例對TOS的處理進行介紹：
對于iconst，其期望的_tos_in(執行前棧頂元素類型)是void類型(vtos)，期望的_tos_out(執行后棧頂元素類型)是int類型(itos)

 1 void TemplateInterpreterGenerator::set_short_entry_points(Template* t, address& bep, address& cep, address& sep, address& aep, address& iep, address& lep, address& fep, address& dep, address& vep) { 2   assert(t->is_valid(), "template must exist"); 3   switch (t->tos_in()) { 4     case btos: 5     case ctos: 6     case stos: 7       ShouldNotReachHere();  // btos/ctos/stos should use itos. 8       break; 9     case atos: vep = __ pc(); __ pop(atos); aep = __ pc(); generate_and_dispatch(t); break;10     case itos: vep = __ pc(); __ pop(itos); iep = __ pc(); generate_and_dispatch(t); break;11     case ltos: vep = __ pc(); __ pop(ltos); lep = __ pc(); generate_and_dispatch(t); break;12     case ftos: vep = __ pc(); __ pop(ftos); fep = __ pc(); generate_and_dispatch(t); break;13     case dtos: vep = __ pc(); __ pop(dtos); dep = __ pc(); generate_and_dispatch(t); break;14     case vtos: set_vtos_entry_points(t, bep, cep, sep, aep, iep, lep, fep, dep, vep);     break;15     default  : ShouldNotReachHere();                                                 break;16   }17 }

其中__定義如下：

# define __ _masm->

即模板解釋器的宏匯編器

(5).以期望的棧頂狀態為vtos狀態為例，分析set_vtos_entry_points()：

 1 void TemplateInterpreterGenerator::set_vtos_entry_points(Template* t, 2                                                          address& bep, 3                                                          address& cep, 4                                                          address& sep, 5                                                          address& aep, 6                                                          address& iep, 7                                                          address& lep, 8                                                          address& fep, 9                                                          address& dep,10                                                          address& vep) {11   assert(t->is_valid() && t->tos_in() == vtos, "illegal template");12   Label L;13   aep = __ pc();  __ push_ptr();  __ jmp(L);14   fep = __ pc();  __ push_f();    __ jmp(L);15   dep = __ pc();  __ push_d();    __ jmp(L);16   lep = __ pc();  __ push_l();    __ jmp(L);17   bep = cep = sep =18   iep = __ pc();  __ push_i();19   vep = __ pc();20   __ bind(L);21   generate_and_dispatch(t);22 }

以ftos入口類型為例(vtos即當前字節碼的實現不關心棧頂元素的狀態)，分析該入口的處理指令：
push_f()：
　　定義在 /hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp中

1 void InterpreterMacroAssembler::push_f(XMMRegister r) {2   subptr(rsp, WordSize);3   movflt(Address(rsp, 0), r);4 }

　　其中r的默認值為xmm0，wordSize為機器字長(如64位機器為8字節)

subptr()實際上調用了subq()：

1 void MacroAssembler::subptr(Register dst, int32_t imm32) {2   LP64_ONLY(subq(dst, imm32)) NOT_LP64(subl(dst, imm32));3 }

subq()的實現如下：

1 void Assembler::subq(Register dst, int32_t imm32) {2   (void) prefixq_and_encode(dst->encoding());3   emit_arith(0x81, 0xE8, dst, imm32);4 }

而emit_arith()將調用emit_byte()/emit_long()寫入指令的二進制代碼”83 EC 08”(由于8可由8位有符號數表示，第一個字節為0x81 | 0x02，即0x83，rsp的寄存器號為4，第二個字節為0xE8 | 0x04，即0xEC，第三個字節為0x08 & 0xFF，即0x08)，該指令即AT&T風格的sub $0x8,%rsp

 1 void Assembler::emit_arith(int op1, int op2, Register dst, int32_t imm32) { 2   assert(isByte(op1) && isByte(op2), "wrong opcode"); 3   assert((op1 & 0x01) == 1, "should be 32bit Operation"); 4   assert((op1 & 0x02) == 0, "sign-extension bit should not be set"); 5   if (is8bit(imm32)) { //iconst_0的操作數為0，即可以用8位二進制數表示 6     emit_byte(op1 | 0x02); // set sign bit 7     emit_byte(op2 | encode(dst)); 8     emit_byte(imm32 & 0xFF); 9   } else {10     emit_byte(op1);11     emit_byte(op2 | encode(dst));12     emit_long(imm32);13   }14 }

emit_byte()定義在/hotspot/src/share/vm/asm/assembler.inlilne.hpp中：
該函數將把該字節復制到_code_pos處

1 inline void AbstractAssembler::emit_byte(int x) {2   assert(isByte(x), "not a byte");3   *(unsigned char*)_code_pos = (unsigned char)x;4   _code_pos += sizeof(unsigned char);5   sync();6 }

故subq()向代碼緩沖寫入了指令sub $0x8,%rsp
類似地，movflt()向代碼緩沖寫入了指令 movss %xmm0,(%rsp)
jmp()向代碼緩沖寫入了指令jmpq (addr為字節碼的本地代碼入口)

set_vtos_entry_points()產生的入口部分代碼如下：

 1 push %rax        .....(atos entry) 2 jmpq <addr>  3 sub $0x8,%rsp     .....(ftos entry) 4 movss %xmm0,(%rsp) 5 jmpq <addr>(addr為字節碼的本地代碼入口) 6 sub $0x10,%rsp    .....(dtos entry) 7 movsd %xmm0,(%rsp) 8 jmpq <addr> 9 sub $0x10,%rsp     .....(ltos entry)10 mov %rax,(%rsp)11 jmpq <addr>12 push %rax         ...(itos entry)

set_vtos_entry_points()的最后調用generate_and_dispatch()寫入了當前字節碼的解釋代碼和跳轉到下一個字節碼繼續執行的邏輯處理部分

generate_and_dispatch()主要內容如下：

 1 void TemplateInterpreterGenerator::generate_and_dispatch(Template* t, TosState tos_out) { 2   // ... 3   // generate template 4   t->generate(_masm); 5   // advance 6   if (t->does_dispatch()) { 7     //asserts 8   } else { 9     // dispatch to next bytecode10     __ dispatch_epilog(tos_out, step);11   }12 }

這里我們以iconst()為目標代碼生成器為例，分析generate()：

1 void Template::generate(InterpreterMacroAssembler* masm) {2   // parameter passing3   TemplateTable::_desc = this;4   TemplateTable::_masm = masm;5   // code generation6   _gen(_arg);7   masm->flush();8 }

generate()會調用生成器函數_gen(_arg)，該函數根據平臺而不同，如x86_64平臺下，定義在/hotspot/src/cpu/x86/vm/templateTable_x86_64.cpp中

1 void TemplateTable::iconst(int value) {2   transition(vtos, itos);3   if (value == 0) {4     __ xorl(rax, rax);5   } else {6     __ movl(rax, value);7   }8 }

我們知道，iconst_i指令是將i壓入棧，這里生成器函數iconst()在i為0時，沒有直接將0寫入rax，而是使用異或運算清零，即向代碼緩沖區寫入指令”xor %rax, %rax”；在i不為0時，寫入指令”mov $0xi, %rax”

當不需要轉發時，會調用dispatch_epilog()生成取下一條指令和分派的目標代碼：

1 void InterpreterMacroAssembler::dispatch_epilog(TosState state, int step) {2   dispatch_next(state, step);3 }

dispatch_next()實現如下：

1 void InterpreterMacroAssembler::dispatch_next(TosState state, int step) {2   // load next bytecode (load before advancing r13 to prevent AGI)3   load_unsigned_byte(rbx, Address(r13, step));4   // advance r135   increment(r13, step);6   dispatch_base(state, Interpreter::dispatch_table(state));7 }

dispatch_next()首先調用load_unsigned_byte()寫入指令”movzbl (%r13),%rbx”，再調用increment()寫入指令”inc/add (,)%r13”指令，最后調用dispatch_base()寫入”jmp *(%r10,%rbx,8)”。這類似于PC自增一條指令的寬度再繼續取值運行的過程。

分析到這里，不禁有一個疑問，_code_pos是哪里？之前說過，StubQueue是用來保存生成的本地代碼的Stub隊列，隊列每一個元素對應一個InterpreterCodelet對象，InterpreterCodelet對象包含了字節碼對應的本地代碼以及一些調試和輸出信息。那么_code_pos是如何和InterpreterCodelet對應的呢？

我們注意到無論是為JVM的各種入口函數，還是為字節碼生成本地代碼，都會構造一個CodeletMark對象

CodeletMark cm(_masm, Bytecodes::name(code), code);

CodeletMark的構造函數如下：在初始值列表中，調用了StubQueue的request()創建了一個InterpreterCodelet對象，并以該InterpreterCodelet目標代碼地址和大小為參數構造了一塊CodeBuffer用來存放生成的目標代碼。

public:  CodeletMark(    InterpreterMacroAssembler*& masm,    const char* description,    Bytecodes::Code bytecode = Bytecodes::_illegal):    _clet((InterpreterCodelet*)AbstractInterpreter::code()->request(codelet_size())),    _cb(_clet->code_begin(), _clet->code_size())  { // request all space (add some slack for Codelet data)    assert (_clet != NULL, "we checked not enough space already");    // initialize Codelet attributes    _clet->initialize(description, bytecode);    // create assembler for code generation    masm  = new InterpreterMacroAssembler(&_cb);    _masm = &masm;  }

但在此時還未生成目標代碼，所以并不知道生成的目標代碼有多大，所以這里會向StubQueue申請全部的空閑空間(只留有一點用來對齊空間，注意StubQueue實際上是一片連續的內存空間，所有Stub都在該空間上進行分配)
隨后初始化該InterpreterCodelet的描述部分和對應字節碼，并以該CodeBuffer為參數構造了一個編譯器對象InterpreterMacroAssembler

分析到這里，就應該明白編譯器的_code_pos指的就是生成代碼在CodeBuffer中的當前寫位值
還需一提的就是CodeletMark的析構函數，這里確認編譯器的生產代碼完全寫入到CodeBuffer中后，就會調用StubQueue的commit()將占用的空間劃分為當前Stub(InterpreterCodelet)所有

~CodeletMark() {    // align so printing shows nop's instead of random code at the end (Codelets are aligned)    (*_masm)->align(wordSize);    // make sure all code is in code buffer    (*_masm)->flush();    // commit Codelet    AbstractInterpreter::code()->commit((*_masm)->code()->pure_insts_size());    // make sure nobody can use _masm outside a CodeletMark lifespan    *_masm = NULL;  }