Thursday, December 15, 2011

Compilation Process & Program Memory

Describe in detail the steps and outputs involved in each step in making an executable from source file.Also explain different PROGRAM MEMORY SEGMENTS/ Memory Layout of C Programs

Overall Process----------
Source->Preprocessing->compile=Object files(machine instruction)->link(relocation + Linking)=Load module/executable module->Loader=Memory Address Space

1.Preprocessing:
It is the first pass of any C compilation. It processes include-files, conditional compilation instructions and macros.
2.Compilation
Compilation invloves two steps:
i)Compiler--is the second pass. It takes the output of the preprocessor, and the source code, and generates assembler source code.
ii)Assembler: is the third stage of compilation. It takes the assembly source code and produces an assembly listing with offsets. The assembler output is stored in an object file.So,
Compiler translates each complex instruction of the source program into a set of machine instructions and replacing each symbolic reference by an address reference.Result is an BINARY FILE/OBJECT FILE

Object Module Structure:->
1.Header Section:-Contains sizes of all the other sections involved-including the size of the uninitialized data section, which is not created until load time-in order to parse the object module(because it is a binary file & no binary value can be used to indicate the end or beginning of a section)
2.Machine code section-TEXT section—Code for printf function,main ,for loop etc
3. Initialised data Section-array sizes,variable sizes (all initilased)
4.Symbol Table Section-It transforms symbolic references into address references in terms of offset(distance in bytes) from the beginning of the object module.Thus start executing function x() means start executing instructions at address y.

3.Linking
The object module is then linked together with at least two library object modules one for the standard functions like printf() and other containing the code for program termination.
i)Relocation---The object modules are merged together and the internal address refrences within each object module must be updated to reflect the offset changes brought on by merging all object modules into one
ii)Linking---External address references in each object module must be resolved. Linker must resolve explicit reference from the object module to the standard functions like printf()(i.e replace it by the appropriate address reference through the offset with respect to the beginning of the LOAD MODULE).such addresses are also called RELATIVE or logical address.It also assigns final addresses to procedures/functions and variables, and revises code and data to reflect new addresses (a process called relocation).

4.Loading & Memory Mapping--------------------

1.The loader takes the load module and create LOGICAL ADDRESS SPACE (Code, DATA[data[initilised],bss[uninitialized],heap[dynamic data],Unused logical address space,Stack) for a program
2.The loader must map the logical addresses to physical address in the main memory and then copy the binary information or data to these memory locations.
A logical address is mapped onto physical address by a simple addition of the logical ddress(offset) to the base register.(starting address of the load module)


Compiler assembler linker and loader 
Process Memory Address Space

Process memory layout

OBJECT FILES and EXECUTABLE

  • After the source code has been assembled, it will produce an Object files (e.g. .o, .obj) and then linked, producing an executable files.
  • An object and executable come in several formats such as ELF (Executable and Linking Format) and COFF (Common Object-File Format).  For example, ELF is used on Linux systems, while COFF is used on Windows systems.
  • Other object file formats are listed as follows
    a.out--
    The a.out format is the original file format for Unix.  It consists of three sections: text, data, and bss, which are for program code, initialized data, and uninitialized data, respectively.  This format is so simple that it doesn't have any reserved place for debugging information.  The only debugging format for a.out is stabs, which is encoded as a set of normal symbols with distinctive attributes.
    COFF-The COFF (Common Object File Format) format was introduced with System V Release 3 (SVR3) Unix. COFF files may have multiple sections, each prefixed by a header. The number of sections is limited.  The COFF specification includes support for debugging but the debugging information was limited.  There is no file extension for this format.
    ELF--The ELF (Executable and Linking Format) format came with System V Release 4 (SVR4) Unix.  ELF is similar to COFF in being organized into a number of sections, but it removes many of COFF's limitations.  ELF used on most modern Unix systems, including GNU/Linux, Solaris and Irix. Also used on many embedded systems.

    • When we examine the content of object files there are areas called sections.  Sections can hold executable code, data, dynamic linking information, debugging data, symbol tables, relocation information, comments, string tables, and notes.
    • Some sections are loaded into the process image and some provide information needed in the building of a process image while still others are used only in linking object files.
    • There are several sections that are common to all executable formats (may be named differently, depending on the compiler/linker) as listed below:
       .text: 
      This section contains the executable instruction codes and is shared among every process running the same binary. This section usually has READ and EXECUTE permissions only. This section is the one most affected by optimization.
      .bss:
      BSS stands for ‘Block Started by Symbol’. It holds un-initialized global and static variables. Since the BSS only holds variables that don't have any values yet, it doesn't actually need to store the image of these variables. The size that BSS will require at runtime is recorded in the object file, but the BSS (unlike the data section) doesn't take up any actual space in the object file.
      .data:
      Contains the initialized global and static variables and their values. It is usually the largest part of the executable. It usually has READ/WRITE permissions
      .rdata:
      Also known as .rodata (read-only data) section. This contains constants and string literals.
      .reloc:
      Stores the information required for relocating the image while loading.
      Symbol Table:
      A symbol is basically a name and an address.  Symbol table holds information needed to locate and relocate a program’s symbolic definitions and references. A symbol table index is a subscript into this array. Index 0 both designates the first entry in the table and serves as the undefined symbol index.
      • Since assembling to machine code removes all traces of labels from the code, the object file format has to keep these around in different places.
      • It is accomplished by the symbol table that contains a list of names and their corresponding offsets in the text and data segments.
      • A disassembler provides support for translating back from an object file or executable.
      Relocation Records:
      Relocation is the process of connecting symbolic references with symbolic definitions. For example, when a program calls a function, the associated call instruction must transfer control to the proper destination address at execution. Re-locatable files must have relocation entries’ which are necessary because they contain information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process's program image.  Simply said relocation records are information used by the linker to adjust section contents. 

4 comments:

  1. Memory Layout of C Programs------
    http://www.geeksforgeeks.org/archives/14268

    ReplyDelete
  2. Complete Details of Program Execution----
    http://www.tenouk.com/ModuleW.html

    ReplyDelete
  3. Compilation Process in GCC------
    http://codingfreak.blogspot.com/2008/02/compilation-process-in-gcc.html

    ReplyDelete
  4. Program memory after LOADING-----------------

    http://en.wikipedia.org/wiki/Data_segment#Program_memory

    http://en.wikipedia.org/wiki/.bss

    http://en.wikipedia.org/wiki/Code_segment

    http://en.wikipedia.org/wiki/Object_file

    ReplyDelete