PHP is simple, but it’s not easy to master. In addition to knowing how to use it, we also need to know how it works under the hood.

What is the purpose of understanding the underlying implementation of PHP? To use a dynamic language well, we must first understand it, the memory management and framework model is worth learning from, and we can optimize the performance of our programs by extending the development to achieve more and more powerful features.

PHP is a dynamic language for web development. To be more specific, it is a software framework that contains a large number of component modules implemented in C. It is a powerful UI framework.

In short; the PHP dynamic language execution process: after getting a piece of code, the source program is translated into individual instructions (opcodes) through lexical and syntactic parsing, and then the ZEND virtual machine executes these instructions in order to complete the operation. PHP itself is implemented in C, so the final calls are to C functions, so in effect we can think of PHP as a C-developed piece of software.

PHP directory structure

The PHP source code also includes several files generated during development, and several sections maintained in their respective locations upstream. (Note: PHP version 7.4.13).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
<php-src>/
 ├─ .git/                           # Git configuration and source directory
 ├─ TSRM/                           # Thread Safe Resource Manager
 └─ Zend/                           # Zend Engine
    ├─ zend_vm_execute.h            # Generated by `Zend/zend_vm_gen.php`
    ├─ zend_vm_opcodes.c            # Generated by `Zend/zend_vm_gen.php`
    ├─ zend_vm_opcodes.h            # Generated by `Zend/zend_vm_gen.php`
    └─ ...
 ├─ appveyor/                       # Appveyor CI service files
 └─ build/                          # *nix build system files
    ├─ ax_*.m4                      # https://github.com/autoconf-archive/autoconf-archive
    ├─ config.guess                 # https://git.savannah.gnu.org/cgit/config.git
    ├─ config.sub                   # https://git.savannah.gnu.org/cgit/config.git
    ├─ libtool.m4                   # https://git.savannah.gnu.org/cgit/libtool.git
    ├─ ltmain.sh                    # https://git.savannah.gnu.org/cgit/libtool.git
    ├─ shtool                       # https://www.gnu.org/software/shtool/
    └─ ...
 ├─ docs/                           # PHP internals and repository documentation
 └─ ext/                            # PHP core extensions
    └─ bcmath/
       ├─ libbcmath/                # Forked and maintained in php-src
       └─ ...
    └─ curl/
       ├─ sync-constants.php        # The curl symbols checker
       └─ ...
    └─ date/
       └─ lib/                      # Bundled datetime library https://github.com/derickr/timelib
          ├─ parse_date.c           # Generated by re2c 0.15.3
          ├─ parse_iso_intervals.c  # Generated by re2c 0.15.3
          └─ ...
       └─ ...
    └─ ffi/
       ├─ ffi_parser.c              # Generated by https://github.com/dstogov/llk
       └─ ...
    └─ fileinfo/
       ├─ libmagic/                 # Modified libmagic https://github.com/file/file
       ├─ data_file.c               # Generated by `ext/fileinfo/create_data_file.php`
       ├─ libmagic.patch            # Modifications patch from upstream libmagic
       ├─ magicdata.patch           # Modifications patch from upstream libmagic
       └─ ...
    └─ gd/
       ├─ libgd/                    # Bundled and modified GD library https://github.com/libgd/libgd
       └─ ...
    └─ mbstring/
       ├─ libmbfl/                  # Forked and maintained in php-src
       ├─ unicode_data.h            # Generated by `ext/mbstring/ucgendat/ucgendat.php`
       └─ ...
    └─ pcre/
       ├─ pcre2lib/                 # https://www.pcre.org/
       └─ ...
    └─ pdo_mysql/
       ├─ php_pdo_mysql_sqlstate.h  # Generated by `ext/pdo_mysql/get_error_codes.php`
       └─ ...
    └─ skeleton/                    # Skeleton for developing new extensions with `ext/ext_skel.php`
       └─ ...
    └─ standard/
       └─ html_tables/
          ├─ mappings/              # https://www.unicode.org/Public/MAPPINGS/
          └─ ...
       ├─ credits_ext.h             # Generated by `scripts/dev/credits`
       ├─ credits_sapi.h            # Generated by `scripts/dev/credits`
       ├─ html_tables.h             # Generated by `ext/standard/html_tables/html_table_gen.php`
       └─ ...
    └─ tokenizer/
       ├─ tokenizer_data.c          # Generated by `ext/tokenizer/tokenizer_data_gen.sh`
       └─ ...
    └─ xmlrpc/
       ├─ libxmlrpc/                # Forked and maintained in php-src
       └─ ...
    └─ zend_test                    # For testing internal APIs. Not needed for regular builds.
       └─ ...
    └─ zip/                         # Bundled https://github.com/pierrejoye/php_zip
       └─ ...
    └─ ...
 └─ main/                           # Binding that ties extensions, SAPIs, and engine together
    ├─ streams/                     # Streams layer subsystem
    ├─ php_version.h                # Generated by release managers using `configure`
    └─ ...
 ├─ pear/                           # PEAR installation
 └─ sapi/                           # PHP SAPI modules
    └─ cli/
       ├─ mime_type_map.h           # Generated by `sapi/cli/generate_mime_type_map.php`
       └─ ...
    └─ ...
 ├─ scripts/                        # php-config, phpize and internal development scripts
 ├─ tests/                          # Core features tests
 ├─ travis/                         # Travis CI service files
 └─ win32/                          # Windows build system files
    ├─ cp_enc_map.c                 # Generated by `win32/cp_enc_map_gen.exe`
    └─ ...
 └─ ...
Directory Description
TSRM Thread-related safety implementation, PHP thread safety is built on top of the TSRM library, PHP implementation of the common *G macro is usually the encapsulation of TSRM, TSRM (Thread Safe Resource Manager) thread-safe resource manager.
Zend Core implementation of PHP parser, such as lexical syntax parsing of scripts, execution of opcode and implementation of extension mechanism, etc.
build Compile related directories under linux
ext PHP extensions, including the definition and implementation of most PHP functions, such as array series, pdo series, spl series and other function implementations, are in this directory. Personally written extensions can also be placed in this directory when testing, for easy testing and debugging.
main PHP’s main code, where the most core PHP files are stored, mainly to achieve the basic facilities of PHP, here and Zend engine is not the same, Zend engine mainly to achieve the language’s most core language runtime environment.
netware Network directory, definition and implementation of sockets
pear PEAR is an abbreviation for the PHP Extension and Application Repository, a code repository for PHP extensions and applications. It is a code repository for PHP extensions and applications. Simply put, PEAR is to PHP what CPAN (Comprehensive Perl Archive Network) is to Perl.
sapi PHP’s application layer interface contains code for various server abstraction layers, such as apache’s mod_php, cli,cgi,embed, and fpm.
scripts Script directory under Linux
tests Test scripts directory, containing test files for various PHP functions
travis For building, non-PHP specific directories
win32 The scripts related to compiling PHP under Windows, such as the implementation of sokcet is not quite the same under Windows and *Nix platform, and also includes the scripts related to compiling PHP under Windows。

Although there are many source directories, the only core directories are sapi, main, zend, ext, and TSRM.

SAPI

The input to PHP programs can be standard input from the command line, or network requests based on the cgi/fastcgi protocol. It can even be embedded in a microcontroller for C, C++ programs to call. They correspond to cli mode, fpm/cgi mode, embed mode, and in addition to these there are apache2handler, litespeed mode.

  1. apache2handle: This is the way to deal with apache as webserver, using mod_PHP mode to run, and it is the most widely used one now.

  2. cgi: This is another way of interaction between webserver and PHP directly, that is, the famous fastcgi protocol, in recent fastcgi+PHP is getting more and more applications, and it is the only way supported by asynchronous webserver; typical application nginx server; fastcgi is To be clear, it is an extension of php.

  3. cli: command invocation.

    php cli

The sapi directory is an abstraction of the input and output layers, and is the specification for PHP to provide external services.

Similarly, the output can be written to the standard output of the command line or returned to the client as a network response based on the cgi/fastcgi protocol.

SAPI full name Server API, responsible for PHP external service specification, it defines the structure sapi_module_struct, the structure defines the mode start, shutdown, activation, expiration and so on many hook function pointers, each mode will these function pointers to their own function, it can easily extend the way of PHP external service. The above several modes are also the implementation of sapi_module_strcut to complete the multi-scenario application of PHP.

fastcgi process

  • Web Server loads the FastCGI process manager (IIS ISAPI or Apache Module) at startup
  • The FastCGI process manager initializes itself, starts multiple CGI interpreter processes (visible as multiple php-cgi) and waits for a connection from the Web Server.
  • When a client request reaches the Web Server, the FastCGI process manager selects and connects to a CGI interpreter. The Web server sends the CGI environment variables and standard input to the FastCGI subprocess php-cgi.
  • The FastCGI subprocess finishes processing and returns the standard output and error messages to the Web Server from the same connection. When the FastCGI subprocess closes the connection, the request is processed. The FastCGI subprocess then waits for and processes the next connection from the FastCGI process manager (running in the Web Server). In CGI mode, php-cgi exits at this point.
  • In the above case, you can imagine how slow CGI usually is. For every Web request, PHP has to re-parse php.ini, reload all the extensions and re-initialize all the data structures. With FastCGI, all of this happens only once when the process starts. An additional benefit is that Persistent database connection works.

CGI CGI

main

The main directory is the glue between the SAPI layer and the Zend layer.

The role of the main directory is to take requests from SAPI, parse out the script files and parameters to be executed, and initialize the environment and configuration, such as initializing variables and constants, registering functions, parsing configuration files, loading extensions, etc.

main php

Zend

The Zend engine is the kernel part of php, which translates php code into executable opcode processing and implements the corresponding processing methods, basic data structures, memory allocation management, etc. It consists of two parts: the compiler and the executor.

The compiler is responsible for the lexical and syntactic analysis of the PHP code, and generates an abstract syntax tree, which is then further compiled into opcode, opcode is the instruction recognized by the Zend virtual machine, php7 has 173 opcodes in total, and all the syntax is composed of these opcodes. The executor is responsible for executing the opcode output by the compiler.

php zend

Extensions

ext(extension), which is a way to extend the function of PHP kernel, divided into PHP extension and zend extension, both support user-defined development, both are more common, PHP extensions are gd, json, date, array, etc., and the familiar opcache is Zend extension.

php Extensions

TSRM

TSRM (Thread Safe Resource Manager) is a thread-safe resource manager.

A global variable is a variable defined outside a function, it is a public resource, in a multi-threaded environment, access to public resources may cause conflicts, TSRM is born to solve the problem.

The main purpose of TSRM is to ensure the safety of shared resources, and PHP’s thread safety mechanism is simple and intuitive - in a multi-threaded environment, each thread is provided with a separate copy of the global variable. This is implemented by allocating (locking before allocation) an independent ID (self-incrementing) to each thread via TSRM as an index to the current thread’s global variable memory area, enabling complete independence between threads for subsequent global variable access.

Most of the PHP SAPIs are single-threaded, so there is not much need to pay attention to thread safety, but in the case of Apache or the user’s own implementation of the PHP environment, it is necessary to consider thread safety.

PHP design philosophy and features

  1. multi-process model: Since PHP is a multi-process model, different requests do not interfere with each other, which ensures that a request hanging will not affect the full service, and PHP also supports multi-threaded model as early as now.
  2. weakly typed language: different from C/C++, JAVA, C# and other languages, PHP is a weakly typed language. The type of a variable is not determined unchanged at the beginning, it will be determined and may occur implicitly or display type conversion only in the run, the flexibility of this mechanism is very convenient and efficient in web development, the specific will be detailed in the PHP variables later.
  3. engine (Zend) + component (ext) model to reduce internal coupling.
  4. middle layer (sapi ) Sapi full name is Server Application Programming Interface isolated web server and PHP.
  5. syntax is simple and flexible, not too much specification. Disadvantages lead to mixed styles.

php execution flow & opcode

php execution flow & opcode

The php dynamic language execution process: after getting a piece of code, the source program is translated into individual instructions (opcodes) through lexical and syntactic parsing, and then the Zend virtual machine executes these instructions sequentially. php itself is implemented in c, so the final calls are to c functions, so in effect, we can think of php as a piece of software developed in c.

The core of php execution is the translated directives (opcode), which are the basic unit of php program execution.

There are several common processing functions.

1
2
3
4
5
6
ZEND_ASSIGN_SPEC_CV_CV_HANDLER : 变量分配 ($a=$b
ZEND_DO_FCALL_BY_NAME_SPEC_HANDLER:函数调用
ZEND_CONCAT_SPEC_CV_CV_HANDLER:字符串拼接 $a.$b
ZEND_ADD_SPEC_CV_CONST_HANDLER: 加法运算 $a+2
ZEND_IS_EQUAL_SPEC_CV_CONST:判断相等 $a==1
ZEND_IS_IDENTICAL_SPEC_CV_CONST:判断相等 $a===1

HashTable - the core data structure

HashTable is the core data structure of Zend, it is used to implement almost all common functions in PHP, we know PHP array is its typical application, in addition, inside zend, such as function symbol table, global variables, etc. are also based on hash table with the following features.

  1. supports typical key->value queries
  2. can be used as an array
  3. O(1) complexity for adding and deleting nodes
  4. key supports mixed types: the presence of both associative index arrays
  5. value supports mixed types: array(“string”,2332)
  6. linear traversal support: such as foreach

Zend hash table implements the typical hash table hash structure, and provides forward and reverse traversal of arrays by attaching a two-way chain table. The structure is shown in the following figure.

As you can see: in hash table there is both a hash structure in the form of key->value and a bidirectional linked table model, which makes it very convenient to support fast lookup and linear traversal.

HashTable

  1. Hash structure: Zend’s hash structure is a typical hash table model, which resolves conflicts by means of linked lists. Note that zend’s hash table is a self-growing data structure, and when the hash table is full, it dynamically expands by a factor of two and repositions the elements. In addition, zend itself has made some optimizations to speed up the key->value fast lookup by trading space for time. For example, in each element, a variable nKeyLength is used to identify the length of the key for quick determination.

  2. Doubly linked list: Zend hash table implements a linear traversal of elements through a Linked list structure. In theory, it is enough to use a Linked list for traversal. The main reason for using a Doubly linked list is to quickly delete and avoid traversal. The Zend hash table is a composite structure that can be used as an array, i.e. it supports the usual associative arrays and can be used as sequential indexed numbers, even allowing a mixture of the two. PHP associative arrays: Associative arrays are the typical application of hash_table. A query process goes through the following steps (as you can see from the code, this is a common hash query process with some quick determinations to speed up the lookup).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
getKeyHashValue h;
index = n & nTableMask;
Bucket *p = arBucket[index];
while (p) {
       if ((p->h == h) && (p->nKeyLength == nKeyLength)) {
              RETURN p->data;
        }
        p=p->next;
}
RETURN FALTURE;
  • PHP Indexed Arrays

Index arrays are our common arrays, accessed by subscripts. For example: $arr[0], zend hashtable is internally normalized, and for index type key is also assigned hash value and nKeyLength (to 0). The internal member variable nNextFreeElement is the maximum id currently assigned, which is automatically added to one after each push. It is this normalization process that allows PHP to achieve a mix of associative and non-associative. Due to the special nature of the push operation, the order of index keys in the PHP array is not determined by the subscript size, but by the order of push. For example, $arr[1] = 2; $arr[2] = 3; for a double type key, Zend hashtable will treat him as an index key.

PHP variables

PHP is a weakly typed language that does not strictly distinguish between the types of variables itself. PHP does not require a type to be specified at the time of variable declaration.

PHP may perform implicit conversions of variable types during program runtime. As with other strongly typed languages, explicit type conversions may be performed in programs.

PHP variables can be classified as simple types (int, string, bool), collection types (array resource object) and constants (const). All of the above variables have the same structure at the bottom zval.

Zval consists of three main parts.

  • type: specifies the type of variable stated (integer, string, array, etc.)
  • refcount&is_ref: used to implement reference counting (described later in detail)
  • value: the core part, which stores the actual data of the variable

Zvalue is used to store the actual data of a variable. Because of the need to store multiple types, zvalue is a union, and thus implements weak types.

The correspondence between the php variable type and its actual storage is as follows.

PHP variables

  1. Reference counting is widely used in memory recovery, string manipulation, etc. Variables in PHP are a typical application of reference counting. Zval’s reference counting is implemented by member variables is_ref and ref_count, which allows multiple variables to share the same data. This allows multiple variables to share a single copy of the data, avoiding the need for frequent copying. When assigning, zend points the variable to the same Zval with ref_count++ and ref_count-1 when unset. only when ref_count is reduced to 0 is the actual destruction operation performed. In the case of a reference assignment, zend will modify is_ref to 1.

  2. PHP variables share data by reference counting, so what if you change the value of one of the variables? When trying to write to a variable, if Zend finds that the Zval pointed to by that variable is shared by multiple variables, it makes a copy of the Zval with a ref_count of 1 and decrements the refcount of the original Zval, a process called ‘Zval separation’. This process is called ‘Zval separation’. As you can see, zend will only copy when a write operation occurs, so it is also called copy-on-write. Integer and floating point number is one of the basic types in PHP and a simple type variable. For integers and floating point numbers, the corresponding values are stored directly in Zvalue. The types are long and double.

  3. The Zvalue structure shows that, unlike strongly typed languages such as c, php does not distinguish between int, unsigned int, long, etc. For it, there is only one type of integer, which is long, and thus the range of integers in php is determined by the number of bits in the compiler rather than being fixed.

  4. For floating point numbers, similar to integers, it also doesn’t distinguish between float and double, but is uniformly of the type double only. In php, what if an integer is out of bounds? In this case, it is automatically converted to double, so be careful, as many triks are generated from this.

  5. Like integers, character variables are also base and simple type variables in php. The Zvalue structure shows that in php, strings consist of a pointer to the actual data and a length structure, which is more similar to string in c++. Since the length is represented by an actual variable, unlike c, its strings can be binary numbers (containing \0), and in php, finding the string strlen is an O(1) operation. When adding, modifying, or appending string operations, php reallocates memory to generate a new string. Finally, for security reasons, php still adds a \0 to the end of a string when it is generated.

Common string splicing methods and speed comparison. Suppose there are 4 variables as follows.

1
2
3
4
5
$strA = '123';
$strB = '456';

$intA = 123;
$intB = 456;

Now a comparison and explanation of several character splicing methods as above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// 这种情况下,zend会重新 malloc 一块内存并进行相应处理,其速度一般
$res = "{$strA}{$strB}"

// 这种速度是最快的,zend会在当前 strA基础上直接 relloc,避免重复拷贝
$res = $strA.$strB

// 这种速度是比较慢的,需要进行隐式转换
$res = $intA.$intB

// 这会是最慢的一种方式,因为sprintf在PHP中并不是一个语言结构,本身对于格式识别和处理就需要耗费比较多时间,另外本身机制也是malloc内存。不过sprintf的方式最具可读性,实际中可以根据具体情况灵活选择。 
$res = sprintf ("%s%s",$strA.$strB);

PHP arrays are implemented naturally via zend hashtable. how is foreach operation implemented?

  • For an array foreach is done by traversing a Doubly linked list in hashtable. For indexed arrays, foreach is more efficient than for, eliminating the need for a key->value lookup. count calls HashTabel -> NumOfElements, O(1), and for strings like ‘123’, zend converts them to integers, $arr[123] and $arr[' 123'] are equivalent.

  • Resource types are the most complex variables in PHP and are a type of compound structure. PHP’s Zval can represent a wide range of data types, but it is difficult to adequately describe them for custom data types. Since there is no efficient way to depict these composite structures, there is also no way to use traditional straw rentals for them. To solve this problem, it is only necessary to refer to a pointer by an essentially arbitrary identifier (label), in a way known as a resource.

In Zval, for resource, lval is used as a pointer to the address where the resource is located. resource can be any composite structure, we are familiar with mysqli, fsock, memcached, etc. are all resources.

How to use resources

  • Registration: for a custom data type to be used as a resource. It needs to be registered first, and zend will assign a globally unique label to it.
  • Get a resource variable: For a resource, zend maintains a hash_tale of id->actual data. For a resource, only its id is recorded in Zval. fetch finds the specific value in the hash_table by id and returns it.
  • Resource destruction: There are various data types for resources. There is no way to destroy it in Zend itself. So you need to provide the destruction function when registering the resource.
  • When unset a resource, zend calls the appropriate function to complete the destruct. It is also removed from the global resource table.

A resource can persist for a long time, not just after all the variables that reference it have gone out of scope, but even after a request has ended and a new one has been made. These resources are called persistent resources because they persist through the entire lifecycle of SAPI, unless they are intentionally destroyed. In many cases, persistent resources can improve performance to some extent. For example, in the common case of mysql_pconnect, persistent resources are allocated via pemalloc so that they are not freed at the end of the request. For zend, there is no distinction between the two per se.

How local and global variables are implemented in PHP

  • For a request, at any given moment PHP can see two symbol tables (symbol_table and active_symbol_table), where the former is used to maintain global variables. The latter is a pointer to the symbol table of the currently active variable. When the program enters a function, zend allocates a symbol table x to it while pointing active_symbol_table to a. The distinction between global and local variables is achieved in this way.
  • Get variable values: PHP symbol table is implemented by hash_table, each variable is assigned a unique identifier, and the corresponding Zval is found from the table when fetching.
  • Using global variables in functions: In functions, we can use global variables by explicitly declaring global. Create a reference to a variable with the same name in active_symbol_table, or first if there is no variable with the same name in symbol_table.