c history

Even if we count from K&R C in 1978, C will be 44 years old in 2022.

I don’t know what C looks like in the reader’s mind, but I have the impression that C has poor expressiveness, only two high-level data structures, arrays and pointers, a small standard library and many historical problems, and no package management mechanism. The most fatal thing is that memory needs to be managed manually. It is difficult for modern young programmers to be interested in C. There are too many high-level languages to choose from, such as Java, the rising star Go/Rust, so why choose C.

Background introduction

In order to experience the development experience of C, I first read the book 21st Century C. This book is relatively short, so I can finish it quickly, but the book is also relatively shallow, so if you don’t understand the tool chain of C before, reading this book will help, but it is limited to help write C code, so I revisited C Programming Language (2nd Edition - New Edition). I have to say, even after all these years, this book is still the best textbook for learning C. The content is concise and concise, with typical cases and not a word of nonsense in the whole book. The book is also relatively short, so you can read it in a week without doing exercises. If there is a drawback to this book, it is that the variable names do not need to be defined at the beginning of the function, now the C compiler is much more advanced than before.

With the bottom of K&R C, you can directly start the real world. I recently developed a 2K line project with C: jiacai2050/oh-my-github, not too trivial, mainly to experience the following content from this project.

  • C development process, familiar with the relevant tool chain
  • C99/C11 language features
  • C coding style essentials, how to design APIs to avoid users stepping on potholes

The following section summarizes what we have learned in the past few months, focusing on these three points. Due to the limited contact time, there are inevitably shortcomings in the text, and readers are welcome to criticize and correct them.

Toolchain

Let’s talk about the toolchain first. Before developing a formal project, there are some rather tedious things to do, such as configuring the development and debugging environment, installing dependencies, etc.. The language server I use is clangd, which supports variable definitions, references, auto-completion, etc. clangd uses compile_commands.json file for configuration, some build tools can generate it directly, for simple personal projects, you can also directly use compile_flags.txt for configuration, sample usage:

1
2
3
4
-I
/opt/homebrew/Cellar/jansson/2.14/include
-I
/opt/homebrew/Cellar/pcre2/10.40/include

For veteran C programmers, they may be more familiar with universal-ctags/ctags and try it when they encounter a situation where LSP is not up to the task.

Package management

After the development tools are configured, it is often necessary to install dependencies before formally writing code, which brings us to an important topic: package management.

The most important point of package management is to ensure that the dependencies are fixed for each project, i.e. reprodubile build, which is not a simple matter of choosing the right version when directly or indirectly depending on multiple versions of a library, npm2 that downloading each library dependency separately is one solution, and Go’s Minimal version selection is also a solution.

npm2

In NPM, there may be multiple versions of the same library.

NPM relies on on disk

This is the structure that NPM relies on on disk.

go mod

In Go, under the same major version, the largest minor version that meets the requirements is selected.

Before describing how C does package management, let’s review how C code is organized. A C project has two main types of files.

  • .h header files, which are mainly used for declarations, including function signatures, types, etc.
  • .c source files, which are mainly used to provide implementation of the declarations in the header files

These two types of files are used at different stages of the build, and can be found in the following image (source).

C Build, Execute Process

As you can see, the header files will only be used in the second phase (i.e. preprocessing), the source files will only be used in the third phase (i.e. compilation), and the fourth and fifth phases will link the user’s own code together with the code of third-party dependencies to form the final executable.

When a project is used as a class library, the source code is generally not provided directly. Instead, a .so shared library or .a static library (created using the archive command) is provided corresponding to the source code file

This is to protect the source code from leaks on the one hand and to speed up the build process on the other. Users only need to compile their own code, and three-party dependencies do not need to be compiled repeatedly.

For C, there is no strict package organization, as long as the corresponding file can be found during compilation and linking. There are package managers in the community, such as vcpkg, conan-io/conan, CMake, etc. For personal projects, you can also choose the following approach: install the dependencies through the tools provided by the operating system (brew or apt, etc.) and then write the Makefile by hand to configure the compilation and linking parameters.

This approach may seem rudimentary, but it solves the problem more effectively, and with the addition of container technology, it also does a better job of version isolation. The only drawback is that it is not possible to specify the exact dependency version.

Makefile

The following is an introduction to the basic use of Makefile, a fully functional example can be found at: Makefile.

1
2
3
4
5
6
7
P = program_name
OBJECTS = main.o
CFLAGS = -g -Wall -O3
LDLIBS =
CC = gcc
$(P): $(OBJECTS)
    $(CC) $(OBJECTS) $(LDFLAGS) -o $(P)

This is a relatively basic Makefile template. Makefile has default rules for converting .c files to .o files, with the following general commands.

1
2
%.o: %.c
    $(CC) $(CFLAGS) -c $< -o $@

Therefore, compile-time parameters can be defined via CFLAGS. There are several common variables in the Makefile.

  • $@ for target name
  • $* for target name without suffix
  • $< for target dependencies, i.e., what comes after the colon

You can use pkg-config to simplify the manual configuration of CFLAGS, for example, if you have installed the dependency libcurl, you can use the following way to find Its compilation parameters and link parameters.

1
2
# pkg-config --cflags --libs libcurl
-I/usr/include/aarch64-linux-gnu -lcurl
  • -I is used to set the search directory for header files
  • -l to specify the library to be linked by the linker

Generally C libraries are distributed with the corresponding header files and compiled shared or static libraries. On Debian systems, you can find the files installed by libcurl4-openssl-dev with the following command.

1
2
3
4
5
6
dpkg -L libcurl4-openssl-dev

/usr/include/aarch64-linux-gnu/curl/curl.h
/usr/include/aarch64-linux-gnu/curl/easy.h
/usr/lib/aarch64-linux-gnu/libcurl.a
/usr/lib/aarch64-linux-gnu/libcurl.so

As for whether the linker chooses static or shared libraries, each linker does it differently, so refer to the documentation for the corresponding platform.

GNU ld does it by specifying static libraries by means of -l:, for example: -l:libXYZ.a will only go for libXYZ.a, while -lXYZ will expand to libXYZ.so or libXYZ.a. Reference.

Language features

Interpreting pointer declarations

Pointers, as the most important type of C, often cause a lot of trouble for beginners, not only in terms of usage, but also in terms of deciphering pointer definitions. For example.

1
2
int *ptr;
int *ptr2[10];

ptr is better understood as a pointer to an int type, but what about ptr2? Is it a pointer to an array, or an array whose elements are pointers?

Actually, there is a nod to this question in the K&R C book, namely

The syntax of the declaration for a variable mimics the syntax of expressions in which the variable might appear.

That is, the declaration syntax of a variable clarifies the type of that variable in the expression. It seems a bit difficult to understand, so look at a few examples to understand.

1
int *ptr;

*ptr is an expression of type int, so ptr must be a pointer to int.

1
int arr[100];

arr[i] is an expression of type int, so arr must be an array and the elements of the array are int.

1
int *arr[100];

*arr[i] is an expression of type int, so arr[i] must be a pointer, so arr must be an array with elements that are pointers to int.

1
int (*ptr)[100];

(*ptr)[100] is an expression of type int, so ptr must be a pointer to an array of type int.

1
int *comp()

*comp() is an expression of type int, so comp() must return an int pointer, so comp is a function that returns a pointer to an int.

1
int (*comp)()

(*comp)() is an expression of type int, so *comp must be a function, so comp is a function pointer

If you don’t understand the above explanation, it doesn’t matter, you can think about it when you write code. For complex declarations, it is generally recommended to use the typedef approach. For example.

1
2
3
4
5
typedef int *int_ptr;
typedef int_ptr array_of_ten[10];


array_of_ten a1 = ...;

The a1 defined in this way is not much more difficult to understand; it is first an array whose elements are pointers to int. K&R C has a program that converts complex declarations into textual descriptions: K&R - Recursive descent parser.

Memory management

C, being a system-level language, does not have runtime, and memory requested through functions like malloc needs to be freed manually by the programmer. This is something that high-level languages try to avoid nowadays, because programmers are very unreliable compared to computers. Fortunately, C has also evolved, and in the GNU99 extension (also supported by clang), a cleanup attribute to simplify memory freeing.

Let’s look at how memory releases were handled before cleanup was used (full code).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static char *request(const char *url) {
    CURL *curl = NULL;
    CURLcode status;
    struct curl_slist *headers = NULL;
    char *data = NULL;

    curl_global_init(CURL_GLOBAL_ALL);
    curl = curl_easy_init();
    if (!curl)
        goto error;

    data = malloc(BUFFER_SIZE);
    if (!data)
        goto error;

    ...... // 省略中间逻辑
error:
    if (data)
        free(data);
    if (curl)
        curl_easy_cleanup(curl);
    if (headers)
        curl_slist_free_all(headers);
    curl_global_cleanup();
    return NULL;
}

The request function is a relatively common practice in C, where the function getso to a unified place when it errors out, and cleans up memory there at the same time. Personally, I find this approach rather ugly, and it is easier to miss the release of a variable. Here’s a look at how it looks after using cleanup.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
void free_char(char **str) {
  if (*str) {
    printf("free %s\n", *str);
    free(*str);
  }
}

int main() {
  char *str __attribute((cleanup(free_char))) = malloc(10);
  sprintf(str, "hello");
  printf("%s\n", str);
  return 0;
}

// 依次输出:

// hello
// free hello

As you can see, str executes free_char for resource usage before the main function exits. For ease of use, the following macro can be defined with #define.

1
2
3
4
#define auto_char_t char* __attribute((cleanup(free_char)))

// 使用方式:
auto_char_t str = malloc(10);

One thing needs to be clear: cleanup can only be used in local variables, not in function parameters or return values. Therefore, a complete C project also needs to use other means to ensure memory safety, the main tools are ASAN, valgrind, both of which are currently not The reader can choose according to the situation. Here is an example of the use of valgrind.

1
valgrind --tool=memcheck --leak-check=full --show-leak-kinds=all ./$(CLI) /tmp/test.db

In the case of a memory leak, something along the following lines will be reported.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
==609== HEAP SUMMARY:
==609==     in use at exit: 276 bytes in 37 blocks
==609==   total heap usage: 35,019 allocs, 34,982 frees, 279,075,706 bytes allocated
==609==
==609== 48 bytes in 6 blocks are still reachable in loss record 1 of 3
==609==    at 0x4849E4C: malloc (vg_replace_malloc.c:307)
==609==    by 0x57DB83F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57DCE2F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x5843C5B: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57DB73F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57DC8DB: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57D8443: gcry_control (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x4CE9BC3: libssh2_init (in /usr/lib/aarch64-linux-gnu/libssh2.so.1.0.1)
==609==    by 0x48D58CF: ??? (in /usr/lib/aarch64-linux-gnu/libcurl.so.4.7.0)
==609==    by 0x48831FB: curl_global_init (in /usr/lib/aarch64-linux-gnu/libcurl.so.4.7.0)
==609==    by 0x10A0AB: omg_setup_context (omg.c:184)
==609==    by 0x10DD0B: main (cli.c:19)

==609== 84 bytes in 25 blocks are definitely lost in loss record 2 of 3
==609==    at 0x4849E4C: malloc (vg_replace_malloc.c:307)
==609==    by 0x10D86B: omg_parse_trending (omg.c:1116)
==609==    by 0x10DBEF: omg_query_trending (omg.c:1176)
==609==    by 0x10DD6B: main (cli.c:38)
==609==
==609== 144 bytes in 6 blocks are still reachable in loss record 3 of 3
==609==    at 0x4849E4C: malloc (vg_replace_malloc.c:307)
==609==    by 0x57DB83F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57DCE2F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x5843C4F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57DB73F: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57DC8DB: ??? (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x57D8443: gcry_control (in /usr/lib/aarch64-linux-gnu/libgcrypt.so.20.2.8)
==609==    by 0x4CE9BC3: libssh2_init (in /usr/lib/aarch64-linux-gnu/libssh2.so.1.0.1)
==609==    by 0x48D58CF: ??? (in /usr/lib/aarch64-linux-gnu/libcurl.so.4.7.0)
==609==    by 0x48831FB: curl_global_init (in /usr/lib/aarch64-linux-gnu/libcurl.so.4.7.0)
==609==    by 0x10A0AB: omg_setup_context (omg.c:184)
==609==    by 0x10DD0B: main (cli.c:19)
==609==
==609== LEAK SUMMARY:
==609==    definitely lost: 84 bytes in 25 blocks
==609==    indirectly lost: 0 bytes in 0 blocks
==609==      possibly lost: 0 bytes in 0 blocks
==609==    still reachable: 192 bytes in 12 blocks
==609==         suppressed: 0 bytes in 0 blocks

You can see the leaks very clearly, and then just follow the diagram to fix the corresponding logic. Report after repair.

1
2
3
4
5
6
==481== LEAK SUMMARY:
==481==    definitely lost: 0 bytes in 0 blocks
==481==    indirectly lost: 0 bytes in 0 blocks
==481==      possibly lost: 0 bytes in 0 blocks
==481==    still reachable: 192 bytes in 12 blocks
==481==         suppressed: 0 bytes in 0 blocks

In addition to using tools to avoid memory problems, a more elegant way is to design the API to ensure as few allocations as possible and to distinguish boundaries, which will be discussed later in the API design and not repeated here. The following links have more discussion on cleanup.

Strings

There is no string type in C. It is only defined that when the last element of a character array is NULL, this array can be used as a string, and strings in this way are called Null-terminated byte strings. Since the string length is not recorded, many operations take O(n) time, a more famous recent example being GTA developers, which improves performance by 50% by removing sscanf.

Moreover, there is no variable string like StringBuilder in C, so you need to manually request memory when performing some operations like replace/split, which is not only hard to use, but more importantly, prone to memory leaks and security problems. Here I have two recommended ways to simplify string handling in C.

  • Try to use fixed-size local variables

    When doing some string operations, sometimes you don’t need to dynamically request memory, just use a fixed size of stack memory, which also saves you the trouble of free. For example.

    1
    2
    3
    
    char url[128];
    sprintf(url, "%s/user/repos?type=all&per_page=%zu&page=%zu&sort=created",
            API_ROOT, PER_PAGE, page_num);
    
  • Try to use mature string libraries from the community instead of string.h

    such as Redis author’s Simple Dynamic Strings, for more see: Good C string library - Stack Overflow

  • Similarly, there is an implementation of hashtable in C.

Raw String

In the real world, there are inevitably more complex string scenarios, and other programming languages provide raw strings to simplify the process. In the current C standard, this usage is not supported, but the GNU99 extension provides this functionality:

1
printf(R"(hello\nworld\n)");

In addition to this use of the GNU99 extension, the xxd command can also be used to embed the contents of a file into C code in the following manner.

1
2
3
4
5
6
7
8
echo hello > hello.txt
xxd -i hello.txt

# -i 参数会输出 C 头文件格式的变量定义
unsigned char hello_txt[] = {
  0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a
};
unsigned int hello_txt_len = 6;

Note that hello_txt does not end with NULL, which is not very convenient in some cases, and can be appended with the following command.

1
2
3
4
5
6
7
xxd -i hello.txt | tac | sed '3s/$$/, 0x00/' | tac > hello.h
cat hello.h
# 输出
unsigned char hello_txt[] = {
  0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a
};
unsigned int hello_txt_len = 6;

Designated Initializers

Designated Initializers. It is no exaggeration to say that this is the most exciting feature of ISO C99, making C more like a modern language and more useful at the same time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int a[6] = { [4] = 29, [2] = 15 };
// 等价于
int a[6] = { 0, 0, 15, 0, 29, 0 };


// 嵌套的结构
struct a {
  struct b {
    int c;
    int d;
  } e;
  float f;
} g = {.e.c = 3 };

When using this type of assignment, fields that are not assigned to a value are automatically initialized to zero, which is a very important point for pointers, which point to NULL instead of an arbitrary address.

static_assert

This is a feature added to C11 to check the truth of an assertion at compile time.

1
2
3
4
5
6
7
#include <assert.h>
int main(void)
{
  static_assert(2 + 2 == 4, "2+2 isn't 4");      // well-formed
  static_assert(sizeof(int) < sizeof(char),
                "this program requires that int is less than char"); // compile-time error
}

Generic selection

Generic selection. This feature of C11 supports generic programming to a certain extent.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <math.h>

// Possible implementation of the tgmath.h macro cbrt
#define cbrt(X) _Generic((X),                   \
                         long double: cbrtl,    \
                         default: cbrt,         \
                         float: cbrtf           \
                         )(X)

int main(void)
{
  double x = 8.0;
  const float y = 3.375;
  printf("cbrt(8.0) = %f\n", cbrt(x)); // selects the default cbrt
  printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float,
  // then selects cbrtf
}

Multi-threading

C11 has added the following two header files for multi-threading support.

Coding style

Error handling

In traditional C, it is common practice for functions to return an integer error code, and the error message is obtained by reading a global variable. For example libc, the processing logic is roughly as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
void do_something(some_arg_t args, out_t *out) {
  error_code err = do_task1();
  if(err) {
    goto error;
  }

  err = do_task2();
  if(err) {
    goto error;
  }

  err = do_task3();
  if(err) {
    goto error;
  }

 error:
  ...

The real return value is passed through the last pointer parameter. This practice has a long history but has two drawbacks, as follows.

  • Handling verbosity. Each function call requires error handling and cannot be chained
  • Forcing a single memory allocation. Because the return value needs to be assigned to a pointer argument, a memory allocation is inevitable

A recommended approach, modeled after the Result type in Rust, is to simply return a struct containing both the real data and the error message.

1
2
3
4
5
6
7
typedef struct {
  char *data;
  isize_t size;
  valid_t valid;
} file_contents_t;

file_contents_t read_file_contents(const char* filename);

Other functions use the contents.valid judgment when using the results, this way the above two problems are solved, the use of the effect is as follows.

1
2
3
4
5
6
7
file_contents_t fc = read_file_contents("milo.cat");
image_t img = load_image_from_file_contents(fc);
texture_t texture = load_texture_from_image(img);

if(texture.valid) {
  ...
}

Since the value type struct is returned directly, this indirectly reduces the stress of manually managing memory. And since there are no pointers to pointers, the program will theoretically run faster.

API Packaging

In the above introduction to C package management, it was mentioned that when C programs use a library, they only need to care about the header file, which defines the public interface to use the library, i.e. the implementation and the interface are separate.

However, in the general sense, the header file only encapsulates the function, only the declaration of the function, not the implementation, but in fact, it can also encapsulate the struct. For example.

1
2
3
4
5
/* Opaque pointer representing an Emacs Lisp value.
   BEWARE: Do not assume NULL is a valid value!  */
typedef struct emacs_value_tag *emacs_value;

emacs_value init_value();

Here emacs_value is the encapsulation of the structure emacs_value_tag, the real definition of the structure in the source code file, the class library only needs to provide the constructor of the structure can be, the user does not need to sense the structure size and implementation.

For more information about the C API design, see this document (PDF), which is a discussion of.

Summary

As a language with a long history, C is not obsolete, but on the contrary, it has evolved gradually as the times progress. For programmers who have been using high-level languages for a long time, when they first switch to C, they may feel that it is too rudimentary and inefficient for development, but this is really just a superficial phenomenon that will fade away through familiarity with the whole ecosystem. And with less black magic, programmers will have a greater sense of control over the entire code base.

In the beginning you always want results.

In the end all you want is CONTROL.