Anyone familiar with Lua knows that Lua allows you to modify and delete elements in a for ... pairs loop. pairs` loops to modify and delete elements in a table. There is nothing wrong with the following code:

1
2
3
4
5
6
local t = {a = 1, b = 2, c = 3}
for k, v in pairs(t) do
    if v == 1 then
        t[k] = nil
    end
end

However, if we delete elements and add new elements while traversing, there will be a problem. For example:

1
2
3
4
5
6
7
local t = {a = 1, b = 2, c = 3}
for k, v in pairs(t) do
    if v == 1 then
        t[k] = nil
        t[k .. 1] = v + 1
    end
end

Running the above code will give you this error:

1
2
3
4
5
invalid key to 'next'
stack traceback:
        [C]: in function 'next'
        stdin:1: in main chunk
        [C]: in ?

Anyone familiar with Lua will be familiar with this error. The solution is to avoid adding new elements to the table during the traversal. The official Lua documentation also makes it clear: if any value is assigned to a field in the table that does not exist during the traversal, the behavior of next is undefined. But the error itself is interesting, as it involves many aspects of Lua’s mechanics, and I think it’s a good opportunity to understand Lua’s internal implementation.

Puzzling behavior

Why is it interesting? First of all this error doesn’t always occur (the documentation also says “behavior undefined”), for example the following code doesn’t have this problem:

1
2
3
4
5
6
7
local t = {a = 1, b = 2, c = 3, d = 4, e = 5, f = 6}
for k, v in pairs(t) do
    if v == 1 then
        t[k] = nil
        t[k .. 1] = v + 1
    end
end

On the other hand, we know that the for ... pairs loop essentially calls next at each iteration, passing in the table and the key of the previous iteration to get the key-value pair for the current loop.pairs loop is essentially calling next at each iteration, passing in the table and the key of the previous iteration, and getting the key-value pair of the current loop. The meaning of the next(t, k) function is that, for a given table t, it returns the next key-value pair adjacent to the given key k. If k is the last element of t, then nil is returned; if k is nil, then the first element of t is returned. What if k is not in t? Naturally, an error should be reported:

1
2
3
4
5
6
> next({a = 1, b = 2}, 'c')
invalid key to 'next'
stack traceback:
        [C]: in function 'next'
        stdin:1: in main chunk
        [C]: in ?

The familiar error. In that case, the code at the beginning of this article should also report an error, because when v equals 1, the corresponding key is deleted, causing the next iteration to call next to try to find the next element for a key that does not exist in the table. But not only does it work, the following code also works:

1
2
3
4
local t = {a = 1, b = 2}
print(next(t, 'a')) -- b    2
t.a = nil
print(next(t, 'a')) -- b    2

The second next call returns normally, as if a were still in t. The error occurs when we delete and then insert a new element:

1
2
3
4
5
local t = {a = 1, b = 2}
print(next(t, 'a')) -- b    2
t.a = nil
t.a1 = 2
print(next(t, 'a')) -- error: invalid key to 'next'

Of course there’s more fun, to get this error, you don’t have to insert a new element, sometimes it comes out with a GC:

1
2
3
4
5
6
7
8
9
local k = 'a'..'b'
local t = {
    a = 1,
    [k] = 2,
}
t[k] = nil
k = nil
collectgarbage("collect")
next(t, 'a'..'b') -- error: invalid key to 'next'

It’s written in a strange way to accommodate certain Lua mechanisms. If you change the above code a bit, for example by changing next(t, 'a'...' b') to next(t, 'ab'), the error will not appear.

Implementation of next function

When calling next(t, k), k should exist in t; but it seems to work if k has just been deleted. How can this be done? Let’s look at the implementation of the next function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* ltable.c */

static unsigned int findindex (lua_State *L, Table *t, StkId key) {
  unsigned int i;
  if (ttisnil(key)) return 0;  /* first iteration */
  i = arrayindex(key);
  if (i != 0 && i <= t->sizearray)  /* is 'key' inside array part? */
    return i;  /* yes; that's the index */
  else {
    int nx;
    Node *n = mainposition(t, key);
    for (;;) {  /* check whether 'key' is somewhere in the chain */
      /* key may be dead already, but it is ok to use it in 'next' */
      if (luaV_rawequalobj(gkey(n), key) ||
            (ttisdeadkey(gkey(n)) && iscollectable(key) &&
             deadvalue(gkey(n)) == gcvalue(key))) {
        i = cast_int(n - gnode(t, 0));  /* key index in hash table */
        /* hash elements are numbered after array ones */
        return (i + 1) + t->sizearray;
      }
      nx = gnext(n);
      if (nx == 0)
        luaG_runerror(L, "invalid key to 'next'");  /* key not found */
      else n += nx;
    }
  }
}


int luaH_next (lua_State *L, Table *t, StkId key) {
  unsigned int i = findindex(L, t, key);  /* find original element */
  for (; i < t->sizearray; i++) {  /* try first array part */
    if (!ttisnil(&t->array[i])) {  /* a non-nil value? */
      setivalue(key, i + 1);
      setobj2s(L, key+1, &t->array[i]);
      return 1;
    }
  }
  for (i -= t->sizearray; cast_int(i) < sizenode(t); i++) {  /* hash part */
    if (!ttisnil(gval(gnode(t, i)))) {  /* a non-nil value? */
      setobj2s(L, key, gkey(gnode(t, i)));
      setobj2s(L, key+1, gval(gnode(t, i)));
      return 1;
    }
  }
  return 0;  /* no more elements */
}

You can see that luaH_next first calls findindex to find the position of a given key in table t, and then looks backwards from that position to the next element. In findindex, several things are done:

  • Line 5 determines that if key is nil, it returns 0;
  • Lines 6 to 8, if key is a positive integer and is in the array field, return the array index;
  • Otherwise, look in the hash field. First line 11 gets the primary location of key;
  • In the loop from line 12 to 25, lines 14 to 16 check if the value of the current position is equal to key, if so, the lookup succeeds, line 19 returns directly; otherwise, line 21 checks the next position;
  • If the node equal to key is not found until the end of the chain, the search fails and an error invalid key to 'next' is thrown on line 23.

As we saw earlier, calling next immediately after an element is set to nil, returns it normally. This means that findindex has successfully found the position of the element. Notice the comment on line 13: “key may be dead already, but it is ok to use it in ’next’”, which means that a deleted key is fine for next? With this in mind, let’s first see what happens when we execute t[k] = nil.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* lvm.h */

#define luaV_fastset(L,t,k,slot,f,v) \
  (!ttistable(t) \
   ? (slot = NULL, 0) \
   : (slot = f(hvalue(t), k), \
     ttisnil(slot) ? 0 \
     : (luaC_barrierback(L, hvalue(t), v), \
        setobj2t(L, cast(TValue *,slot), v), \
        1)))

/* lvm.c */

#define settableProtected(L,t,k,v) { const TValue *slot; \
  if (!luaV_fastset(L,t,k,slot,luaH_get,v)) \
    Protect(luaV_finishset(L,t,k,v,slot)); }

void luaV_execute (lua_State *L) {
  ...
  /* main loop of interpreter */
  for (;;) {
    Instruction i;
    StkId ra;
    vmfetch();
    vmdispatch (GET_OPCODE(i)) {
      ...
      vmcase(OP_SETTABLE) {
        TValue *rb = RKB(i);
        TValue *rc = RKC(i);
        settableProtected(L, ra, rb, rc);
        vmbreak;
      }
      ...
    }
  }
}

We know that t[k] = nil executes the OP_SETTABLE instruction, so we find the lvm.c where the luaV_execute function executes OP_SETTABLE, which is lines 27 to 32 above, and see that it does several things:

  • Get the operands, ra , rb , and rc for table, key, and value, respectively;
  • Line 30 calls the macro settableProtected ;
  • line 15 settableProtected first tries to call luaV_fastset ;
  • luaV_fastset determines that luaV_fastset fails when t is not a table (line 4), or k is not in t (line 7), and returns 0. Otherwise, the value is assigned directly in line 9;
  • If luaV_fastset fails, then luaV_finishset is called instead at line 16.

When we delete an element using t[k] = nil, k is obviously in t, so it will execute to line 9, assigning a value directly to slot. Notice that slot is actually the return value of luaH_get, which will be equal to the value of k. So when t[k] = nil is executed, it will only assign the value of k to nil, and will not change the key. This also gives the next function an opportunity to do so.

Let’s go back to luaH_next . It does this in findindex

1
2
3
4
if (luaV_rawequalobj(gkey(n), key) ||
      (ttisdeadkey(gkey(n)) && iscollectable(key) &&
       deadvalue(gkey(n)) == gcvalue(key))) {
    ...

to determine if the current position is equal to key. If our key is not a GCObject, such as a number or a boolean, it can be accessed even if it is deleted, and luaV_rawequalobj always gives the correct result; even if the key is a GCObject, it can be compared properly as long as it is not GC’d. However, if the position of the key is occupied by an element inserted later, the lookup will fail. This is why it is possible to get an error like “invalid key to ’next’” when traversing both deleted and new elements.

When using the for ... pairs loop, you don’t have to worry about the key being GC’d, because the iterator always holds a reference to the current key. So in a for ... pairs loop, you can safely perform the delete operation.

Lua’s GC mechanism

(ttisdeadkey(gkey(n)) && iscollectable(key) && deadvalue(gkey(n)) == gcvalue(key)) is used to handle the case when the key is GC’d. If the current location is a deadkey, and key is a GCObject, compare deadvalue(gkey(n)) == gcvalue(key) , i.e. compare the address of the key at the current location (i.e. gkey(n) ) and the incoming key to see if they are equal. So what is deadkey? This brings us to Lua’s GC mechanism.

Lua GC works by stringing all the GCObjects into a chain; then periodically scanning and marking all the objects in the VM, and the ones that are not marked are the ones that need to be recycled. Let’s look at the function that scans the table:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/* lgc.c */

static void traversestrongtable (global_State *g, Table *h) {
  Node *n, *limit = gnodelast(h);
  unsigned int i;
  for (i = 0; i < h->sizearray; i++)  /* traverse array part */
    markvalue(g, &h->array[i]);
  for (n = gnode(h, 0); n < limit; n++) {  /* traverse hash part */
    checkdeadkey(n);
    if (ttisnil(gval(n)))  /* entry is empty? */
      removeentry(n);  /* remove it */
    else {
      lua_assert(!ttisnil(gkey(n)));
      markvalue(g, gkey(n));  /* mark key */
      markvalue(g, gval(n));  /* mark value */
    }
  }
}

To be precise, this is a function that scans non-weak tables. This function is relatively straightforward, and it does the following:

  • Lines 6 to 7 scan the array fields, all marked;
  • Lines 8 to 17 scan the hash field, where:
    • lines 10 to 11 check if the value of the current position is nil, and if so, call removeentry ;
    • Otherwise lines 13 to 15 mark the key and value of the current location.

Objects that are not marked are reclaimed, so if the value of an element is nil, it will be reclaimed in this GC. But before it is reclaimed, removeentry is also called, and what does it do?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/* lobject.h */

#define settt_(o,t)	((o)->tt_=(t))

#define setdeadvalue(obj)	settt_(obj, LUA_TDEADKEY)


/* lgc.c */

static void removeentry (Node *n) {
  lua_assert(ttisnil(gval(n)));
  if (valiswhite(gkey(n)))
    setdeadvalue(wgkey(n));  /* unused and unmarked key; remove it */
}

It will mark the key of the element as dead. Note that this marker actually sets the object’s type tt_ field to LUA_TDEADKEY . This way when calling luaV_rawequalobj to compare a deadkey with a regular object it always returns false. This way when deadvalue(gkey(n)) == gcvalue(key) is executed, the key of element n must have been GC’d.

For table, function, thread, userdata, two objects are equal if and only if their addresses are equal. When an object is GC’d, it doesn’t have any references, so it can’t be passed into next.

This is not the case with strings, where the equality of two strings depends on the content of the string, not on its address. So when a string is GC’d, we can still construct a string that is equal to it, but with a different address. Passing the newly constructed string into next, will cause an error. In the previous example, the reason for using 'a'...' b'' is because we want him to construct a different object. Just change any of the ''a'...' b' anywhere in it, then the string ab' will be in the constant table, so it won’t be GC’d; Lua has a short string reuse mechanism, so 'a'...' b' and 'ab' are actually the same object, so this error will not occur.

Conclusion

I was surprised when I first discovered that next could pass in a key that had just been deleted, because a deleted key should be something like a wild pointer and should not be used anymore. But after a closer look, everything is fine. This made me feel the subtlety of Lua design once again. This little problem involves the Lua table implementation, the GC mechanism, the constant table mechanism, and short string reuse. It is rather boring to gnaw on Lua source code, so we can explore Lua implementation from some interesting phenomena.