Python handles control characters in text

Previously, when using Python for crawling, I encountered an error reading the data. After analysis, I found that the returned HTML contains control characters (it turns out that anti-crawler can also do this, control characters in the crawler easily cause errors, but when presented to the user in the browser does not affect anything).

What is a control character?

Control characters (Control Character), or non-printing characters, appear in the text of a specific message, indicating a control function characters, such as control characters: LF (line feed), CR (carriage return), FF (page break), DEL (delete), BS (backspace), BEL (ringing), etc.; communication-specific characters: SOH (text header), EOT (end of text), ACK (confirmation), etc.

There are two sets of specific control characters as follows.

  • Seven-bit ASCII defines 33 codes as control characters, which are 0 to 31, and 127, (located at 0x00-0x1F and 0x7F)
  • Compatible eight-bit ISO/IEC 8859-1 plus 32 codes from ISO/IEC 6429 defined from 128 to 159, located at 0x80-0x9F

Control character list: http://ascii-table.com/control-chars.php

Python solution for control characters

Option 1.

1
strip_control_characters = lambda s:"".join(i for i in s if 31<ord(i)<127)

Option 2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def strip_control_characters(str_input):  
    if str_input:  
        import re  
        # unicode invalid characters  
        RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
                        u'|' + \
                        u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
                        (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),  
                           unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),  
                           unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),  
                           )  
        str_input = re.sub(RE_XML_ILLEGAL, "", input)  
        # ascii control characters  
        str_input = re.sub(r"[\x01-\x1F\x7F]", "", input)  
    return str_input

Option 3.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import re

def remove_control_chars(s):
    control_chars = ''.join(map(unichr, range(0,32) + range(127,160)))
    control_char_re = re.compile('[%s]' % re.escape(control_chars))

    return control_char_re.sub('', s)

cleaned_json = remove_control_chars(original_json)
obj = simplejson.loads(cleaned_json)

Reference links.