When data crawling, we often encounter program exceptions due to network problems, and the initial approach is just to record the error content and post-process the error content again. Some better exception retry methods or mechanisms are compiled here.

Initial version :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def crawl_page(url):
    pass

def log_error(url):
    pass

url = ""
try:
   crawl_page(url)
except:
    log_error(url)

Improved version (increased number of retries):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
attempts = 0
success = False
while attempts < 3 and not success:
    try:
        crawl_page(url)
        success = True
    except:
        attempts += 1
        if attempts == 3:
            break

New solution: retrying

retrying is a Python retry package that can be used to automatically retry segments that may fail to run. retrying provides a decorator function retry, and the decorated function will then be re-executed under a run failure condition, and by default will keep retrying as long as it keeps reporting errors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import random
from retrying import retry

@retry
def do_something_unreliable():
    if random.randint(0, 10) > 1:
        raise IOError("Broken sauce, everything is hosed!!!111one")
    else:
        return "Awesome sauce!"

print(do_something_unreliable())

If we run the have_a_try function, then it will not finish executing until random.randint returns 5, otherwise it will keep reexecuting.

retry can also accept some parameters, this can be seen from the initialization function of the Retrying class in the source code as optional parameters.

  • stop_max_attempt_number: used to set the maximum number of attempts, after which the retry will be stopped
  • stop_max_delay: set to 10000, for example, then from the point in time when the decorated function starts to execute to the point in time when the function runs successfully or fails to abort, as long as this time exceeds 10 seconds, the function will not be executed again
  • wait_fixed: set the dwell time between retrying
  • wait_random_min and wait_random_max: generate the dwell time between retrying in a random way
  • wait_exponential_multiplier and wait_exponential_max: generates the dwell time between retrying in exponential form, generating a value of 2^previous_attempt_number * wait_exponential_ multiplier, previous_attempt_number is the number of previous retry, if the generated value exceeds the size of wait_exponential_max, then the dwell time between two retrying is wait_exponential_max. this design This design accommodates the exponential backoff algorithm and can mitigate blocking.
  • We can specify which exceptions we want to retry when they occur, this is done by passing a function object with retry_on_exception.
1
2
3
4
5
6
7
def retry_if_io_error(exception):
    return isinstance(exception, IOError)

@retry(retry_on_exception=retry_if_io_error)
def read_a_file():
    with open("file", "r") as f:
        return f.read()

During the execution of the read_a_file function, if an exception is reported, then this exception is passed into the retry_if_io_error function with the form parameter exception, if the exception is an IOError then retry is performed, if not it stops running and throws an exception.

We can also specify which results we want to retry when we get them, this is done by passing a function object with retry_on_result.

1
2
3
4
5
6
def retry_if_result_none(result):
    return result is None

@retry(retry_on_result=retry_if_result_none)
def get_result():
    return None

After executing get_result successfully, the return value of the function will be passed into the retry_if_result_none function through the form parameter result, if the return value is None then retry will be performed, otherwise it will end and return the function value.

Reference.