RetryRunner plugin

RetryRunner plugin implements retry logic to improve task execution reliability.

Primary usecase for RetryRunner is to make Nornir task execution as reliable as possible utilizing queuing, retries, connections splaying and exponential backoff mechanisms.

RetryRunner Architecture

RetryRunner helps to control the rate of connections establishment by limiting the number of connector workers.

For example, if num_connectors is 5, meaning at any point in time there are only 5 workers establishing connections to devices, even if there are 100 devices, RetryRunner will connect only with 5 of them at a time. This is very helpful when connections rate need to be limited due to operations restrictions like AAA (TACACS, RADIUS) servers load.

When new task started and if no connection exist to device that this task makes the use of, RetryRunner attempts to connect to device retrying up to connect_retry times.

Once connection established, task handed over to worker threads for execution, workers will retry the task up to task_retry times if task fails.

Connection parameters such as timeouts or usage of SSH keys handled by Nornir Connection plugins. RetryRunner calls Nornir to start the connection, further connection establishment details controlled by Connection plugin itself.

../_images/RetryRunner_v0.png

Sample Usage

Instruct Nornir to use RetryRunner on instantiation and run your tasks:

from nornir import InitNornir

NornirObj = InitNornir(
    runner={
        "plugin": "RetryRunner",
        "options": {
            "num_workers": 100,
            "num_connectors": 10,
            "connect_retry": 3,
            "connect_backoff": 1000,
            "connect_splay": 100,
            "task_retry": 3,
            "task_backoff": 1000,
            "task_splay": 100
        }
    }
)

Sample code to demonstrate usage of RetryRunner, DictInventory and ResultSerializer plugins:

import yaml
import pprint
from nornir import InitNornir
from nornir.core.task import Result, Task
from nornir_netmiko import netmiko_send_command, netmiko_send_config
from nornir_salt.plugins.functions import ResultSerializer

inventory_data = '''
hosts:
  R1:
    hostname: 192.168.1.151
    platform: ios
    groups: [lab]
  R2:
    hostname: 192.168.1.153
    platform: ios
    groups: [lab]
  R3:
    hostname: 192.168.1.154
    platform: ios
    groups: [lab]

groups:
  lab:
    username: cisco
    password: cisco
'''

inventory_dict = yaml.safe_load(inventory_data)

NornirObj = InitNornir(
    runner={
        "plugin": "RetryRunner",
        "options": {
            "num_workers": 100,
            "num_connectors": 10,
            "connect_retry": 3,
            "connect_backoff": 1000,
            "connect_splay": 100,
            "task_retry": 3,
            "task_backoff": 1000,
            "task_splay": 100
        }
    },
    inventory={
        "plugin": "DictInventory",
        "options": {
            "hosts": inventory_dict["hosts"],
            "groups": inventory_dict["groups"],
            "defaults": inventory_dict.get("defaults", {})
        }
    },
)

def _task_group_netmiko_send_commands(task, commands):
    # run commands
    for command in commands:
        task.run(
            task=netmiko_send_command,
            command_string=command,
            name=command
        )
    return Result(host=task.host)

# run single task
result1 = NornirObj.run(
    task=netmiko_send_command,
    command_string="show clock"
)

# run grouped tasks
result2 = NornirObj.run(
    task=_task_group_netmiko_send_commands,
    commands=["show clock", "show run | inc hostname"],
    connection_name="netmiko"
)

# run another single task
result3 = NornirObj.run(
    task=netmiko_send_command,
    command_string="show run | inc hostname"
)

NornirObj.close_connections()

# Print results
formed_result1 = ResultSerializer(result1, add_details=True)
pprint.pprint(formed_result1, width=100)

formed_result2 = ResultSerializer(result2, add_details=True)
pprint.pprint(formed_result2, width=100)

formed_result3 = ResultSerializer(result3, add_details=True)
pprint.pprint(formed_result3, width=100)

Connections handling

Warning

For parent or grouped tasks need to explicitly provide connection plugin connection_name task parameter such as netmiko, napalm, scrapli, scrapli_netconf, etc. Specifying connection_name attribute for parent or grouped tasks not required if that task has CONNECTION_NAME global variable defined within it. Lack of connection_name attribute will result in skipping connections retry logic, jumphost connection logic or credentials retry logic and connections to all hosts initiated simultaneously up to the number of num_workers option.

Above restriction stems from the fact that Nornir tasks does not have built-in way to communicate the set of connection plugins that task will use. By convention, task may contain CONNECTION_NAME global parameter to identify the name(s) of connection plugin(s) task uses.

CONNECTION_NAME global parameter can be a single connection name or a comma separated list of connection plugin names that task and its subtask uses. RetryRunner honors this parameter and tries to establish all specified connections before starting the task.

Alternatively, inline task parameter connection_name can be provided on task run.

However, only parent/main/grouped task supports task parameters, subtasks does not support them. As a result, if subtask uses connection plugin different from specified in parent task connection_name parameter or CONNECTION_NAME variable, subtask connection does not handled by RetryRunner connections establishment logic and connection established on subtask start simultaneously in parallel up to the number of num_workers option.

Sample task that uses different connection plugins for subtasks:

from nornir.core.task import Result, Task
from nornir_scrapli.tasks import netconf_get_config
from nornir_scrapli.tasks import send_command as scrapli_send_command
from nornir_netmiko.tasks import netmiko_send_command

# inform RetryRunner to establish these connections
CONNECTION_NAME = "scrapli_netconf, netmiko, scrapli"

def task(task: Task) -> Result:

    task.run(
        name="Pull Configuration Using Scrapli Netconf",
        task=netconf_get_config,
        source="running"
    )

    task.run(
        name="Pull Configuration using Netmiko",
        task=netmiko_send_command,
        command_string="show run",
        enable=True
    )

    task.run(
        name="Pull Configuration using Scrapli",
        task=scrapli_send_command,
        command="show run"
    )

    return Result(host=task.host)

RetryRunner task parameters

RetryRunner supports a number of task parameters to influence its behavior on a per-task basis. These parameters can be supplied to the task as key/value arguments to override RetryRunner options supplied on Nornir object instantiation.

RetryRunner task parameters description:

  • run_connect_retry - number of connection attempts

  • run_task_retry - number of attempts to run task

  • run_creds_retry - list of connection credentials and parameters to retry while connecting to device

  • run_num_workers - number of threads for tasks execution

  • run_num_connectors - number of threads for device connections

  • run_reconnect_on_fail - if True, re-establish connection on task failure

  • run_task_stop_errors - list of glob patterns to stop retrying if seen in task exception string

  • connection_name - name of connection plugin to use to initiate connection to device

Note

Tasks retry count is the smallest of run_connect_retry and run_task_retry counters, i.e. task_retry set to min(run_connect_retry, run_task_retry) value.

Warning

only main/parent tasks support RetryRunner task parameters, subtasks does not support them.

Sample code to use RetryRunner task parameters:

import yaml
from nornir import InitNornir
from nornir.core.task import Result, Task
from nornir_netmiko import netmiko_send_command

inventory_data = '''
hosts:
  R1:
    hostname: 192.168.1.151
    platform: ios
    groups: [lab]

groups:
  lab:
    username: foo
    password: bar

defaults:
  data:
    credentials:
      local_creds:
        username: nornir
        password: nornir
      dev_creds:
        username: devops
        password: foobar
'''

inventory_dict = yaml.safe_load(inventory_data)

NornirObj = InitNornir(
    runner={
        "plugin": "RetryRunner"
    },
    inventory={
        "plugin": "DictInventory",
        "options": {
            "hosts": inventory_dict["hosts"],
            "groups": inventory_dict["groups"],
            "defaults": inventory_dict.get("defaults", {})
        }
    },
)

# run task without retrying - simulate QueueRunner behavior
result1 = NornirObj.run(
    task=netmiko_send_command,
    command_string="show clock",
    run_connect_retry=0,
    run_task_retry=0,
)

# run task one by one - simulate SerialRunner behavior but with retrying
result2 = NornirObj.run(
    task=netmiko_send_command,
    command_string="show clock",
    run_num_workers=1,
    run_num_connectors=1,
)

# retry credentials if login fails but without retrying conection establishment
result3 = NornirObj.run(
    task=netmiko_send_command,
    command_string="show clock",
    run_retry_creds=["local_creds", "dev_creds"]
    run_connect_retry=0,
)

Connecting to hosts behind jumphost

RetryRunner implements logic to connect with hosts behind bastion/jumphosts.

To connect to devices behind jumphost, need to define jumphost parameters in host’s inventory data:

hosts:
  R1:
    hostname: 192.168.1.151
    platform: ios
    username: test
    password: test
    data:
      jumphost:
        hostname: 10.1.1.1
        port: 22
        password: jump_host_password
        username: jump_host_user

Note

Only Netmiko connection_name="netmiko" and Ncclient connection_name="ncclient" tasks, support connecting to hosts behind Jumphosts using above inventory data.

Retrying different credentials

RetryRunner is capable of trying several credentials while connecting to device. Credentials tried in a sequence starting with host’s inventory username and password parameters moving on to connection parameters supplied in creds_retry RetryRunner option.

Credentials retry logic implemented using conn_open task plugin in a way that creds_retry list content passed as reconnect argument to conn_open task.

Items of creds_retry list tried sequentially until connection successfully established, or list runs out of items. If no connection established after all creds_retry items tried, this connection attempt considered unsuccessful, hosts queued back to connectors queue and process repeats on next try.

Sample inventory with retry credentials:

hosts:
  R1:
    hostname: 192.168.1.151
    platform: ios
    groups: [lab]
    data:
      credentials:
        local_creds:
          username: admin
          password: admin

groups:
  lab:
    username: foo
    password: bar

defaults:
  data:
    credentials:
      local_creds:
        username: nornir
        password: nornir
        extras:
          optional_args:
            key_file: False
      dev_creds:
        username: devops
        password: foobar

credentials defined within default data section, but can be defined inside host or groups data. Credentials definitions does not merged across different data sections but searched in a host -> groups -> defaults order and first one encountered used.

Sample code to use creds_retry:

from nornir import InitNornir

NornirObj = InitNornir(
    runner={
        "plugin": "RetryRunner",
        "options": {
            "creds_retry": ["local_creds", "dev_creds"]
        }
    }
)

creds_retry items parameters used as Nornir host.open_connection kwargs, as a result all arguments of open_connection method are supported such as username, password, port, extras etc.

API Reference

class nornir_salt.plugins.runners.RetryRunner.RetryRunner(num_workers: int = 100, num_connectors: int = 20, connect_retry: int = 3, connect_backoff: int = 5000, connect_splay: int = 100, task_retry: int = 1, task_backoff: int = 5000, task_splay: int = 100, reconnect_on_fail: bool = True, task_timeout: int = 600, creds_retry: Optional[list] = None, task_stop_errors: Optional[list] = None)

RetryRunner is a Nornir runner plugin that strives to make task execution as reliable as possible.

Parameters
  • num_workers – number of threads for tasks execution

  • num_connectors – number of threads for device connections

  • connect_retry – number of connection attempts

  • connect_backoff – exponential backoff timer in milliseconds

  • connect_splay – random interval between 0 and splay for each connection in milliseconds

  • task_retry – number of attempts to run task

  • task_backoff – exponential backoff timer in milliseconds

  • task_splay – random interval between 0 and splay before task start in milliseconds

  • reconnect_on_fail – boolean, default True, perform reconnect to host on task failure

  • task_timeout – int, seconds to wait for task to complete before closing all queues and stopping connectors and workers threads, default 600

  • creds_retry – list of connection credentials and parameters to retry while connecting to device

  • task_stop_errors – list of glob patterns to stop retrying if seen in task exception string, these patterns not applicable to errors encountered during connection establishment. Error *validation error* pattern always included in these list.