blob: e5ddcb96e0ba323ba8d70dc848ac42acd585eea9 [file] [log] [blame]
----------------------------------
----------------------------------
Details on using the Mob* Monitor:
----------------------------------
----------------------------------
Overview:
---------
The Mob* Monitor provides a way to monitor the health state of a particular
service. Service health is defined by a set of satisfiable checks, called
health checks.
The Mob* Monitor executes health checks that are written for a particular
service and collects information on the health state. Users can query
the health state of a service via an RPC/RESTful interface.
When a service is unhealthy, the Mob* Monitor can be requested to execute
repair actions that are defined in the service's check file package.
Check Files and Check File Packages:
------------------------------------
Check file packages are located in the check file directory. Each 'package'
is a Python package.
The layout of the checkfile directory is as follows:
checkfile_directory:
service1:
__init__.py
service_actions.py
more_service_actions.py
easy_check.py
harder_check.py
...
service2:
__init__.py
service2_actions.py
service_check.py
....
.
.
.
serviceN:
...
Each service check file package should be flat, that is, no subdirectories will
be walked to collect health checks.
Check files define health checks and must end in '_check.py'. The Mob* Monitor
does not enforce how or where in the package you define repair actions.
Health Checks:
--------------
Health checks are the basic conditions that altogether define whether or not a
service is healthy from the perspective of the Mob* Monitor.
A health check is a python object that implements the following interface:
- Check()
Tests the health condition.
-> Returns 0 if the health check was completely satisfied.
-> Returns a positive integer if the check was successfuly, but could
have been better.
-> Returns a negative integer if the check was unsuccessful.
- Diagnose(errocode)
Maps an error code to a description and a set of actions that can be
used to repair or improve the condition.
-> Returns a tuple of (description, actions) where:
description is a string describing the state.
actions is a list of repair functions.
Health checks can (optionally) also define the following attributes:
- CHECK_INTERVAL: Defines the interval (in seconds) between health check
executions. This defaults to 30 seconds if not defined.
A check file may contain as many health checks as the writer feels is
necessary. There is no restriction on what else may be included in the
check file. The writer is free to write many health check files.
Repair Actions:
---------------
Repair actions are used to repair or improve the health state of a service. The
appropriate repair actions to take are returned in a health check's Diagnose
method.
Repair actions are functions and can be defined anywhere in the service check
package.
It is suggested that repair actions are defined in files ending in 'actions.py'
which are imported by health check files.
Health Check and Action Example:
--------------------------------
Suppose we have a service named 'myservice'. The check file package should have
the following layout:
checkdir:
myservice:
__init__.py
myservice_check.py
repair_actions.py
The 'myservice_check.py' file should look like the following:
from myservice import repair_actions
def IsKeyFileInstalled():
"""Checks if the key file is installed.
Returns:
True if USB key is plugged in, False otherwise.
"""
....
return result
class MyHealthCheck(object):
CHECK_INTERVAL = 10
def Check(self):
if IsKeyFileInstalled():
return 0
return -1
def Diagnose(self, errcode):
if -1 == errcode:
return ('Key file is missing.' [repair_actions.InstallKeyFile])
return ('Unknown failure.', [])
And the 'repair_actions.py' file should look like:
def InstallKeyFile(**kwargs):
"""Installs the key file."""
...
Communicating with the Mob* Monitor:
------------------------------------
A small RPC library is provided for communicating with the Mob* Monitor
which can be found in the module 'chromite.mobmonitor.rpc.rpc'.
Communication is done via the RpcExecutor class defined in the above module.
The RPC interface provided by RpcExecutor is as follows:
- GetServiceList()
Returns a list of the names of the services that are being monitored.
There will be one name for each recognized service check directory.
- GetStatus(service)
Returns the health status of a service with name |service|. The |service|
name may be omitted, in this case, the status of every service is
retrieved.
A service's health status is a named tuple with the following fields:
- service: The name of the service.
- health: A boolean as to whether or not the service is healthy.
- healthchecks: A list of healthchecks that did not succeed. Referring
back to the 'Health Checks' section above, a check writer can
specify return codes for health checks that tell the monitor that
the health check result was satisfactory, but not optimal. These
quasi-healthy checks will also be listed here.
A healthcheck returned in a service's health status is a named tuple with
the following fields:
- name: The name of the health check.
- health: A boolean as to whether or not the health check succeeded.
- description: A description of the health check's state.
- actions: A list of the names of actions that may be taken to repair or
improve this health condition.
A service is unhealthy if at least one health check failed. A failed health
check will have its health field marked as False.
A healthy service will display its health field as True and will not list
any health checks.
A service may also be quasi-healthy. In this case, the health field will
be True, but health conditions that could be improved are listed.
- RepairService(service, action, args, kwargs)
Request the Mob* Monitor to execute a repair action for the specified
service. |args| is a list of positional arguments and |kwargs| is a
dict of keyword arguments.
The monitor will return the status of the service post repair execution.
Using the RPC library:
from chromite.mobmonitor.rpc import rpc
def testStatus():
# RpcExecutor takes optional keyword args for |host| and |port|.
# They default to 'localhost' and 9991 respectively.
rpcexec = rpc.RpcExecutor()
service_list = rpcexec.GetServiceList()
for service in service_list:
print(rpcexec.GetStatus(service))
rpcexec.RepairService('someservice', 'someaction', [1, 2], {'z': 3})
Using the mobmoncli:
A command line interface is provided for communicating with the Mob* Monitor.
The mobmoncli script is installed and part of the PATH on moblabs.
It provides the same interface discussed above for the RpcExecutor.
See chromite.mobmonitor.scripts.mobmoncli for a list of options that can be
passed.
Usage examples:
Getting a list of every service:
$ mobmoncli GetServiceList
Getting every service status:
$ mobmoncli GetStatus
Getting a particular service status
$ mobmoncli GetStatus -s myservice
Repairing a service:
$ mobmoncli RepairService -s myservice -a myaction
Passing arguments to a repair action:
$ mobmoncli RepairService -s myservice -a myotheraction -i 1,2,a=3
The inputs are a comma-separated list. Each item in the list may or
may not be equal-sign separated. If they are equal-sign separated,
that item is treated as a keyword argument, else as a positional
argument to the repair function.