User:Sunfinite/sandbox

The load monitor tool aims to be a simple interface for all load related activities on edge machines namely starting load with specified objects, keeping it a specified flit level and stopping load. It also provides consolidated MapNOCC graphs for the machine under load in one place and sends email notifications at user-specified intervals. The monitor framework ensures that the machines running the load are not overloaded and can split a single load across machines if needed. The monitoring of the load itself is implemented using the load-balancing features of monkey, the load generation tool being used, along with some external features in the framework to overcome monkey's limitations. The basic features of monitoring framework have also been used to implement a roll monitor and can also be easily extended to run automated tests.

Problem

 * In the FreeFlow SQA network, designated machines for running load on ghosts would often become overloaded and unresponsive. In addition to affecting the load started from these machines, this impacted other tests too because the same machines doubled up as our origin servers. This required frequent reboots and manual restart of all the tasks.
 * Load started using monkey would not scale up to or overshoot the required level necessitating much tweaking with the load parameters and/or starting new monkey processes.

Phase 1 Solution

 * 1) A simple program reading configurations from a text file, run as a cron job from a local machine that would start load, increase and decrease, restart and stop load as needed and send emails.
 * 2) This program was written in perl using library functions from akatest.

Problems with phase 1 solution

 * Only one load could be started and monitored.
 * A load could have only one runner and one receiver.
 * Very crude logic for increasing and decreasing load:
 * Increase: Redo what was done to start the load initially.
 * Decrease: Kill 1/3 of load processes(Arrived at using the complex mathematical process of random guessing)

Start Load
Load is started via start load form on the home page: Test Server Form Fields (Note that even for ipv6 loads, the receiver ip has to be ipv4, the ipv6 address if fetched before starting the load)
 * Receiver IP: Ipv4 address of the machine on which load should be pumped
 * Type: Address type(ipv4 or ipv6) to be used as the target for the load processes
 * Monkey profile file: To enable backward compatibility with how load tasks are currently being handled in akatest, the framework takes xml files as input. Users can either upload new files or select an existing file from the dropdown.


 * A sample profile file:


 * The action tag is for the monkey request specifiers listed here and the url tag is for the corresponding load object.
 * Any other monkey option can be specified as a tag in the file but will be ignored by the framework.
 * A new optional tag called percentage(not a monkey option) can be used to specify how much of the eventual load will be constituted by the respective object. Internally the framework produces a file compatible with monkey's composite option.

Emails are also sent to notify the start and stop of a load.
 * Target flit percentage: The framework will try to keep the target machine at the specified flit percentage.
 * Monitor interval: The framework will monitor the load after every interval and perform necessary tasks(increase, split etc.)
 * Email ids: Comma separated list of email ids to which notifications/summaries will be sent.
 * Email interval: The framework will send a load summary after every interval. This summary will include the current runner machines, errors since the last summary and mapnocc graphs.

After the form is submitted, an email will be sent when load is successfully started stating the ip of the initial runner.

See: Start load implementation

Monitor load

 * This can be done either via the graph page or the log page for the load.
 * The graph page lists all the graphs for the target machine found in the profiler page in MapNOCC along with a new graph for percentage flit load.
 * Each graph can be modified and re-fetched individually(i.e period, filter by component and timezone) which cannot be done in the profiler page in MapNOCC as of now. The profiler page also does not allow filter by component. So the graph page can be considered as a consolidation of multiple ip by process pages.

See: MapNOCC implementation

The log page lists all the tasks that were performed during each monitoring run for the load including the commands that were executed and errors faced. Socket usage and cpu usage on the runner and average requests per second, maximum number of open connections and current flit load for the load on a given runner in the interval between the last run and now are also written to the log.

Stop Load
The stop load button can be accessed via the graph page. It can also be accessed via the intermediate page displayed immediately after the load is submitted, in this way it can be stopped without load actually being started.

See: Stop load implementation

Create a roll monitor object

 * Roll monitor homepage

Form fields
 * Machines: Comma separated list of ips which should be checked for rolls. The default value all performs a check on all machines in the network
 * Exclude Machines: Comma separated list of ips which should be excluded from checks if all is specified in Machines
 * Services: Comma separated list of components which should be checked for rolls on Machines. The default value all performs a check for all components on Machines.
 * Exclude Services: Comma separated list of components to be excluded from checks
 * Email ids: Comma separated list of ids to which roll email will be sent
 * Interval: The framework will check for rolls after every interval and send an e-mail only if there were rolls in the preceding interval.

See: Roll monitor implementation

Current Implementation

 * Flowcharts will be a better way of doing this

Choosing a runner

 * 1) Fetch list of load machines by performing a dig against origin-ff.qa.akamai.com and excluding machines already in the runner list
 * 2) Choose a random ip from the list and check it’s health
 * 3) If socket usage is less than 70% and CPU usage is less than 50%, return runner as stable for starting load.
 * 4) If not healthy, remove ip from list and redo 2
 * 5) Repeat 2,3,4 for a specified number of retries and return an error after.

See: Runner balancing TODO

Start Load

 * 1) Check for the health of the runner. This can be disabled in cases where choosing a runner has just performed the check.
 * 2) If check is disabled, assume new runner and move monkey’s composite file and output parse script to the runner. This composite file is generated and written to disk when an xml profile file is first uploaded. The output parse script is used to fetch the average values from monkey’s output during check load.
 * 3) If check is enable, check for the health of each runner in the runner list breaking when a runner with less than 70% socket usage and 50% CPU usage is found.
 * 4) If runner was not found in the previous step, choose a new runner and start load on this runner with check disabled.
 * 5) If runner was found, start load and redirect output to a unique file marked by the current timestamp and load id.

See: Load balancing TODO

Check Load

 * 1) Get the process list for load processes on a runner.
 * 2) If process list is not found, stop load on that runner, choose a new runner and start load. If process list was not found due to connection error, runner is still assumed to be running load and hence not removed from the runner list.
 * 3) If socket usage is more than 90% or CPU usage is more than 70%, split load.
 * 4) Get monkey output for the load on the runner. This involves running the python script which will read all temporary output files for this load, parse them and truncate them. It returns the average number of requests per second(per), average number of simultaneous open connections(max) and average flit percentage(flit) during the interval between the previous run and the current run.
 * 5) If parse was not successful, stop load on the runner.
 * 6) If current average flit percentage is less than the target (with tolerance, see TODO), compute new per, max and number of monkey processes to be started as stated below.
 * 7) Start num_monkey new loads with runner health check enabled.

(Note: All this computation is required for increasing load because monkey’s load balancing feature computes new_per based on completed requests/second(i.e per), however this new value will never be reached if per is the maximum number of requests being completed currently. This requires new monkey processes to be started. Also, the load balancing feature does not change the value of max which can render any per value ineffectual if too low and can lead to excessive socket usage and crash if too high. However, the setup should be enough for decreasing load in case of an overload)

See: Load balancing TODO

Split Load

 * 1) Compute number of processes to be killed based on excess and per_process_socket_usage(assuming it to be equally divided, not true, I know)
 * 2) For each process killed, start load with the per and max of the killed process with check enabled.

See: Split load TODO

Stop Load

 * 1) On a specific runner:
 * 2) Kill all load processes
 * 3) Remove temporary output files and composite files
 * 4) If 1 and 2 were successful, remove runner from runner list
 * 5) An entire load:
 * 6) When stop button is pressed, the id for the load is written to a separate file.(Seemed the easiest way out of race conditions)
 * 7) The monitor program reads this file and truncates it every time it runs.
 * 8) Repeat 1,2,3 for specific runner for each runner in the runner list

Load Monitor

 * 1) Fetch all currently running load objects.
 * 2) For each object, see if current time exceeds last run time plus monitor interval.
 * 3) If yes, perform checking.
 * 4) If first run for an object, set ipv6 address if necessary and fetch MapNOCC graphs. If load is started successfully, send email.
 * 5) If current time exceeds email interval, send build and send summary.
 * (Currently run every 30 seconds)

See:Monitor TODO

MapNOCC

 * 1) Fetch all graph images from the profile page for a machine and write to a file after setting height and width and re-arrangement.
 * 2) When rendering the load page, build a json object out of every image string. This enables easy modification via javascript.
 * 3) Build and set the image source url with the new values via javascript when each fetch is requested.

See: MapNOCC TODO

Email Summary

 * 1) Parse log since the last email summary.
 * 2) For each error, count the number of occurrences and time.
 * 3) Reading image source strings from graph file.
 * 4) Add runner list and send email.

See: Email TODO

Check Rolls

 * 1) Build sql2 query, run it against the agg and return output

Roll Monitor

 * Similar to load monitor

General

 * Move to dedicated server
 * Pubcookie integration

Balancing

 * How on earth were the threshold values arrived at?
 * Move away from monkey by adapting it’s features to existing phase 1 framework