User:ArsenalFan20/sandbox

= Google Cloud AI = Google has developed a range of artificial intelligence (AI) products on its Google Cloud Platform (GCP). The two main application programming interfaces (APIs) found on GCP are Cloud Vision and Cloud Natural Language. These APIs offer pre-trained machine learning models; the former was released in May, 2017 with the latter in November, 2016. In 2018, Google released AutoML; giving users the ability to create custom models.

Cloud Vision
Cloud Vision serves as an image recognition tool. It involves training a convolutional neural network (CNN), as well as relying on deep learning algorithms. The former is built to replicate the way humans perceive their surroundings. In terms of its architecture, a CNN is composed of 3 types of layers: convolutional layers, pooling layers and fully connected layers. The main layers, which are also the first few layers, are the convolution layers. They are made up of filters - small groups of pixels - and feature maps - the output produced by filters - and are able to detect high level patterns like rough edges and curves from an input image. As the network performs more convolutions, it can start to identify specific objects. Finally, in the last few layers, the fully connected layers have the job of acting as classifiers, converting the feature maps into probabilities. A new CNN has only random values to work with. But, by using an error/loss function, it can compare how close its prediction was to the actual label. Thus, with each iteration, it updates its filters' values, and becomes progressively more accurate.

Face
The API can detect facial expressions such as joy, sorrow, anger and surprise; if the face is blurred or exposed and whether if one has headwear on. Other specifications it takes into consideration are the roll, pan and tilt angles which represent clockwise/anti-clockwise rotation, leftward/rightward pointing and upwards/downwards pointing of the face respectively. This detection method also outputs four vertices that form a rectangle around the face and where facial landmarks are located i.e. the eyes, nose, lips, ears and more, using x, y, and z (depth) coordinates.

Label
The label detection method provides information on the content of the image and how prominently featured they are; condensed into a confidence/topicality score which ranges from 0 to 1.

Text
The text detection method, also known as optical character recognition (OCR), performs the task of identifying text and framing the text within four vertices. It can detect a string of text and its individual components in an image or document. This method also works on handwriting.

Web
The results supplied by web detection are from comparing with other sites. They include descriptions of the content of the image and URLs of full, partial matching images and also visually similar images.

Landmark
For landmark detection, the response contains a box made by four vertices (with x and y values), the latitude and longitude of the structure as well as a description accompanied by a confidence score from 0 to 1. It can be used to detect both man-made and natural landmarks.

Safe Search
The SafeSearch feature allows the user to have control over the display of explicit content. It is made up of five categories: adult, spoof, medical, violence, and racy. Adult and racy usually refer to content of a sexual nature, with spoof involving instances of humour or causing offence. Each category is coupled with a likelihood factor that is on a spectrum from unknown, very likely, unlikely, possible, likely and very likely.

Cost
When deciding whether to use the Cloud Vision API, one must take into consideration cost. There are many providers in this market, the big six being: Algorithmia, Amazon Rekognition, Clarifai, IBM Watson Visual Recognition, Google Cloud Vision and Microsoft Cognitive Services. Google came in second most expensive ($3,886) when judged by the "Cost of 30 days of API calls at 1 query per second" (about 2.6 millions requests per month) as of August, 2019, the most expensive being IBM at ($5,169) and the least being Microsoft, tied with Amazon, at ($2,090). This amount of usage resembles the requirements of a large corporation.

Accuracy
In order to establish the most accurate provider, Google Cloud Vision's performance on detecting text in a 63,686 image set containing daily life settings, called COCO-Text - part of Microsoft Common Objects in Context (MS COCO) - was compared with Amazon Rekognition and Microsoft Cognition Services in August, 2019. It placed second with 49.55%, ahead of Amazon with 48.91% but behind Microsoft with 59.43%.

In a study done by Hossein Hosseini, Baicen Xiao and Radha Poovendran in July 2017, an average of 14.25% impulse noise density was needed in order to deceive the Cloud Vision API and cause it to output a label that differed from the correct label (i.e. for the original image); with a success rate of 100% occurring when 35% impulse noise was added. This was applied on 100 images obtained from the ImageNet dataset, consisting of natural images. With regards to impulse noise density being put on sample images from the Face94 dataset, an average of 23.8% was enough for the API to be unable to detect a face.

Biases
There have been many experiments conducted which have exposed biases existing in Cloud Vision. One done by AlgorithmWatch involved feeding the API an image of a black person and a fair coloured person, both holding a temperature gun. In respect to the former, the label returned for the object was "gun", while for the latter, the object was an "electronic device". This issue of racial discrimination is also evident in a study done in April and May 2017 by Joy Buolamwini, from MIT Media Lab, who discovered that industry-wide, the worse detected group was darker females and the best detected group was lighter males.

Another study conducted by Facebook AI Research in February 2019, with image-tagging APIs (one being Google Cloud Vision) on the Dollar Street dataset - containing images of household items from 264 homes in over 50 countries - led to the discovery of both geographical/cultural and income-related disparities. In the case of geography, non-Western countries fared worse than Western countries (e.g. the accuracy of detection of items from the United States was 15% higher than those from Somalia or Burkina Faso). As for the latter, when comparing the lowest income bracket of US$50 per month to the highest income bracket of US$3,500 per month, there was a difference of 10% accuracy in favour of the rich. The underlying causes of these two biases are that the distribution of the world population is not reflected in the geographical sampling of image datasets (e.g. ImageNet, COCO, and OpenImages ), especially with regards to Africa, India, China and Southeast Asia, and that the English language, being the base language used for image collection, doesn't take into consideration other languages and the nuanced meaning of some of their words.

Uses
A real-world application for this API is to function as assistive technology (AT) systems for people with disabilities, in particular the visually impaired and blind. In a study by Arsenio Reis, Dennis Paulino, Vitor Filipe, and Joao Barroso in 2018, they tested Google Cloud Vision to see how it could be used in the daily lives of the above people. The first part of the experiment was performed by non-visually impaired researchers. They conducted two tests: text recognition and object recognition. For the former, the test cases involved the pages of the Bible, while for the later, it was done on daily life items such as a chair, a table, an apple, a tree, a bread, a car, ... and a dog. Google scored a relative effectiveness of 86.5% on the first test, as opposed to Microsoft Cognitive Services, which scored 77.4%. The ranking was flipped for the second test, which included the criteria relative effectiveness and confidence degree. Microsoft scored 92.5% and 74% and Google 66% and 69% respectively. The second part was executed by four blind persons. The tests were designed in collaboration with the affected; these were actions which they found useful. Some tasks assessed were: (1) - getting the user to face a door 2m away and take a photo - and (5) - trying to locate a computer inside a large meeting room. The results of the participants using both systems in these two cases were poor; done better were (2) - looking into a meeting room, (3) - looking at a person, and (4) - looking for a computer on a desk. In conclusion, the study rated both these systems as promising and only limited by high latency and the user interface.

Cloud Natural Language
This API performs analysis on text, breaking it down based on sentiment, entity, content (categories) and syntax.

Sentiment
Sentiment analysis makes a judgement of the prevailing emotion the text is written in and how much is expressed. The former is summarised into a score that ranges from 1.0, standing for ‘positive’, to -1.0, representing ‘negative’, and the middle ground, 0, indicating a neutral or mixed tone. The latter is known as ‘magnitude’ and also given a score, with 0 being the lowest. It is used to differentiate between neutral, having a score of 0, and mixed, having a score of approximately 3.0/4.0.

Entity
Entities refers to people, places (e.g. restaurants, landmarks), things and events. The entity form of analysis has four parameters: type, metadata, salience and mentions. Kinds of types include person, location and proper and common nouns. Under the umbrella term metadata exists a Wikipedia URL, if available, and a machine-generated identifier (MID) which is unique to every object. In regards to salience, a number of 1 corresponds to a main theme, while a score closer to 0 corresponds to a minor theme. Finally mentions refers to a group of terms with the same meaning.

Categories
The API has a database of over 700 categories, which is called when more than 20 words are inputted. It may return more than one result when classifying, with only the most specific chosen. The results are also accompanied by a confidence score - from 0 to 1 - to measure how sure the model was.

Syntax
The syntactic analysis is split into 5 parts: dependency, parse label, part of speech, lemma and morphology. The dependency of a word (token) is where it fits in relative to other tokens in the sentence. A tree can be drawn of the dependencies. Within dependency also features “headTokenIndex” which is what other word it has the strongest association with and the parse “label” which is the function of the word in the sentence (e.g. ‘root’ equals the main verb). Under parts of speech exist sub-categories like ‘tag’ - noun, verb and conjunction -, ’tense’ - present, past and future - and ‘person’ - first, second, third - and acts as the source of morphological information. Finally the lemma section is used to clarify the root of a word i.e. for ‘done and does’ is ‘do’.

Vision
The first step in creating a custom version is to enable AutoML Vision on GCP. The next step is to prepare the training dataset. It is recommended that 100 images are uploaded to correspond to each label. The model sets aside approximately 10% of the examples to be used as a test sample. AutoML Vision gives you the option to select the length of computational time.

Natural Language (NL)
This AutoML NL is found on GCP. Its preparation requires uploading a dataset in either .txt or .csv format. One containing 55,327 text items, categorised as either ‘toxic’ or ‘clean’, will take approximately 4 hours and 45 minutes to train.