Classification of Traffic Vehicle Density Using Deep Learning

The volume density of vehicle is a problem that often occurs in every city, as for the impact of the vehicle density is congestion. Classification of vehicle density levels on certain roads is required because there are at least 7 vehicle density level conditions. Monitoring conducted by the police, the Department of Transportation and the organizers of the road currently using video-based surveillance such as CCTV that is still monitored by people manually. Deep Learning is an approach of synthetic neural network-based learning machines that are actively developed and researched lately because it has succeeded in delivering good results in solving various soft-computing problems, This research uses the convolutional neural network architecture. This research tries to change the supporting parameters on the convolutional neural network to further calibrate the maximum accuracy. After the experiment changed the parameters, the classification model was tested using K-fold cross-validation, confusion matrix and model exam with data testing. On the K-fold cross-validation test with an average yield of 92.83% with a value of K (fold) = 5, model testing is done by entering data testing amounting to 100 data, the model can predict or classify correctly i.e. 81 data. Keywords—Complexity Vehicle density, Deep learning, Classification, Convolutional neural network.  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 14, No. 1, January 2020 : 69 – 80 70


INTRODUCTION
The level of vehicle density is an issue that often occurs in every city. The impact of vehicle density levels is congestion. Losses suffered from this congestion problem when the quantification in monetary unit is very large, i.e. losses due to travel time to be long and longer, vehicle operating costs become larger and vehicle pollution resulting in increasing.
Deep Learning introduces the Convolutional Neural Network (CNN) method which has excellent performance in pattern recognition and image classification. Convolutional Neural Network is designed to resemble human brain function, the computer will be given image data to learn, trained to recognize every visual element on imagery and understand its image pattern until later computer capable Identify the imagery. Convolutional Neural Network is an image recognition method that contributes greatly to the development of computer technology.
Deep learning implementation uses the Convolutional Neural Network (CNN) architecture on the fingerprint image classification using 80 fingerprint image data. The training process uses data that is 24 × 24 pixels and performs tests by comparing the number of epoch and learning rate so it is known that if the larger the number of epoch and the smaller the learning rate then the better the accuracy rate Training obtained. In this study, the level of training accuracy was achieved at 100% [1]. In the research that has been mentioned with very high accuracy results even reaching 100% of the authors see a deficiency IE occurs overfitting, therefore the author proposes a classification of the level of vehicle density using Deep learning specifically uses the Convolutional Neural Network architecture but to avoid overfitting that occurs in this research using cross-validation techniques and also dropout techniques.

Traffic Behavior
The traffic behavior declares a quantity size that describes the conditions assessed by the road builder. Road traffic behaviors include capacity, travel time and average travel speed. Capacity is defined as the maximum current through a point on the path that can be maintained per hour unit under certain conditions. For two-way two-lane roads, capacity is determined for a bidirectional current (two-way combination), but for roads with multiple lanes, currents are separated per direction and capacity is determined per lane [2]. The basic equation for determining capacity is on the equation (1). (1) Where: = Capacity (SMP/hour) = Basic Capacity (SMP/hourr) = Road width adjustment factor = directional Separation adjustment factor (for undivided roads only) = adjustment factor of side barriers and road shoulders = City size Adjustment factors Service level is a measure of performance of roads or intersections calculated based on the level of road usage, speed, density, and obstacles occurring. Road service level calculations are based on comparisons of traffic volumes with road capacity known as the V/C ratio. The analysis of traffic density is based on road level calculation. The better the level of road service on a road, the lower the density of traffic.

1.1 CCTV
Closed Circuit Television (CCTV) according to a recording tool that uses one or more video cameras and generates video or audio data. CCTV has the benefit of being able to record all activities remotely without any distance limitation, can monitor and record all forms of activities that occur on the observation site using a laptop or PC in real-time from anywhere, and can be Recording the entire incident by 24 hours, or can record when a movement of the monitored area occurs [3].

1.2 Image
An image or image is a spatial representation of an actual object in the two-dimensional field usually written in X-y Cartesian coordinates, and each coordinate represents one smallest signal of the object [4]. A digital image is a two-dimensional function, f (x, y), which is a function of light intensity where the x and Y values are spatial coordinates and function values at each point, and (x, y) is the level of imagery in that point. A digital image is indicated by a matrix in which the row and its column state a point on the image and its matrices (referred to as the image or pixel element) state the level of the grey at that point. (2)

2 Image Preprocesing
Image preprocessing is done to prepare a better image for feature extraction needs or classification needs. Image preprocessing has several commonly used techniques such as eliminating noise with Gaussian blur, segmentation and resizing.

2.1 RGB Color Conversion to Grayscale
Changing the image color to grayscale from red, green, and blue or RGB makes it easier to process digital imagery. The method of converting an image from RGB to Greyscale is the Average value method and the Luminance value on the image. The Average method is the simplest method of changing the image color space from RGB to greyscale. The average method can be done using equations (3) [5]. (3)

2.2 Cropping
Cropping is a way of making certain areas of interest in imagery, which aims to make it easier to analyze imagery and reduce the size of imagery irregularities. In the image processing process, usually not the overall scene of the image we use, to get the area we want we can cut the image [6]. Image Cropping can be used for spatial data as well as spectral data. Image cropping can be done based on the coordinate point, the number of pixels or zooming results of certain areas.

3 Deep Learning
Deep Learning is one of the techniques in machine learning that utilizes many nonlinear information processing layers to perform feature extraction, pattern recognition, and classification. Deep learning is a problem-solving approach to computer learning systems that use a hierarchical concept. The concept of hierarchy makes computers able to learn complex concepts by combining them from simpler concepts. If it is depicted as a graph of how the concept is built on another concept, this graph will be in with many layers, it is the reason it is referred to as deep learning [7].

3.1 Convolutional Neural Network
The Convolutional Neural Network (CNN) is one of the algorithms of deep learning which is the development of Multilayer Perceptron (MLP) designed to process data in twodimensional forms, such as images or sound. CNN is used to classify data labeled using a supervised learning method or well-curated learning. CNN is often used to recognize objects or landscapes and performs object detection and segmentation. CNN learns directly from its image data thereby eliminating feature extraction manually [8]. The overview of CNN architecture can be shown in Figure 1.

3.2 Convolution Operation
Basic operations in CNN is convolution operation or h(x). Convolution has two functions f(x) that are functions of the original object and g(x) as a convolution kernel function that is defined as an equation (4). The convolution operation is also imposed on the S(t) function of a multi-dimensional array value can be formulated like the equation (5). (5) Where: S (t) = function of the convolution operation results x = multi-dimensional array containing data w = weight or commonly referred to as filter t = variable of function A = dummy variable constants In machine learning applications, weights (w) are multi-dimensional arrays which are parameters that can be learned.

3.2 Activation Function
The activation function is calculated after convolution operation. Activation functions that are often used in convolutional neural networks include tanh(), ReLu (Reactified Linear Unit), sigmoid, and softmax [8]. This research will be use ReLu and SoftMax activation functions.
1. The ReLu function is a function that the output value of a neuron can be expressed by 0 if the input value is negative. If the input value is positive, then the output of the neuron is the input value of the activation itself. The equation of this function can be shown in equations (6).
2. Softmax activation is applied in the last layer of the neural network. Softmax is more commonly used than the ReLU activation function, sigmoid or tanh (). Softmax is useful for converting the output of the last layer in neural networks to basic distribution probability. The Softmax equation is shown in the equation (7).

3.3 Pooling Operation
After calculating activation function, pooling operations are carried out by reducing size of matrix by means of max-pooling or average-pooling. Output from pooling operation is a matrix with smaller dimensions compared to the initial image. Convolution and pooling process is carried out to obtain the desired feature map to be input into fully connected layer [8]. The pooling illustration is shown in Figure 2, namely pooling by max-pooling.

3.4 Pooling Operation
Stride Stride is a parameter that determines number of filter shifts in an image pixel. If the stride value is 1, the convolution filter will shift by 1 pixel horizontally and vertically. More smaller stride value, model will capture more detailed information from an input image, but it requires more computation when compared to a large stride [8]. A small stride value does not always produce better pixel information details, but with a small stride value prevents stacking of unused pixel information.

3.5 Padding
Padding Padding or Zero Padding is a parameter that determines number of pixels (containing a value of 0) to be added to each side of the input. This is used in order to manipulate the output dimensions of the convolution layer (Feature Map) [9]. Purpose using padding layer is output dimensions of the convolution layer will always be smaller than the input (except the use of a 1x1 filter with stride 1) so that more information is wasted which is not needed when the convolution process is running. In addition, zero padding will set output layer's dimensions to remain the same as the input dimension or at least not drastically reduced. If in a dimension actually input is 5x5, then convolution is done with a 3x3 and stride filter of 2, then a 2x2 feature map will be obtained. But if you add zero padding with a value of 1x1, then the resulting map feature is 3x3 (more information is generated). Calculating the dimensions of a feature map can be used the following equation (8). (8) Where: V = Volume Size F = Filter height P = Zero Padding S = Stride

3.6 Adam Optimizer
Adam's optimization was introduced by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in 2015 ICLR paper entitled "Adam: A Method for Stochastic Optimization". Adam stands for Adaptive Moment Estimation [10]. Adam optimizer is an optimization algorithm that is used as a substitute for classic gradient stochastic procedures that will update network weights based on iteratives without changing the learnign rate. The algorithm from Adam Optimizer is shown in Figure 3.

1 Data and Pre-processing
This research uses CCTV video data taken via DISHUB Sukoharjo, the existing video data will be converted into image data. In this research video data is changed to 5 seconds per video. Authors add one category to make it easier on classification Peros. How to assign categories or labels to individual data using the V/C ratio.   From the result, the value will be obtained capacity used in the calculation of V/C.

Example of a V/C calculation on a video that has been changed to 5 seconds, if the vehicle passes through 3 cars and 1 motocycle then:
With the result 0.92 then the level of density referring to table 4 is the level of density E.

2 Pre-processing Data
Data pre-processing is done to produce better, more computable data. CCTV video is transformed into image on every frame in the video. Images are cut according to the region of interest. Then converted to a grayscale image. This process illustration can be seen in Figure 4. Figure 4 Pre-processing Data

2 Arhcitecture CNN
Implemented CNN's architecture model and initializing the parameters on CNN models. The CNN model in question can be detailed with the architecture shown in Figure  5.

3 Parameter effect on CNN
The implementation of CNN's model with a good value of accuracy should be accompanied by parameters that help to produce the best accuracy value. The parameters in question are the many convolution layers, the number or size of the filter of each layer, the number of neurons in the hidden layer, and the learning rate.

3.1 Effect of Epoch number
Epoch is when the entire dataset is already going through the training process on the Neural Network until it is returned to the beginning for one round. In this experiment, the number of epoch given is in table 6. With almost the same accuracy results between epoch 50 and 60 but this study considers the time when training is fewer then the best epoch selection is 50 epoch.

3.2 Effect of Learning rate
The value of learning rate is done to produce a model with a steady and minimal training loss value. The value of the learning rate is given when the CNN model is ready to compile using Adam optimizer. The value of the learning rate provided is 0.0001, 0.001, and 0.001. Figure 6 Effect of learning rate Learning rate 0.001 almost experienced a stable similarity with the value of leraning rate 0.00001 But the result of loss greater than 54.39%. With this experiment, this research uses the value of learning rate that produces the best accuracy with learning rate 0.0001.

3.3 Effect without or using Dropout
Dropout is one of the efforts to prevent the occurrence of overfitting and also accelerate the learning process. Overfitting is a condition where almost all data that has been through the training process reaches a good percentage, but there is a discrepancy in the predictive process. In its working system, Dropout eliminates while a neuron is a hidden layer or a visible layer located inside the network. Results effect using or without dropout to show the best accuracy when using a dropout of 0.1 with an accuracy of 92.73%. Results effect using or without dropout experiments demonstrated by table 7.

3.4 Effect Filter Size
A certain sized convolution (filter) kernel, the computer gets new representative information from the multiplication of parts of the image with the filter used. From the effect of filter size, the best filter size is obtained at 4 × 4 sizes. The bigger the filter size is used, the better the accuracy is 94.67%.

3.5 Effect Pooling
The best max-pooling size is 2 × 2 which generates 94.26% accuracy, but taking into consideration the results of his training loss then the one used for this research is max pooling with size 3 × 3.

4 Effect of Edge Image Data Input
The result of the accuracy of the training on the edge image is aimed at table , the grayscale and edge image example can be seen in image 1. Better result in the show on the edge image of grayscale image due to the absence of blur or difference in light levels Images that affect the value of existing pixels. The accuracy of the data training on this edge image is 95.82%.

5 Evaluation using Confusion Matrix
This evaluation aims to test the performance of the classification model that has been created by the system. With data validation amount of 3364 data. Precision, recall and F1-score results can be viewed in table.

5 Evaluation using K-fold Cross Validation
This study will try testing using stratified K-fold cross-validation with value K = 5. Each partition gets the result of each training accuracy. The accuracy results are calculated and the average search is 92.83%. With the average result not far from the results of the accusation of each partition can be concluded if the distribution of dataset training and also testing is correct with the greatest accuracy gained on partition 1.

6 Evaluation CNN model with Testing data
Test experiments use data as much as 50 and 100 that the category spread is balanced. Test results with grayscale images tend to be smaller than the edge image. From the results of data testing the input grayscale image amounted to 50 CNN models were only able to categorize correctly 29 only while from the trial of the 100 edge model image CNN could properly predict 81 data.

CONCLUSIONS
The effect of using edge image input data shows better results compared with grayscale image input, with 94.64% accuracy for grayscale image and 95.82% for the edge image. In this study, the use of the learning rate parameter with a value of 0.0001 can be a suitable parameter that results in the least small training loss.
The use of dropout in this study with a value of 0.1 resulted in better training accuracy of 92.73% compared to not using dropout ie 87.32%. Using an edge input image of testing 100 data testing processed through the CNN model created. Models can classify into density-level categories correctly amounting to 81 data.