I Introduction
As smartphones get increasingly integrated into our daily lives and the numbers of both cyberphysical systems and smart devices continues to grow, there has been a noticeable evolution in the way many large data sets are being generated. In fact, Cisco [13] predicted that in 2021, whilst 20.6 ZB of data (e.g. large ecommerce site records) will be handled by cloudbased approaches in large datacentres, this amount will be dwarfed by the 850 ZB generated by local devices [40]. In response to the data sources becoming more devicecentric, there has been a shift in focus for many machine learning algorithms to be implemented and even trained locally on the (potentially hardware limited) devices. Running the algorithms on the devices represents a radical shift away from traditional centralised learning where the data and algorithms are stored and processed in the cloud but brings the benefits of [40] 1. increased user privacy as the data is not transmitted to a centralised sever 2. reduced latency since the algorithms can react immediately to newly generated data from the device and 3. improved energy efficiency mostly because the data and algorithm outputs don’t have to be constantly transferred to and from the cloud. However, running the algorithms locally on the devices brings its own issues, most notably in dealing with the devices’ limited computational power, memory and energy storage. Overcoming these hardware constraints has motivated substantial efforts to improve algorithm design, particularly towards developing leaner, more efficient neural networks [36].
Two popular approaches to make neural network algorithms leaner and more hardwareconscious are quantised neural networks [34, 7, 35], where fixedpoint arithmetic is used to accelerate the computational speed and reduce memory footprint, and pruned neural networks [27, 4, 18, 19, 32, 31, 22, 17, 12, 23, 26, 15, 30, 24], where, typically, the weights contributing least to the function mapping are removed, promoting sparsity in the weights. Both of these approach have achieved impressive results. For instance, by quantising, [25] was able to reduce model size by
without any noticeable loss in accuracy when evaluated on the CIFAR10 benchmark and
[16] demonstrated that between 5080 of its model weights could be pruned with little impact on performance [36]. However, our understanding of neural network reduction methods such as these remains lacking and reliably predicting their performance remains a challenge. Illustrating this point, [27] stated that for pruned neural networks “our results suggest the need for more careful baseline evaluations in future research on structured pruning methods” with a similar sentiment raised in [4] “our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics”. These quotes indicate a need for robust evaluation methods for lean neural network designs, a perspective explored in this work.Contribution
This paper introduces a method to automatically synthesize neural networks of reduced dimensions (meaning fewer neurons) from a trained larger one, as illustrated in Figure 1. These smaller networks are termed reducedorder neural networks since the approach was inspired by reduced order modelling in control theory [14]. The weights and biases of the reduced order network are generated form the solution of a semidefinite program (SDP) a class of wellstudied convex problems [5] combining a linear cost function with a linear matrix inequality (LMI) constraint which minimises the worstcase approximation error with respect to the larger network. Bounds are also obtained for this worstcase approximation error and so the performance of the network reduction is guaranteed.
What separates the proposed synthesis approach to the existing methods for generating efficient neural networks, e.g. pruning, is the inclusion of the worstcase approximation error of the reducedorder neural network directly within the cost function for computing the weights and biases. Whilst the presented results are still preliminary, their focus on robust neural network synthesis introduces a new set of of tools to generate lean neural networks which should have more reliable outofsample performance and which are equipped with approximation error bounds. The broader goals of this work are to translate recent results on the verification of NN robustness using an SDP [11, 33] into a synthesis problem, mimicking the progression from absolute stability theory [39] to robust control synthesis [9] witnessed in control theory during the 1980s. In this way, this work carries on the tradition of control theorists exploring the connections between robust control theory and neural networks, as witnessed since the 1990s with Glover [6], Barabanov [3], Angeli [2] and Narendra [21].
Ia Notation
Nonnegative real vectors of dimension
are denoted . A positive (negative) definite matrix is denoted . Nonnegative diagonal matrices of dimension are . The matrix of zeros of dimension is and the vector of zeros of dimension is. The identity matrix of size
is . The vector of 1s of dimension is and the matrix of 1s is . The element of a vector is denoted unless otherwise defined in the text. The notation is adopted to represent symmetric matrices in a compact form, e.g.(1) 
IB Neural networks
The neural networks considered will be treated as functions mapping input vectors of size to output vectors of dimension . In a slight abuse of notation, will refer to mappings of both scalars and vectors, with the vector operation applied elementwise. The fullorder neural network will be composed of hidden layers, with the layer being composed of neurons. The total number of neurons in the fullorder neural network is . Similarly, the reducedorder neural network will be composed of hidden layers with the layer being composed of neurons. The total number of neurons in the reducedorder network is
. The dimension of the domain of the activation functions is defined as
(fullorder network) and (reducedorder network).Ii Problem statement
In this section, the general problem of synthesizing reducedorder NNs is posed. Consider a nonlinear function mapping input data to an output set . The goal of this work is to generate a “simpler” function that is as “close” as possible to for all . Here, “simpler” refers to the dimension of the approximating neural network (measured by the total number of neurons) and “close”ness relates to the approximation error between the two functions and measured by the induced 2norm . The goal is to automatically synthesize the simpler functions from the solution of a convex problem and obtain worstcase bounds for approximation error with respect to the larger neural network for all .
To ensure that the function approximation problem remains feasible, structure is added to the set . It is assumed that the function being approximated
is generated by a feedforward neural network
(2a)  
(2b)  
(2c) 
Here, the input data is mapped through the nonlinear activation functions
(which could be the standard choices of ReLU, sigmoid, tanh or any function that satisfies a quadratic constraint as given in Section
IIIB) elementwise with the weight matrices , and biases , . Whilst the results are described for feedforward neural networks, the method can be generalised to other network architectures, such as recurrent and even implicit neural networks [10]. As an aside, verifying the wellposedness of implicit neural networks has a strong connection to that of Lurie systems with feedthrough terms [37].The neural network (which will be referred to as the fullorder neural network) is to be approximated by another neural network (referred to as the reducedorder neural network) of a smaller dimension
(3a)  
(3b)  
(3c) 
The weights and biases in this neural network are , , . The network structure in (3b) is general, and even allows for implicitly defined networks [10]. This generality follows from the lack of structure imposed on the matrices used in the synthesis procedure. However, by adding structure, the search can be limited to, for example, feedforward networks, which are simpler to implement.
In this work, the dimension of the reducedorder network is fixed and the problem is to find the reducedorder NN’s parameters, being the weights and biases , that minimise the worstcase approximation error between the full and reduced order neural networks for all . The main tool used for this reducedorder NN synthesis problem is the outer approximation of the NN’s input set , nonlinear activation function’s gains and the output error by quadratic constraints. These outer approximations enable the robust weight synthesis problem to be stated as a convex SDP, albeit at the expense of introducing conservatism into the analysis.
Iii Quadratic Constraints
In this section, the quadratic constraints for the convex outer approximations of the various sets of interest of the reduced NN synthesis problem are defined. These characterisations are posed in the framework of [11], which in turn was inspired by the integral quadratic constraint framework of [29] and the classical absolute stability results for Lurie systems [20].
Iiia Quadratic constraint: Input set
The input data is restricted to the hyperrectangle .
Definition 1
Define the hyperrectangle . If then where
(4) 
IiiB Quadratic constraint: Activation functions
The main obstacle to any robustnesstype result for neural networks is in accounting for the nonlinear activation functions . To address this issue, the following function properties are introduced.
Definition 2
The activation function satisfying is said to be sector bounded if
(5a)  
and slope restricted if  
(5b)  
If then the nonlinearity is monotonic and if is slope restricted then it is also sector bounded. The activation function is bounded if  
(5c)  
it is positive if  
(5d)  
its complement is positive if  
(5e)  
and it satisfies the complementarity condition if  
(5f) 
Most popular activation functions, including the ReLU, (shifted)sigmoid and tanh satisfy some of these conditions, as illustrated in Table I. As the number of properties satisfied by increases, the characterisation of this function within the robustness analysis improves, often resulting in less conservative results. It is also noted that to satisfy some activation functions may require a shift, e.g. the sigmoid, or they may require transformations to satisfy additional function properties, as demonstrated in the representation of the LeakyReLU as a ReLU + linear term function.
property  Shifted sigmoid  tanh  ReLU  ELU 

Sector bounded  ✓  ✓  ✓  ✓ 
Slope restricted  ✓  ✓  ✓  ✓ 
Bounded  ✓  ✓  ×  × 
Positive  ×  ×  ✓  × 
Positive complement  ×  ×  ✓  × 
Complementarity condition  ×  ×  ✓  × 
Properties of commonly used activation functions, including the sigmoid, tanh, rectified linear unit ReLU and exponential linear unit (ELU). The properties of other functions, such as the LeakyReLU, can also be inferred.
As is wellknown from control theory [20], functions with these specific properties are important for robustness analysis problems because they can be characterised by quadratic constraints.
Lemma 1
Consider the vectors , and that are mapped componentwise through the activation functions . If is a sectorbounded nonlinearity, then
(6a)  
and if it is sloperestricted then  
(6b)  
If is bounded then  
(6c)  
if is positive then  
(6d)  
if the complement of is positive then  
(6e)  
and if satisfies the complementary condition then  
(6f)  
Additionally, if both and its complement are positive then so are the cross terms  
(6g)  
(6h) 
Inequalities (6a)(6f) are wellknown however the cross terms (6g)(6h) acting jointly on activation function pairs are less so.
The characterisation of the nonlinear activation functions via quadratic constraints allows the neural network robustness analysis to be posed as a SDP with the various ’s in Lemma 1 being decision variables. Such an approach has been used in [11, 14, 3], and elsewhere, for neural networks robustness problems, with the conservatism of this approach coming from the obtained worstcase bounds holding for all nonlinearities satisfying the quadratic constraints. In this work, the aim is to extend this quadratic constraint framework for neural network robustness analysis problems to a synthesis problem.
A quadratic constraint characterisation of both the reduced and fullorder neural networks can then be written, with the following lemma being the application of Lemma 1 for both the reduced and fullorder neural networks.
Lemma 2
Appendix 2 details the characterisation of for the specific case of ReLU activation functions.
IiiC Quadratic constraint: Approximation error of the reducedorder neural network
An upper bound for the approximation error between the full and reducedorder networks can also be expressed as a quadratic constraint. This error bound will be used as a performance metric to gauge how well the reducedorder neural network approximates the fullorder one, as in how well .
Definition 3 (Approximation error)
For some , , the reducedorder NN’s approximation error is defined as the quadratic bound
(10) 
In practice, this bound is computed by minimising over some combination of and
Iv Reducedorder neural network synthesis problem
This section contains the main result of the paper; an SDP formulation of the reducedorder NN synthesis problem (Proposition 1). To arrive at this formulation, a general statement of the synthesis problem is first defined in Theorem 1. This theorem characterises the search for the reducedorder neural network’s parameters as minimising the worstcase approximation error for all inputs .
Theorem 1
Assume the activation functions satisfy some of the properties from Definition 2. Then, with the fixed weights , if there exists a solution to
(11a)  
(11b)  
then the worstcase approximation error is bounded by for all .
Proof. See Appendix 3.
The main issue with Theorem 1 is verifying inequality (11b) since it includes a nonconvex bilinear matrix inequality (BMI) between the matrix variables of the reducedorder network’s weights, its biases and the scaling variables in . The following proposition details how this constraint can be written (after the application of a convex relaxation of the underlying BMI) as an LMI. The search over the reduced NN variables can then be translated into a SDP, a class of well understood convex optimisation problems with many standard solvers such as MOSEK [1] implemented through the YALMIP [28] interface in MATLAB or even the Robust Control Toolbox.
Proposition 1
Consider the fullorder neural network of (2) mapping and the reducedorder neural network of (3) mapping . For fixed weights , if there exists matrix variables , and that solve
(12a)  
(12b) 
with defined in (33) of Appendix 4, then the reducedorder network with weights and affine terms
(13) 
ensures that the worstcase approximation error bound of the reducedorder neural network satisfies for all .
Proof. See Appendix 4.
Appendix 5 details the matrix (which characterises how the activation functions are included within the robustness condition ) for the special case ReLU(). Some remarks about the proposition are given in Appendix 6.
V Numerical example
The proposed reducedorder neural network synthesis method was then evaluated in two numerical examples. In both cases, the performance of the synthesized neural networks were evaluated graphically (see Figures 23) to give a better representation of the robustness of the approximations (the focus of this work). Only academic examples were considered due to the wellknown scalability issues of SDP solvers (but which are becoming less of an issue [8]) and because performance was measured graphically. The code for the numerical examples can be obtained on request from the authors.
The first example explores the impact of reducing the dimension of the reducedorder neural network on its accuracy. In this case, the fullorder neural network considered was a single hidden layer network of dimension 10 with the weights , , and
all obtained from sampling a zero mean normal distribution with variance 1 and which mapped a single input to a single output,
, with the input constrained to . The ReLU was taken as the activation function of both the full and reducedorder neural networks. Reducedorder neural networks with single hidden layers of various dimensions were then synthesized using Proposition 1. Figure 1(a) shows the various approximations obtained and Figure 1(b) shows how the error bounds and approximation errors changed as the dimension of the reducedorder network increased. The error bound was satisfied in all cases (albeit conservatively) and the approximation error dropped as the degree of the reduced network increased, as expected.The second example considers a more complex function to approximate and illustrates some potential pitfalls of pruning too hard. In this case, the fullorder network’s weights were defined by , , and with then the weights and biases were , , , , and Figure 3 shows the output generated from a reducedorder neural network as well as the network generated by setting the matrices in Lemma 2 to be diagonal (this reduced the compute time but, as shown, can alter the obtained function). Also shown is the case when the fullorder neural network has been pruned to have a similar number of connections as the reducedorder one by removing the 32 smallest weights. In this case, the pruned network was cut so far that it simply generated a constant function, but further finetuning of the pruned network may recover performance. Likewise, finetuning of the reducedorder neural network (through different substitutions of and ) may improve the approximation of the synthesized reducedorder neural networks.
Conclusions
A method to synthesize the weights and biases of reducedorder neural networks (having few neurons) approximating the input/output mapping of a larger was introduced. A semidefinite program was defined for this synthesis problem that directly minimised the worstcase approximation error of the reducedorder network with respect to the larger one, with this error being bounded. By including the worstcase approximation error directly within the training cost function, it is hoped that the ideas explored in this paper will lead to more robust and reliable reducedorder neural network approximations. Several open problems still remain to be explored, most notably in reducing the conservatism of the bounds, scaling up the method to large neural networks and exploring the convexification of the bilinear matrix inequality of the synthesis problem.
Acknowledgements
The authors were funded for this work through the Nextrode Project of the Faraday Institution (EPSRC Grant EP/M009521/1) and a UK Intelligence community fellowship from the Royal Academy of Engineering.
References

[1]
(2000)
The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm
. In High Performance Optimization, pp. 197–232. Cited by: §IV.  [2] (2009) Convergence in networks with counterclockwise neural dynamics. IEEE Transactions on Neural Networks 20 (5), pp. 794–804. Cited by: §I.

[3]
(2002)
Stability analysis of discretetime recurrent neural networks
. IEEE Transactions on Neural Networks 13 (2), pp. 292–303. Cited by: §I, §IIIB.  [4] (2020) What is the state of neural network pruning?. arXiv preprint arXiv:2003.03033. Cited by: §I.
 [5] (2004) Convex optimization. Cambridge University Press. Cited by: §I.
 [6] (1999) Bounds of the induced norm and model reduction errors for systems with repeated scalar nonlinearities. IEEE Transactions on Automatic Control 44 (3), pp. 471–483. Cited by: §I.
 [7] (2014) Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024. Cited by: §I.
 [8] (2020) Enabling certification of verificationagnostic networks via memoryefficient semidefinite programming. Advances in Neural Information Processing Systems 33. Cited by: §V, Computational cost.
 [9] (1988) Statespace solutions to standard and control problems. In Procs. of the American Control Conference, pp. 1691–1696. Cited by: §I.

[10]
(2019)
Implicit deep learning
. arXiv preprint arXiv:1908.06315. Cited by: §II, §II.  [11] (2019) Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. arXiv preprint arXiv:1903.01287. Cited by: §I, §IIIB, §III, Computational cost.
 [12] (2019) The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §I.
 [13] Cloud index: Forecast and methodology, 2016–2021, white paper. [online]. Available: https://www.cisco.com/c/en/us/ solutions/collateral/serviceprovider/globalcloudindexgci/whitepaperc11738085.htm. Cited by: §I.
 [14] (1984) All optimal Hankelnorm approximations of linear multivariable systems and their error bounds. International Journal of Control 39 (6), pp. 1115–1193. Cited by: §I, §IIIB.
 [15] (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §I.
 [16] (2015) Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28, pp. 1135–1143. Cited by: §I.
 [17] (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 164–171. Cited by: §I.
 [18] (1989) Pruning versus clipping in neural networks. Physical Review A 39 (12), pp. 6600. Cited by: §I.
 [19] (1990) A simple procedure for pruning backpropagation trained neural networks. IEEE Transactions on Neural Networks 1 (2), pp. 239–242. Cited by: §I.
 [20] (2002) Nonlinear systems. 3 edition, Prentice hall Upper Saddle River, NJ. Cited by: §IIIB, §III.
 [21] (1990) Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks 1 (1), pp. 4–27. Cited by: §I.
 [22] (1990) Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605. Cited by: §I.
 [23] (2019) A signal propagation perspective for pruning neural networks at initialization. arXiv preprint arXiv:1906.06307. Cited by: §I.
 [24] (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §I.
 [25] (2016) Fixed point quantization of deep convolutional networks. In International conference on machine learning, pp. 2849–2858. Cited by: §I.
 [26] (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: §I.
 [27] (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §I.
 [28] (2004) YALMIP: A toolbox for modeling and optimization in MATLAB. In International Conference on Robotics and Automation, pp. 284–289. Cited by: §IV.
 [29] (1997) System analysis via integral quadratic constraints. IEEE Transactions on Automatic Control 42 (6), pp. 819–830. Cited by: §III.

[30]
(2016)
Pruning convolutional neural networks for resource efficient inference
. arXiv preprint arXiv:1611.06440. Cited by: §I.  [31] (1989) Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems, pp. 107–115. Cited by: §I.
 [32] (1989) Using relevance to reduce network size automatically. Connection Science 1 (1), pp. 3–16. Cited by: §I.
 [33] (2018) Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pp. 10877–10887. Cited by: §I.
 [34] (2016) Fixedpoint performance analysis of recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 976–980. Cited by: §I.
 [35] (2015) Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §I.
 [36] (2017) Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §I, §I.
 [37] (2018) Regional analysis of sloperestricted Lurie systems. IEEE Transactions on Automatic Control 64 (3), pp. 1201–1208. Cited by: §II.
 [38] (2000) A tutorial on linear and bilinear matrix inequalities. Journal of Process Control 10 (4), pp. 363–385. Cited by: Bilinearity.
 [39] (1968) Stability conditions for systems with monotone and sloperestricted nonlinearities. SIAM Journal on Control 6 (1), pp. 89–108. Cited by: §I.

[40]
(2019)
Edge intelligence: Paving the last mile of artificial intelligence with edge computing
. Proceedings of the IEEE 107 (8), pp. 1738–1762. Cited by: §I.
Appendices
Appendix 1: Definition of various vectors
Definition 4
For the fullorder neural network, the vector mapped through the nonlinear activation function is
(14) 
where
and for the reducedorder neural network with activation function , it is
(15) 
with
(16) 
The following two vectors are used to define the quadratic constraints of Proposition 1,
(17) 
Also, define
(18a)  
such that  
(18b) 
Appendix 2: Quadratic Constraint for the ReLU
Consider the generalised quadratic constraint of Lemma 2 with the ReLU activation function. For this particular case, the matrix in (9) is structured as in in (19) where , , , , , , , and .
(19) 
Appendix 3: Proof of Theorem 1
Appendix 4: Proof of Proposition 1
In order to obtain the SDP formulation of the NN synthesis problem, the three quadratic constraints for the input set (Definition 1), the nonlinear activation functions (Lemma 2) and the output error set (Definition 3) need to be defined as matrix inequalities. The vectors and from Definition 4 in Appendix 1 are used throughout.
(21) 
The bound on the approximation error given by inequality (10) in Definition 3 can be written as , with
(22) 
using
(23) 
and defined in (18a).
The most challenging aspect of the proposition is obtaining an LMI for the synthesis of the neural network’s weights and biases. From Lemma 2, if the activation functions satisfy quadratic constraints and hence the inequality (9), then where is defined in (24).
(24) 
The product of matrix variables (highlighted in blue in the above) leads to a nonconvex BMI constraint in the problem, with both the scaling terms of the quadratic constraint inequality (9) and the reducedorder NN’s weights and biases being matrix variables. A convex relaxation is proposed for this BMI constraint (see (25)) but it is highlighted that this relaxation is perhaps the biggest source of conservatism in the proposition, as it restricts the space of solutions that can be searched over.
To achieve the desired linear representation of (24), the bilinear terms in (24) are expressed as scaled versions of the new matrix variables and , with the scaling done by the matrices and . In other words, it is proposed to express , with the bilinear terms
(25a)  
and  
(25b) 
The scaling matrices , are free variables that can be picked by the user, under the stipulation that they preserve the properties of the multipliers of Lemma 2. In this work, the choice was
(26) 
but more refined choices may also exist. The substitutions (25) convert the bilinear matrix variables in (24) into the linear matrix variables of in (27).
(27) 
In this way, the nonconvexity of the bilinear matrix inequality of the problem has been relaxed into a convex linear one. However, the substitution (25) limits the space of solutions that can be searched over by the synthesis SDP, resulting in only local optima being achieved and increased conservatism in the approximation error bounds.
The zeros in the top left corner of are problematic for numerically verifying positivedefiniteness of the matrix used in the proposition. To alleviate this problem, the structural equality constraints of (29) are introduced to fill this zero block. These constraints are defined by drawing connections between the various elements of the vector , namely
(28) 
with , , and . The following quadratic equalities can then be written
(29a)  
(29b)  
(29c)  
(29d) 
for any , , . There then exists a matrix (given in (31)) built from , and such that
(30) 
(31) 
With the matrices and defined, the robustness condition (11b) of Theorem 1 can be expressed as with . However, because is a special case of , if instead it can be shown that holds with , then a solution to has also been found.
Satisfying inequality is equivalent to verifying negative definiteness of , as in
Comments
There are no comments yet.