

# Technology and Design Tools for Multicore Embedded Systems Software Development

Yuriy Sheynin, Alexey Syschikov, Boris Sedov

Saint Petersburg State University of Aerospace Instrumentation

# Why do we need such technology?

1. "Two-in-one" developer is required:



skilled domain experts + skilled programmer



- 2. Contradictive requirements to hardware platforms
- 3. Development of an algorithm and program should be started before the selection of a specific platform
- 4. Hardware platforms become more and more complex, includes many cores, are heterogeneous in all aspects (cores, memory, interconnect)
- 5. In order to achieve the necessary requirements an adaptation of algorithms for the platform and the platform for algorithms is needed































- Flexibility and ease-of-change at any design stage
- Explicit parallel program scheme control and management
- No direct coder influence on a parallel program scheme
- Decreasing errors possibility without sacrificing parallel program visibility
- Efficient program maintenance during the whole lifecycle



# VPL and Domain Specific Programming



# An example: DSL creation for image processing (OpenCV)

1. Analysis of the domain area



2. Creation of the functional elements (FE) library



Development of FE functionality on C++ & OpenCV

```
int CVBWFilter(DataLink *in11, DataLink *out21)
    IplImage* src=0;
    IplImage *im bw=0;
                                                                 filter
                                                                              filter
    src = DecodeImage(in11,src);
    im bw = cvCreateImage(cvGetSize(src), IPL DEPTH 8U,1);
    cvCvtColor(src,im bw,CV RGB2GRAY);
    EncodeImage(im bw,out21); int CVSmooth(DataLink *in11, DataLink *out21, DataLink *in31,
                                       int radius, int filter type)
    cvReleaseImage(&src);
    cvReleaseImage(&im bw);
                                    IplImage* src=0;
    return 0;
                                   IplImage* smooth=0;
                                    int r;
                                    memcpy(&r,in31->Data,sizeof(int));
                                    src = DecodeImage(in11,src);
                                    smooth = cvCloneImage(src);
                                    cvSmooth(src,smooth, filter_type, radius, radius, 0.0,0.0);
                                   EncodeImage (smooth, out21);
                                    cvReleaseImage(&src);
                                   cvReleaseImage(&smooth)
                                    return 0;
```

# An example: DSL usage for image processing Image recognition



4. The scheme is designed of DSL and basic VPL language elements



# An example: DSL usage for image processing Face/eyes recognition



## About OpenVX

#### **OpenVX**

- C-based programming approach with mixed C/non-C computing model
- Includes functions and data types of video processing domain area
- Functions library can be expanded, but it is inconvenient (non-portable)

#### **OpenVX support in VIPE**

- Full implementation for spec. v.1.0.1 (working on v.1.1)
- VPL:
  - DSL for OpenVX functions
  - OpenVX-specific data objects
- Code generation:
  - Plain C mode with OpenVX functions (vxu)
  - OpenVX graph mode
  - Mixed mode with OpenVX graphs and other DSLs

An OpenVX graph – a limited subset of VPL program schemes. VPL scheme + OpenVX functions combines all benefits

### Asynchronous Growing Processes (AGP) formal computational model

#### **AGP** defines:

- VPL language syntax
- semantics of VPL language objects
- control units

#### **AGP** provides:

- formal verification
- identical results in different run-time environments
- dynamics of parallel computations
- combination of working in shared and distributed memory models

AGP – the single model for all types of parallel computations and kernel - data interaction (shared memory / message passing)



# Visual programming environment: VIPE









#### Development process support tools:

- parallel program scheme validation
- verification (in progress)
- interactive debugging, etc.
- step-by-step debug
- breakpoints
- watches
- data transfers
- computation traces
- operators executions
- functional debugging by serial execution
- etc.



# Performance evaluation. Visual Profiler Hot-spots detection



#### Modes:

- Absolute execution time of each node
- Relative execution time of each node
- Hot-spots

# Performance evaluation. Static Analysis

Fast, early estimation of the program performance on the many core platform



Time reduce

 $R = \frac{T_n}{T_1} * 100\%$ 

Speedup

$$S = \frac{T_1}{T_n}$$

Efficiency

$$E = \frac{T_1}{N * T_n}$$

 $T_1$  – program execution time on the 1 processor

 $T_n$  – program execution time on N processors

*N* – number of processors

#### Performance evaluation. VPL Simulator



#### Allows estimating:

- 1. Performance requirements for cores of the embedded system
- 2. Memory requirements
- 3. Load balance of various allocations
- 4. Volume and intensity of data exchange
- 5. Efficiency of hardware occupation
- 6. Bottlenecks of hardware platform, program and task distribution

# Support of heterogeneous platforms programming





- Mapping operators to one or several core types (CPU, GPU, DSP, DMA)
  - Operators on various core types
  - Data on various data types
  - Data exchanges on various connection types
- Selecting the implementation for data processing operators
- Preparation of initial data and the results of operator of the program, taking into account the specifics of the different communication mechanisms

# Heterogeneous allocation



# Deployment to target platforms Visual development environment Validation Verification Debugging (interactive)

#### Working prototypes

- ANSI C
- C++
- RT-run-time on a multicore platform (in progress)
- Parallel OpenMP

#### Proof of concept

- Parallel threads
- MPI
- Assembler MIPS, DSP





## Use cases and demonstrators

# Use case: face identification Task description



#### Use case: face identification





- Software part: low quality of face recognition
- Hardware part: Ci20 (perspective ELISE by ELVEES)
- Computations only on the CPU
- Works slowly

Terminator Vision System. Student project

Autonomous Cyber-Physical System combining multicore computations, control and mechanical parts. The Vision System identifies people from the database and tracks them with rotating camera.

Project presented on **hackster.io** + **Imagination** challenge:

https://www.hackster.io/contests/CI20

- Project is developed with VIPE
- Face recognition is performed by using training neural network
- Database of faces was created for face classification
- Tracking is performed by using servo, which is controlled by Arduino that receives commands from the Ci-20 board



## Use case: number plate recognition



#### Use case: number plate recognition





- Software part: low quality of number plate recognition
- Hardware part: Ci20 (perspective ELISE by ELVEES)
- Computations only on the CPU
- Works slowly

### Use case: number plate recognition





## Use case: number plate recognition Scheme of working with Imagination Creator Ci20 board



#### VIPE one button deployment





# Feature tracking (OpenVX) DSL and design



## Feature tracking (OpenVX) Results



Feature tracking program run on the x86 platform with using the sample implementation by Khronos



### Feature tracking Static analysis





Performance estimation of the feature tracking program with **sequential** frame processing

Performance estimation of the feature tracking program with **parallel** frame processing

## Feature tracking Visual Profiler

Large amount of time is taken by image format conversion function (from OpenVX format and back)



Profiling of the feature tracking program

## Traffic radar object detection Development



[SEG\_LENGTH]

## Traffic radar object detection Static analysis



Static analysis shows acceptable reduction of time on 2-3 cores

However, the results were worse than expected. Static analysis of subprogram "Data processing units" shows close to a linear reduction of time for 8 cores



## Traffic radar object detection Visual Profiler



File reading function is in sequential part, hence the parallelism is limited by Amdahl's law. Actual process of getting the input data should be optimized to take less time. Evaluation of program with reduced operating time of reading function shows satisfactory results.

Visual Profiler shows, that a large amount of time is taken by function for reading the input file (prototype uses data from the input file).



# Traffic radar object detection Comparison of the results of analysis, simulation and execution

| Cores (simulator) or threads(OpenMP) | Static analysis sec. / % | Modeling<br>VPL<br>sec. / % | Execution sec. / % |
|--------------------------------------|--------------------------|-----------------------------|--------------------|
| Without OpenMP                       |                          |                             | 1.60               |
| 1                                    | 1.26 / 100               | 1.29 / 100                  | 1.65 / 100         |
| 2                                    | 0.64 / 50.8              | 0.88 / 68.2                 | 1.00 / 60.6        |
| 3                                    | 0.60 / 47.6              | 0.72 / 55.8                 | 0.81 / 49.1        |
| 4                                    | 0.34 / 30.0              | 0.55 / 42.6                 | 1.25 / 76.7        |

#### Hardware platform

- Core i7 8 cores-> VirtualBox VM 4 cores
- Ubuntu 14.04, GCC 4.9.2.

#### Input data

• 12 MB signal samples

### Summary

- Technology covers various requirements of embedded SW development
  - Design, programming, evaluation, porting etc.
- DSLs for involving domain specialists into development process
- Rapid SW prototyping for early customer presentations
- Formal model basis for proofed and predictable results, including debugging
- Fast tools adaptation for new cores and platforms

 Supporting development tool for heterogeneous cores, platforms, system software infrastructure