Wednesday, July 5, 2017

Parallel processors and specialized processor v1.15

A.     Version History

 

Date

Revision

Description

Author

2017-07-05

A

Documents created with features in subsequent revisions

Vinh Nguyen

canvinh@gmail.com

2022-04-09

PB1

Added section 15 for multiple monitors

Vinh Nguyen

 

 

 

 

 

 

 

 

 

 

 

 

 

B.     System Descriptions




Figure 1. Layout of data flow from a program to a processor.
I will base my notes on some of syntax of the PLEX-C programming language for easier illustration of the flow of data and processing. I can’t recall the correct syntax after 20 years.


1. Generic Processors 
Assuming that a program started execution of codes or a program was triggered by receiving a signal (or a function/procedure call) with data. The whole codes before the statement Exit would be sent to a generic processor, i.e. either Processor 1 or Processor 2 in the diagram above. For example, codes of a program receiving a signal ReportData from another program.
Receive ReportData data1, data2, data3, data4;
Goto Label_ProcessingData;
---------

Label_ProcessingData;
Program statements to process data included in the signal ReportData. The section of code between Label_ProcessingData and Exit would be sent to the Processing queue to be executed by a Generic Processor.

SEND PrintData WITH data5, data6, data7, data8;
SP_PROCESS SP_Display3dReport WITH data5, data6, data7, data8, data9, data10;

Exit;  ! This means releasing processor for other programs to execute their codes.
----------

The OS would place the entire codes between Label_ProcessingData and Exit in the processing queue to wait for an idle generic processor. The idle generic processor would set a busy flag to prevent other processors accessing the queue(s). It then fetched data available from the queue for execution, and cleared the busy flag.
Depending of the variable type or address in the section of codes, e.g. permanent or temporary, the processor would save data in the database for other programs used.

In this case, the signal PrintData would be placed in the Delivery Queue to dispatch to its destination program.

The signal SP_Display3dReport (prefixed with characters SP_) with data would be placed in the Special Processing Queue by the generic processor to be executed by the Specialized Processor discussed later. The system identified the signal SP_Display3dReport by the syntax SP_PROCESS. Using the prefix SP_ would help in debugging.
2. Specialized Processor

This specialized processor executed special tasks or signals for faster results. To make system simple, the section of codes and signals sent to this processor would be prefixed with the reserved characters “SPLabel_” or “SP_” respectively. The compiler would parse the codes and send correct requests to this Special Processing queue. For example, codes of a program receiving a signal DisplayRotatingAirplane from another program.
Receive DisplayRotatingAirplane vertex1, vertex2, vertex3, vertex4;

Goto SPLabel_DisplayAirplane;
---------

SPLabel_DisplayAirplane;
Specialized program statements to process data included in the signal DisplayRotatingAirplane. The section of code between SPLabel_DisplayAirplane and SP_Exit would be sent to the Special Processing queue to be executed by the Specialized Processor.

EXEC SP_RotateAirplane WITH vertex5, vertex6, vertex7, vertex8;

SEND RecordData with data1, data2, data3, data4;
SP_Exit;  ! This means releasing processor for other programs to execute their codes.

----------
The OS would place the entire codes between SPLabel_DisplayAirplane and SP_Exit in the special processing queue to wait for execution. 

Depending of the variable type or address in the section of codes, e.g. permanent or temporary, the processor would save data in the database for other programs used.

In this case, the signal SP_RotateAirplane would also be executed by the specialized processor.

The specialized processor would place the signal RecordData with data to the Delivery queue for delivery.

This specialized processor may send back data to other programs in ordinary signals or predetermined signals with data for further processing. In case of sending an ordinary signal, the signal would be placed in the Delivery Queue or Special Delivery Queue for dispatching, i.e. the Delivery Queue and Special Delivery Queue could be the same OR only one Delivery Queue required by the system.
3. Difference between generic processor and specialized processor

The parallel generic processors would be used to execute common tasks. Any of generic processors would be able to handle the same codes specified in the requested application.

The specialized processor is a customized processor with firmware/software and hardware including microprocessor for special tasks, e.g. dedicated for hidden surface removal of an object. This would be useful in robotic simulation or 3D games.

The application could send a list of vertices of an object and a view point to the specialized processor, the object would be displayed properly with hidden surface removed. For example, software provider could provide an interface to declare an object as below
Object DisplayAirplane;
Attribute 1: Linked list of surfaceCharacteristics;
Attribute 2: ViewPoint;
End;
-----

Object surfaceCharacteristics;
Attribute 1: Vertices for a surface;
Attribute 2: Image or color to be displayed on the surface above;
End;

If the specialized processor could handle many similar special tasks, the cost of the system would go down, because a specialized microprocessor was expensive.

4. Ericsson’s PLEX and other programming languages

Generic processor 1 and Generic processor 2 are running in parallel. The specialized processor is customized for specific tasks.

PLEX's programming language provides "Exit" to release processors for other programs executing tasks, thus it's easier to support parallel generic processors. The way programs (blocks) developed at Ericsson was similar, i.e. "a couple of tens lines of codes between receiving a signal (which triggers an execution of codes in that block) and then an Exit". This would help to distribute programming codes evenly for all generic processors.

Other programming languages such as Java, C++, C#, etc. supports a string (stack) of function calls continuously, i.e. a processor was seized for a long period of time. It's harder to spread or to distribute tasks evenly on many generic processors.

In the diagram, only 2 identical generic processors were described. However, a system could be equipped with as many generic processors as needed, i.e. just plug-in new circuit board with a generic processor with proper bus and communication signals

5. Several specialized processors

The goal was to keep product cost low and develop a good system. Having more specialized processors would make system expensive, i.e. harder to sell to many end users. However, developers could add more specialized processors as needed.

The modification would be


Specialized processor 1 will be specified with “SP1_” and “SPLabel1_” in the programming language; Adding Special Processing Queue 1 and Delivery Queue 1.
Specialized processor 2 will be specified with “SP2_” and “SPLabel2_” in the programming language; Adding Special Processing Queue 2 and Delivery Queue 2.
The entire process would be the same as for a single specialized processor. System would correctly dispatch signals and section of codes to a specialized processor based on prefix characters.

If the compiler had converted the high level language codes into assembly codes before loading into a queue or a processor, those statements/signals with prefixes would be treated differently, thus codes would be properly dispatched and executed.

Personally I don’t think a system would need parallel specialized processors. Otherwise developers could add identical specialized processor(s) in the same way as generic processors.

In the diagram, only 2 generic parallel processors were described, but more generic processors could be added as needed. All discussions covered 2 or more generic processors. The only issue was the higher cost of a computer with more processors. A typical laptop would only need 2 generic parallel processors plus a specialized graphic processor for high demanding graphic games.

6. Processing queue and Delivery queue
In the case above, a processing queue stores many sections of codes deposited by many programs during execution. Each section of codes would be marked the beginning and the end in the queue, thus a processor could fetch one section of codes at a time when the processor was ready for processing.
A delivery queue in this case stores many signals with data to be dispatched to a destination program. Each signal would be delivered whenever a processor was scheduled or available to process this queue. All processors could be busy with high priority tasks by the operating system.
Delivering a signal is an important task, because a signal would trigger a process in an application. One of method was to implement this in processor(s), thus an idle or free processor would check the delivery queue after a certain number of hardware clock cycles in order to deliver signals to their destination programs. However, having processor stopped and checked delivery queue in very short period of time, e.g. every 1 nanosecond, could be over-killed or slow down system unnecessarily. This process could be logged and fine tune by an operating system for an optimum period of time.
All queues could be implemented in an operating system, thus the operating system would load section of codes onto a motherboard for processing, i.e. a processor doesn’t need to fetch those codes from a queue.

7. Difference between Microsoft Windows and APY in design

Microsoft develops Windows OS and other suppliers delivered hardware components including mother board. The motherboard and other components would establish or exchange data controlled by the OS via a protocol. However Microsoft wouldn’t grant other components access to their core OS. For example, it’s likely that the OS would load a section of codes or many tasks onto a motherboard. This would add some extra tasks to handle by systems to figure out idle processor(s) by polling or by signals (by the idle processor or triggered by a processor) triggering the process, i.e. slowing down the system a little bit.

APY is developed by Ericsson for both hardware and OS. This would give APY designers complete freedom in system design to optimize codes/hardware for better overall performance. For example, the motherboard (processor) could be equipped with flash disk to store some special codes used by micro-processor(s). The idle processor could access the processing queue(s) to download necessary section of codes for execution. Of course, the motherboard or processor would be more expensive.

8. Explanation

The process to check delivery queue to deliver signal with data based on hardware (crystal) clock cycle could be tied to a processor or all parallel generic processors in the system. Microprocessors in these days are reliable, thus having a single generic processor checking delivery queue would simplify circuitry and its implementation including processing power. However, a critical system would tie this process to more than 1 processor or all generic processors to ensure that system would be functional if fault happened, e.g. a faulty processor for this process wouldn’t stop the entire system.

The OS supports this design would work for a motherboard of a single generic processor or more.

The specialized processor is an optional component in the system. If it is added to the motherboard and supported by a special programming language with library, the supplier must communicate with the OS and other generic processor providers in order to use the generic processors in their applications.

The programming language used in the example is PLEX. This language supports protocol’s application development. This would reduce the stacking of function calls on the system and release processors for other tasks in an efficient way. Personally I found that PLEX is easy to understand and to debug with Test System tool.

If the Delivery Queue is designed to deliver signals with data, the system would probably need only one queue for both generic processors and specialized processor(s). However, a specialized processor could communicate with itself and other components with signals plus unique data/information, thus having a separate delivery queue would be required.
9. Clarification

Some of notes may be patented, thus you should check with patent offices before implementing those. I have used some of my knowledge of the Ericsson’s APZ architecture based on my memory to discuss parallel processors. Many of our PC and laptop have used processor core instead of parallel processors for faster task execution.

I intended to illustrate a system with parallel processors and a specialized processor. However, the entire system would be implemented in many different components of a real computer system, e.g. programming language, compiler, operating system, assembly language, motherboard hardware, and micro-processors.

I haven’t worked in any development of an operating system or hardware design in my career.

10 Examples

10.1 Current laptop configuration

Currently a laptop with an Intel i5 with 8GB RAM is around $650 CAD. A laptop with an Intel i7 and more (16GB) RAM could be around $3000 CAD.

I found the i5 laptop + Win 10 is performed well for daily stuff such as web chat and reading news. My kids could play many web games without any issues.

Thus, a laptop equipped with 2 Intel i5 microprocessors as generic parallel processors would perform very well. A specialized (graphic) microprocessor would be needed for users playing 3D games, virtual reality games, or intensive demand of processing power. This laptop would be less than $3000 CAD, but outperform the $3000 CAD laptop above.

10.2 Tuning number of clock cycles required to deliver signals in Delivery Queue


There is no good way to guess what would be the best number of clock cycles required to dispatch signals in the Delivery Queue for system’s best performance. There are so many unknown variables to consider in this case, e.g.


·        Each application loaded on the system could send different amount of signals to communicate with other programs.

·        The operating system could perform many tasks at a time, which also requires many or a few of signals dispatched.
System could propose a guideline to fine tune the system such as 
·        There shouldn’t many signals waiting for being dispatched in the queue for the preset number of clock cycles. If there are so many signals waiting in the queue, the number of clock cycles to trigger delivery process should be lower;

·        The system designer must recommend or suggest a formula for users to change this “number of clock cycles”;

·        They could also provide support phone number in this case, because this task is important for system performance.
Dispatching signals based on a crystal clock cycle is better than a timer, because a crystal clock is included in hardware design. The clock is also used in control logic, thus designer only needs to keep track of an optimized cycle number in a flash disk, which is easily over written with another (tuning) value.

Timer would require logics and calculation, i.e. some processing power required.
10.3 Example of a block (program) in PLEX


An example of a program in PLEX could be as follow

! Section reserved for signal entry
RECEIVE SIGNAL ABC with data1, data2, data3;
GOTO label1;
RECEIVE SIGNAL DEF with data4, data5, data6, data7;
GOTO label2;
RECEIVE SIGNAL GHI with data8, data9;
GOTO label3;

! Processing of data received by signals
label1;

    Programming codes for processing data related to Signal ABC. A
    signal with data could be sent out for further processing.

Exit;

label2;

   Programming codes for processing data related to Signal DEF. A
   signal with data could be sent out for further processing.
Exit;
label3;
   Programming codes for processing data related to Signal GHI. A
   signal with data could be sent out for further processing.
Exit;

11. Parallel processing

11.1 Prerequisite data not ready in a processor
Parallel processing may result in section of codes arrived in a processor, but its required (prerequisite) data were not ready yet. It may have to wait for another part of codes executed earlier to produce necessary data.
For generic processors, it's very likely that section of codes arrived to a processor in good or correct sequential order. If data were not ready, that section of codes must be reinserted in the processing queue. Should that section of codes be inserted at the beginning of a queue, the middle of a queue, or at the end of the queue for best performance?
For any system equipped with a specialized processor, coordination between codes arriving in a generic processor or a specialized processor would be more complicated.
Logging of data for this type of problem could help system designers to understand and fine tune their application(s).

11.2 Supplementary Processing Queue


One of method to resolve issue above was to implement a Supplementary Processing Queue to store the early arrived section of codes, and to wait for its required data being executed in another part of the application.
The programming language could provide a special statement to instruct the processor to load the entire section of codes on the processor to the supplementary processing queue. For example,
-----------------
RECEIVE SIGNAL ABC with data1, data2, data3;
GOTO label1;
-----------------
label1;
     IF (database_field_XYZ <> 1) THEN
            Programming codes for processing data related to Signal ABC. A
            signal with data could be sent out for further processing.
    ELSE  ! data in the database was not set properly by another process
           DELAY label1;  ! the processor would load the entire section of
                                        ! codes between label1 and exit to the
                                        ! supplementary processing queue
    ENDIF;
Exit;
-----------------
The only issue would be when the delayed section of codes should be loaded back to a generic processor for best performance?
Usually an application was designed in such a way that data flow of the program was in sequential order. The required data may be about to be executed or being executed in the adjacent processor.
By adding a counter to count the number of section of codes from the processing queue loading to the generic processors, we could estimate when to load the delayed codes back to an idle processor. For example, after 3 loads of codes the delayed codes would be loaded back to a processor. This counter could be dynamically adjusted if “3” was not a best guess.

12. Preloading data in memory to each processor


12.1 Concurrent main memory access 


Going down to microprocessor level, concurrent memory “write” access is not possible, i.e. only one microprocessor could write to memory at a time. However many processors could read data from memory at the same time.
If we could equip each generic processor (and specialized processor), which include a microprocessor with cache, with preloading data (variables) required in each section of codes, the processing time could faster for the entire system. In brief, each microprocessor could write and read preloaded data in each own cache memory in parallel.
There is a case that caused preloading data didn’t work as stated above. Software developers could write a section of codes that require reading and writing into database of main memory within that section.
To permit many processors to preload data in many sections of codes simultaneously, we could implement a few flags in the hardware or operating system for memory and for each processor such as
-         Memory flag could be RedW (red-write) when a write operation is in progress. Memory flag could be RedR (red-read) when a read operation(s) is in progress.

-         Processor write flag green means that it’s writing data into the main memory. The main Memory flag would be set to RedW in this case. When it finished writing, it could set its Processor write flag to red.

-         Processor Write flag yellow means that it’s waiting to write into the main memory. This process must be in the “waiting queue” or wait if the Memory flag is still RedW or RedR. If the Memory flag is green, it can start to write & set the Memory flag to RedW.

-         Processor Read flag is yellow means that the generic processor is ready to read data from the memory.

-         Processor Read flag is green means that the generic processor is reading data from memory. After reading data from main memory, it could set the Processor Read flag to Red. During reading from the main memory, the Memory flag is set to RedR. Unless parallel reading is permitted or no-conflict, all processors could read data at the same time.

-         The operating system would coordinate the whole system. If the Memory flag is RedR or RedW, one or more Processor Write flag is green, and one or many Processor Read flags are yellow; it would wait for the current writing process finished to change the Memory flag to green.

o   Change the Processor Read yellow flags to green

o   It allows all reading process to preload data into cache of each processor; set the Memory flag to RedR.

o   In this case, all pending Processor Write flag yellow must wait until the reading process completed and the Memory flag back to green again.

-         If all Processor Write flags are yellow and Memory flag is green, the system would allow all pending Read process started.

-         If the Memory flag is RedR and a processor has the Read flag as yellow, operating system should let that processor reads data from the main memory.

-         If a read or write operation is in progress, the Memory flag should be redR or redW. If there is no read/write operation occurs, the Memory flag is green.
The above process would be better illustrated in a flow chart.

12.2 Handling of pre-load variables in a processors 


As described in the section Concurrent Main Memory Access, the process of read and write to main memory is complicated. System would slow down considerably if there are many read and write operations required for the same section of codes.
a. Typical case where each variable is process once
During preloading data, each variable would be stored in a place holder in the cache of each processor. A place holder is associated with a memory address in main memory or a hard disk.
At the end of code execution, “changed or written” permanent data in the place holders would be transferred to the main memory.
b. A variable is processed several times in a section of codes
There is a case that a program statement reads a value from main memory into a variable; writes new value into that variable; and then used the same variable again in the same section of codes. This would normally require a write access to main memory for writing new value, and then a read of that value again in those statements.
In order to save a write access to main memory during execution of this section, some flags could be used during compiling the application to note the above scenario, so only one write access would be required at the end of execution of that section of codes, i.e. a write operation in the middle of codes was not required.
During the first pass of compilation of codes, the compiler would notice the same variable used several times in the section. In the second compiling pass, the compiler marked that the first use of this variable would require a Read value from main memory into a place holder in cache memory. The second updating of this variable would be writing new value into this place holder. The third use of this variable would be a read from this holder in this cache instead to reading from main memory. 
At the end of code execution, only “changed or written” permanent data in the place holders would be transferred to the main memory.

13. Parallel memory 


The main memory couldn’t handle concurrent writing by many microprocessors at a time, because it could create a scenario like “data corrupted” in the application.
System could create several memory spaces and let application developers used variables in each memory space separately. This would help concurrent writing process. In term of hardware, there would be separate data bus to each memory space. Variables without any prefix below used the previous memory space. 
This could be done by using different set of prefix for variables. For example,
-         Variables start with “S1_” means variable stored in memory space 1.

-         Variables start with “S2_” means variable stored in memory space 2.
A system could be upgraded with parallel memory even if system had been deployed in the field. New applications only need to use appropriated prefix recognized by the parallel memory space. Hardware and operating must be upgraded by the way, but they could keep existing applications intact.


Parallel memory would be supported by separate buses from the mother board to its hard disk (RAM is similar with separated buses). The hard disk is like a cylinder with (2) different platters attached to an axis. There would be reading heads, e.g. 2 reading heads for 2 platters, rotating around the axis and access information on different locations of data on 2 platters in parallel.

14. Miscellaneous notes
14.1 Special Syntax

The special deliver queue was removed from the diagram, because the syntax and examples didn't support it. 

If it's needed, syntax in addition to EXEC and SP_PROCESS must be used. For example, in the statement SP_Queue SP_LocateServer (zone1, zone2, zone3), the signal SP_LocateServer would be placed in the Special Delivery Queue waiting to be delivered to a program section. Depending on the operating system allocation, the signals loaded on the special delivery queue could be delivered to its destination faster.

If the SP_Queue command was not implemented the special codes could send a signal using regular Delivery Queue.


The syntax SEND could be replaced by the syntax QUEUE, i.e. same functionality. These statements are to place a signal with data in the processing queue.

Figure 2. Specialized processor with delivery queue

14.2 Special Function prefixed with SP_

As described in section 1, the function SP_Display3dReport has prefix of “SP_”, which could be used in a special implementation. For example, a system provider could come up a special library and specialized processor.
·        Register a specialized processor with OS including a special prefix
·        By loading the library to the IDE for programming language, the IDE would allow users to specify the special prefix to all those added functions in the library.
·        Software developers would use those functions in library as usual.
·        The OS would load all functions with that special prefix in specialized processors for processing.


This way of implementation is slower in processing than using the syntax SP_PROCESS. However this option is probably the best choice for Windows OS shared or used by many users and applications providers, because they don’t need to pre-configure a specialized processor and used the syntax SP_PROCESS or SP_QUEUE.

Multiple Monitors

To help stock traders or software developers, who have been using multiple monitors in their tasks, we could design a 4-screen or 4-monior laptop. The additional top (ID = 01), right (ID = 11), left (ID = 10) monitors could be pulled out from the regular/main monitor laptop (ID = 00) to make the laptop with 4 screens. Manufacturers could add 3 more (touch) keyboards underneath of the regular keyboards, which control a specific monitor.

Figure 3. Layout of multiple monitor system

The communications between software application and each monitor was described in the figure 3. Each monitor is uniquely identified by the processor motherboard with ID = 00, 01, 10, or 11.

As shown in the figure 3, each monitor is connected to the Monitor Dispatching Manager. When a user start an application, main monitor would show the intended data/UI of that application. User would normally drag any screen/UI to the monitor on the top, left, or right for display.

There are 2 ways for the software applications communicating correctly with the assigned monitor are

·        When the user dragged the UI to a screen the system would relay the ID of the monitor back to the application layer and store it for subsequent communications.

·        When the user started an application, the application would prompt user for destination monitor, e.g. main, top, left, or right. The application would store the monitor ID for subsequent communications.

In the figure 3, there are only 4 monitors used, therefore system would reserve 2 bits as ID. These 2 bits would be embedded or included in each message processed by the OS, processor motherboard, and then sent to the Monitor Dispatching Manager. Based on the value of these 2 bits, messages would be dispatched to correct monitor for display.

If each application has assigned a unique ID, which is passed through OS, processor motherboard, and toward the monitor, system does not need to pass the monitor ID (00, 01, 10, or 11) back to the application for future processing. The Monitor Dispatching Manager would only associate the application ID with each monitor ID and store locally for future communications after user dragging a UI to a specific monitor.