Sunday, March 10, 2019

Pointer Hack in Packing & Unpacking the Frame



Pointer is one of the tools in C which we can use to do whatever we want. There are many hacks we can do with the pointers. I thought of writing about one such hack that we can use and about the ways the same hack could backfire. Let’s get started.


Assume the following requirement, your project has more than one node and are interconnected that is communicating with one another. Let the communication line be anything ( LAN or serial ) , it`s obvious that the data is transmitted in the range of bytes. Each node is having data with various datatypes and want to share that data with one another. As the communication link is transmitting data as a byte, there comes the Packing and Unpacking in each node.

              
Suppose one node wants to share the array of data with datatype of uint32_t, then it should handle the conversion of data as shown below,
                             
uint32_t u32DataBuffer[10];
uint8_t u8LanTxBuffer[100];
                                            
u8LanTxBuffer[ 0 ] = u32DataBuffer [ 0 ] & 0xFF;
u8LanTxBuffer[ 1 ] = ( u32DataBuffer [ 0 ] >> 8 ) & 0xFF;
u8LanTxBuffer[ 2 ] = ( u32DataBuffer [ 0 ] >> 16 ) & 0xFF;
u8LanTxBuffer[ 3 ] = ( u32DataBuffer [ 0 ] >> 24 ) & 0xFF;

                                            
               And each node which receives the data, should handle the conversion as shown below,
              
uint32_t u32DataBuffer[10];
uint8_t u8LanRxBuffer[100];
                                            
u32DataBuffer[ 0 ]  = u8LanRxBuffer[ 0 ];
u32DataBuffer[ 0 ] |= u8LanRxBuffer[ 1 ] << 8;
u32DataBuffer[ 0 ] |= u8LanRxBuffer[ 2 ] << 16;
u32DataBuffer[ 0 ] |= u8LanRxBuffer[ 3 ] << 24;
                                            
                                            
Let`s suppose, you want to transfer 50 elements of u32DataBuffer, then there are two issues you will face. One is code readability as for the copy of one element you need four assignment statements and could take 200 line for the packing itself. The second issue is performance. 


Readability issue can be mitigated by employing for loop to iterate through or memcpy can also be used, but assume the worst case, the data you want to transmit in a single LAN packet consists of the assortment of different datatypes. Look at the below sequence,
              
LanDataTransmitBuffer <- uint32_t Data1
LanDataTransmitBuffer <- uint16_t Data2
LanDataTransmitBuffer <- uint8_t Data3
LanDataTransmitBuffer <- uint32_t Data4
   
                          
In above case for loop can`t be used to iterate through and you need to use four statements for single copy of uint32_t data and it will really mess up the code readability. This issue can be fixed by using macro as shown below,
              
#define UINT32_TO_UINT8_IN_LE( destination, source )      \
do                                                                                             \
{                                                                                                \
destination[ 0 ] = source & 0xFF;                                   \
destination[ 1 ] = ( source >> 8 ) & 0xFF;                       \
destination[ 2 ] = ( source >> 16 ) & 0xFF;                     \
destination[ 3 ] = ( source >> 24 ) & 0xFF;                     \
}while( 0 )
                                            
The macro can used as shown below,
                                            
UINT32_TO_UINT8_IN_LE( LanDataTransmitBuffer[ 0 ] , Data1 );


Code readability issue is fixed, but what about the performance, this packing and unpacking in each node surely consumes considerable amount of time in the total transmission as for each copy there are four load/store instruction in addition to the shifting and other instruction. What can be done for this?


This is where one of the pointer hack can be used to improve the performance. The assignment can be done using a single statement by using pointers instead of the above method where at least four statements are needed. Below single statement can be used to do the same copy of uint32_t data into the byte buffer array as above method,
                             
*(( uint32_t * ) ( &destination[ 0 ] ) ) = source[ 0 ];
                                            
                                            
Fair and simple right? Yes you can`t type these many things for each conversion and having this for each conversion would make code a bit unreadable and makes it prone for mistake, this can be fixed with a simple macro definition as shown below,
                             
#define UINT32_TO_UINT8_IN_LE( destination, source )                \
( *(( uint32_t * ) ( & ( destination ) ) ) = ( source ) )
                                        
    
The macro can used as shown below,
                                            
UINT32_TO_UINT8_IN_LE( LanDataTransmitBuffer[ 0 ] , Data1 );
                                
                           
Hurrah, We`ve achieved what we want in a single statement instead of four statements ( Similar macro can be implemented for unpacking in receiving node ). Performance wise is this is considerable amount of improvement, Code readability wise also it`s okay.

                             
Is there anything wrong with this method? Can this be used on any system blindly without any other consideration?

                                            
As with the usual cases of using pointers, there`s one loop hole here also and that could create havoc if you don`t take necessary precaution.
                                 
           
Any guess what it is? Yes the issue is Unaligned memory access, In Higher end processors which supports unaligned memory access this pointer dereference method can be used without any fuss, But as most of the embedded system consists of low or medium end processor which may or may not support unaligned memory access this is definitely a worrying issue.

                                            
One way of tackling this problem is taking care of the alignment of uint8_t data buffer while creating it, this can be done by using pragma. By allocating the starting byte of uint8_t data buffer in the address which is multiple of four as needed by our controllers, we can sort out the issue. But the problem with this method is, if there`s requirement which needs mixture of data with different datatype as mentioned previously, this will fail, as we may do uint32_t data copy from buffer element located in address which is not a multiple of four. So In the processors which doesn`t support unaligned memory access this method can`t be used.
                     
        
In most processors which supports unaligned memory access there are certain things you need to ensure before using this method. First one is to ensure the Processor MMU is properly configured to support unaligned memory access.


For example, in ARM7 Architectures, you need to disable the unaligned memory access trap in CP15 register before using the above method. Likewise you need to look for the appropriate configuration in other architectures as well.
                             

You also need to ensure another thing, you should define the uint8_t data buffer in the memory region which is not a strongly ordered memory region in Cache Lookup table. If you`ve configured that memory region as strongly ordered, then again you`ll end in trap.
                             

So there are important things to consider even in the processor`s which support unaligned memory access. With these many configurations using this method will definitely painful in the software which may need be ported to different platforms and architectures in future. Another  basic thing you need to consider with this method is endianness.
                             

With these many things to consider do you think this can be used in production software, especially in safety critical systems? Well I`ve seen this method used in production code of Class III Medical Device ( yes, software failure will result in death of the person). That device uses Intel Atom processor and VxWorks platform. As these higher end processors platforms supports unaligned memory access, we`ve used this pointer dereference method in the project. If you know what you`re doing, then with Pointers you can make your software run with its utmost efficiency, but even if you miss tad a bit you`ll have to face the wrath.
              

Okay, what`s your take? Will you go for this pointer deference method or traditional method or memcpy? If you`ve any other alternative, please let me know in comments or by mail.