Tuesday, December 31, 2019

Setting up TLS Communication between Embedded Devices


It`s time to write again. This time it`s about setting up TLS communication between embedded devices. When I tried to set up TLS communication between the two devices, I had to implement my own Provisioning set up. During that time, I tried to look up the internet to get some help, unfortunately, not much information is available there related to that. There were a few nice articles which helped me finish the work. So, I decided to write about what I had done as it would help me in the future if I had to do the same task again, if not useful to any other newbie like me.


My target was based on embedded RTOS, So I had to port the mbedTLS library to my platform which was embOS + HCC TCP/IP Stack. It`s not very difficult to port the mbedTLS as implementing few wrapper functions around Socket connectivity functions is enough to get the basic TLS Operation. If you want to use your target`s on-chip peripherals for cryptographic functions instead of MbedTLS library functions, you may need to implement a few more interface functions.


To get the knowledge about porting the MbedTLS library go through this below link as it has enough information related to that.



In short, for basic TLS functionality implementing your own net_sockets.c and replacing the existing implementation is enough. I`m not going to write about it as it varies depending on the TCP/IP stack being used and is quite easy compared to porting the TCP/IP Stack.


Let`s directly jump into our focus point which is about setting up the TLS Communication. I hope you already know about the basics of TLS, if not please go through the link provided at the end of the article first. To establish the TLS communication first, we need to set up provisioning as one of the fundamental concepts of TLS is about verifying the validity of the other end with whom we are talking. In fact, during the initial handshake itself that will happen. In our example for simplicity only the server`s certificate will be verified. It`s best to verify both the server and the client in real-time embedded devices.


As the first step in provisioning, we need to create the Certificate Authority that can give certificates to the end nodes to attest to the legitimacy of the end node to which the certificate belongs. OpenSSL is used for our demonstration. To do our task we can leverage the script provided by OpenSSL instead of commands as it will make our life easy.


To make the life much easier ( and for the safety of our host system ), copy the CA.pl ( usually resides in usr/lib/ssl/misc ) script and openssl.conf ( usually resides in usr/lib/ssl/ ) files to the directory where you want to generate the certificate. Edit both the files as per your requirement. For example, you might want to increase/decrease the certificate validity time or modify the number of bits used for RSA Key, etc.


Once you are done, enter the below command in the terminal from the same path CA.pl and openssl.conf are copied.

./CA.pl -newca


Enter all the necessary information about the certificate authority. If you don`t want to type all these during the certificate generation, you can edit the CA.pl and openssl.conf to provide all the information, in that case, you can just press enter as the default value given in script and configuration file will be used for Certificate generation. The command will create the root certificate authority which can be used to sign our certificates.



Now the certificate authority is ready. We need to create the Signing request, to do that enter the following command,

./CA.pl -newreq


Enter all the necessary information about the server. Please note that all these information are related to the server and will be in the subject field of the server certificate, while all the information provided above ( in newca command ) will be in the Issuer field of the server certificate.



Finally, we need to use the CA to generate the Signed Server certificate, to do that enter the following command,

./CA.pl -sign



Use the Credentials ( Pass Phrase ) of the Certificate authority to sign. Ensure all the presented information matches the information of the Server certificate and the validity of the certificate. Once everything is okay proceed for signing.


In the end, the server certificate will be generated in the same directory CA.pl is located with the file name of newcert.pem and the private key of the server certificate will be with the file name of newkey.pem.


Now we can use these in our embedded target for TLS Communication. I assume you`re already familiar with the examples provided by MbedTLS example ( ssl_server and ssl_client ). I`m going to mention only the functions which need to be modified for our task. Please modify other configurations as per your requirement.


We don`t need the file system to store the certificate and private key, we can just store the certificate and private key in character array format to feed the MbedTLS library.


Character string starting from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE----- should be stored in an array for the Server certificate and the root certificate, likewise for the private key, the string starting from -----BEGIN ENCRYPTED PRIVATE KEY----- to -----END ENCRYPTED PRIVATE KEY----- should be stored in an array.


In server-side we need to provide the server certificate followed by the root certificate, in case if there are intermediate certificates between root and server certificates ( not present in our example, but in real-time it`s good to have ), then those intermediate certificates should also be provided in the order. We need to pass the certificates to the mbedtls_x509_crt_parse function as shown below on the server-side.

mbedtls_x509_crt_parse( &sX509ServerCertificate, (const unsigned char *) u8ServerCertificate, sizeof( u8ServerCertificate ) );

In which u8ServerCertificate is the array that is initialized with the server certificate string.


Like above, Certificate of Root Authority should also be passed to the mbedtls_x509_crt_parse function as shown below,

mbedtls_x509_crt_parse( & sX509ServerCertificate, (const unsigned char *) u8CaCertificate, sizeof( u8CaCertificate ) );

In which u8CaCertificate is the array which is initialized with the certificate string of Root Certificate Authority.


To provide the private key of the server to the SSL Configuration, we need to pass the array which has the private key to the mbedtls_pk_parse_key function as shown below,

mbedtls_pk_parse_key( &pkey, (const unsigned char *) u8ServerPrivateKey, sizeof( u8ServerPrivateKey ), NULL, 0 );


That`s all what we need to configure on the server-side, on client-side we just need to configure the trusted root certificate, that can be achieved just like above using the same mbedtls_x509_crt_parse function. That`s all you need for basic TLS communication, you can just execute both the server and client applications, which will be talking with each other securely with TLS.



Please refer to the below link for more details on TLS.



Saturday, May 4, 2019

Typecasting Pointers to different Datatypes - A Reflection


There were questions from multiple persons in the earlier article, all were about pointer typecasting and dereferencing. So I thought of writing about the same, let’s get started. Before that if you haven`t checked out those articles, please do so, the link is at the end of this article. It`s better to read this first before going through those links if you`re not familiar with pointer typecasting.


The most asked question is how this below code snippet ( and similar ) works?

uint32 u32Destination;
uint8 u8Source[ 12 ] ;
u8Source[ 0 ] = 0x01 ;
u8Source[ 1 ] = 0x23 ;
u8Source[ 2 ] = 0x45 ;
u8Source[ 3 ] = 0x67 ;


As you may know, I used this statement to copy the four byte value from long integer to the byte array contiguously. We can consider pointers which point to variables of different datatype to get to know how this works.


Before that let me start from basics, as you would have known pointer is just a ( integer ) variable which is used to hold the address of any other variable. To be precise it will have the address of the variable which it points to.


Let’s see what that means with the example. Consider the below code snippet,

uint32_t u32Integer = 0xAABBCC;
uint32_t * p32Pointer = & u32Integer.


Let’s assume the long integer variable u32Integer is located at 0x20000, then the pointer p32Pointer will be assigned the value of 0x20000 ( the address of u32Integer ). If you dereference the p32Pointer,  the controller will read four consecutive bytes from the location 0x20000 and will combine those four one byte values based on endianness of the architecture and will give you the four byte long integer value. So the statement ( * p32Pointer ) will give you the value of 0xAABBCC.


Let`s look at another example with a pointer to a short integer as shown below,

uint16_t u16Integer = 0xDDEE;
uint16_t * p16Pointer = & u16Integer.


For the sake of simplicity assume this u16Integer variable is also located in 0x20000, then the pointer p16Pointer will have the value of 0x20000 as well. In this case, if you dereference the p16Pointer then it will read two consecutive bytes from the location 0x20000 and will combine those two one byte values based on endianness of the architecture and will give you the two byte short integer value.


Note that there is no difference in the value ( address ) stored in the p16Pointer ( which is a pointer to a short integer ) and p32Pointer ( which is a pointer to a long integer ), both are having the value of 0x20000 which is the address of the variables u16Integer and u32Integer.


So during dereferencing, how the controller gives you the four byte value for p32Pointer and two byte value for p16Pointer? The difference is in the compiler interpretation while dereferencing the pointers. If the pointer points to the short integer then the compiler reads two bytes from the address the pointer points to. If the pointer points to the long integer then the compiler reads four bytes from the address the pointer points to.


So If you want to extract the lower two bytes from u32Integer and store it in uint16_t variable, you can achieve that with conventional AND method or typecasting, but you can also achieve the same using pointer as shown below,

uint16_t u16Integer = *( ( uint16_t * )  p32Pointer )


Have you got how this works? As we saw earlier, there is no difference in the value stored in the pointer variable, the pointer p32Pointer will have the value of 0x20000. By default, the compiler will interpret the p32Pointer as a pointer to the uint32_t as that`s what the datatype used in the declaration of p32Pointer. So if you want to read just two bytes from 0x20000, then you have to make that pointer as a pointer to uint16_t, this can be done through typecasting. So prefixing p32Pointer with ( uint16_t * ) will tell the compiler that the p32Pointer is a pointer to uint16_t, then the compiler will interpret the p32Pointer as a pointer to uint16_t.


Now if we dereference it, it will read two one byte values from the location 0x20000 and will combine those two bytes to give you the two byte short integer value. So as a result, u16Integer will have the value of  0xBBCC. Here you`ve to consider another subtlety, this result will come only in little endian based system, and will be different in big endian systems.


As I mentioned earlier, the value will be stored in memory based on the endianness of the system. For the example above, for the little endian based system, the value will be stored in memory as shown below,

0x20000 = 0xCC
0x20001 = 0xBB
0x20002 = 0xAA
0x20003 = 0x00


So in this case, if we read two bytes from 0x20000, the result we get is 0xCC and 0xBB, if we concatenate these two values based on little endian format the resultant two byte value is 0xBBCC.

So that`s what will be the value of *( ( uint16_t * )  p32Pointer ) statement.


Now let’s assume the endianness of the system architecture is big endian, the value will be stored in memory as shown below,

0x20000 = 0x00
0x20001 = 0xAA
0x20002 = 0xBB
0x20003 = 0xCC


So in this case, if we read two bytes from 0x20000, the result we get is 0x00 and 0xAA, if we concatenate these two values based on little endian format the resultant two byte value is 0x00AA.

So that`s what will be the value of *( ( uint16_t * )  p32Pointer ) statement.


I hope you understood how it works. Now let`s go back to the main question,

u32Destination  = *(( uint32_t * ) ( & u8Source [ 0 ] ) );


Here ( &u8Source[ 0 ] ) gives the address of the first element of u8Destination array. Let`s suppose it is located at 0x30000. Prefixing this ( &u8Source[ 0 ] ) with ( uint32_t * ) makes it the pointer to the unsigned long integer. So if you dereference it, the controller will read four one byte values starting from the location 0x30000 and will combine those values into one four byte value, which will be stored in u32Destination.


 So If you are working with little endian system and the buffer data is also aligned in little endian format, then you can use the dereference method discussed in my previous article. In big endian systems, your buffer data has to be aligned in the big endian format for this dereference method to work.


Refer the below link to know about the usage of dereferencing method,



Sunday, March 10, 2019

Pointer Hack in Packing & Unpacking the Frame



Pointer is one of the tools in C which we can use to do whatever we want. There are many hacks we can do with the pointers. I thought of writing about one such hack that we can use and about the ways the same hack could backfire. Let’s get started.


Assume the following requirement, your project has more than one node and are interconnected that is communicating with one another. Let the communication line be anything ( LAN or serial ) , it`s obvious that the data is transmitted in the range of bytes. Each node is having data with various datatypes and want to share that data with one another. As the communication link is transmitting data as a byte, there comes the Packing and Unpacking in each node.

              
Suppose one node wants to share the array of data with datatype of uint32_t, then it should handle the conversion of data as shown below,
                             
uint32_t u32DataBuffer[10];
uint8_t u8LanTxBuffer[100];
                                            
u8LanTxBuffer[ 0 ] = u32DataBuffer [ 0 ] & 0xFF;
u8LanTxBuffer[ 1 ] = ( u32DataBuffer [ 0 ] >> 8 ) & 0xFF;
u8LanTxBuffer[ 2 ] = ( u32DataBuffer [ 0 ] >> 16 ) & 0xFF;
u8LanTxBuffer[ 3 ] = ( u32DataBuffer [ 0 ] >> 24 ) & 0xFF;

                                            
               And each node which receives the data, should handle the conversion as shown below,
              
uint32_t u32DataBuffer[10];
uint8_t u8LanRxBuffer[100];
                                            
u32DataBuffer[ 0 ]  = u8LanRxBuffer[ 0 ];
u32DataBuffer[ 0 ] |= u8LanRxBuffer[ 1 ] << 8;
u32DataBuffer[ 0 ] |= u8LanRxBuffer[ 2 ] << 16;
u32DataBuffer[ 0 ] |= u8LanRxBuffer[ 3 ] << 24;
                                            
                                            
Let`s suppose, you want to transfer 50 elements of u32DataBuffer, then there are two issues you will face. One is code readability as for the copy of one element you need four assignment statements and could take 200 line for the packing itself. The second issue is performance. 


Readability issue can be mitigated by employing for loop to iterate through or memcpy can also be used, but assume the worst case, the data you want to transmit in a single LAN packet consists of the assortment of different datatypes. Look at the below sequence,
              
LanDataTransmitBuffer <- uint32_t Data1
LanDataTransmitBuffer <- uint16_t Data2
LanDataTransmitBuffer <- uint8_t Data3
LanDataTransmitBuffer <- uint32_t Data4
   
                          
In above case for loop can`t be used to iterate through and you need to use four statements for single copy of uint32_t data and it will really mess up the code readability. This issue can be fixed by using macro as shown below,
              
#define UINT32_TO_UINT8_IN_LE( destination, source )      \
do                                                                                             \
{                                                                                                \
destination[ 0 ] = source & 0xFF;                                   \
destination[ 1 ] = ( source >> 8 ) & 0xFF;                       \
destination[ 2 ] = ( source >> 16 ) & 0xFF;                     \
destination[ 3 ] = ( source >> 24 ) & 0xFF;                     \
}while( 0 )
                                            
The macro can used as shown below,
                                            
UINT32_TO_UINT8_IN_LE( LanDataTransmitBuffer[ 0 ] , Data1 );


Code readability issue is fixed, but what about the performance, this packing and unpacking in each node surely consumes considerable amount of time in the total transmission as for each copy there are four load/store instruction in addition to the shifting and other instruction. What can be done for this?


This is where one of the pointer hack can be used to improve the performance. The assignment can be done using a single statement by using pointers instead of the above method where at least four statements are needed. Below single statement can be used to do the same copy of uint32_t data into the byte buffer array as above method,
                             
*(( uint32_t * ) ( &destination[ 0 ] ) ) = source[ 0 ];
                                            
                                            
Fair and simple right? Yes you can`t type these many things for each conversion and having this for each conversion would make code a bit unreadable and makes it prone for mistake, this can be fixed with a simple macro definition as shown below,
                             
#define UINT32_TO_UINT8_IN_LE( destination, source )                \
( *(( uint32_t * ) ( & ( destination ) ) ) = ( source ) )
                                        
    
The macro can used as shown below,
                                            
UINT32_TO_UINT8_IN_LE( LanDataTransmitBuffer[ 0 ] , Data1 );
                                
                           
Hurrah, We`ve achieved what we want in a single statement instead of four statements ( Similar macro can be implemented for unpacking in receiving node ). Performance wise is this is considerable amount of improvement, Code readability wise also it`s okay.

                             
Is there anything wrong with this method? Can this be used on any system blindly without any other consideration?

                                            
As with the usual cases of using pointers, there`s one loop hole here also and that could create havoc if you don`t take necessary precaution.
                                 
           
Any guess what it is? Yes the issue is Unaligned memory access, In Higher end processors which supports unaligned memory access this pointer dereference method can be used without any fuss, But as most of the embedded system consists of low or medium end processor which may or may not support unaligned memory access this is definitely a worrying issue.

                                            
One way of tackling this problem is taking care of the alignment of uint8_t data buffer while creating it, this can be done by using pragma. By allocating the starting byte of uint8_t data buffer in the address which is multiple of four as needed by our controllers, we can sort out the issue. But the problem with this method is, if there`s requirement which needs mixture of data with different datatype as mentioned previously, this will fail, as we may do uint32_t data copy from buffer element located in address which is not a multiple of four. So In the processors which doesn`t support unaligned memory access this method can`t be used.
                     
        
In most processors which supports unaligned memory access there are certain things you need to ensure before using this method. First one is to ensure the Processor MMU is properly configured to support unaligned memory access.


For example, in ARM7 Architectures, you need to disable the unaligned memory access trap in CP15 register before using the above method. Likewise you need to look for the appropriate configuration in other architectures as well.
                             

You also need to ensure another thing, you should define the uint8_t data buffer in the memory region which is not a strongly ordered memory region in Cache Lookup table. If you`ve configured that memory region as strongly ordered, then again you`ll end in trap.
                             

So there are important things to consider even in the processor`s which support unaligned memory access. With these many configurations using this method will definitely painful in the software which may need be ported to different platforms and architectures in future. Another  basic thing you need to consider with this method is endianness.
                             

With these many things to consider do you think this can be used in production software, especially in safety critical systems? Well I`ve seen this method used in production code of Class III Medical Device ( yes, software failure will result in death of the person). That device uses Intel Atom processor and VxWorks platform. As these higher end processors platforms supports unaligned memory access, we`ve used this pointer dereference method in the project. If you know what you`re doing, then with Pointers you can make your software run with its utmost efficiency, but even if you miss tad a bit you`ll have to face the wrath.
              

Okay, what`s your take? Will you go for this pointer deference method or traditional method or memcpy? If you`ve any other alternative, please let me know in comments or by mail.