Menu
Log in

Computer Engineering Concepts

2.7 Real Number Representation

Representing real numbers in a computer can be done in one of two ways. One method is called the fixed point method and the other is called the floating point method. A real number essentially has two parts to it: the integral (whole) part and the fractional part. The parts of a real number are separated by the radix point. The digits on the left represent the integral part and the digits on the right represents the fractional part.


The Fixed point representation

In the fixed point method the radix point that is used to divide the whole part of the number from the fractional part is assumed to be between two bits in the number. For example if the number is 8 bits, the radix point can be set between the 4th and the 5th bit as shown below.

10010110 = 1001.0110

00100101 = 0010.0101

In this method the radix does not have to be in the middle of the representation. If it is in the middle, then the integral and fractional part are given equal emphasis. The fixed point representation is limited in its application because it limits the scale of the numbers that can be represented. Using this method the very small and the very large real numbers cannot be represented. One way to solve the problem is to shift the location of the radix; to the left for small number representation and right for large number representation. This unfortunately is not a practical solution because the computer has no way of determining the placing of the radix if it is allowed to move around. Due to the lack of flexibility and limited range of the method, the fixed point representation is not used in computers.


The Floating Point Representation

The floating point method overcomes the challenges and limitations of the fixed point method. In this method the number is represented in scientific notation and all the pieces of information are handled by the assigned bits. This method is called the floating point method because the radix no longer separates the ones and the tenths, but instead it floats or changes, as shown below for decimal numbers.

-672.992 = -6.72992 x 102

7895000 = 7.895 x 105

 0.00017 = 1.7 x 10-4

In the first example, the radix separates the hundreds and the tens, and in the second case the radix separates the millions and the hundred thousands. In this method of expressing real numbers there are three essential components: the sign, the exponent, and the significant digits or the mantissa. To represent a real number in binary, all three parts of the floating point method must be included in the representation. There are several ways in which this can be done, but the focus here will be on the method outlined by the IEEE 754 standard. In this method the available bits are broken into three groups, with each group representing one piece of the floating point representation as shown in figure 2.1. In a 32 bit representation of the real number the sign information is stored using a single bit, 0 for positive and 1 for negative.

Fig 2.1. IEEE 754 floating point standard bit assignment.

In the decimal floating point representation the base of the exponent is 10, but to facilitate the computational process the base of the exponent is set to 2 in the binary floating point representation. The exponent is stored using 8 bits. The exponent information is not a simple conversion of the exponent value to binary, but instead it is implemented using a special system. This is done to accommodate both positive and negative numbers. Since 8 bits are available, a total of 256 different exponent values can be stored. So half of them are used to represent positive values and the other half is used to represent negative values. A value of 127 in the exponent field would represent an exponent value of 0. Values above 127 represent positive exponents and values below 127 represent negative exponents. For example, a value of 150 in the exponent field would represent an exponent value of 23 (150-127).

The mantissa or significant digit information is stored in the remaining 23 bits. In standard floating point or scientific form the first digit of the mantissa cannot be 0, and the radix point is always after the first digit. If this standard form is carried over to the binary representation, then the first digit in a binary representation will always remain 1 as there are no other symbols available. Since the first digit is always guaranteed to be 1, it does not have to be included in the bits of the mantissa.

The values of the real numbers that can be represented using the floating point method is limited to the number of bits available. The 32 bit standard is called single precision. For larger numbers with greater precision 64 bits are used for the representation. The 64 bit standard is called double precision. In double precision there are more bits available for the exponent and the mantissa, thus allowing for greater precision. The ability to handle a larger range and precision could be implemented by using even greater number of bits.

Example I:        Express 100011.101, a binary real number, using the 32 bit IEEE 754 standard.

Solution:         

100011.101 =1.00011101 x 100000 = 1.00011101 x 25

0  10000100  00011101000000000000000

The three parts of the number are separated by spaces to show their individual binary representations

The sign bit is 0 because it is a positive number

The exponent is 10000100 which is 127 + 5 =132 in binary

The mantissa is 00011101000000000000000 without the leading 1.


Example II:       Express 11000100110101100000000000000000 as a real binary number and a decimal number. Assume the number is expressed using the 32 bit IEEE 754 standard.

Solution:         

The number expressed in its three parts is

1   10001001   10101100000000000000000

Since the sign bit is 1 the number is a negative.

10001001 in decimal is 137.

The adjusted exponent is 137 - 127 = 10

The mantissa with the leading 1 added is 1.101011

Therefore the real binary number is 1.101011 x 21010

or 1.101011 x 210

In decimal the number is

(20 + 2-1 + 2-3 + 2-5 + 2-6) x 210

(1+0.5 +0.125 +0.03125 +0.015625) x 1024

1.671875 x 1024

1712 or 1.712 x 103


2.7 Practice Questions 

1.     Express the following real binary numbers using the 32 bit IEEE 754 floating point standard.

        a. 10111000      b. 1010010101.1            c. 0.000101011             

2.     Express the following decimal numbers in 32 bit IEEE 754 format.

        a. 12.375            b. 531.75                         c. 893.5                 

3.    Convert the following 32 bit IEEE 754 floating point representation to real binary numbers

        a. 11001100110101100000000000000000

        b. 01000101101101110110000000000000

        c. 11000111011011010110000000000000

        d. 01000111001101100000000000000000

4.     Convert the results of question 2 to decimal.



GlobalEduTech Solutions

Powered by Wild Apricot Membership Software