home

PDF (letter size)

Java floating point numbers review

Nasser M. Abbasi

Nov 15, 2000   Compiled on January 29, 2024 at 3:01am

Contents

1 Java primitive types sizes
2 Maximum value in signed and unsigned integers
3 Some bits table
4 Power of 2 table
5 Float and Double in Java
5.1 How to read a floating point?
6 References

1 Java primitive types sizes

type size in bytes
byte 1
short 2
int 4
long 8
float 4 (IEEE 754)
double 8 (IEEE 754)

2 Maximum value in signed and unsigned integers

Signed integer table

number of bits Java type range range in base 10
8 byte 27127 127128
16 short 2151215 32,76732,768
32 int 2311231 2,147,483,6472,147,483,648
64 long 2631263 9,223,372,036,854,775,8079,223,372,036,854,775,808
number of bits Java type range range in HEX
8 byte 27127 7F 80
16 short 2151215 7F FF 8000
32 int 2311231 7F FF FF FF 80000000
64 long 2631263 7F FF FF FF FF FF FF FF 8000000000000000

Unsigned integer table

number of bits Java type range range in base 10
8 byte 2810 2550
16 short 21610 65,5350
32 int 23210 4,294,967,2950
64 long 26410 18,446,744,073,709,551,6150
number of bits Java type range range in HEX
8 byte 2810 FF 00
16 short 21610 FF FF 0000
32 int 23210 FF FF FF FF 00000000
64 long 26410 FF FF FF FF FF FF FF FF 0000000000000000

3 Some bits table

The max value that can be obtained using n bits is found by using the formula 2n1, this assume unsignd values.

bit pattern base 10 Hex
0 0 0
1 1 1
10 2 2
11 3 3
100 4 4
101 5 5
110 6 6
111 7 7
1000 8 8
1001 9 9
1010 10 A
1011 11 B
1100 12 C
1101 13 D
1110 14 E
1111 15 F
1 0000 16 10
1 0001 17 11
1 0010 18 12
1 0011 19 13
1 0100 20 14
1 0101 21 15
1 0110 22 16
1 0111 23 17
1 1000 24 18
1 1001 25 19
1 1010 26 1A
1 1011 27 1B
1 1100 28 1C
1 1101 29 1D
1 1110 30 1E
1 1111 31 1F
10 0000 32 20
0111 1111 127 7F
10000000 128 80
11111111 255 FF
1 00000000 256 1 00
1111 11111111 4,095 F FF
11111111 11111111 65,535 FF FF
1111 11111111 11111111 1,048,575 F FF FF
11111111 11111111 11111111 16,777,215 FF FF FF
1111 11111111 11111111 11111111 268,435,455 F FF FF FF
11111111 11111111 11111111 11111111 4,294,967,295 FF FF FF FF

So, 16 bits needs 5 digits in base 10 to represent it.
32 bits needs 10 digits in base 10 to represent it.
64 bits needs 20 digits in base 10 to represent it.

So, it looks like the number of digits in base 10 to represent a bit pattern of length n is (1/3)n
So 128 bits will require about 42 digits in base 10 to represent externally.

4 Power of 2 table

power of two base 2 base 10 Hex
20 1 1 1
21 01 2 2
22 100 4 4
23 1000 8 8
24 1 0000 16 10
25 10 0000 32 20
26 100 0000 64 40
27 1000 0000 128 80
28 1 0000 0000 256 1 00
29 10 0000 0000 512 2 00
210 (1K) 1,024 4 00
211 2,048 8 00
212 4,096 10 00
213 8,192 20 00
214 16,384 40 00
215 32,768 80 00
216 65,536 1 00 00
217 131,072 2 00 00
218 262,144 4 00 00
219 524,288 8 00 00
220 (1 MB) 1,048,576 10 00 00
221 2,097,152 20 00 00
222 4,194,304 40 00 00
223 8,388,608 80 00 00
224 16,777,216 1 00 00 00
225 33,554,432 2 00 00 00
226 67,108,864 4 00 00 00
227 134,217,728 8 00 00 00
228 268,435,456 10 00 00 00
229 536,870,912 20 00 00 00
230 (1 GB) 1,073,741,824 40 00 00 00
231 2,147,483,648 80 00 00 00
232 4,294,967,296 1 00 00 00 00
233 8,589,934,592 2 00 00 00 00
234 17,179,869,184 4 00 00 00 00
235 34,359,738,368 8 00 00 00 00
236 68,719,476,736 10 00 00 00 00
237 137,438,953,472 20 00 00 00 00
238 274,877,906,944 40 00 00 00 00
239 549,755,813,888 80 00 00 00 00
240 (1 tera) 1,099,511,627,776 1 00 00 00 00 00
241 2,199,023,255,552 2 00 00 00 00 00
242 4,398,046,511,104 4 00 00 00 00 00
243 8,796,093,022,208 8 00 00 00 00 00
244 17,592,186,044,416 10 00 00 00 00 00
245 35,184,372,088,832 20 00 00 00 00 00
246 70,368,744,177,664 40 00 00 00 00 00
power of two base 2 base 10 Hex
247 100000… 140,737,488,355,328 80 00 00 00 00 00
248 281,474,976,710,656 1 00 00 00 00 00 00
249 562,949,953,421,312 2 00 00 00 00 00 00
250 1,125,899,906,842,624 4 00 00 00 00 00 00
251 2,251,799,813,685,248 8 00 00 00 00 00 00
252 4,503,599,627,370,496 10 00 00 00 00 00 00
253 9,007,199,254,740,992 20 00 00 00 00 00 00
254 18,014,398,509,481,984 40 00 00 00 00 00 00
255 36,028,797,018,963,968 80 00 00 00 00 00 00
256 72,057,594,037,927,936 1 00 00 00 00 00 00 00
257 144,115,188,075,855,872 2 00 00 00 00 00 00 00
258 288,230,376,151,711,744 4 00 00 00 00 00 00 00
259 576,460,752,303,423,488 8 00 00 00 00 00 00 00
260 1,152,921,504,606,846,976 10 00 00 00 00 00 00 00
261 2,305,843,009,213,693,952 20 00 00 00 00 00 00 00
262 4,611,686,018,427,387,904 40 00 00 00 00 00 00 00
263 9,223,372,036,854,775,808 80 00 00 00 00 00 00 00
264 18,446,744,073,709,551,616 1 00 00 00 00 00 00 00 00

5 Float and Double in Java

Java uses IEEE 754.

A number such as 0.125 is expressed as 1.25101 or 123.

In floating point, the second form above is used. i.e. base 2 is used for the exponent.

The sign uses 1 bit. 0 for positive and 1 for negative. The exponent uses the next 8 bits (biased by 127), and the exponent uses the next 23 bits.

In Java, a float uses IEEE 754. The following explains how float and double represented in Java.

sm2EN+1s is the sign, and can be1or+11m2241=16,777,215126E+127N=24

So, from the above, a float f in IEEE 754 is in the range

116777215212624+1f+116777215212724+1167772152149f+1677721521042.351038f3.41038

In Java a double is expressed as

sm2EN+1s is the sign, and can be1or+11m2531=9,007,199,254,740,9911022E+1023N=24

So, from the above, a double f in IEEE 754 is in the range

190071992547409912102224+1f+190071992547409912102324+1900719925474099121045f+9007199254740991210002.210308f1.810308

5.1 How to read a floating point?

Given this example:

11000011100101100000000000000000

The above is binary representation of single precision floating point (32 bit).

Reading from the left most bit (bit 31) to the right most bit (bit 0).

bit 31 is 1, so this is a negative number. bits 30 …23 is the exponent, which is 10000111 or 135. But since the exponent is biased by 127, it is actually 8, so now we have the exponent part which is 28. Next is bits 22 …0, which is 00101100000000000000000, since there is an implied 1, the above can be re-written as 1.00101100000000000000000, which is read as follows:

1+0(1/2)+0(1/4)+1(1/8)+0(1/16)+1(1/32)+1(1/64)+0(1/128)+0(1/256)+allzeros

which is 1+(1/8)+(1/32)+(1/64)=1+(11/64)=75/64

Hence the final number is (75/64)28=(75/64)256=300.

The above implies that a number that be can’t be expressed as sum of power of 2, can’t be represented exactly in a floating point. Since a float is represented as m2e, assume e=0, then the accuracy of a float goes like this: 1,1+(1/2),1+(1/2)+(1/4),1+(1/2)+(1/4)+(1/8),1+(1/2)+(1/4)+(1/8)+(1/16), or 1,1.5,1.75,1.87,,

So, a number such as 1.4 can’t be exactly expressed in floating point ! because the .4 value can’t be expressed as a sum of power of 2.

The greatest number that has an exact IEEE single-precision representation is 340282346638528859811704183484516925440.0 (21282104), This is 40 digits number, which is represented by 01111111011111111111111111111111

6 References

The Java programing language specifications.

http://www.math.grin.edu/~stone/courses/fundamentals/IEEE-reals.html