Java ﬂoating point numbers review


number of bits	Java type	range	range in base 10

8	byte	$2^{7} - 1 \dots - 2^{7}$	$127 \dots - 128$
16	short	$2^{15} - 1 \dots - 2^{15}$	$32, 767 \dots - 32, 768$
32	int	$2^{31} - 1 \dots - 2^{31}$	$2, 147, 483, 647 \dots - 2, 147, 483, 648$
64	long	$2^{63} - 1 \dots - 2^{63}$	$9, 223, 372, 036, 854, 775, 807 \dots - 9, 223, 372, 036, 854, 775, 808$


number of bits	Java type	range	range in HEX

8	byte	$2^{7} - 1 \dots - 2^{7}$	7F $\dots - 80$
16	short	$2^{15} - 1 \dots - 2^{15}$	7F FF $\dots - 8000$
32	int	$2^{31} - 1 \dots - 2^{31}$	7F FF FF FF $\dots - 80000000$
64	long	$2^{63} - 1 \dots - 2^{63}$	7F FF FF FF FF FF FF FF $\dots - 8000000000000000$


number of bits	Java type	range	range in base 10

8	byte	$2^{8} - 1 \dots 0$	$255 \dots 0$
16	short	$2^{16} - 1 \dots 0$	$65, 535 \dots 0$
32	int	$2^{32} - 1 \dots 0$	$4, 294, 967, 295 \dots 0$
64	long	$2^{64} - 1 \dots 0$	$18, 446, 744, 073, 709, 551, 615 \dots 0$


number of bits	Java type	range	range in HEX

8	byte	$2^{8} - 1 \dots 0$	FF $\dots 00$
16	short	$2^{16} - 1 \dots 0$	FF FF $\dots 0000$
32	int	$2^{32} - 1 \dots 0$	FF FF FF FF $\dots 00000000$
64	long	$2^{64} - 1 \dots 0$	FF FF FF FF FF FF FF FF $\dots 0000000000000000$

3 Some bits table

The max value that can be obtained using

n

bits is found by using the formula

2^{n} - 1

, this assume unsignd values.


bit pattern	base 10	Hex

0	0	0
1	1	1
10	2	2
11	3	3
100	4	4
101	5	5
110	6	6
111	7	7
1000	8	8
1001	9	9
1010	10	A
1011	11	B
1100	12	C
1101	13	D
1110	14	E
1111	15	F

1 0000	16	10
1 0001	17	11
1 0010	18	12
1 0011	19	13
1 0100	20	14
1 0101	21	15
1 0110	22	16
1 0111	23	17
1 1000	24	18
1 1001	25	19
1 1010	26	1A
1 1011	27	1B
1 1100	28	1C
1 1101	29	1D
1 1110	30	1E
1 1111	31	1F
10 0000	32	20

0111 1111	127	7F
10000000	128	80
11111111	255	FF
1 00000000	256	1 00
1111 11111111	$4, 095$	F FF
11111111 11111111	$65, 535$	FF FF
1111 11111111 11111111	$1, 048, 575$	F FF FF
11111111 11111111 11111111	$16, 777, 215$	FF FF FF
1111 11111111 11111111 11111111	$268, 435, 455$	F FF FF FF
11111111 11111111 11111111 11111111	$4, 294, 967, 295$	FF FF FF FF

So, 16 bits needs 5 digits in base 10 to represent it.
32 bits needs 10 digits in base 10 to represent it.
64 bits needs 20 digits in base 10 to represent it.

So, it looks like the number of digits in base 10 to represent a bit pattern of length

n

(1 / 3) n

So 128 bits will require about 42 digits in base 10 to represent externally.

4 Power of 2 table


power of two	base 2	base 10	Hex

$2^{0}$	1	1	1
$2^{1}$	01	2	2
$2^{2}$	100	4	4
$2^{3}$	1000	8	8
$2^{4}$	1 0000	16	10
$2^{5}$	10 0000	32	20
$2^{6}$	100 0000	64	40
$2^{7}$	1000 0000	128	80
$2^{8}$	1 0000 0000	256	1 00
$2^{9}$	10 0000 0000	512	2 00
$2^{10}$	…	(1K) $1, 024$	4 00
$2^{11}$		$2, 048$	8 00
$2^{12}$		$4, 096$	10 00
$2^{13}$		$8, 192$	20 00
$2^{14}$		$16, 384$	40 00
$2^{15}$		$32, 768$	80 00
$2^{16}$		$65, 536$	1 00 00
$2^{17}$		$131, 072$	2 00 00
$2^{18}$		$262, 144$	4 00 00
$2^{19}$		$524, 288$	8 00 00
$2^{20}$		(1 MB) $1, 048, 576$	10 00 00
$2^{21}$		$2, 097, 152$	20 00 00
$2^{22}$		$4, 194, 304$	40 00 00
$2^{23}$		$8, 388, 608$	80 00 00
$2^{24}$		$16, 777, 216$	1 00 00 00
$2^{25}$		$33, 554, 432$	2 00 00 00
$2^{26}$		$67, 108, 864$	4 00 00 00
$2^{27}$		$134, 217, 728$	8 00 00 00
$2^{28}$		$268, 435, 456$	10 00 00 00
$2^{29}$		$536, 870, 912$	20 00 00 00
$2^{30}$		(1 GB) $1, 073, 741, 824$	40 00 00 00
$2^{31}$		$2, 147, 483, 648$	80 00 00 00
$2^{32}$		$4, 294, 967, 296$	1 00 00 00 00
$2^{33}$		$8, 589, 934, 592$	2 00 00 00 00
$2^{34}$		$17, 179, 869, 184$	4 00 00 00 00
$2^{35}$		$34, 359, 738, 368$	8 00 00 00 00
$2^{36}$		$68, 719, 476, 736$	10 00 00 00 00
$2^{37}$		$137, 438, 953, 472$	20 00 00 00 00
$2^{38}$		$274, 877, 906, 944$	40 00 00 00 00
$2^{39}$		$549, 755, 813, 888$	80 00 00 00 00
$2^{40}$		(1 tera) $1, 099, 511, 627, 776$	1 00 00 00 00 00
$2^{41}$		$2, 199, 023, 255, 552$	2 00 00 00 00 00
$2^{42}$		$4, 398, 046, 511, 104$	4 00 00 00 00 00
$2^{43}$		$8, 796, 093, 022, 208$	8 00 00 00 00 00
$2^{44}$		$17, 592, 186, 044, 416$	10 00 00 00 00 00
$2^{45}$		$35, 184, 372, 088, 832$	20 00 00 00 00 00
$2^{46}$		$70, 368, 744, 177, 664$	40 00 00 00 00 00


power of two	base 2	base 10	Hex

$2^{47}$	100000…	$140, 737, 488, 355, 328$	80 00 00 00 00 00
$2^{48}$		$281, 474, 976, 710, 656$	1 00 00 00 00 00 00
$2^{49}$		$562, 949, 953, 421, 312$	2 00 00 00 00 00 00
$2^{50}$		$1, 125, 899, 906, 842, 624$	4 00 00 00 00 00 00
$2^{51}$		$2, 251, 799, 813, 685, 248$	8 00 00 00 00 00 00
$2^{52}$		$4, 503, 599, 627, 370, 496$	10 00 00 00 00 00 00
$2^{53}$		$9, 007, 199, 254, 740, 992$	20 00 00 00 00 00 00
$2^{54}$		$18, 014, 398, 509, 481, 984$	40 00 00 00 00 00 00
$2^{55}$		$36, 028, 797, 018, 963, 968$	80 00 00 00 00 00 00
$2^{56}$		$72, 057, 594, 037, 927, 936$	1 00 00 00 00 00 00 00
$2^{57}$		$144, 115, 188, 075, 855, 872$	2 00 00 00 00 00 00 00
$2^{58}$		$288, 230, 376, 151, 711, 744$	4 00 00 00 00 00 00 00
$2^{59}$		$576, 460, 752, 303, 423, 488$	8 00 00 00 00 00 00 00
$2^{60}$		$1, 152, 921, 504, 606, 846, 976$	10 00 00 00 00 00 00 00
$2^{61}$		$2, 305, 843, 009, 213, 693, 952$	20 00 00 00 00 00 00 00
$2^{62}$		$4, 611, 686, 018, 427, 387, 904$	40 00 00 00 00 00 00 00
$2^{63}$		$9, 223, 372, 036, 854, 775, 808$	80 00 00 00 00 00 00 00
$2^{64}$		$18, 446, 744, 073, 709, 551, 616$	1 00 00 00 00 00 00 00 00

5 Float and Double in Java

A number such as

0.125

is expressed as

1.25 \cdot 10^{- 1}

1 \cdot 2^{- 3}

In ﬂoating point, the second form above is used. i.e. base 2 is used for the exponent.

The sign uses 1 bit. 0 for positive and 1 for negative. The exponent uses the next 8 bits (biased by 127), and the exponent uses the next 23 bits.

In Java, a ﬂoat uses IEEE 754. The following explains how ﬂoat and double represented in Java.

5.1 How to read a ﬂoating point?

The above is binary representation of single precision ﬂoating point (32 bit).

bit 31 is 1, so this is a negative number. bits 30 …23 is the exponent, which is 10000111 or 135. But since the exponent is biased by 127, it is actually 8, so now we have the exponent part which is

2^{8}

. Next is bits 22 …0, which is 00101100000000000000000, since there is an implied 1, the above can be re-written as 1.00101100000000000000000, which is read as follows:

1 + 0 (1 / 2) + 0 (1 / 4) + 1 (1 / 8) + 0 (1 / 16) + 1 (1 / 32) + 1 (1 / 64) + 0 (1 / 128) + 0 (1 / 256) + \dots a l l z e r o s

Hence the ﬁnal number is

- (75 / 64) \cdot 2^{8} = - (75 / 64) \cdot 256 = - 300

The above implies that a number that be can’t be expressed as sum of power of 2, can’t be represented exactly in a ﬂoating point. Since a ﬂoat is represented as

m \cdot 2^{e}

, assume

e = 0

, then the accuracy of a ﬂoat goes like this:

1, 1 + (1 / 2), 1 + (1 / 2) + (1 / 4), 1 + (1 / 2) + (1 / 4) + (1 / 8), 1 + (1 / 2) + (1 / 4) + (1 / 8) + (1 / 16), \dots

1, 1.5, 1.75, 1.87, \dots

So, a number such as

1.4

can’t be exactly expressed in ﬂoating point ! because the

.4

value can’t be expressed as a sum of power of 2.

The greatest number that has an exact IEEE single-precision representation is 340282346638528859811704183484516925440.0

(2^{128} - 2^{104})

, This is 40 digits number, which is represented by

01111111011111111111111111111111