Q&A 2021¶

1. R or Python?¶

Are dataframes in R and Python the same? Which one is better? Which one is faster?

The conclusion is not obvious, Pandas dataframes and R dataframes have many similarities but they are not identical.

In general, when thinking about the programming language, Python is more object-oriented than R. For data analysis tasks, Python relies on few big packages while R relies on many small packages.

Interesting articles here:

Are there modules that are similar to tidyverse and dplyr in Python?

Interesting articles here:

2. What is the meaning of the "b" character in front a string?¶

The "b" character in front of a string means that what follows is a byte object and not a string. Byte objects are sequences of bytes, while strings are sequences of characters. Byte objects are in machine readable form, while strings are in human readable form. Due to this reason, byte objects can be directly stored on disk, while strings need encoding before they can be stored on disk.

3. What is the difference between `reshape` and `resize`?¶

reshape does not change the original array and returns the changed array.

resize changes the original array and does not return anything.

4. What is the difference between `flatten` and `ravel`?¶

flatten returns a copy of the original array.

ravel returns a view of the original array whenever possible and, if you modify the array returned by ravel, you may modify the original array.

5. How does fancy indexing work?¶

There are additional details here: https://numpy.org/doc/stable/user/basics.indexing.html

6. How does `tobytes` work?¶

import numpy as np

x = np.array([[1, 2], [3, 4]], order='C', dtype=np.uint8)
print(x)
print(x.tobytes('A'))

[[1 2]
 [3 4]]
b'\x01\x02\x03\x04'

In the case above, we create an array of 4 elements, each element represented by 8 bits (1 byte). When we call tobytes, we display the bytes of the array, namely the firt byte corresponding to the first element \x01, the second byte corresponding to the second element \x02, and so on.

x = np.array([[2**8-1, 2], [3, 4]], order='C', dtype=np.uint8)
print(x)
print(x.tobytes('A'))

[[255   2]
 [  3   4]]
b'\xff\x02\x03\x04'

In the case above, the first element is 2 to the power of 8 minus 1, which is 255. This is the largest number that can be represented by 8 bits without sign. In hexadecimal, 255 corresponds to ff, which is what we see in the first byte.

x = np.array([[10, 2], [3, 4]], order='C', dtype=np.uint8)
print(x)
print(x.tobytes('A'))

[[10  2]
 [ 3  4]]
b'\n\x02\x03\x04'

In the case above, we see \n rather than the notation with \x0A. To display the byte in the correct way, we can do the following (convert each element of tobytes to hexadecimal):

[hex(el) for el in x.tobytes('A')]

['0xa', '0x2', '0x3', '0x4']

x = np.array([[10, 2], [3, 4]], order='C', dtype=np.uint32)
print(x)
print(x.tobytes('A'))

[[10  2]
 [ 3  4]]
b'\n\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'

[hex(el) for el in x.tobytes('A')]

['0xa',
 '0x0',
 '0x0',
 '0x0',
 '0x2',
 '0x0',
 '0x0',
 '0x0',
 '0x3',
 '0x0',
 '0x0',
 '0x0',
 '0x4',
 '0x0',
 '0x0',
 '0x0']

In the case above, each element of the array is represented by 32 bits (4 bytes). Only the least significant byte is used to store the numerical value of the elements of the array. If we increase the numerical values, we reach the point where more than one byte is needed.