Hashing refers to the process of generating a fixed-size output from an input of variable size. This is done through the use of mathematical formulas known as hash functions (implemented as hashing algorithms).
Although not all hash functions involve the use of cryptography, the so-called cryptographic hash functions are at the core of cryptocurrencies. Thanks to them, blockchains and other distributed systems are able to achieve significant levels of data integrity and security.
Both conventional and cryptographic hash functions are deterministic. Being deterministic means that as long as the input doesn’t change, the hashing algorithm will always produce the same output (also known as digest or hash).
Typically, the hashing algorithms of cryptocurrencies are designed as one-way functions, meaning they cannot be easily reverted without large amounts of computing time and resources. In other words, it is quite easy to create the output from the input, but relatively difficult to go in the opposite direction (to generate the input from the output alone). Generally speaking, the more difficult it is to find the input, the more secure the hashing algorithm is considered to be.
How does a hash function work?
Different hash functions will produce outputs of differing sizes, but the possible output sizes for each hashing algorithm is always constant. For instance, the SHA-256 algorithm can only produce outputs of 256 bits, while the SHA-1 will always generate a 160-bits digest.
To illustrate, let’s run the words “Binance” and “binance” through the SHA-256 hashing algorithm (the one used in Bitcoin).
Note that a minor change (the casing of the first letter) resulted in a totally different hash value. But since we are using SHA-256, the outputs will always have a fixed size of 256-bits (or 64 characters) - regardless of the input size. Also, it doesn’t matter how many times we run the two words through the algorithm, the two outputs will remain constant.
Conversely, if we run the same inputs through the SHA-1 hashing algorithm, we would have the following results:
Notably, the acronym SHA stands for Secure Hash Algorithms. It refers to a set of cryptographic hash functions that include the SHA-0 and SHA-1 algorithms along with the SHA-2 and SHA-3 groups. The SHA-256 is part of the SHA-2 group, along with SHA-512 and other variants. Currently, only the SHA-2 and SHA-3 groups are considered secure.
Why do they matter?
Conventional hash functions have a wide range of use cases, including database lookups, large files analyses, and data management. On the other hand, cryptographic hash functions are extensively used in information-security applications, such as message authentication and digital fingerprinting. When it comes to Bitcoin, cryptographic hash functions are an essential part of the mining process and also play a role in the generation of new addresses and keys.
The real power of hashing comes when dealing with enormous amounts of information. For instance, one can run a big file or dataset through a hash function and then use its output to quickly verify the accuracy and integrity of the data. This is possible because of the deterministic nature of hash functions: the input will always result in a simplified, condensed output (hash). Such a technique removes the need to store and “remember” large amounts of data.
Hashing is particularly useful within the context of blockchain technology. The Bitcoin blockchain has several operations that involve hashing, most of them within the process of mining. In fact, nearly all cryptocurrency protocols rely on hashing to link and condense groups of transactions into blocks, and also to produce cryptographic links between each block, effectively creating a blockchain.
Cryptographic hash functions
Again, a hash function that deploys cryptographic techniques may be defined as a cryptographic hash function. In general, breaking a cryptographic hash function requires a myriad of brute-force attempts. For a person to “revert” a cryptographic hash function, they would need to guess what the input was by trial and error until the corresponding output is produced. However, there is also the possibility of different inputs producing the exact same output, in which case a “collision” occurs.
Technically, a cryptographic hash function needs to follow three properties to be considered effectively secure. We may describe those as collision resistance, preimage resistance, and second preimage resistance.
Before discussing each property, let’s summarize their logic in three short sentences.
Collision resistance: infeasible to find any two distinct inputs that produce the same hash as output.
Preimage resistance: infeasible to “revert” the hash function (find the input from a given output).
Second-preimage resistance: infeasible to find any second input that collides with a specified input.
Collision resistance
As mentioned, a collision happens when different inputs produce the exact same hash. Thus, a hash function is considered collision-resistant until the moment someone finds a collision. Note that collisions will always exist for any hash function because the possible inputs are infinite, while the possible outputs are finite.
Put in another way, a hash function is collision-resistant when the possibility of finding a collision is so low that it would require millions of years of computations. So despite the fact that there are no collision-free hash functions, some of them are strong enough to be considered resistant (e.g., SHA-256).
Among the various SHA algorithms, the SHA-0 and SHA-1 groups are no longer secure because collisions have been found. Currently, the SHA-2 and SHA-3 groups are considered resistant to collisions.
Preimage resistance
The property of preimage resistance is related to the concept of one-way functions. A hash function is considered preimage-resistant when there is a very low probability of someone finding the input that generated a particular output.
Note that this property is different from the previous one because an attacker would be trying to guess what was the input by looking at a given output. A collision, on the other hand, occurs when someone finds two different inputs that generate the same output, but it doesn’t matter which inputs were used.
The property of preimage resistance is valuable for protecting data because a simple hash of a message can prove its authenticity, without the need to disclose the information. In practice, many service providers and web applications store and use hashes generated from passwords rather than the passwords in plaintext.
Second-preimage resistance
To simplify, we may say that the second-preimage resistance is somewhere in between the other two properties. A second-preimage attack occurs when someone is able to find a specific input that generates the same output of another input that they already know.
In other words, a second-preimage attack involves finding a collision, but instead of searching for two random inputs that generate the same hash, they search for an input that generates the same hash that was generated by another specific input.
Therefore, any hash function that is resistant to collisions is also resistant to second-preimage attacks, as the latter will always imply a collision. However, one can still perform a preimage attack on a collision-resistant function as it implies finding a single input from a single output.
Mining
There are many steps in Bitcoin mining that involves hash functions, such as checking balances, linking transactions inputs and outputs, and hashing transactions within a block to form a Merkle Tree. But one of the main reasons Bitcoin blockchain is secure is the fact that miners need to perform a myriad of hashing operations in order to eventually find a valid solution for the next block.
Specifically, a miner has to try several different inputs when creating a hash value for their candidate block. In essence, they will only be able to validate their block if they generate an output hash that starts with a certain number of zeros. The number of zeros is what determines the mining difficulty, and it varies according to the hash rate devoted to the network.
In this case, the hash rate represents how much computer power is being invested in Bitcoin mining. If the network's hash rate increases, the Bitcoin protocol will automatically adjust the mining difficulty so that the average time needed to mine a block remains close to 10 minutes. In contrast, if several miners decide to stop mining, causing the hash rate to drop significantly, the mining difficulty will be adjusted, making it easier to mine (until the average block time comes back to 10 minutes).
Note that miners don't have to find collisions because there are multiple hashes they can generate as a valid output (starting with a certain number of zeros). So there are several possible solutions for a certain block, and miners only have to find one of them - according to the threshold determined by the mining difficulty.
Because Bitcoin mining is a cost-intensive task, miners have no reason to cheat the system as it would lead to significant financial losses. The more miners join a blockchain, the bigger and stronger it gets.
Closing thoughts
There is no doubt that hash functions are essential tools in computer science, especially when dealing with huge amounts of data. When combined with cryptography, hashing algorithms can be quite versatile, offering security and authentication in many different ways. As such, cryptographic hash functions are vital to nearly all cryptocurrencies networks, so understanding their properties and working mechanisms is certainly helpful for anyone interested in blockchain technology.