Data integrity

Innovative Technologies for Computer Security

Theory

Author

Affiliation

Andrei Biziuk

VSTU

Published

December 3, 2024

Data Integrity Control

Data Integrity Control refers to the processes, measures, and mechanisms put in place to ensure the accuracy, consistency, and reliability of data in a computer system or database. The primary goal of data integrity control is to prevent and detect unauthorized changes, corruption, or errors in data, which can occur due to various factors, including human errors, hardware or software failures, and malicious activities.

Key aspects of data integrity control include:

Data Validation: This involves checking data to ensure that it adheres to predefined standards and rules. It verifies that data is in the correct format, within allowable ranges, and meets specific criteria.
Data Verification: Data is checked for correctness and accuracy. Verification may involve comparing data against known values or using redundancy checks.
Error Detection and Correction: Techniques like checksums or error-correcting codes are used to identify and fix errors in data, ensuring its integrity.
Access Controls: Implementing strict access controls and permissions to limit who can modify data, reducing the risk of unauthorized changes.
Audit Trails: Maintaining logs of all data-related activities, making it possible to trace any changes or access to data, which is crucial for accountability and security.
Encryption: Protecting data through encryption, which ensures that even if unauthorized access occurs, the data remains confidential and its integrity intact.
Hash Functions: Using hash functions to generate fixed-length strings (hashes) from data, which can be compared to detect any changes in the data. If the hash values don’t match, it indicates data corruption.

Data integrity control is crucial in various domains, such as finance, healthcare, e-commerce, and any situation where the accuracy and reliability of data are paramount. It helps in maintaining the trustworthiness of data, preventing data loss, and safeguarding against data breaches and tampering.

Why Data Integrity Control is Needed:

Protecting data from unauthorized alterations: Data integrity control helps detect attempts at hacking or malicious actions.
Ensuring data reliability: This is crucial in fields such as finance, medicine, and critical systems, where even minor data changes can have significant consequences.

Hash Functions

A hash function is a mathematical algorithm that takes an input (or “message”) and produces a fixed-size string of characters, which is typically a hexadecimal number. The output, often referred to as the hash value or hash code, is a unique representation of the input data. Hash functions have several important characteristics:

Fixed Output Length: Hash functions always produce an output of a fixed length, regardless of the size or length of the input. For example, the SHA-256 hash function produces a 256-bit (32-byte) hash value.
Deterministic: The same input will always produce the same hash value. This property is essential for the reliability and consistency of hash functions.
Efficiency: Hash functions are designed to be computed quickly, making them suitable for real-time applications and large datasets.
Pre-Image Resistance: It should be computationally infeasible to reverse the hash value to obtain the original input. In other words, you shouldn’t be able to determine the input data from the hash value.
Avalanche Effect: A small change in the input should result in a significantly different hash value. This property ensures that similar inputs produce vastly different hash values, reducing the likelihood of collisions.
Collision Resistance: It should be extremely unlikely for two different inputs to produce the same hash value. Collisions can occur, but the chances should be minimal.

Hash functions have various applications, including:

Data Integrity: Hash values can be used to verify the integrity of data. By comparing the hash value of received data with a previously generated hash value, you can determine whether the data has been altered during transmission or storage.
Password Storage: Storing passwords securely by hashing them and storing the hash values rather than the plain text passwords. When a user logs in, the system hashes the entered password and compares it to the stored hash.
Digital Signatures: Hash functions are a fundamental component of digital signatures. A digital signature is created by hashing a message and then encrypting the hash value with a private key.
Cryptographic Applications: Hash functions are used in various cryptographic protocols and algorithms for ensuring data security, including in blockchain technology.
Data Deduplication: Hashing data can identify duplicate or redundant data, reducing storage requirements and optimizing data management.

Commonly used hash functions include MD5, SHA-1, SHA-256, and SHA-3, with varying levels of security and performance. The choice of hash function depends on the specific use case and security requirements.

Hashing algorithms

Hashing algorithms are fundamental in computer science and information security. They take an input (or “message”) and produce a fixed-size string of characters, called a hash value or hash code. These hash values are used for various purposes, including data integrity verification, password storage, digital signatures, and more. Here are some commonly used hashing algorithms:

MD5 (Message Digest Algorithm 5):
- Produces a 128-bit (16-byte) hash value.
- Widely used in the past but considered weak for security due to vulnerabilities, including collision attacks.
SHA-1 (Secure Hash Algorithm 1):
- Produces a 160-bit (20-byte) hash value.
- Previously used for security but is now considered weak due to vulnerabilities. SHA-1 collisions have been demonstrated.
SHA-256 (Secure Hash Algorithm 256):
- A part of the SHA-2 family, which includes SHA-224, SHA-256, SHA-384, and SHA-512, producing hash values of varying lengths.
- SHA-256 produces a 256-bit (32-byte) hash value.
- Widely used for data integrity and security.
SHA-3 (Secure Hash Algorithm 3):
- The latest member of the Secure Hash Algorithm family.
- Provides hash values of different lengths, with SHA-3-256 producing a 256-bit hash value.
- Developed to address potential vulnerabilities in SHA-2.
RIPEMD (RACE Integrity Primitives Evaluation Message Digest):
- A family of cryptographic hash functions, including RIPEMD-128, RIPEMD-160, and RIPEMD-256.
- Developed in Europe as an alternative to existing hash functions.
Whirlpool:
- A cryptographic hash function that produces a fixed-size 512-bit (64-byte) hash value.
- Designed to be secure and resistant to known attacks.
Blake2:
- A cryptographic hash function that is faster than many alternatives, making it popular in applications like password hashing and checksumming.
Bcrypt and Scrypt:
- While not traditional hash functions, these are used for secure password hashing.
- They incorporate techniques to slow down brute-force attacks, such as key stretching.
Argon2:
- A state-of-the-art memory-hard function designed for password hashing.
- Winner of the Password Hashing Competition (PHC) and widely considered a strong choice for password security.

It’s important to note that not all hash functions are suitable for the same purposes. Cryptographic hash functions are specifically designed to withstand attacks and are suitable for security-related applications. Non-cryptographic hash functions, like xxHash, are used for tasks where high-speed hashing is required, but security is not a primary concern.

When selecting a hash function, it’s crucial to consider your specific use case and the level of security required. For cryptographic purposes, it’s advisable to use one of the well-established and secure cryptographic hash functions.

Examples of hashing algorithms on Python

Using the hashlib library to compute hashes:

Python’s hashlib library provides a straightforward way to use various hashing algorithms. Here’s an example using the SHA-256 algorithm:

import hashlib

data = "Hello, World!".encode('utf-8')  # Convert the string to bytes
hash_object = hashlib.sha256(data)
hash_hex = hash_object.hexdigest()

print("SHA-256 Hash:", hash_hex)

This code calculates the SHA-256 hash value for the input string “Hello, World!” and prints the hash in hexadecimal format.

Using bcrypt for password hashing:

The bcrypt library is commonly used for secure password hashing. You can install it using pip:

pip install bcrypt

Here’s an example of using bcrypt to hash a password:

import bcrypt

# User's password (should be bytes, not a string)
password = b'my_secure_password'

# Generate a salt (you should save this with the hash)
salt = bcrypt.gensalt()

# Hash the password
hashed_password = bcrypt.hashpw(password, salt)

# Check a password (e.g., during login)
input_password = b'my_secure_password'
if bcrypt.checkpw(input_password, hashed_password):
    print("Password is correct.")
else:
    print("Password is incorrect.")

This code hashes a password using bcrypt and later verifies a user’s input against the stored hash. It’s essential to save the generated salt along with the hash to verify passwords later.

Using the hashlib library to verify data integrity:

You can also use hashing algorithms to verify the integrity of data, such as files. Here’s an example:

import hashlib

# Calculate the SHA-256 hash of a file
def calculate_file_hash(file_path):
    sha256_hash = hashlib.sha256()
    with open(file_path, 'rb') as f:
        while True:
            data = f.read(65536)  # Read in 64KB blocks
            if not data:
                break
            sha256_hash.update(data)
    return sha256_hash.hexdigest()

# Example: Verify a file's integrity
file_path = 'example.txt'
expected_hash = 'your_expected_hash_here'

file_hash = calculate_file_hash(file_path)
if file_hash == expected_hash:
    print("File integrity verified.")
else:
    print("File integrity compromised.")

In this example, the code calculates the SHA-256 hash of a file and compares it to an expected hash value to verify the file’s integrity.

Electronic signatures

Hashing functions play a critical role in electronic signatures, ensuring the integrity and security of digital signatures. Here are some key applications of hashing functions in electronic signatures:

Data Integrity Verification: When a document or message is signed electronically, a hash of the content is computed. This hash is then signed, rather than the entire document. When the document is later verified, the recipient computes the hash of the received content and compares it to the hash in the signature. If they match, it indicates that the document has not been altered during transmission.
Reducing Computational Overhead: Electronic signatures involve complex mathematical operations, and signing the entire document can be computationally expensive, especially for large files. Hashing reduces the document’s size to a fixed-length hash value, making the signing process more efficient.
Storage Efficiency: Storing the hash of a document in the signature is more efficient in terms of storage space compared to storing the entire document. This is especially beneficial when dealing with a large number of signed documents.
Collision Resistance: A good hashing function should be resistant to collisions, where two different documents produce the same hash. Ensuring collision resistance is vital to prevent an attacker from substituting one document for another without detection.
Authentication: Hashing functions help authenticate the signer. A recipient can verify that the document was indeed signed by the expected party by verifying the signature and the hash value, assuming they know the signer’s public key.
Non-repudiation: Using a secure hashing function in conjunction with digital signatures provides non-repudiation. This means that the signer cannot later deny signing the document, as the signature is unique to the signer and the document’s content.
Timestamping: Electronic signatures can also include a timestamp to prove that the signature was applied at a specific point in time. Hashing functions are used to compute a hash of the document and the timestamp, which is then signed. This proves that the document existed in its current state at the time of signing.
Chaining Signatures: In cases where multiple parties need to sign a document in a specific order, each signer can sign the hash of the document and the previous signer’s signature. This creates a chain of trust, ensuring that the document has not been tampered with and that each signer has endorsed the previous signatures.
Secure Key Storage: Hashing is used in the generation and storage of cryptographic keys. Hash values are used to derive keys from passwords, ensuring that even if the hash of the password is exposed, the original password remains confidential.

Overall, hashing functions are a foundational component of electronic signatures, providing data integrity, authentication, and security. They help ensure the reliability and trustworthiness of electronically signed documents and messages.

Example of electronic signature

To create an example of an electronic signature using hashing functions in Python, we’ll use the hashlib library to calculate the hash of a message and the cryptography library to generate a digital signature using a private key. Here’s a step-by-step example:

First, install the required libraries if you haven’t already:

pip install cryptography

Now, let’s create an electronic signature example:

from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding
import hashlib

# Step 1: Generate a private key
private_key = rsa.generate_private_key(
    public_exponent=65537,
    key_size=2048,
)

# Step 2: Serialize the private key
private_pem = private_key.private_bytes(
    encoding=serialization.Encoding.PEM,
    format=serialization.PrivateFormat.TraditionalOpenSSL,
    encryption_algorithm=serialization.NoEncryption()
)

# Step 3: Generate a public key from the private key
public_key = private_key.public_key()

# Step 4: Serialize the public key
public_pem = public_key.public_bytes(
    encoding=serialization.Encoding.PEM,
    format=serialization.PublicFormat.SubjectPublicKeyInfo
)

# Step 5: Prepare a message
message = "This is the message to be signed.".encode('utf-8')

# Step 6: Calculate the hash of the message
hash_object = hashlib.sha256()
hash_object.update(message)
message_hash = hash_object.digest()

# Step 7: Sign the hash with the private key
signature = private_key.sign(
    message_hash,
    padding.PSS(
        mgf=padding.MGF1(hashes.SHA256()),
        salt_length=padding.PSS.MAX_LENGTH
    ),
    hashes.SHA256()
)

# Step 8: Verify the signature using the public key
try:
    public_key.verify(
        signature,
        message_hash,
        padding.PSS(
            mgf=padding.MGF1(hashes.SHA256()),
            salt_length=padding.PSS.MAX_LENGTH
        ),
        hashes.SHA256()
    )
    print("Signature is valid.")
except Exception:
    print("Signature is invalid.")

# Optionally, you can print the public and private keys
# for demonstration purposes.
print("Private Key (Keep this secret):\n", private_pem.decode('utf-8'))
print("Public Key (Share this):\n", public_pem.decode('utf-8'))

This code generates an RSA private key, derives a public key from it, prepares a message, calculates a SHA-256 hash of the message, signs the hash with the private key, and then verifies the signature using the public key.

Please note that in a real-world scenario, you should securely store the private key and share the public key for signature verification. Electronic signatures are used to ensure the integrity and authenticity of data in various applications, such as secure communication and document verification.