Read Binary Files¶
In this tutorial file we show to how to read a binary file in parallel with PIGO.
The Setup¶
Suppose we are reading a dense array of integers from a binary file. The file format is simple: first the number of integers are stored (as an unsigned integer) and then, one after another, the integers.
We have an existing program that reads such arrays and the goal is to modify it to read with PIGO to take advantage of parallel improvements.
The full format is as follows:
<size:uint64_t><data:int32>...
On Linux with bash, we can generate a random large file with the following commands (Warning: this will create a 3.8 GB file. Adjust the numbers if necessary.):
perl -e 'print(pack('Q',999817214))' > ints.bin
dd if=/dev/urandom of=ints.bin seek=1 bs=8 count=131071
dd if=/dev/urandom of=ints.bin seek=1 bs=1048576 count=3813
Loading the Vector¶
We want to build a program that loads the vector to then perform some action on it.
Following is the existing program that we start with (readbin.cpp)
#include <iostream>
#include <fstream>
#include <cstdint>
#include <omp.h>
using namespace std;
void do_something(uint64_t size, int32_t* vec) { }
int main(int argc, char** argv) {
// Open and load the file
if (argc != 2) {
cerr << "Usage: " << argv[0] << " file" << endl;
return 1;
}
double t = omp_get_wtime();
uint64_t size;
// Open the input file
auto fin = fstream(argv[1], ios::in | ios::binary);
// Read the expected size of the file
fin.read((char*)&size, sizeof(size));
// Allocate the vector
int32_t* vec = new int32_t[size];
// Copy the integers
fin.read((char*)vec, sizeof(int32_t)*size);
cerr << "[read] " << omp_get_wtime()-t << endl;
do_something(size, vec);
// Cleanup
delete [] vec;
fin.close();
return 0;
}
This program can be compiled on Linux with
g++ -std=c++11 -O3 -o readbin readbin.cpp -fopenmp
and on Mac, after brew install libomp
, with
clang++ -std=c++11 -O3 -o readbin readbin.cpp -Xclang -fopenmp /usr/local/lib/libomp.dylib
.
Loading Faster¶
We can use PIGO to load in parallel instead.
Ensure that pigo.hpp
is in the same folder.
PIGO has support for reading and writing binary files through the
FileReader
. This class ROFile
can be used for read-only
files. First, create an ROFile
with the filename to open using
ROFile fin { argv[1] };
. Next, the uint64_t size
can be read
out by using the read
method of the ROFile
, size
= fin.read<uint64_t>();
Finally, the vector can be populated by replacing
the read with a parallel_read
, which takes the destination and the
number of bytes to read.
The modified program is as follows:
#include <iostream>
#include <fstream>
#include <cstdint>
#include <omp.h>
using namespace std;
#include "pigo.hpp"
using namespace pigo;
void do_something(uint64_t size, int32_t* vec) { }
int main(int argc, char** argv) {
// Open and load the file
if (argc != 2) {
cerr << "Usage: " << argv[0] << " file" << endl;
return 1;
}
// Read the input file
double t = omp_get_wtime();
ROFile fin { argv[1] };
uint64_t size = fin.read<uint64_t>();
int32_t* vec = new int32_t[size];
fin.parallel_read((char*)vec, sizeof(int32_t)*size);
cerr << "[read] " << omp_get_wtime()-t << endl;
do_something(size, vec);
delete [] vec;
return 0;
}
Performance Comparison¶
Following are the results in seconds between the original, fread
version and the PIGO version. (C) means the cache is cold and (W) means
warm.
System |
|
PIGO |
Speedup |
---|---|---|---|
Laptop (C) |
6.44 |
6.39 |
1.01 x |
Laptop (W) |
3.25 |
1.48 |
2.20 x |
Workstation (C) |
4.22 |
1.50 |
2.81 x |
Workstation (W) |
2.32 |
0.26 |
8.92 x |
Server (C) |
6.53 |
1.53 |
4.27 x |
Server (W) |
1.73 |
0.18 |
9.61 x |