You will learn how to construct a Raptor layout of large collections of nucleotide sequences.
| Difficulty | Easy |
|---|---|
| Duration | 30 min |
| Prerequisite tutorials | First steps with Raptor |
| Recommended reading | Interleaved Bloom Filter (IBF), Hierarchical Interleaved Bloom Filter (HIBF) |
Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter (HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses an almost always more space-saving method of storing the bins (except if the input samples are all of the same size). It distinguishes between the user bins, which reflect the individual input samples, and the technical bins, which are physical storage units within the HIBF. Technical bins may store a single user bin, a split part of a user bin, or several (merged) user bins. This is especially useful when samples vary dramatically in size.
To use the HIBF, a layout must be created.
To realise this distinction between user bins and technical bins, a layout must be calculated before creating an index. For this purpose we have developed our own tool Chopper and integrated it into raptor. So you can simply call it up with raptor layout without having to install Chopper separately.
The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout. The first step is to estimate the number of (representative) k-mers per user bin by computing HyperLogLog (HLL) sketches of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later HyperLogLog sketch. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated using a k-mer count per user bin which represents the potential density in a technical bin in an Interleaved Bloom filter.
Using all default values a first call will look like:
The input-file looks exactly as in our previous calls of raptor index; it contains all the paths of our database files.
The parameter --tmax limits the number of technical bins on each level of the HIBF. Choosing a good 






We then get the resulting layout (default: binning.out) as an output file, which we then pass to Raptor to create the index. You can change this default with --output-filename.
And use the data of the 1024 Folder.
Then first determine the best tmax value and calculate a layout with default values and this tmax.
/note Sometimes it would be better to use the absolute paths instead.
And you should have run:
With the output:
And afterwards:
Your directory should look like this:
To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that a k-mer from sample i marks in bin i with j hash functions j bits with a 1.
Example: Query ACGT with kmers ACG, CGT. Sample 1: AATGT, Sample 2: ACCGT, Sample 3: ACGTA 2 Hash funktions -> Bins 1 to 3 for ACG could look like |0000|0000|0101| Bins 1 to 3 for CGT could look like |0000|0110|1100| -> The query seems to match Sample 3
If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a sample.
This principle also applies to the Hierarchical Interleaved Bloom Filter, except that the bins are then stored even more efficiently as described above and this is described by the layout. This means that you already have to know some parameters for the layout, which you would otherwise specify in the index:
With --kmer-size you can specify the length of the k-mers, which should be long enough to avoid random hits. By using multiple hash functions, you can sometimes further reduce the possibility of false positives (--num-hash-functions). We found a useful Bloom Filter Calculator to get a calculation if it could help. As it is not ours, we do not guarantee its accuracy. To use this calculator the number of inserted elements is the number of kmers in a single bin and you should use the biggest bin to be sure.
Each Bloom Filter has a bit vector length that, across all Bloom Filters, gives the size of the Interleaved Bloom Filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can offer the option to name the desired false positive rate with --false-positive-rate.
A call could then look like this:
Raptor supports parallelization. By specifying --threads, for example, the k-mer hashes are processed simultaneously.
Use the same all_bin_paths.txt and create a binning2.out. Take a kmer size of 16, 3 hash functions, a false positive rate of 0.1 and use 2 threads.
Your directory should look like this:
The first step is to estimate the number of (representative) k-mers per user bin by computing HyperLogLog (HLL) sketches of the input data. These HLL sketches are stored in a directory and will be used in computing an HIBF layout.
We will also give a short explanation of the HLL sketches here to explain the possible parameters, whereby each bin is sketched individually.
So the question is how many elements of our multiset are identical? With exact calculation, we need more storage space and runtime than with the HLL estimate. So, to find this out, we form (binary) 64 bit hash values of the data. These are equally distributed over all possible hash values. If you go through this hashed data, you can then estimate how many different elements you have seen so far only by reading leading zeros. (For the i'th element with p leading zeros, it is estimated that you have seen 

However, if we are unlucky and come across a hash value that consists of only 0's, then 

We can influence this m with --sketch-bits. m must be a power of two so that we can divide the 64 bit evenly, so we use --sketch-bits to set a b with 
If we choose our b (m) to be very large, then we need more memory but get higher accuracy. (Storage consumption is growing exponentially.) In addition, calculating the layout can take longer with a high b (m). If we have many user bins and observe a long layout computation time, then it is worth choosing a somewhat smaller b (m). Furthermore, the relative error of the HLL estimate increases with a decreasing b (m). Based on our benchmarks, we believe that anything above m = 512 should be fine.
The following options should only be touched if the calculation takes a long time.
We have implemented another preprocessing that summarises the technical bins even better with regard to the similarities of the input data. This can be switched off with the flag --skip-similarity-preprocessing if it costs too much runtime.
With --max-rearrangement-ratio you can further influence a part of the preprocessing (value between 0 and 1). If you set this value to 1, it is switched off. If you set it to a very small value, you will also need more runtime and memory. If it is close to 1, however, just little re-arranging is done, which could result in a less memory-efficient layout. This parameter should only be changed if the layouting takes to much memory or time, because there it can have a large influence.
One last observation about these advanced options: If you expect hardly any similarity in the data set, then the similarity preprocessing makes very little difference.
You should only touch the parameter --alpha if you have understood very well how the layout works and you are dissatisfied with the resulting index, e.g. there is still a lot of space in RAM but the index is very slow.
The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of optimizing the runtime for querying such an HIBF. In general, the ratio of merged bins and split bins influences the query time because a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used.
Alpha is a parameter for weighting the storage space calculation of the lower-level IBFs. It functions as a lower-level penalty, i.e. if alpha is large, the DP algorithm tries to avoid lower levels, which at the same time leads to the top-level IBF becoming somewhat larger. This improves query times but leads to a bigger index.