1. DataFinder Indexing Performance
DataFinder Server Edition is built to index files contained in defined search areas. Before you can start mining the content of a DataFinder server, these files must be indexed. The indexing speed is determined by:
- Network performance to access these files
- DataPlugin(s) used to index files
- Number of items per file
- Number of “optimized” custom properties
- Indexing strategy
- DataFinder index performance
Cause 1.1: Network performance
A DataFinder server indexes files which are defined by one or more search areas, while a search area specifies a given network share or a part of it. The network performance between that file server and the DataFinder server has a direct influence on the indexing speed. The faster a file can be accessed by DataFinder, the faster it can be indexed.
Please note that even if the network speed between these two servers has a significant influence on indexing, it is not advisable to directly install DataFinder on a file server. The traffic caused by other users, processes, and applications will significantly decrease the overall DataFinder performance.
Cause 1.2: DataPlugin
DataFinder uses DataPlugins to index files. The time it takes for a DataPlugin to read a file will directly influence the indexing speed. Since DataFinder only indexes the meta data (properties) of a file, the DataPlugin should be created to ensure that the bulk or channel data section of the file can be skipped during indexing. The following rules apply to the DataPlugins VBS API:
- Use GetBinaryBlock, GetFixedWidthBlock, GetStringBlock, or GetCellBlock to define a DirectAccessChannel of the bulk or channel data section of the file
- In case the upper methods cannot be used given the nature of the file, create the DataPlugin by using the “DataFinder parameter” and exclude the bulk or channel data section of the file with the statement “if not DataFinderParameter then”
Please note that every improvement of the DataPlugin performance will have a significant influence on the overall system performance, including indexing files and loading data.
Cause 1.3: Number of items per file
Since DataFinder only indexes the meta data of a file, the amount of meta data has a significant influence on the time it takes to index a file. The overall performance depends on the number of elements (channel groups, channels) and their according custom properties contained in a particular file. This means that a file with many channels, containing a lot of bulk data but only few properties on the group level, will index faster than a file with only a few channels, but where each channel contains hundreds of custom properties.
Cause 1.4: Number of “optimized” custom properties
Not only the number of elements and custom properties per file influence the index performance, but also whether specific custom properties are “optimized”. Optimizing custom properties results in a better query performance of the conditions containing these properties.
The flipside of query performance improvement is a slow-down in the indexing speed. Therefore, optimizing custom properties should only be considered when the query performance is low (or when it is necessary to enumerate possible values of these specific custom properties).
A good approach would be to optimize custom properties after the first complete indexing of the DataFinder search areas.
Cause 1.5: Indexing strategy
The appropriate indexing strategy should be chosen to match the number of files being indexed at a given time and according to whether those indexing changes will be queried immediately (see Indexer tab in DataFinder configuration dialog):
- Index on data file changes
Choose this option to let DataFinder automatically index files once they are displayed in a search area.
This method is based on a Windows notification service which may not reliably trigger the indexer if, for example, many changes have occurred in the observed search areas or if other listeners have subscribed to this service.
- Index files/folders defined in a job file
Choose this option in case the files contained in a search area are generated by an automated process that can also create the required job file.
This method reliably triggers indexing files or whole folders. Even when DataFinder is not running while files and according job files are created. It also helps to trigger the indexing process at the very moment all files are successfully copied into the search area.
- No automatic indexing
Turn off automatic indexing if the indexing process is triggered by a DIAdem script or a LabVIEW DataFinder Toolkit-based application running on that server.
- Scheduled indexing
Run scheduled indexing in the background to guarantee that the index is updated from time to time regardless of the chosen indexing algorithm.
It is also possible to schedule the indexing process individually per search area.
Cause 1.6: DataFinder index performance
See Chapter 3: DataFinder index performance
2. DataFinder Query Performance
As is the case with the indexing speed, the query performance is also influenced by several factors, such as:
- Complexity of query
- Optimized custom properties
- DataFinder index performance
Cause 2.1: Complexity of query
The complexity of a query is mainly determined by
- Number of objects to return
A query that returns a large number of results will execute slower because time is required to send the query results and build the according result objects (files, channel groups or channels) on the client side.
- Included hierarchy levels
Using conditions with different hierarchy levels within a single query may decrease the query speed. This means that a query for channels with a specific channel name and channel unit might perform better than a query for channels belonging to a group with a specific name and the description of a specific file with a given extension.
- Wildcards at the beginning of a string
Using wildcards at the beginning of a condition comparison value, for instance channel.name = *temp, will defeat all index optimization methods and decrease the query speed.
- Order By
Applying "Order by" to a condition will force the determination of the whole result set before reducing it to the requested number of results. Additionally the sorting algorithm will be executed. Both causing a noticeable decrease of the query speed.
Cause 2.2: Optimized custom properties
If custom properties are used in a query condition, it is a good idea to optimize these specific custom properties to increase query performance (and allow enumerating possible values of these specific custom properties).
As a flipside, optimized custom properties will decrease indexing performance and increase the overall index size, so the number of optimized custom properties should be chosen with moderation.
Cause 2.3: DataFinder index performance
See Chapter 3: DataFinder index performance
3. DataFinder Index Performance
The overall DataFinder index performance is determined by the physical and process boundary conditions of the DataFinder environment, such as:
- File server volatility
- Index optimization
- Hard disk
Cause 3.1: File server volatility
Moving and deleting files is the worst kind of operation to be performed on a file server because it will cause massive reorganization of the DataFinder index and will result in poor indexing and query speed for the overall system. This reorganization will start immediately if indexing is defined to take place on file changes, and it will probably go unnoticed by the person or process moving or deleting files.
Cause 3.2: Index optimization
Depending on the ongoing indexing, the query performance will degrade over time as the index gradually fragments.
Optimizing the index from time to time will bring back the original query performance. This is why DataFinder has a default function to schedule this optimization process.
Cause 3.3: Hard disk
The access speed to the DataFinder index is directly linked to the hard disk performance on which the index is stored. Consider a dedicated hard disk for the DataFinder index with a very good access speed, such as an SSD.
Cause 3.4: Multicore
DataFinder is a massive parallel program using several parallel threads and parallel processes. Therefore, a multicore system is recommended to increase the overall performance.
Cause 3.5: RAM
Given the fact that DataFinder uses several parallel processes and also tries to cache as much of its index in memory as possible, a very good memory installation (RAM) should be considered as well.
4. Performance Summary
As discussed within this document, the overall DataFinder performance is determined by several parameters. A reasonable server computer and a well-designed file storing process, especially for huge file servers, will help to ensure that the self-contained and optimized DataFinder server will fulfill the desired performance requirements.