Code192 Data Integration Tool System Administrator’s Guide

System Requirements

code192 Data Integration Tool can run on something as simple as a laptop, but it can also be clustered across many enterprise-class servers. Therefore, the amount of hardware and memory needed will depend on the size and nature of the dataflow involved. The data is stored on disk while Data Integration Tool is processing it. So Data Integration Tool needs to have sufficient disk space allocated for its various repositories, particularly the content repository, flowfile repository, and provenance repository (see the System Properties section for more information about these repositories). Data Integration Tool has the following minimum system requirements:

Supported Operating Systems:
- Windows
Supported Web Browsers:
- Internet Explorer 9+ (see note below)
- Mozilla FireFox 24+
- Google Chrome 36+
- Safari 8

There is a known issue in Internet Explorer (IE) 10 and 11 that can cause problems when moving items on the Data Integration Tool graph. If you encounter this problem, we suggest using a browser other than IE. This known issue is described here: https://connect.microsoft.com/IE/Feedback/Details/1050422.

Java 7 default perm gen sizing can result in out of memory errors due to the amount of classes loaded by Data Integration Tool. See the Bootstrap Properties section for more information.

Under sustained and extremely high throughput the CodeCache settings may need to be tuned to avoid sudden performance loss. See the Bootstrap Properties section for more information.

Configuration Files

When Data Integration Tool first starts up, the following files and directories are created:

content_repository
database_repository
flowfile_repository
provenance_repository
work directory
logs directory
Within the conf directory, the flow.xml.gz file and the templates directory are created

See the System Properties section of this guide for more information about configuring Data Integration Tool repositories and configuration files.

Security Configuration

Data Integration Tool provides several different configuration options for security purposes. The most important properties are those under the "security properties" heading in the nifi.properties file. In order to run securely, the following properties must be set:

Property Name Description

Property Name	Description
`nifi.security.anonymous.authorities`	Specifies the roles that should be granted to users that connect over HTTPS anonymously. All users can make use of anonymous access, however if they have been granted a particular level of access by an administrator it will take precedence if they access Data Integration Tool using a client certificate or once they have logged in.
`nifi.security.keystore`	Filename of the Keystore that contains the server’s private key.
`nifi.security.keystoreType`	The type of Keystore. Must be either `PKCS12` or `JKS`.
`nifi.security.keystorePasswd`	The password for the Keystore.
`nifi.security.keyPasswd`	The password for the certificate in the Keystore. If not set, the value of `nifi.security.keystorePasswd` will be used.
`nifi.security.truststore`	Filename of the Truststore that will be used to authorize those connecting to Data Integration Tool. If not set, all who attempt to connect will be provided access as the Anonymous user.
`nifi.security.truststoreType`	The type of the Truststore. Must be either `PKCS12` or `JKS`.
`nifi.security.truststorePasswd`	The password for the Truststore.
`nifi.security.needClientAuth`	Specifies whether or not connecting clients must authenticate themselves. Specifically this property is used by the Data Integration Tool cluster protocol. If the Truststore properties are not set, this must be `false`. Otherwise, a value of `true` indicates that nodes in the cluster will be authenticated and must have certificates that are trusted by the Truststores.

nifi.security.anonymous.authorities

Specifies the roles that should be granted to users that connect over HTTPS anonymously. All users can make use of anonymous access, however if they have been granted a particular level of access by an administrator it will take precedence if they access Data Integration Tool using a client certificate or once they have logged in.

nifi.security.keystore

Filename of the Keystore that contains the server’s private key.

nifi.security.keystoreType

The type of Keystore. Must be either PKCS12 or JKS.

nifi.security.keystorePasswd

The password for the Keystore.

nifi.security.keyPasswd

The password for the certificate in the Keystore. If not set, the value of nifi.security.keystorePasswd will be used.

nifi.security.truststore

Filename of the Truststore that will be used to authorize those connecting to Data Integration Tool. If not set, all who attempt to connect will be provided access as the Anonymous user.

nifi.security.truststoreType

The type of the Truststore. Must be either PKCS12 or JKS.

nifi.security.truststorePasswd

The password for the Truststore.

nifi.security.needClientAuth

Specifies whether or not connecting clients must authenticate themselves. Specifically this property is used by the Data Integration Tool cluster protocol. If the Truststore properties are not set, this must be false. Otherwise, a value of true indicates that nodes in the cluster will be authenticated and must have certificates that are trusted by the Truststores.

Once the above properties have been configured, we can enable the User Interface to be accessed over HTTPS instead of HTTP. This is accomplished by setting the nifi.web.https.host and nifi.web.https.port properties. The nifi.web.https.host property indicates which hostname the server should run on. This allows admins to configure the application to run only on specific network interfaces. If it is desired that the HTTPS interface be accessible from all network interfaces, a value of 0.0.0.0 should be used.

It is important when enabling HTTPS that the nifi.web.http.port property be unset.

Similar to nifi.security.needClientAuth, the web server can be configured to require certificate based client authentication for users accessing the User Interface. In order to do this it must be configured to not support username/password authentication (see below) and not grant access to anonymous users (see nifi.security.anonymous.authorities above). Either of these options will configure the web server to WANT certificate based client authentication. This will allow it to support users with certificates and those without that may be logging in with their credentials or those accessing anonymously. If username/password authentication and anonymous access are not configured, the web server will REQUIRE certificate based client authentication.

Now that the User Interface has been secured, we can easily secure Site-to-Site connections and inner-cluster communications, as well. This is accomplished by setting the nifi.remote.input.secure and nifi.cluster.protocol.is.secure properties, respectively, to true.

User Authentication

Data Integration Tool supports user authentication via client certificates or via username/password. Username/password authentication is performed by a Login Identity Provider. The Login Identity Provider is a pluggable mechanism for authenticating users via their username/password. Which Login Identity Provider to use is configured in two properties in the nifi.properties file.

The nifi.login.identity.provider.configuration.file property specifies the configuration file for Login Identity Providers. The nifi.security.user.login.identity.provider property indicates which of the configured Login Identity Provider should be used. If this property is not configured, Data Integration Tool will not support username/password authentication and will require client certificates for authenticating users over HTTPS. By default, this property is not configured meaning that username/password must be explicity enabled.

Data Integration Tool does not perform user authentication over HTTP. Using HTTP all users will be granted all roles.

Lightweight Directory Access Protocol (LDAP)

Below is an example and description of configuring a Login Identity Provider that integrates with a Directory Server to authenticate users.

<provider>
    <identifier>ldap-provider</identifier>
    <class>org.apache.nifi.ldap.LdapProvider</class>
    <property name="Authentication Strategy">START_TLS</property>

    <property name="Manager DN"></property>
    <property name="Manager Password"></property>

    <property name="TLS - Keystore"></property>
    <property name="TLS - Keystore Password"></property>
    <property name="TLS - Keystore Type"></property>
    <property name="TLS - Truststore"></property>
    <property name="TLS - Truststore Password"></property>
    <property name="TLS - Truststore Type"></property>
    <property name="TLS - Client Auth"></property>
    <property name="TLS - Protocol"></property>
    <property name="TLS - Shutdown Gracefully"></property>

    <property name="Referral Strategy">FOLLOW</property>
    <property name="Connect Timeout">10 secs</property>
    <property name="Read Timeout">10 secs</property>

    <property name="Url"></property>
    <property name="User Search Base"></property>
    <property name="User Search Filter"></property>

    <property name="Authentication Expiration">12 hours</property>
</provider>

With this configuration, username/password authentication can be enabled by referencing this provider in nifi.properties.

nifi.security.user.login.identity.provider=ldap-provider

Property Name Description

Property Name	Description
`Authentication Expiration`	The duration of how long the user authentication is valid for. If the user never logs out, they will be required to log back in following this duration.
`Authentication Strategy`	How the connection to the LDAP server is authenticated. Possible values are ANONYMOUS, SIMPLE, or START_TLS.
`Manager DN`	The DN of the manager that is used to bind to the LDAP server to search for users.
`Manager Password`	The password of the manager that is used to bind to the LDAP server to search for users.
`TLS - Keystore`	Path to the Keystore that is used when connecting to LDAP using START_TLS.
`TLS - Keystore Password`	Password for the Keystore that is used when connecting to LDAP using START_TLS.
`TLS - Keystore Type`	Type of the Keystore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).
`TLS - Truststore`	Path to the Truststore that is used when connecting to LDAP using START_TLS.
`TLS - Truststore Password`	Password for the Truststore that is used when connecting to LDAP using START_TLS.
`TLS - Truststore Type`	Type of the Truststore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).
`TLS - Client Auth`	Client authentication policy when connecting to LDAP using START_TLS. Possible values are REQUIRED, WANT, NONE.
`TLS - Protocol`	Protocol to use when connecting to LDAP using START_TLS. (i.e. TLS, TLSv1.1, TLSv1.2, etc).
`TLS - Shutdown Gracefully`	Specifies whether the TLS should be shut down gracefully before the target context is closed. Defaults to false.
`Referral Strategy`	Strategy for handling referrals. Possible values are FOLLOW, IGNORE, THROW.
`Connect Timeout`	Duration of connect timeout. (i.e. 10 secs).
`Read Timeout`	Duration of read timeout. (i.e. 10 secs).
`Url`	Url of the LDAP servier (i.e. ldap://<hostname>:<port>).
`User Search Base`	Base DN for searching for users (i.e. CN=Users,DC=example,DC=com).
`User Search Filter`	Filter for searching for users against the User Search Base. (i.e. sAMAccountName={0}). The user specified name is inserted into {0}.

Authentication Expiration

The duration of how long the user authentication is valid for. If the user never logs out, they will be required to log back in following this duration.

Authentication Strategy

How the connection to the LDAP server is authenticated. Possible values are ANONYMOUS, SIMPLE, or START_TLS.

Manager DN

The DN of the manager that is used to bind to the LDAP server to search for users.

Manager Password

The password of the manager that is used to bind to the LDAP server to search for users.

TLS - Keystore

Path to the Keystore that is used when connecting to LDAP using START_TLS.

TLS - Keystore Password

Password for the Keystore that is used when connecting to LDAP using START_TLS.

TLS - Keystore Type

Type of the Keystore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).

TLS - Truststore

Path to the Truststore that is used when connecting to LDAP using START_TLS.

TLS - Truststore Password

Password for the Truststore that is used when connecting to LDAP using START_TLS.

TLS - Truststore Type

Type of the Truststore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).

TLS - Client Auth

Client authentication policy when connecting to LDAP using START_TLS. Possible values are REQUIRED, WANT, NONE.

TLS - Protocol

Protocol to use when connecting to LDAP using START_TLS. (i.e. TLS, TLSv1.1, TLSv1.2, etc).

TLS - Shutdown Gracefully

Specifies whether the TLS should be shut down gracefully before the target context is closed. Defaults to false.

Referral Strategy

Strategy for handling referrals. Possible values are FOLLOW, IGNORE, THROW.

Connect Timeout

Duration of connect timeout. (i.e. 10 secs).

Read Timeout

Duration of read timeout. (i.e. 10 secs).

Url

Url of the LDAP servier (i.e. ldap://<hostname>:<port>).

User Search Base

Base DN for searching for users (i.e. CN=Users,DC=example,DC=com).

User Search Filter

Filter for searching for users against the User Search Base. (i.e. sAMAccountName={0}). The user specified name is inserted into {0}.

Kerberos

Below is an example and description of configuring a Login Identity Provider that integrates with a Kerberos Key Distribution Center (KDC) to authenticate users.

<provider>
    <identifier>kerberos-provider</identifier>
    <class>org.apache.nifi.kerberos.KerberosProvider</class>
    <property name="Default Realm">NIFI.APACHE.ORG</property>
    <property name="Kerberos Config File">/etc/krb5.conf</property>
    <property name="Authentication Expiration">12 hours</property>
</provider>

With this configuration, username/password authentication can be enabled by referencing this provider in nifi.properties.

nifi.security.user.login.identity.provider=kerberos-provider

Property Name Description

Property Name	Description
`Authentication Expiration`	The duration of how long the user authentication is valid for. If the user never logs out, they will be required to log back in following this duration.
`Default Realm`	Default realm to provide when user enters incomplete user principal (i.e. NIFI.APACHE.ORG).
`Kerberos Config File`	Absolute path to Kerberos client configuration file.

Authentication Expiration

The duration of how long the user authentication is valid for. If the user never logs out, they will be required to log back in following this duration.

Default Realm

Default realm to provide when user enters incomplete user principal (i.e. NIFI.APACHE.ORG).

Kerberos Config File

Absolute path to Kerberos client configuration file.

See also Kerberos Service to allow single sign-on access via client Kerberos tickets.

Controlling Levels of Access

Once Data Integration Tool is configured to run securely and an authentication mechanism is configured, it is necessary to configure who will have access to the system and what types of access those people will have. Data Integration Tool controls this through the user of an Authority Provider. The Authority Provider is a pluggable mechanism for providing authorizations to different users. Which Authority Provider to use is configured using two properties in the nifi.properties file.

The nifi.authority.provider.configuration.file property specifies the configuration file for Authority Providers. The nifi.security.user.authority.provider property indicates which of the configured Authority Providers should be used.

By default, the file-provider Authority Provider is selected and is configured to use the permissions granted in the authorized-users.xml file. This is typically sufficient for instances of Data Integration that are run in "standalone" mode. If the Data Integration Tool instance is configured to run in a cluster, the node will typically use the cluster-node-provider Provider and the Cluster Manager will typically use the cluster-ncm-provider Provider. Both of these Providers have a default configuration in the authority-providers.xml file but are commented out.

When using the cluster-node-provider Provider, all of the authorization is provided by the Cluster Manager. In this way, the configuration only has to be maintained in one place and will be consistent across the entire cluster.

When configuring the Cluster Manager or a standalone node, it is necessary to manually designate an ADMIN user in the authorized-users.xml file, which is located in the root installation’s conf directory. After this ADMIN user has been added, s/he may grant access to other users, systems, and other instances of Data Integration Tool, through the User Interface (UI) without having to manually edit the authorized-users.xml file. If you are the administrator, you would add yourself as the ADMIN user in this file.

Open the authorized-users.xml file in a text editor. You will notice that it includes a template to guide you, with example entries that are commented out.

It is only necessary to manually add one user, the ADMIN user, to this file. So, at a minimum, the following example entry should be included and contain the user Distinguished Name (DN) in place of "user dn - read only and admin":

<users>
    <user dn="[user dn - read only and admin]">
        <role name="ROLE_ADMIN"/>
    </user>
</users>

Here is an LDAP example entry using the name John Smith:

<users>
    <user dn="cn=John Smith,ou=people,dc=example,dc=com">
        <role name="ROLE_ADMIN"/>
    </user>
</users>

Here is a Kerberos example entry using the name John Smith and realm NIFI.APACHE.ORG:

<users>
    <user dn="johnsmith@NIFI.APACHE.ORG">
        <role name="ROLE_ADMIN"/>
    </user>
</users>

After the authorized-users.xml file has been edited and saved, restart Data Integration. Once the application starts, the ADMIN user is able to access the UI at the HTTPS URL that is configured in the nifi.properties file.

From the UI, click on the Users icon ( Users ) in the Management Toolbar (upper-right corner of the UI), and the User Management Page opens.

The ADMIN user should be listed. Click on the pencil icon to see this user’s role(s). You may edit the roles by selecting the appropriate checkboxes.

The following roles are available in Data Integration Tool:

Role Name	Description
Proxy	The Proxy Role is assigned to a system in order to grant that system permission to make requests on behalf of a user. For instance, if an HTTP proxy service is used to gain access to the system, the certificate being used by that service can be given the Proxy Role.
Administrator	Administrator is able to configure thread pool sizes and user accounts as well as purge the dataflow change history.
Data Flow Manager	Data Flow Manager is given the ability to manipulate the dataflow. S/he is able to add, remove, and manipulate components on the graph; add, remove, and manipulate Controller Services and Reporting Tasks; create and manage templates; view statistics; and view the bulletin board.
Read Only	Users with Read Only access are able to view the dataflow but are unable to change anything.
Provenance	Users with Provenance access are able to query the Data Provenance repository and view the lineage of data. Additionally, this role provides the ability to view or download the content of a FlowFile from a Provenance event (assuming that the content is still available in the Content Repository and that the Authority Provider also grants access). This access is not provided to users with Read Only (unless the user has both Read Only and Provenance roles) because the information provided to users with this role can potentially be very sensitive in nature, as all FlowFile attributes and data are exposed. In order to Replay a Provenance event, a user is required to have both the Provenance role as well as the Data Flow Manager role.
Data Integration Tool	The Data Integration Tool Role is intended to be assigned to machines that will interact with an instance of Data Integration via Site-to-Site. This role provides the ability to send data to or retrieve data from Root Group Ports (but only those that they are given permissions to interact with - see the User Guide for more information on providing access to specific Ports) as well as obtain information about which Ports exist. Note that this role allows the client to know only about the Ports that it has permissions to interact with.

Role Name

Description

Proxy

The Proxy Role is assigned to a system in order to grant that system permission to make requests on behalf of a user. For instance, if an HTTP proxy service is used to gain access to the system, the certificate being used by that service can be given the Proxy Role.

Administrator

Administrator is able to configure thread pool sizes and user accounts as well as purge the dataflow change history.

Data Flow Manager

Data Flow Manager is given the ability to manipulate the dataflow. S/he is able to add, remove, and manipulate components on the graph; add, remove, and manipulate Controller Services and Reporting Tasks; create and manage templates; view statistics; and view the bulletin board.

Read Only

Users with Read Only access are able to view the dataflow but are unable to change anything.

Provenance

Users with Provenance access are able to query the Data Provenance repository and view the lineage of data. Additionally, this role provides the ability to view or download the content of a FlowFile from a Provenance event (assuming that the content is still available in the Content Repository and that the Authority Provider also grants access). This access is not provided to users with Read Only (unless the user has both Read Only and Provenance roles) because the information provided to users with this role can potentially be very sensitive in nature, as all FlowFile attributes and data are exposed. In order to Replay a Provenance event, a user is required to have both the Provenance role as well as the Data Flow Manager role.

Data Integration Tool

The Data Integration Tool Role is intended to be assigned to machines that will interact with an instance of Data Integration via Site-to-Site. This role provides the ability to send data to or retrieve data from Root Group Ports (but only those that they are given permissions to interact with - see the User Guide for more information on providing access to specific Ports) as well as obtain information about which Ports exist. Note that this role allows the client to know only about the Ports that it has permissions to interact with.

When users want access to the Data Integration Tool UI, they navigate to the configured URL and are prompted to request access. When someone has requested access, the ADMIN user sees a star on the Users icon in the Management Toolbar, alerting the ADMIN to the fact that a request is pending. Upon opening the User Management Page, the pending request is visible, and the ADMIN can grant access and click on the pencil icon to set the user’s roles appropriately.

The ADMIN may also select multiple users and add them to a "Group". Hold down the Shift key and select multiple users, then click the Group button in the upper-right corner of the User Management Page. Then, provide a name for the group.

The group feature is especially useful when a remote v cluster is connecting to this Data Integration Tool using a Remote Process Group. In that scenario, all the nodes in the remote cluster can be included in the same group. When the ADMIN wants to grant port access to the remote cluster, s/he can grant it to the group and avoid having to grant it individually to each node in the cluster.

Encryption Configuration

This section provides an overview of the capabilities of Data Integration to encrypt and decrypt data.

The EncryptContent processor allows for the encryption and decryption of data, both internal to Data Integration Tool and integrated with external systems, such as openssl and other data sources and consumers.

Key Derivation Functions

Key Derivation Functions (KDF) are mechanisms by which human-readable information, usually a password or other secret information, is translated into a cryptographic key suitable for data protection. For further information, read the Wikipedia entry on Key Derivation Functions. Currently, KDFs are ingested by CipherProvider implementations and return a fully-initialized Cipher object to be used for encryption or decryption. Due to the use of a CipherProviderFactory, the KDFs are not customizable at this time. Future enhancements will include the ability to provide custom cost parameters to the KDF at initialization time. As a work-around, CipherProvider instances can be initialized with custom cost parameters in the constructor but this is not currently supported by the CipherProviderFactory. Here are the KDFs currently supported by Data Integration Tool (primarily in the EncryptContent processor for password-based encryption (PBE)) and relevant notes:

Data Integration Tool Legacy KDF
- The original KDF used by Data Integration Tool for internal key derivation for PBE, this is 1000 iterations of the MD5 digest over the concatenation of the password and 8 or 16 bytes of random salt (the salt length depends on the selected cipher block size).
- This KDF is deprecated as of Data Integration Tool 0.5.0 and should only be used for backwards compatibility to decrypt data that was previously encrypted by a legacy version of Data Integration Tool.
OpenSSL PKCS#5 v1.5 EVP_BytesToKey
- This KDF was added in v0.4.0.
- This KDF is provided for compatibility with data encrypted using OpenSSL’s default PBE, known as EVP_BytesToKey. This is a single iteration of MD5 over the concatenation of the password and 8 bytes of random ASCII salt. OpenSSL recommends using PBKDF2 for key derivation but does not expose the library method necessary to the command-line tool, so this KDF is still the de facto default for command-line encryption.
Bcrypt
- This KDF was added in v0.5.0.
- Bcrypt is an adaptive function based on the Blowfish cipher. This KDF is strongly recommended as it automatically incorporates a random 16 byte salt, configurable cost parameter (or "work factor"), and is hardened against brute-force attacks using GPGPU (which share memory between cores) by requiring access to "large" blocks of memory during the key derivation. It is less resistant to FPGA brute-force attacks where the gate arrays have access to individual embedded RAM blocks.
- Because the length of a Bcrypt-derived key is always 184 bits, the complete output is then fed to a SHA-512 digest and truncated to the desired key length. This provides the benefit of the avalanche effect on the formatted input.
- The recommended minimum work factor is 12 (2¹² key derivation rounds) (as of 2/1/2016 on commodity hardware) and should be increased to the threshold at which legitimate systems will encounter detrimental delays (see schedule below or use BcryptCipherProviderGroovyTest#testDefaultConstructorShouldProvideStrongWorkFactor() to calculate safe minimums).
- The salt format is $2a$10$ABCDEFGHIJKLMNOPQRSTUV. The salt is delimited by $ and the three sections are as follows:
  - 2a - the version of the format. An extensive explanation can be found here. Data Integration Tool currently uses 2a for all salts generated internally.
  - 10 - the work factor. This is actually the log₂ value, so the total iteration count would be 2¹⁰ in this case.
  - ABCDEFGHIJKLMNOPQRSTUV - the 22 character, Base64-encoded, unpadded, raw salt value. This decodes to a 16 byte salt used in the key derivation.
Scrypt
- This KDF was added in v0.5.0.
- Scrypt is an adaptive function designed in response to bcrypt. This KDF is recommended as it requires relatively large amounts of memory for each derivation, making it resistant to hardware brute-force attacks.
- The recommended minimum cost is N=2¹⁴, r=8, p=1 (as of 2/1/2016 on commodity hardware) and should be increased to the threshold at which legitimate systems will encounter detrimental delays (see schedule below or use ScryptCipherProviderGroovyTest#testDefaultConstructorShouldProvideStrongParameters() to calculate safe minimums).
- The salt format is $s0$e0101$ABCDEFGHIJKLMNOPQRSTUV. The salt is delimited by $ and the three sections are as follows:
  - s0 - the version of the format. Data Integration Tool currently uses s0 for all salts generated internally.
  - e0101 - the cost parameters. This is actually a hexadecimal encoding of N, r, p using shifts. This can be formed/parsed using Scrypt#encodeParams() and Scrypt#parseParameters().
    
    Some external libraries encode N, r, and p separately in the form $400$1$1$. A utility method is available at ScryptCipherProvider#translateSalt() which will convert the external form to the internal form.
  - ABCDEFGHIJKLMNOPQRSTUV - the 12-44 character, Base64-encoded, unpadded, raw salt value. This decodes to a 8-32 byte salt used in the key derivation.
PBKDF2
- This KDF was added in v0.5.0.
- Password-Based Key Derivation Function 2 is an adaptive derivation function which uses an internal pseudorandom function (PRF) and iterates it many times over a password and salt (at least 16 bytes).
- The PRF is recommended to be HMAC/SHA-256 or HMAC/SHA-512. The use of an HMAC cryptographic hash function mitigates a length extension attack.
- The recommended minimum number of iterations is 160,000 (as of 2/1/2016 on commodity hardware). This number should be doubled every two years (see schedule below or use PBKDF2CipherProviderGroovyTest#testDefaultConstructorShouldProvideStrongIterationCount() to calculate safe minimums).
- This KDF is not memory-hard (can be parallelized massively with commodity hardware) but is still recommended as sufficient by NIST SP 800-132 (PDF) and many cryptographers (when used with a proper iteration count and HMAC cryptographic hash function).
None
- This KDF was added in v0.5.0.
- This KDF performs no operation on the input and is a marker to indicate the raw key is provided to the cipher. The key must be provided in hexadecimal encoding and be of a valid length for the associated cipher/algorithm.

Additional Resources

Salt and IV Encoding

Initially, the EncryptContent processor had a single method of deriving the encryption key from a user-provided password. This is now referred to as DataIntegrationToolLegacy mode, effectively MD5 digest, 1000 iterations. In v0.4.0, another method of deriving the key, OpenSSL PKCS#5 v1.5 EVP_BytesToKey was added for compatibility with content encrypted outside of Data Integration Tool using the openssl command-line tool. Both of these Key Derivation Functions (KDF) had hard-coded digest functions and iteration counts, and the salt format was also hard-coded. With v0.5.0, additional KDFs are introduced with variable iteration counts, work factors, and salt formats. In addition, raw keyed encryption was also introduced. This required the capacity to encode arbitrary salts and Initialization Vectors (IV) into the cipher stream in order to be recovered by Data Integration Tool or a follow-on system to decrypt these messages.

For the existing KDFs, the salt format has not changed.

Data Integration Tool Legacy

The first 8 or 16 bytes of the input are the salt. The salt length is determined based on the selected algorithm’s cipher block length. If the cipher block size cannot be determined (such as with a stream cipher like RC4), the default value of 8 bytes is used. On decryption, the salt is read in and combined with the password to derive the encryption key and IV.

Data Integration Tool Legacy Salt Encoding

OpenSSL PKCS#5 v1.5 EVP_BytesToKey

OpenSSL allows for salted or unsalted key derivation. *Unsalted key derivation is a security risk and is not recommended.* If a salt is present, the first 8 bytes of the input are the ASCII string "Salted__" (0x53 61 6C 74 65 64 5F 5F) and the next 8 bytes are the ASCII-encoded salt. On decryption, the salt is read in and combined with the password to derive the encryption key and IV. If there is no salt header, the entire input is considered to be the cipher text.

OpenSSL Salt Encoding

For new KDFs, each of which allow for non-deterministic IVs, the IV must be stored alongside the cipher text. This is not a vulnerability, as the IV is not required to be secret, but simply to be unique for messages encrypted using the same key to reduce the success of cryptographic attacks. For these KDFs, the output consists of the salt, followed by the salt delimiter, UTF-8 string "DataIntegrationToolSALT" (0x4E 69 46 69 53 41 4C 54) and then the IV, followed by the IV delimiter, UTF-8 string "DataIntegrationToolIV" (0x4E 69 46 69 49 56), followed by the cipher text.

Bcrypt, Scrypt, PBKDF2

Bcrypt Salt & IV Encoding

Scrypt Salt & IV Encoding

PBKDF2 Salt & IV Encoding

Java Cryptography Extension (JCE) Limited Strength Jurisdiction Policies

Because of US export regulations, default JVMs have limits imposed on the strength of cryptographic operations available to them. For example, AES operations are limited to 128 bit keys by default. While AES-128 is cryptographically safe, this can have unintended consequences, specifically on Password-based Encryption (PBE).

PBE is the process of deriving a cryptographic key for encryption or decryption from user-provided secret material, usually a password. Rather than a human remembering a (random-appearing) 32 or 64 character hexadecimal string, a password or passphrase is used.

A number of PBE algorithms provided by Data Integration Tool impose strict limits on the length of the password due to the underlying key length checks. Below is a table listing the maximum password length on a JVM with limited cryptographic strength.

Table 1. Maximum Password Length on Limited Cryptographic Strength JVM
Algorithm	Max Password Length
`PBEWITHMD5AND128BITAES-CBC-OPENSSL`	16
`PBEWITHMD5AND192BITAES-CBC-OPENSSL`	16
`PBEWITHMD5AND256BITAES-CBC-OPENSSL`	16
`PBEWITHMD5ANDDES`	16
`PBEWITHMD5ANDRC2`	16
`PBEWITHSHA1ANDRC2`	16
`PBEWITHSHA1ANDDES`	16
`PBEWITHSHAAND128BITAES-CBC-BC`	7
`PBEWITHSHAAND192BITAES-CBC-BC`	7
`PBEWITHSHAAND256BITAES-CBC-BC`	7
`PBEWITHSHAAND40BITRC2-CBC`	7
`PBEWITHSHAAND128BITRC2-CBC`	7
`PBEWITHSHAAND40BITRC4`	7
`PBEWITHSHAAND128BITRC4`	7
`PBEWITHSHA256AND128BITAES-CBC-BC`	7
`PBEWITHSHA256AND192BITAES-CBC-BC`	7
`PBEWITHSHA256AND256BITAES-CBC-BC`	7
`PBEWITHSHAAND2-KEYTRIPLEDES-CBC`	7
`PBEWITHSHAAND3-KEYTRIPLEDES-CBC`	7
`PBEWITHSHAANDTWOFISH-CBC`	7

Allow Insecure Cryptographic Modes

By default, the Allow Insecure Cryptographic Modes property in EncryptContent processor settings is set to not-allowed. This means that if a password of fewer than 10 characters is provided, a validation error will occur. 10 characters is a conservative estimate and does not take into consideration full entropy calculations, patterns, etc.

Allow Insecure Cryptographic Modes

On a JVM with limited strength cryptography, some PBE algorithms limit the maximum password length to 7, and in this case it will not be possible to provide a "safe" password. It is recommended to install the JCE Unlimited Strength Jurisdiction Policy files for the JVM to mitigate this issue.

If on a system where the unlimited strength policies cannot be installed, it is recommended to switch to an algorithm that supports longer passwords (see table above).

Allowing Weak Crypto

If it is not possible to install the unlimited strength jurisdiction policies, the Allow Weak Crypto setting can be changed to allowed, but this is not recommended. Changing this setting explicitly acknowledges the inherent risk in using weak cryptographic configurations.

It is preferable to request upstream/downstream systems to switch to keyed encryption or use a "strong" Key Derivation Function (KDF) supported by Data Integration Tool.

Clustering Configuration

This section provides a quick overview of Data Integration Tool Clustering and instructions on how to set up a basic cluster. In the future, we hope to provide supplemental documentation that covers the Data Integration Tool Cluster Architecture in depth.

The design of Data Integration Tool clustering is a simple master/slave model where there is a master and one or more slaves. While the model is that of master and slave, if the master dies, the slaves are all instructed to continue operating as they were to ensure the dataflow remains live. The absence of the master simply means new slaves cannot join the cluster and cluster flow changes cannot occur until the master is restored. In Data Integration Tool clustering, we call the master the Data Integration Tool Cluster Manager (NCM), and the slaves are called Nodes. See a full description of each in the Terminology section below.

Why Cluster?

Data Integration Tool Administrators or Dataflow Managers (DFMs) may find that using one instance of Data Integration Tool on a single server is not enough to process the amount of data they have. So, one solution is to run the same dataflow on multiple Data Integration Tool servers. However, this creates a management problem, because each time DFMs want to change or update the dataflow, they must make those changes on each server and then monitor each server individually. By clustering the Data Integration Tool servers, it’s possible to have that increased processing capability along with a single interface through which to make dataflow changes and monitor the dataflow. Clustering allows the DFM to make each change only once, and that change is then replicated to all the nodes of the cluster. Through the single interface, the DFM may also monitor the health and status of all the nodes.

Data Integration Tool Clustering is unique and has its own terminology. It’s important to understand the following terms before setting up a cluster.

Terminology

Data Integration Tool Cluster Manager: A Data Integration Tool Cluster Manager (NCM) is an instance of Data Integration that provides the sole management point for the cluster. It communicates dataflow changes to the nodes and receives health and status information from the nodes. It also ensures that a uniform dataflow is maintained across the cluster. When DFMs manage a dataflow in a cluster, they do so through the User Interface of the NCM (i.e., via the URL of the NCM’s User Interface). Fundamentally, the NCM keeps the state of the cluster consistent.

Nodes: Each cluster is made up of the NCM and one or more nodes. The nodes do the actual data processing. (The NCM does not process any data; all data runs through the nodes.) While nodes are connected to a cluster, the DFM may not access the User Interface for any of the individual nodes. The User Interface of a node may only be accessed if the node is manually removed from the cluster.

Primary Node: Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below). By default, the NCM will elect the first node that connects to the cluster as the Primary Node; however, the DFM may select a new node as the Primary Node in the Cluster Management page of the User Interface if desired. If the cluster restarts, the NCM will "remember" which node was the Primary Node and wait for that node to re-connect before allowing the DFM to make any changes to the dataflow. The ADMIN may adjust how long the NCM waits for the Primary Node to reconnect by adjusting the property nifi.cluster.manager.safemode.duration in the nifi.properties file, which is discussed in the System Properties section of this document.

Isolated Processors: In a Data Integration Tool cluster, the same dataflow runs on all the nodes. As a result, every component in the flow runs on every node. However, there may be cases when the DFM would not want every processor to run on every node. The most common case is when using a processor that communicates with an external service using a protocol that does not scale well. For example, the GetSFTP processor pulls from a remote directory, and if the GetSFTP on every node in the cluster tries simultaneously to pull from the same remote directory, there could be race conditions. Therefore, the DFM could configure the GetSFTP on the Primary Node to run in isolation, meaning that it only runs on that node. It could pull in data and -with the proper dataflow configuration- load-balance it across the rest of the nodes in the cluster. Note that while this feature exists, it is also very common to simply use a standalone Data Integration Tool instance to pull data and feed it to the cluster. It just depends on the resources available and how the Administrator decides to configure the cluster.

Heartbeats: The nodes communicate their health and status to the NCM via "heartbeats", which let the NCM know they are still connected to the cluster and working properly. By default, the nodes emit heartbeats to the NCM every 5 seconds, and if the NCM does not receive a heartbeat from a node within 45 seconds, it disconnects the node due to "lack of heartbeat". (The 5-second and 45-second settings are configurable in the nifi.properties file. See the System Properties section of this document for more information.) The reason that the NCM disconnects the node is because the NCM needs to ensure that every node in the cluster is in sync, and if a node is not heard from regularly, the NCM cannot be sure it is still in sync with the rest of the cluster. If, after 45 seconds, the node does send a new heartbeat, the NCM will automatically reconnect the node to the cluster. Both the disconnection due to lack of heartbeat and the reconnection once a heartbeat is received are reported to the DFM in the NCM’s User Interface.

Communication within the Cluster

As noted, the nodes communicate with the NCM via heartbeats. The communication that allows the nodes to find the NCM may be set up as multicast or unicast; this is configured in the nifi.properties file (See System Properties ). By default, unicast is used. It is important to note that the nodes in a Data Integration Tool cluster are not aware of each other. They only communicate with the NCM. Therefore, if one of the nodes goes down, the other nodes in the cluster will not automatically pick up the load of the missing node. It is possible for the DFM to configure the dataflow for failover contingencies; however, this is dependent on the dataflow design and does not happen automatically.

When the DFM makes changes to the dataflow, the NCM communicates those changes to the nodes and waits for each node to respond, indicating that it has made the change on its local flow. If the DFM wants to make another change, the NCM will only allow this to happen once all the nodes have acknowledged that they’ve implemented the last change. This is a safeguard to ensure that all the nodes in the cluster have the correct and up-to-date flow.

Dealing with Disconnected Nodes

A DFM may manually disconnect a node from the cluster. But if a node becomes disconnected for any other reason (such as due to lack of heartbeat), the NCM will show a bulletin on the User Interface, and the DFM will not be able to make any changes to the dataflow until the issue of the disconnected node is resolved. The DFM or the Administrator will need to troubleshoot the issue with the node and resolve it before any new changes may be made to the dataflow. However, it is worth noting that just because a node is disconnected does not mean that it is not working; it just means that the NCM cannot communicate with the node.

Basic Cluster Setup

This section describes the setup for a simple two-node, non-secure, unicast cluster comprised of three instances of Data Integration Tool:

The NCM
Node 1
Node 2

Administrators may install each instance on a separate server; however, it is also perfectly fine to install the NCM and one of the nodes on the same server, as the NCM is very lightweight. Just keep in mind that the ports assigned to each instance must not collide if the NCM and one of the nodes share the same server.

For each instance, certain properties in the nifi.properties file will need to be updated. In particular, the Web and Clustering properties should be evaluated for your situation and adjusted accordingly. All the properties are described in the System Properties section of this guide; however, in this section, we will focus on the minimum properties that must be set for a simple cluster.

For all three instances, the Cluster Common Properties can be left with the default settings. Note, however, that if you change these settings, they must be set the same on every instance in the cluster (NCM and nodes).

For the NCM, the minimum properties to configure are as follows:

Under the Web Properties, set either the http or https port that you want the NCM to run on. If the NCM and one of the nodes are on the same server, make sure this port is different from the web port used by the node.
Under the Cluster Manager Properties, set the following:
- nifi.cluster.is.manager - Set this to true.
- nifi.cluster.manager.protocol.port - Set this to an open port that is higher than 1024 (anything lower requires root). Take note of this setting, as you will need to reference it when you set up the nodes.

For Node 1, the minimum properties to configure are as follows:

Under the Web Properties, set either the http or https port that you want Node 1 to run on. If the NCM is running on the same server, choose a different web port for Node 1. Also, consider whether you need to set the http or https host property.
Under the State Management section, set the nifi.state.management.provider.cluster property to the identifier of the Cluster State Provider. Ensure that the Cluster State Provider has been configured in the state-management.xml file. See Configuring State Providers for more information.
Under Cluster Node Properties, set the following:
- nifi.cluster.is.node - Set this to true.
- nifi.cluster.node.address - Set this to the fully qualified hostname of the node. If left blank, it defaults to "localhost".
- nifi.cluster.node.protocol.port - Set this to an open port that is higher than 1024 (anything lower requires root). If Node 1 and the NCM are on the same server, make sure this port is different from the nifi.cluster.manager.protocol.port.
- nifi.cluster.node.unicast.manager.address - Set this to the NCM’s fully qualified hostname.
- nifi.cluster.node.unicast.manager.protocol.port - Set this to exactly the same port that was set on the NCM for the property nifi.cluster.manager.protocol.port.

For Node 2, the minimum properties to configure are as follows:

Under the Web Properties, set either the http or https port that you want Node 2 to run on. Also, consider whether you need to set the http or https host property.
Under the State Management section, set the nifi.state.management.provider.cluster property to the identifier of the Cluster State Provider. Ensure that the Cluster State Provider has been configured in the state-management.xml file. See Configuring State Providers for more information.
Under the Cluster Node Properties, set the following:
- nifi.cluster.is.node - Set this to true.
- nifi.cluster.node.address - Set this to the fully qualified hostname of the node. If left blank, it defaults to "localhost".
- nifi.cluster.node.protocol.port - Set this to an open port that is higher than 1024 (anything lower requires root).
- nifi.cluster.node.unicast.manager.address - Set this to the NCM’s fully qualified hostname.
- nifi.cluster.node.unicast.manager.protocol.port - Set this to exactly the same port that was set on the NCM for the property nifi.cluster.manager.protocol.port.

Now, it is possible to start up the cluster. Technically, it does not matter which instance starts up first. However, you could start the NCM first, then Node 1 and then Node 2. Since the first node that connects is automatically elected as the Primary Node, this sequence should create a cluster where Node 1 is the Primary Node.

Troubleshooting

If you encounter issues and your cluster does not work as described, investigate the nifi.app log and nifi.user log on both the NCM and the nodes. If needed, you can change the logging level to DEBUG by editing the conf/logback.xml file. Specifically, set the level="DEBUG" in the following line (instead of "INFO"):

    <logger name="org.apache.nifi.web.api.config" level="INFO"
additivity="false">
        <appender-ref ref="USER_FILE"/>
    </logger>

State Management

Data Integration Tool provides a mechanism for Processors, Reporting Tasks, Controller Services, and the framework itself to persist state. This allows a Processor, for example, to resume from the place where it left off after Data Integration Tool is restarted. Additionally, it allows for a Processor to store some piece of information so that the Processor can access that information from all of the different nodes in the cluster. This allows one node to pick up where another node left off, or to coordinate across all of the nodes in a cluster.

Configuring State Providers

When a component decides to store or retrieve state, it does so by providing a "Scope" - either Node-local or Cluster-wide. The mechanism that is used to store and retrieve this state is then determined based on this Scope, as well as the configured State Providers. The nifi.properties file contains three different properties that are relevant to configuring these State Providers.

Property

Description

nifi.state.management.configuration.file

The first is the property that specifies an external XML file that is used for configuring the local and/or cluster-wide State Providers. This XML file may contain configurations for multiple providers

nifi.state.management.provider.local

The property that provides the identifier of the local State Provider configured in this XML file

nifi.state.management.provider.cluster

Similarly, the property provides the identifier of the cluster-wide State Provider configured in this XML file.

This XML file consists of a top-level state-management element, which has one or more local-provider and zero or more cluster-provider elements. Each of these elements then contains an id element that is used to specify the identifier that can be referenced in the nifi.properties file, as well as a class element that specifies the fully-qualified class name to use in order to instantiate the State Provider. Finally, each of these elements may have zero or more property elements. Each property element has an attribute, name that is the name of the property that the State Provider supports. The textual content of the property element is the value of the property.

Once these State Providers have been configured in the state-management.xml file (or whatever file is configured), those Providers may be referenced by their identifiers.

By default, the Local State Provider is configured to be a WriteAheadLocalStateProvider that persists the data to the $NIFI_HOME/state/local directory. The default Cluster State Provider is configured to be a ZooKeeperStateProvider. The default ZooKeeper-based provider must have its Connect String property populated before it can be used. It is also advisable, if multiple Data Integration Tool instances will use the same ZooKeeper instance, that the value of the Root Node property be changed. For instance, one might set the value to /nifi/<team name>/production. A Connect String takes the form of comma separated <host>:<port> tuples, such as my-zk-server1:2181,my-zk-server2:2181,my-zk-server3:2181. In the event a port is not specified for any of the hosts, the ZooKeeper default of 2181 is assumed.

When adding data to ZooKeeper, there are two options for Access Control: Open and CreatorOnly. If the Access Control property is set to Open, then anyone is allowed to log into ZooKeeper and have full permissions to see, change, delete, or administer the data. If CreatorOnly is specified, then only the user that created the data is allowed to read, change, delete, or administer the data. In order to use the CreatorOnly option, Data Integration Tool must provide some form of authentication. See the ZooKeeper Access Control section below for more information on how to configure authentication.

If Data Integration Tool is configured to run in a standalone mode, the cluster-provider element need not be populated in the state-management.xml file and will actually be ignored if they are populated. However, the local-provider element must always be present and populated. Additionally, if Data Integration Tool is run in a cluster, each node must also have the cluster-provider element present and properly configured. Otherwise, Data Integration Tool will fail to startup.

While there are not many properties that need to be configured for these providers, they were externalized into a separate state-management.xml file, rather than being configured via the nifi.properties file, simply because different implementations may require different properties, and it is easier to maintain and understand the configuration in an XML-based file such as this, than to mix the properties of the Provider in with all of the other Data Integration Tool framework-specific properties.

It should be noted that if Processors and other components save state using the Clustered scope, the Local State Provider will be used if the instance is a standalone instance (not in a cluster) or is disconnected from the cluster. This also means that if a standalone instance is migrated to become a cluster, then that state will no longer be available, as the component will begin using the Clustered State Provider instead of the Local State Provider.

Embedded ZooKeeper Server

As mentioned above, the default State Provider for cluster-wide state is the ZooKeeperStateProvider. At the time of this writing, this is the only State Provider that exists for handling cluster-wide state. What this means is that Data Integration Tool has dependencies on ZooKeeper in order to behave as a cluster. However, there are many environments in which Data Integration Tool is deployed where there is no existing ZooKeeper ensemble being maintained. In order to avoid the burden of forcing administrators to also maintain a separate ZooKeeper instance, Data Integration Tool provides the option of starting an embedded ZooKeeper server.

Property

Description

nifi.state.management.embedded.zookeeper.start

Specifies whether or not this instance of Data Integration Tool should run an embedded ZooKeeper server

nifi.state.management.embedded.zookeeper.properties

Properties file that provides the ZooKeeper properties to use if <nifi.state.management.embedded.zookeeper.start> is set to true

This can be accomplished by setting the nifi.state.management.embedded.zookeeper.start property in nifi.properties to true on those nodes that should run the embedded ZooKeeper server. Generally, it is advisable to run ZooKeeper on either 3 or 5 nodes. Running on fewer than 3 nodes provides less durability in the face of failure. Running on more than 5 nodes generally produces more network traffic than is necessary. Additionally, running ZooKeeper on 4 nodes provides no more benefit than running on 3 nodes, ZooKeeper requires a majority of nodes be active in order to function. However, it is up to the administrator to determine the number of nodes most appropriate to the particular deployment of Data Integration Tool. An embedded ZooKeeper server cannot be run on the NCM.

If the nifi.state.management.embedded.zookeeper.start property is set to true, the nifi.state.management.embedded.zookeeper.properties property in nifi.properties also becomes relevant. This specifies the ZooKeeper properties file to use. At a minimum, this properties file needs to be populated with the list of ZooKeeper servers. The servers are specified as properties in the form of server.1, server.2, to server.n. Each of these servers is configured as <hostname>:<quorum port>[:<leader election port>]. For example, myhost:2888:3888. This list of nodes should be the same nodes in the Data Integration cluster that have the nifi.state.management.embedded.zookeeper.start property set to true. Also note that because ZooKeeper will be listening on these ports, the firewall may need to be configured to open these ports for incoming traffic, at least between nodes in the cluster. Additionally, the port to listen on for client connections must be opened in the firewall. The default value for this is 2181 but can be configured via the clientPort property in the zookeeper.properties file.

When using an embedded ZooKeeper, the ./conf/zookeeper.properties file has a property named dataDir. By default, this value is set to ./state/zookeeper. If more than one Data Integration Tool node is running an embedded ZooKeeper, it is important to tell the server which one it is. This is accomplished by creating a file named myid and placing it in ZooKeeper’s data directory. The contents of this file should be the index of the server as specific by the server.<number>. So for one of the ZooKeeper servers, we will accomplish this by performing the following commands:

cd $NIFI_HOME
mkdir state
mkdir state/zookeeper
echo 1 > state/zookeeper/myid

For the next Data Integration Tool Node that will run ZooKeeper, we can accomplish this by performing the following commands:

cd $NIFI_HOME
mkdir state
mkdir state/zookeeper
echo 2 > state/zookeeper/myid

And so on.

For more information on the properties used to administer ZooKeeper, see the ZooKeeper Admin Guide.

For information on securing the embedded ZooKeeper Server, see the Securing ZooKeeper section below.

ZooKeeper Access Control

ZooKeeper provides Access Control to its data via an Access Control List (ACL) mechanism. When data is written to ZooKeeper, Data Integration Tool will provide an ACL that indicates that any user is allowed to have full permissions to the data, or an ACL that indicates that only the user that created the data is allowed to access the data. Which ACL is used depends on the value of the Access Control property for the ZooKeeperStateProvider (see the Configuring State Providers section for more information).

In order to use an ACL that indicates that only the Creator is allowed to access the data, we need to tell ZooKeeper who the Creator is. There are two mechanisms for accomplishing this. The first mechanism is to provide authentication using Kerberos. See Kerberizing Data Integration Tool’s ZooKeeper Client for more information.

The second option is to use a user name and password. This is configured by specifying a value for the Username and a value for the Password properties for the ZooKeeperStateProvider (see the Configuring State Providers section for more information). The important thing to keep in mind here, though, is that ZooKeeper will pass around the password in plain text. This means that using a user name and password should not be used unless ZooKeeper is running on localhost as a one-instance cluster, or if communications with ZooKeeper occur only over encrypted communications, such as a VPN or an SSL connection. ZooKeeper will be providing support for SSL connections in version 3.5.0.

Securing ZooKeeper

When Data Integration Tool communicates with ZooKeeper, all communications, by default, are non-secure, and anyone who logs into ZooKeeper is able to view and manipulate all of the Data Integration Tool state that is stored in ZooKeeper. To prevent this, we can use Kerberos to manage the authentication. At this time, ZooKeeper does not provide support for encryption via SSL. Support for SSL in ZooKeeper is being actively developed and is expected to be available in the 3.5.x release version.

In order to secure the communications, we need to ensure that both the client and the server support the same configuration. Instructions for configuring the Data Integration Tool ZooKeeper client and embedded ZooKeeper server to use Kerberos are provided below.

Kerberizing Embedded ZooKeeper Server

The krb5.conf file on the systems with the embedded zookeeper servers should be identical to the one on the system where the krb5kdc service is running. When using the embedded ZooKeeper server, we may choose to secure the server by using Kerberos. All nodes configured to launch an embedded ZooKeeper and using Kerberos should follow these steps. When using the embedded ZooKeeper server, we may choose to secure the server by using Kerberos. All nodes configured to launch an embedded ZooKeeper and using Kerberos should follow these steps.

In order to use Kerberos, we first need to generate a Kerberos Principal for our ZooKeeper servers. The following command is run on the server where the krb5kdc service is running. This is accomplished via the kadmin tool:

kadmin: addprinc "zookeeper/myHost.example.com@EXAMPLE.COM"

Here, we are creating a Principal with the primary zookeeper/myHost.example.com, using the realm EXAMPLE.COM. We need to use a Principal whose name is <service name>/<instance name>. In this case, the service is zookeeper and the instance name is myHost.example.com (the fully qualified name of our host).

Next, we will need to create a KeyTab for this Principal, this command is run on the server with the Data Integration Tool instance with an embedded zookeeper server:

kadmin: xst -k zookeeper-server.keytab zookeeper/myHost.example.com@EXAMPLE.COM

This will create a file in the current directory named zookeeper-server.keytab. We can now copy that file into the $NIFI_HOME/conf/ directory. We should ensure that only the user that will be running Data Integration Tool is allowed to read this file.

We will need to repeat the above steps for each of the instances of Data Integration that will be running the embedded ZooKeeper server, being sure to replace myHost.example.com with myHost2.example.com, or whatever fully qualified hostname the ZooKeeper server will be run on.

Now that we have our KeyTab for each of the servers that will be running Data Integration Tool, we will need to configure Data Integration Tool’s embedded ZooKeeper server to use this configuration. ZooKeeper uses the Java Authentication and Authorization Service (JAAS), so we need to create a JAAS-compatible file In the $NIFI_HOME/conf/ directory, create a file named zookeeper-jaas.conf (this file will already exist if the Client has already been configured to authenticate via Kerberos. That’s okay, just add to the file). We will add to this file, the following snippet:

Server {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="./conf/zookeeper-server.keytab"
  storeKey=true
  useTicketCache=false
  principal="zookeeper/myHost.example.com@EXAMPLE.COM";
};

Be sure to replace the value of principal above with the appropriate Principal, including the fully qualified domain name of the server.

Next, we need to tell Data Integration to use this as our JAAS configuration. This is done by setting a JVM System Property, so we will edit the conf/bootstrap.conf file. If the Client has already been configured to use Kerberos, this is not necessary, as it was done above. Otherwise, we will add the following line to our bootstrap.conf file:

java.arg.15=-Djava.security.auth.login.config=./conf/zookeeper-jaas.conf

Note: this additional line in the file doesn’t have to be number 15, it just has to be added to the bootstrap.conf file, use whatever number is appropriate for your configuration.

We will want to initialize our Kerberos ticket by running the following command:

kinit –kt zookeeper-server.keytab "zookeeper/myHost.example.com@EXAMPLE.COM"

Again, be sure to replace the Principal with the appropriate value, including your realm and your fully qualified hostname.

Finally, we need to tell the Kerberos server to use the SASL Authentication Provider. To do this, we edit the $NIFI_HOME/conf/zookeeper.properties file and add the following lines:

authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
jaasLoginRenew=3600000
requireClientAuthScheme=sasl

The last line is optional but specifies that clients MUST use Kerberos to communicate with our ZooKeeper instance.

Now, we can start Data Integration Tool, and the embedded ZooKeeper server will use Kerberos as the authentication mechanism.

Kerberizing Data Integration Tool’s ZooKeeper Client

Note: The Data Integration Tool nodes running the embedded zookeeper server will also need to follow the below procedure since they will also be acting as a client at the same time.

The preferred mechanism for authenticating users with ZooKeeper is to use Kerberos. In order to use Kerberos to authenticate, we must configure a few system properties, so that the ZooKeeper client knows who the user is and where the KeyTab file is. All nodes configured to store cluster-wide state using ZooKeeperStateProvider and using Kerberos should follow these steps.

First, we must create the Principal that we will use when communicating with ZooKeeper. This is generally done via the kadmin tool:

kadmin: addprinc "nifi@EXAMPLE.COM"

A Kerberos Principal is made up of three parts: the primary, the instance, and the realm. Here, we are creating a Principal with the primary nifi, no instance, and the realm EXAMPLE.COM. The primary (nifi, in this case) is the identifier that will be used to identify the user when authenticating via Kerberos.

After we have created our Principal, we will need to create a KeyTab for the Principal:

kadmin: xst -k nifi.keytab nifi@EXAMPLE.COM

This keytab file can be copied to the other Data Integration Tool nodes with embedded zookeeper servers.

This will create a file in the current directory named nifi.keytab. We can now copy that file into the $NIFI_HOME/conf/ directory. We should ensure that only the user that will be running Data Integration Tool is allowed to read this file.

Next, we need to configure Data Integration to use this KeyTab for authentication. Since ZooKeeper uses the Java Authentication and Authorization Service (JAAS), we need to create a JAAS-compatible file. In the $NIFI_HOME/conf/ directory, create a file named zookeeper-jaas.conf and add to it the following snippet:

Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="./conf/nifi.keytab"
  storeKey=true
  useTicketCache=false
  principal="nifi@EXAMPLE.COM";
};

Finally, we need to tell Data Integration to use this as our JAAS configuration. This is done by setting a JVM System Property, so we will edit the conf/bootstrap.conf file. We add the following line anywhere in this file in order to tell the Data Integration Tool JVM to use this configuration:

java.arg.15=-Djava.security.auth.login.config=./conf/zookeeper-jaas.conf

We can initialize our Kerberos ticket by running the following command:

kinit -kt nifi.keytab nifi@EXAMPLE.COM

Now, when we start Data Integration Tool, it will use Kerberos to authentication as the nifi user when communicating with ZooKeeper.

Troubleshooting Kerberos Configuration

When using Kerberos, it is import to use fully-qualified domain names and not use localhost. Please ensure that the fully qualified hostname of each server is used in the following locations:

conf/zookeeper.properties file should use FQDN for server.1, server.2, …, server.N values.
The Connect String property of the ZooKeeperStateProvider
The /etc/hosts file should also resolve the FQDN to an IP address that is not 127.0.0.1.

Failure to do so, may result in errors similar to the following:

2016-01-08 16:08:57,888 ERROR [pool-26-thread-1-SendThread(localhost:2181)] o.a.zookeeper.client.ZooKeeperSaslClient An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)]) occurred when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client will go to AUTH_FAILED state.

If there are problems communicating or authenticating with Kerberos, this Troubleshooting Guide may be of value.

One of the most important notes in the above Troubleshooting guide is the mechanism for turning on Debug output for Kerberos. This is done by setting the sun.security.krb5.debug environment variable. In Data Integration Tool, this is accomplished by adding the following line to the _$NIFI_HOME/conf/bootstrap.conf` file:

java.arg.16=-Dsun.security.krb5.debug=true

This will cause the debug output to be written to the Data Integration Tool Bootstrap log file. By default, this is located at $NIFI_HOME/logs/nifi-bootstrap.log. This output can be rather verbose but provides extremely valuable information for troubleshooting Kerberos failures.

Bootstrap Properties

The bootstrap.conf file in the conf directory allows users to configure settings for how Data Integration Tool should be started. This includes parameters, such as the size of the Java Heap, what Java command to run, and Java System Properties.

Here, we will address the different properties that are made available in the file. Any changes to this file will take effect only after Data Integration Tool has been stopped and restarted.

Property

Description

java

Specifies the fully qualified java command to run. By default, it is simply java but could be changed to an absolute path or a reference an environment variable, such as $JAVA_HOME/bin/java

lib.dir

The lib directory to use for Data Integration Tool. By default, this is set to ./lib

conf.dir

The conf directory to use for Data Integration Tool. By default, this is set to ./conf

graceful.shutdown.seconds

When Data Integration Tool is instructed to shutdown, the Bootstrap will wait this number of seconds for the process to shutdown cleanly. At this amount of time, if the service is still running, the Bootstrap will "kill" the process, or terminate it abruptly.

java.arg.N

Any number of JVM arguments can be passed to the Data Integration Tool JVM when the process is started. These arguments are defined by adding properties to bootstrap.conf that begin with java.arg.. The rest of the property name is not relevant, other than to different property names, and will be ignored. The default includes properties for minimum and maximum Java Heap size, the garbage collector to use, etc.

notification.services.file

When Data Integration Tool is started, or stopped, or when the Bootstrap detects that Data Integration Tool has died, the Bootstrap is able to send notifications of these events to interested parties. This is configured by specifying an XML file that defines which notification services can be used. More about this file can be found in the Notification Services section.

notification.max.attempts

If a notification service is configured but is unable to perform its function, it will try again up to a maximum number of attempts. This property configures what that maximum number of attempts is. The default is 5.

nifi.start.notification.services

This property is a comma-separated list of Notification Service identifiers that correspond to the Notification Services defined in the notification.services.file property. The services with the specified identifiers will be used to notify their configured recipients whenever Data Integration Tool is started.

nifi.stop.notification.services

nifi.died.notification.services

Java 7 PermGen Sizing The provided bootstrap.conf file may include a line such as

#java.arg.11=-XX:PermSize=128M
#java.arg.12=-XX:MaxPermSize=128M

If running in Java 7 it is recommended to uncomment those lines to ensure the PermGen size and maximum can be larger than is available by default. This is important because Data Integration can load a significant number of classes which will result in OutOfMemoryError due to PermGen being full. You might choose a value larger than 128MB as well.

Java 7 and 8 handling of codecache It has been observed in both Java 7 and Java 8 runtime environments that performance can suddenly drop by more than an order of magnitude after days or weeks of otherwise ideal behavior. This has only been observed under extremely high load and in cases where considerable Just in Time (JIT) compilation occurs. The core problem is the CodeCache becomes full and is seemingly not properly garbage collected or grown. When this occurs JIT seems to no longer occur or involve considerable delays and performance drops. This is easily overcome by ensuring the following lines are available in the boostrap.conf. By default they are there but commented. Uncomment them for maximum sustained throughput.

#java.arg.7=-XX:ReservedCodeCacheSize=256m
#java.arg.8=-XX:CodeCacheFlushingMinimumFreeSpace=10m
#java.arg.9=-XX:+UseCodeCacheFlushing

Notification Services

When the Data Integration Tool bootstrap starts or stops Data Integration Tool, or detects that it has died unexpectedly, it is able to notify configured recipients. At this point (version 0.3.0 of code192 Data Integration Tool), the only mechanism supplied is to send an e-mail notification. The notification services configuration file, however, is a configurable XML file so that as new notification capabilities are developed, they will be configured similarly.

The default location of the XML file is conf/bootstrap-notification-services.xml, but this value can be changed in the conf/bootstrap.conf file.

The syntax of the XML file is as follows:

<services>
    <!-- any number of service elements can be defined. -->
    <service>
        <id>some-identifier</id>
        <!-- The fully-qualified class name of the Notification Service. -->
        <class>org.apache.nifi.bootstrap.notification.email.EmailNotificationService</class>

        <!-- Any number of properties can be set using this syntax.
             The properties available depend on the Notification Service. -->
        <property name="Property Name 1">Property Value</property>
        <property name="Another Property Name">Property Value 2</property>
    </service>
</services>

Once the desired services have been configured, they can then be referenced in the bootstrap.conf file. Currently, the only implementation is the org.apache.nifi.bootstrap.notification.email.EmailNotificationService implementation. It has the following properties available:

Property

Required

Description

SMTP Hostname

true

The hostname of the SMTP Server that is used to send Email Notifications

SMTP Port

true

The Port used for SMTP communications

SMTP Username

true

Username for the SMTP account

SMTP Password

Password for the SMTP account

SMTP Auth

Flag indicating whether authentication should be used

SMTP TLS

Flag indicating whether TLS should be enabled

SMTP Socket Factory

javax.net.ssl.SSLSocketFactory

SMTP X-Mailer Header

X-Mailer used in the header of the outgoing email

Content Type

Mime Type used to interpret the contents of the email, such as text/plain or text/html

From

true

Specifies the Email address to use as the sender. Otherwise, a "friendly name" can be used as the From address, but the value must be enclosed in double-quotes.

The recipients to include in the To-Line of the email

The recipients to include in the CC-Line of the email

BCC

The recipients to include in the BCC-Line of the email

In addition to the properties above that are marked as required, at least one of the To, CC, or BCC properties must be set.

A complete example of configuring the Email service would look like the following:

     <service>
        <id>email-notification</id>
        <class>org.apache.nifi.bootstrap.notification.email.EmailNotificationService</class>
        <property name="SMTP Hostname">smtp.gmail.com</property>
        <property name="SMTP Port">587</property>
        <property name="SMTP Username">username@gmail.com</property>
        <property name="SMTP Password">super-secret-password</property>
        <property name="SMTP TLS">true</property>
        <property name="From">"Data Integration Tool Service Notifier"</property>
        <property name="To">username@gmail.com</property>
     </service>

Kerberos Service

Data Preparation can be configured to use Kerberos SPNEGO (or "Kerberos Service") for authentication. In this scenario, users will hit the REST endpoint /access/kerberos and the server will respond with a 401 status code and the challenge response header WWW-Authenticate: Negotiate. This communicates to the browser to use the GSS-API and load the user’s Kerberos ticket and provide it as a Base64-encoded header value in the subsequent request. It will be of the form Authorization: Negotiate YII.... Data Preparation will attempt to validate this ticket with the KDC. If it is successful, the user’s principal will be returned as the identity, and the flow will follow login/credential authentication, in that a JWT will be issued in the response to prevent the unnecessary overhead of Kerberos authentication on every subsequent request. If the ticket cannot be validated, it will return with the appropriate error response code. The user will then be able to provide their Kerberos credentials to the login form if the KerberosLoginIdentityProvider has been configured. See Kerberos login identity provider for more details.

Data Integration Tool will only respond to Kerberos SPNEGO negotiation over an HTTPS connection, as unsecured requests are never authenticated.

The following properties must be set in nifi.properties to enable Kerberos service authentication.

Property

Required

Description

Service Principal

true

The service principal used by Data Integration to communicate with the KDC

Keytab Location

true

The file path to the keytab containing the service principal

See [kerberos_properties] for complete documentation.

Notes

Kerberos is case-sensitive in many places and the error messages (or lack thereof) may not be sufficiently explanatory. Check the case sensitivity of the service principal in your configuration files. Convention is HTTP/fully.qualified.domain@REALM.
Browsers have varying levels of restriction when dealing with SPNEGO negotiations. Some will provide the local Kerberos ticket to any domain that requests it, while others whitelist the trusted domains. See Spring Security Kerberos - Reference Documentation: Appendix E. Configure browsers for SPNEGO Negotiation for common browsers.
Some browsers (legacy IE) do not support recent encryption algorithms such as AES, and are restricted to legacy algorithms (DES). This should be noted when generating keytabs.
The KDC must be configured and a service principal defined for Data Preparation and a keytab exported. Comprehensive instructions for Kerberos server configuration and administration are beyond the scope of this document (see MIT Kerberos Admin Guide), but an example is below:

Adding a service principal for a server at nifi.nifi.apache.org and exporting the keytab from the KDC:

root@kdc:/etc/krb5kdc# kadmin.local
Authenticating as principal admin/admin@NIFI.APACHE.ORG with password.
kadmin.local:  listprincs
K/M@NIFI.APACHE.ORG
admin/admin@NIFI.APACHE.ORG
...
kadmin.local:  addprinc -randkey HTTP/nifi.nifi.apache.org
WARNING: no policy specified for HTTP/nifi.nifi.apache.org@NIFI.APACHE.ORG; defaulting to no policy
Principal "HTTP/nifi.nifi.apache.org@NIFI.APACHE.ORG" created.
kadmin.local:  ktadd -k /http-nifi.keytab HTTP/nifi.nifi.apache.org
Entry for principal HTTP/nifi.nifi.apache.org with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:/http-nifi.keytab.
Entry for principal HTTP/nifi.nifi.apache.org with kvno 2, encryption type des-cbc-crc added to keytab WRFILE:/http-nifi.keytab.
kadmin.local:  listprincs
HTTP/nifi.nifi.apache.org@NIFI.APACHE.ORG
K/M@NIFI.APACHE.ORG
admin/admin@NIFI.APACHE.ORG
...
kadmin.local: q
root@kdc:~# ll /http*
-rw------- 1 root root 162 Mar 14 21:43 /http-nifi.keytab
root@kdc:~#

System Properties

The nifi.properties file in the conf directory is the main configuration file for controlling how Data Integration Tool runs. This section provides an overview of the properties in this file and includes some notes on how to configure it in a way that will make upgrading easier. After making changes to this file, restart Data Integration Tool in order for the changes to take effect.

The contents of this file are relatively stable but do change from time to time. It is always a good idea to review this file when upgrading and pay attention for any changes. Consider configuring items below marked with an asterisk (*) in such a way that upgrading will be easier. For details, see a full discussion on upgrading at the end of this section. Note that values for periods of time and data sizes must include the unit of measure, for example "10 sec" or "10 MB", not simply "10".

Core Properties

The first section of the nifi.properties file is for the Core Properties. These properties apply to the core framework as a whole.

Property

Description

nifi.version

The version number of the current release. If upgrading but reusing this file, be sure to update this value.

nifi.flow.configuration.file*

The location of the flow configuration file (i.e., the file that contains what is currently displayed on the Data Integration Tool graph). The default value is ./conf/flow.xml.gz.

nifi.flow.configuration.archive.dir*

The location of the archive directory where backup copies of the flow.xml are saved. The default value is ./conf/archive.

nifi.flowcontroller.autoResumeState

Indicates whether -upon restart- the components on the Data Integration Tool graph should return to their last state. The default value is true.

nifi.flowcontroller.graceful.shutdown.period

Indicates the shutdown period. The default value is 10 sec.

nifi.flowservice.writedelay.interval

When many changes are made to the flow.xml, this property specifies how long to wait before writing out the changes, so as to batch the changes into a single write. The default value is 500 ms.

nifi.administrative.yield.duration

If a component allows an unexpected exception to escape, it is considered a bug. As a result, the framework will pause (or administratively yield) the component for this amount of time. This is done so that the component does not use up massive amounts of system resources, since it is known to have problems in the existing state. The default value is 30 sec.

nifi.bored.yield.duration

When a component has no work to do (i.e., is "bored"), this is the amount of time it will wait before checking to see if it has new data to work on. This way, it does not use up CPU resources by checking for new work too often. When setting this property, be aware that it could add extra latency for components that do not constantly have work to do, as once they go into this "bored" state, they will wait this amount of time before checking for more work. The default value is 10 millis.

nifi.authority.provider.configuration.file*

This is the location of the file that specifies how user access is authorized. The default value is ./conf/authority-providers.xml.

nifi.login.identity.provider.configuration.file*

This is the location of the file that specifies how username/password authentication is performed. This file is only consider if nifi.security.user.login.identity.provider configured with a provider identifier. The default value is ./conf/login-identity-providers.xml.

nifi.templates.directory*

This is the location of the directory where flow templates are saved. The default value is ./conf/templates.l

nifi.ui.banner.text

This is banner text that may be configured to display at the top of the User Interface. It is blank by default.

nifi.ui.autorefresh.interval

The interval at which the User Interface auto-refreshes. The default value is 30 sec.

nifi.nar.library.directory

The location of the nar library. The default value is ./lib and probably should be left as is.

NOTE: Additional library directories can be specified by using the nifi.nar.library.directory. prefix with unique suffixes and separate paths as values.

For example, to provide two additional library locations, a user could also specify additional properties with keys of:

nifi.nar.library.directory.lib1=/nars/lib1
nifi.nar.library.directory.lib2=/nars/lib2

Providing three total locations, including nifi.nar.library.directory.

nifi.nar.working.directory

The location of the nar working directory. The default value is ./work/nar and probably should be left as is.

nifi.documentation.working.directory

The documentation working directory. The default value is ./work/docs/components and probably should be left as is.

nifi.processor.scheduling.timeout

Time to wait for a Processor’s life-cycle operation (@OnScheduled and @OnUnscheduled) to finish before other life-cycle operation (e.g., stop) could be invoked. Default is infinite.

State Management

The State Management section of the Properties file provides a mechanism for configuring local and cluster-wide mechanisms for components to persist state. See the State Management section for more information on how this is used.

Property

Description

nifi.state.management.configuration.file

The XML file that contains configuration for the local and cluster-wide State Providers. The default value is ./conf/state-management.xml

nifi.state.management.provider.local

The ID of the Local State Provider to use. This value must match the value of the id element of one of the local-provider elements in the state-management.xml file.

nifi.state.management.provider.cluster

The ID of the Cluster State Provider to use. This value must match the value of the id element of one of the cluster-provider elements in the state-management.xml file. This value is ignored if not clustered but is required for nodes in a cluster.

nifi.state.management.embedded.zookeeper.start

Specifies whether or not this instance of Data Integration Tool should start an embedded ZooKeeper Server. This is used in conjunction with the ZooKeeperStateProvider.

nifi.state.management.embedded.zookeeper.properties

Specifies a properties file that contains the configuration for the embedded ZooKeeper Server that is started (if the `

H2 Settings

The H2 Settings section defines the settings for the H2 database, which keeps track of user access and flow controller history.

Property

Description

nifi.database.directory

The location of the H2 database directory. The default value is ./database_repository.

nifi.h2.url.append

This property specifies additional arguments to add to the connection string for the H2 database. The default value should be used and should not be changed. It is: ;LOCK_TIMEOUT=25000;WRITE_DELAY=0;AUTO_SERVER=FALSE.

FlowFile Repository

The FlowFile repository keeps track of the attributes and current state of each FlowFile in the system. By default, this repository is installed in the same root installation directory as all the other repositories; however, it is advisable to configure it on a separate drive if available.

Property

Description

nifi.flowfile.repository.implementation

The FlowFile Repository implementation. The default value is org.apache.nifi.controller.repository.WriteAheadFlowFileRepository and should not be changed.

nifi.flowfile.repository.directory*

The location of the FlowFile Repository. The default value is ./flowfile_repository.

nifi.flowfile.repository.partitions

The number of partitions. The default value is 256.

nifi.flowfile.repository.checkpoint.interval

The FlowFile Repository checkpoint interval. The default value is 2 mins.

nifi.flowfile.repository.always.sync

If set to true, any change to the repository will be synchronized to the disk, meaning that Data Integration Tool will ask the operating system not to cache the information. This is very expensive and can significantly reduce Data Integration Tool performance. However, if it is false, there could be the potential for data loss if either there is a sudden power loss or the operating system crashes. The default value is false.

Swap Management

Data Preparation keeps FlowFile information in memory (the JVM) but during surges of incoming data, the FlowFile information can start to take up so much of the JVM that system performance suffers. To counteract this effect, Data Integration Tool "swaps" the FlowFile information to disk temporarily until more JVM space becomes available again. These properties govern how that process occurs.

Property

Description

nifi.swap.manager.implementation

The Swap Manager implementation. The default value is org.apache.nifi.controller.FileSystemSwapManager and should not be changed.

nifi.queue.swap.threshold

The queue threshold at which Data Integration Tool starts to swap FlowFile information to disk. The default value is 20000.

nifi.swap.in.period

The swap in period. The default value is 5 sec.

nifi.swap.in.threads

The number of threads to use for swapping in. The default value is 1.

nifi.swap.out.period

The swap out period. The default value is 5 sec.

nifi.swap.out.threads

The number of threads to use for swapping out. The default value is 4.

Content Repository

The Content Repository holds the content for all the FlowFiles in the system. By default, it is installed in the same root installation directory as all the other repositories; however, administrators will likely want to configure it on a separate drive if available. If nothing else, it is best if the Content Repository is not on the same drive as the FlowFile Repository. In dataflows that handle a large amount of data, the Content Repository could fill up a disk and the FlowFile Repository, if also on that disk, could become corrupt. To avoid this situation, configure these repositories on different drives.

Property

Description

nifi.content.repository.implementation

The Content Repository implementation. The default value is org.apache.nifi.controller.repository.FileSystemRepository and should not be changed.

nifi.content.claim.max.appendable.size

The maximum size for a content claim. The default value is 10 MB.

nifi.content.claim.max.flow.files

The maximum number of FlowFiles to assign to one content claim. The default value is 100.

nifi.content.repository.directory.default*

The location of the Content Repository. The default value is ./content_repository.

NOTE: Multiple content repositories can be specified by using the nifi.content.repository.directory. prefix with unique suffixes and separate paths as values.

For example, to provide two additional locations to act as part of the content repository, a user could also specify additional properties with keys of:

nifi.provenance.repository.directory.content1=/repos/provenance1
nifi.provenance.repository.directory.content2=/repos/provenance2

Providing three total locations, including nifi.content.repository.directory.default.

nifi.content.repository.archive.max.retention.period

If archiving is enabled (see nifi.content.repository.archive.enabled below), then this property specifies the maximum amount of time to keep the archived data. It is 12 hours by default.

nifi.content.repository.archive.max.usage.percentage

If archiving is enabled (see nifi.content.repository.archive.enabled below), then this property also must have a value to indicate the maximum percentage of disk space that may be used before archive data is removed. If this value is already met even before archiving then arhival will not be of much use. It is 50% by default.

nifi.content.repository.archive.enabled

To enable archiving, set this to true and specify a value for the nifi.content.repository.archive.max.usage.percentage property above. By default, archiving is enabled.

nifi.content.repository.always.sync

nifi.content.viewer.url

The URL for a web-based content viewer if one is available. It is blank by default.

Provenance Repository

The Provenance Repository contains the information related to Data Provenance. The next three sections are for Provenance Repository properties.

Property

Description

nifi.provenance.repository.implementation

The Provenance Repository implementation. The default value is org.apache.nifi.provenance.PersistentProvenanceRepository and should not be changed.

Persistent Provenance Repository Properties

Property

Description

nifi.provenance.repository.directory.default*

The location of the Provenance Repository. The default value is ./provenance_repository.

NOTE: Multiple provenance repositories can be specified by using the nifi.provenance.repository.directory. prefix with unique suffixes and separate paths as values.

For example, to provide two additional locations to act as part of the provenance repository, a user could also specify additional properties with keys of:

nifi.provenance.repository.directory.provenance1=/repos/provenance1
nifi.provenance.repository.directory.provenance2=/repos/provenance2

Providing three total locations, including nifi.provenance.repository.directory.default.

nifi.provenance.repository.max.storage.time

The maximum amount of time to keep data provenance information. The default value is 24 hours.

nifi.provenance.repository.max.storage.size

The maximum amount of data provenance information to store at a time. The default is 1 GB.

nifi.provenance.repository.rollover.time

The amount of time to wait before rolling over the latest data provenance information so that it is available in the User Interface. The default value is 30 secs.

nifi.provenance.repository.rollover.size

The amount of information to roll over at a time. The default value is 100 MB.

nifi.provenance.repository.query.threads

The number of threads to use for Provenance Repository queries. The default value is 2.

nifi.provenance.repository.index.threads

The number of threads to use for indexing Provenance events so that they are searchable. The default value is 1. For flows that operate on a very high number of FlowFiles, the indexing of Provenance events could become a bottleneck. If this is the case, a bulletin will appear, indicating that "The rate of the dataflow is exceeding the provenance recording rate. Slowing down flow to accommodate." If this happens, increasing the value of this property may increase the rate at which the Provenance Repository is able to process these records, resulting in better overall throughput.

nifi.provenance.repository.compress.on.rollover

Indicates whether to compress the provenance information when rolling it over. The default value is true.

nifi.provenance.repository.always.sync

nifi.provenance.repository.journal.count

The number of journal files that should be used to serialize Provenance Event data. Increasing this value will allow more tasks to simultaneously update the repository but will result in more expensive merging of the journal files later. This value should ideally be equal to the number of threads that are expected to update the repository simultaneously, but 16 tends to work well in must environments. The default value is 16.

nifi.provenance.repository.indexed.fields

This is a comma-separated list of the fields that should be indexed and made searchable. Fields that are not indexed will not be searchable. Valid fields are: EventType, FlowFileUUID, Filename, TransitURI, ProcessorID, AlternateIdentifierURI, Relationship, Details. The default value is: EventType, FlowFileUUID, Filename, ProcessorID.

nifi.provenance.repository.indexed.attributes

This is a comma-separated list of FlowFile Attributes that should be indexed and made searchable. It is blank by default. But some good examples to consider are filename, uuid, and mime.type as well as any custom attritubes you might use which are valuable for your use case.

nifi.provenance.repository.index.shard.size

Large values for the shard size will result in more Java heap usage when searching the Provenance Repository but should provide better performance. The default value is 500 MB.

nifi.provenance.repository.max.attribute.length

Indicates the maximum length that a FlowFile attribute can be when retrieving a Provenance Event from the repository. If the length of any attribute exceeds this value, it will be truncated when the event is retrieved. The default is 65536.

Volatile Provenance Repository Properties

Property

Description

nifi.provenance.repository.buffer.size

The Provenance Repository buffer size. The default value is 100000.

Component Status Repository

The Component Status Repository contains the information for the Component Status History tool in the User Interface. These properties govern how that tool works.

The buffer.size and snapshot.frequency work together to determine the amount of historical data to retain. As an example to configure two days worth of historical data with a data point snapshot occurring every 5 minutes you would configure snapshot.frequency to be "5 mins" and the buffer.size to be "576". To further explain this example for every 60 minutes there are 12 (60 / 5) snapshot windows for that time period. To keep that data for 48 hours (12 * 48) you end up with a buffer size of 576.

Property

Description

nifi.components.status.repository.implementation

The Component Status Repository implementation. The default value is org.apache.nifi.controller.status.history.VolatileComponentStatusRepository and should not be changed.

nifi.components.status.repository.buffer.size

Specifies the buffer size for the Component Status Repository. The default value is 1440.

nifi.components.status.snapshot.frequency

This value indicates how often to present a snapshot of the components' status history. The default value is 1 min.

Site to Site Properties

These properties govern how this instance of Data Integration Tool communicates with remote instances of Data Integration Tool when Remote Process Groups are configured in the dataflow.

Property

Description

nifi.remote.input.socket.host

The host name that will be given out to clients to connect to this Data Integration Tool instance for Site-to-Site communication. By default, it is the value from InetAddress.getLocalHost().getHostName().

nifi.remote.input.socket.port

The remote input socket port for Site-to-Site communication. By default, it is blank, but it must have a value in order to use Remote Process Groups.

nifi.remote.input.secure

This indicates whether communication between this instance of Data Integration Tool and remote Data Integration Tool instances should be secure. By default, it is set to true. In order for secure site-to-site to work, many Security Properties (below) must also be configured.

Web Properties

These properties pertain to the web-based User Interface.

Property

Description

nifi.web.war.directory

This is the location of the web war directory. The default value is ./lib.

nifi.web.http.host

The HTTP host. It is blank by default.

nifi.web.http.port

The HTTP port. The default value is 8080.

nifi.web.https.host

The HTTPS host. It is blank by default.

nifi.web.https.port

The HTTPS port. It is blank by default. When configuring Data Integration to run securely, this port should be configured.

nif.web.jetty.working.directory

The location of the Jetty working directory. The default value is ./work/jetty.

nifi.web.jetty.threads

The number of Jetty threads. The default value is 200.

Security Properties

These properties pertain to various security features in Data Integration Tool. Many of these properties are covered in more detail in the Security Configuration section of this Administrator’s Guide.

Property

Description

nifi.sensitive.props.key

This is the password used to encrypt any sensitive property values that are configured in processors. By default, it is blank, but the system administrator should provide a value for it. It can be a string of any length, although the recommended minimum length is 10 characters. Be aware that once this password is set and one or more sensitive processor properties have been configured, this password should not be changed.

nifi.sensitive.props.algorithm

The algorithm used to encrypt sensitive properties. The default value is PBEWITHMD5AND256BITAES-CBC-OPENSSL.

nifi.sensitive.props.provider

The sensitive property provider. The default value is BC.

nifi.security.keystore*

The full path and name of the keystore. It is blank by default.

nifi.security.keystoreType

The keystore type. It is blank by default.

nifi.security.keystorePasswd

The keystore password. It is blank by default.

nifi.security.keyPasswd

The key password. It is blank by default.

nifi.security.truststore*

The full path and name of the truststore. It is blank by default.

nifi.security.truststoreType

The truststore type. It is blank by default.

nifi.security.truststorePasswd

The truststore password. It is blank by default.

nifi.security.needClientAuth

This indicates whether client authentication in the cluster protocol. It is blank by default.

nifi.security.user.credential.cache.duration

The length of time to cache user credentials. The default value is 24 hours.

nifi.security.user.authority.provider

This indicates what type of authority provider to use. The default value is file-provider, which refers to the file configured in the core property nifi.authority.provider.configuration.file. Another authority provider may be used, such as when the Data Integration Tool instance is part of a cluster. But the default value of file-provider is fine for a standalone instance of Data Integration Tool.

nifi.security.user.login.identity.provider

This indicates what type of login identity provider to use. The default value is blank, can be set to the identifier from a provider in the file specified in nifi.login.identity.provider.configuration.file. Setting this property will trigger Data Integration to support username/password authentication.

nifi.security.support.new.account.requests

This indicates whether a secure Data Integration Tool is configured to allow users to request access. It is blank by default.

nifi.security.anonymous.authorities

This indicates what roles to grant to anonymous users accessing Data Integration Tool over HTTPS. It is blank by default, but could be set to any combination of ROLE_MONITOR, ROLE_DFM, ROLE_ADMIN, ROLE_PROVENANCE, ROLE_NIFI. Leaving this property blank will require that users accessing Data Integration over HTTPS be authenticated either using a client certificate or their credentials against the configured log identity provider.

nifi.security.ocsp.responder.url

This is the URL for the Online Certificate Status Protocol (OCSP) responder if one is being used. It is blank by default.

nifi.security.ocsp.responder.certificate

This is the location of the OCSP responder certificate if one is being used. It is blank by default.

Cluster Common Properties

When setting up a Data Integration Tool cluster, these properties should be configured the same way on both the cluster manager and the nodes.

Property

Description

nifi.cluster.protocol.heartbeat.interval

The interval at which nodes should emit heartbeats to the cluster manager. The default value is 5 sec.

nifi.cluster.protocol.is.secure

This indicates whether cluster communications are secure. The default value is false.

nifi.cluster.protocol.socket.timeout

The amount of time to wait for a cluster protocol socket to be established before trying again. The default value is 30 sec.

nifi.cluster.protocol.connection.handshake.timeout

The amount of time to wait for a node to connect to the cluster. The default value is 45 sec.

Multicast Cluster Common Properties
If multicast is used, the following nifi.cluster.protocol.multicast.xxx properties must be configured. By default, unicast is used.

Property

Description

nifi.cluster.protocol.use.multicast

Indicates whether multicast is being used. The default value is false.

nifi.cluster.protocol.multicast.address

The cluster multicast address. It is blank by default.

nifi.cluster.protocol.multicast.port

The cluster multicast port. It is blank by default.

nifi.cluster.protocol.multicast.service.broadcast.delay

The multicast service broadcast delay. The default value is 500 ms.

nifi.cluster.protocol.multicast.service.locator.attempts

The number of multicast service locator attempts to make. The default value is 3.

nifi.cluster.protocol.multicast.service.locator.attempts.delay

The multicast service locator attempts delay. The default value is 1 sec.

Cluster Node Properties

Only configure these properties for cluster nodes.

Property

Description

nifi.cluster.is.node

Set this to true if the instance is a node in a cluster. The default value is false.

nifi.cluster.node.address

The fully qualified address of the node. It is blank by default.

nifi.cluster.node.protocol.port

The node’s protocol port. It is blank by default.

nifi.cluster.node.protocol.threads

The number of threads used for the node protocol. The default value is 2.

nifi.cluster.node.unicast.manager.address

If multicast is not used, the value for this property should be the same as the value configured on the cluster manager for manager address.

nifi.cluster.node.unicast.manager.protocol.port

If multicast is not used, the value for this property should be the same as the value configured on the cluster manager for manager protocol port.

Cluster Manager Properties

Only configure these properties for the cluster manager.

Property

Description

nifi.cluster.is.manager

Set this to true if the instance is a cluster manager. The default value is false.

nifi.cluster.manager.address

The fully qualified address of the cluster manager. It is blank by default.

nifi.cluster.manager.protocol.port

The cluster manager’s protocol port. It is blank by default.

nifi.cluster.manager.node.firewall.file

The location of the node firewall file. This is a file that may be used to list all the nodes that are allowed to connect to the cluster. It provides an additional layer of security. This value is blank by default.

nifi.cluster.manager.node.event.history.size

The size of the cluster manager’s event history. The default value is 10.

nifi.cluster.manager.node.api.connection.timeout

The amount of time to wait for an API connection to be made. The default value is 30 sec.

nifi.cluster.manager.node.api.read.timeout

The API read timeout. The default value is 30 sec.

nifi.cluster.manager.node.api.request.threads

The number of threads to use for API requests. The default value is 10.

nifi.cluster.manager.flow.retrieval.delay

The delay before the cluster manager retrieves the latest flow configuration. The default value is 5 sec.

nifi.cluster.manager.protocol.threads

The number of threads used for the cluster manager protocol. The default value is 10.

nifi.cluster.manager.safemode.duration

Upon restart of an already existing cluster, this is the amount of time that the cluster manager waits for the primary node to connect before giving up and selecting another node to be the primary node. The default value is 0 sec, which means to wait forever. If the administrator does not care which node is the primary node, this value can be changed to some amount of time other than 0 sec.

Kerberos Properties

Property

Description

nifi.kerberos.krb5.file*

The location of the krb5 file, if used. It is blank by default. Note that this property is not used to authenticate Data Integration Tool users. Rather, it is made available for extension points, such as Hadoop-based Processors, to use. At this time, only a single krb5 file is allowed to be specified per Data Preparation instance, so this property is configured here rather than in individual Processors. Example: /etc/krb5.conf

nifi.kerberos.service.principal*

The name of the Data Integration Tool Kerberos service principal, if used. It is blank by default. Note that this property is used to authenticate Data Integration Tool users. Example: HTTP/nifi.example.com or HTTP/nifi.example.com@EXAMPLE.COM

nifi.kerberos.keytab.location*

The file path of the Data Integration Tool Kerberos keytab, if used. It is blank by default. Note that this property is used to authenticate Data Integration Tool users. Example: /etc/http-nifi.keytab

nifi.kerberos.authentication.expiration*

The expiration duration of a successful Kerberos user authentication, if used. It is 12 hours by default. Example: 12 hours

For Upgrading - Take care when configuring the properties above that are marked with an asterisk (*). To make the upgrade process easier, it is advisable to change the default configurations to locations outside the main root installation directory. In this way, these items can remain in their configured location through an upgrade, and Data Preparation can find all the repositories and configuration files and pick up where it left off as soon as the old version is stopped and the new version is started. Furthermore, the administrator may reuse this nifi.properties file and any other configuration files without having to re-configure them each time an upgrade takes place. As previously noted, it is important to check for any changes in the nifi.properties file of the new version when upgrading and make sure they are reflected in the nifi.properties file you use.