-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When using the Apache Arrow Java Dataset API (FileSystemDatasetFactory) to read ORC files directly from HDFS, the JVM fails to exit gracefully after the reading process is complete. The application hangs indefinitely because two non-daemon native threads (started by the underlying libhdfs JNI layer) remain active.
This issue does not occur when reading local files; the JVM exits normally in that scenario. It specifically affects HDFS interactions where the C++ libhdfs client is loaded via JNI.
I have tested this on Apache Arrow version 9.0.0 (the last version with built-in HDFS support) and confirmed the behavior persists. I also attempted version 17.0.0, but the thread leakage behavior remains the same once the dependency is added.
Currently, the only workaround is to force terminate the JVM using System.exit(0), which is not ideal for applications relying on shutdown hooks or running within complex containers.
here is java code:
String hdfsUri = "hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk/ds=2/perf_test_200col_500w_nopk";
ScanOptions options = new ScanOptions(/batchSize/ 32768);
try (
BufferAllocator allocator = new RootAllocator();
NativeMemoryPool pool = NativeMemoryPool.getDefault();
DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, pool, FileFormat.ORC, hdfsUri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()
) {
int totalBatchSize = 0;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
totalBatchSize += root.getRowCount();
}
}
System.out.println("Total batch size: " + totalBatchSize);
} catch (Exception e) {
e.printStackTrace();
}