Skip to content

[ARROW Java][HDFS] JVM hangs after reading HDFS files via Arrow Dataset API due to non-daemon native threads #49464

@10183974

Description

@10183974

When using the Apache Arrow Java Dataset API (FileSystemDatasetFactory) to read ORC files directly from HDFS, the JVM fails to exit gracefully after the reading process is complete. The application hangs indefinitely because two non-daemon native threads (started by the underlying libhdfs JNI layer) remain active.

This issue does not occur when reading local files; the JVM exits normally in that scenario. It specifically affects HDFS interactions where the C++ libhdfs client is loaded via JNI.

I have tested this on Apache Arrow version 9.0.0 (the last version with built-in HDFS support) and confirmed the behavior persists. I also attempted version 17.0.0, but the thread leakage behavior remains the same once the dependency is added.

Currently, the only workaround is to force terminate the JVM using System.exit(0), which is not ideal for applications relying on shutdown hooks or running within complex containers.

here is java code:
String hdfsUri = "hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk/ds=2/perf_test_200col_500w_nopk";
ScanOptions options = new ScanOptions(/batchSize/ 32768);
try (
BufferAllocator allocator = new RootAllocator();
NativeMemoryPool pool = NativeMemoryPool.getDefault();
DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, pool, FileFormat.ORC, hdfsUri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()
) {
int totalBatchSize = 0;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
totalBatchSize += root.getRowCount();
}
}
System.out.println("Total batch size: " + totalBatchSize);
} catch (Exception e) {
e.printStackTrace();
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions