[ARROW Java][HDFS] JVM hangs after reading HDFS files via Arrow Dataset API due to non-daemon native threads

When using the Apache Arrow Java Dataset API (FileSystemDatasetFactory) to read ORC files directly from HDFS, the JVM fails to exit gracefully after the reading process is complete. The application hangs indefinitely because two non-daemon native threads (started by the underlying libhdfs JNI layer) remain active.

This issue does not occur when reading local files; the JVM exits normally in that scenario. It specifically affects HDFS interactions where the C++ libhdfs client is loaded via JNI.

I have tested this on Apache Arrow version 9.0.0 (the last version with built-in HDFS support) and confirmed the behavior persists. I also attempted version 17.0.0, but the thread leakage behavior remains the same once the dependency is added.

Currently, the only workaround is to force terminate the JVM using System.exit(0), which is not ideal for applications relying on shutdown hooks or running within complex containers.

here is java code:  
String hdfsUri = "hdfs://node01.cdh5:8020/user/hive/warehouse/perf_test_200col_500w_nopk/ds=2/perf_test_200col_500w_nopk";
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
            BufferAllocator allocator = new RootAllocator();
            NativeMemoryPool pool = NativeMemoryPool.getDefault();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, pool, FileFormat.ORC, hdfsUri);
            Dataset dataset = datasetFactory.finish();
            Scanner scanner = dataset.newScan(options);
            ArrowReader reader = scanner.scanBatches()
        ) {
            int totalBatchSize = 0;
            while (reader.loadNextBatch()) {
                try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) {
                    totalBatchSize += root.getRowCount();
                }
            }
            System.out.println("Total batch size: " + totalBatchSize);
        } catch (Exception e) {
            e.printStackTrace();
        }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARROW Java][HDFS] JVM hangs after reading HDFS files via Arrow Dataset API due to non-daemon native threads #49464

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ARROW Java][HDFS] JVM hangs after reading HDFS files via Arrow Dataset API due to non-daemon native threads #49464

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions