Add Nutanix Integration#22086
Conversation
Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com>
| ] | ||
| check = NutanixCheck('nutanix', {}, [mock_instance]) | ||
| dd_run_check(check) | ||
| aggregator.assert_metric("nutanix.host.count", at_least=1) |
There was a problem hiding this comment.
can this test be more specific?
| ] | ||
|
|
||
| # VM "NTNX-10-0-0-165-PCVM-1767014640" has numCoresPerSocket=1 | ||
| aggregator.assert_metric("nutanix.vm.cpu.cores_per_socket", value=1, tags=expected_tags) |
There was a problem hiding this comment.
these tests could be combined and just assert metrics in one test
|
|
||
| raise ConnectionError("Connection failed") | ||
|
|
||
| mocker.patch('requests.Session.get', side_effect=mock_exception) |
There was a problem hiding this comment.
can we use the new way of mocking the http wrapper here?
There was a problem hiding this comment.
Not sure about that, we can handle it after the migration of existing integrations.
There was a problem hiding this comment.
Can we do it now (a future PR) so there's one less thing to migrate later?
| ) | ||
|
|
||
| # Test POST | ||
| check._make_request_with_retry("http://test.com", method='post', json={'key': 'value'}) |
There was a problem hiding this comment.
nit- we shouldn't be testing private/helper methods directly
Review from sarah-witt is dismissed. Related teams and files:
- agent-integrations
- nutanix/assets/configuration/spec.yaml
- nutanix/datadog_checks/nutanix/check.py
- nutanix/datadog_checks/nutanix/config_models/defaults.py
- nutanix/datadog_checks/nutanix/config_models/instance.py
- nutanix/datadog_checks/nutanix/data/conf.yaml.example
- nutanix/datadog_checks/nutanix/infrastructure_monitor.py
- nutanix/tests/test_resource_filters.py
- nutanix/tests/test_vms.py
Review from dkirov-dd is dismissed. Related teams and files:
- agent-integrations
- nutanix/assets/configuration/spec.yaml
- nutanix/datadog_checks/nutanix/check.py
- nutanix/datadog_checks/nutanix/config_models/defaults.py
- nutanix/datadog_checks/nutanix/config_models/instance.py
- nutanix/datadog_checks/nutanix/data/conf.yaml.example
- nutanix/datadog_checks/nutanix/infrastructure_monitor.py
- nutanix/tests/test_resource_filters.py
- nutanix/tests/test_vms.py
Review from sarah-witt is dismissed. Related teams and files:
- agent-integrations
- nutanix/datadog_checks/nutanix/infrastructure_monitor.py
- nutanix/tests/test_vms.py
|
Will addressing docs review and tests/assets review in a separate PR! |
|
The backport to To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-7.77.x 7.77.x
# Navigate to the new working tree
cd .worktrees/backport-7.77.x
# Create a new branch
git switch --create backport-22086-to-7.77.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 ae8e33e6c688ef49aacd17ae461c0e9b19d3b04d
# Push it to GitHub
git push --set-upstream origin backport-22086-to-7.77.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-7.77.xThen, create a pull request where the |
* initial scaffolding * working nutanix.health.up metric * working nutanix.cluster.count metric * work in progress * work in progress: assert tags in unit tests * tests cleanup * cleanup * health check * collecting basic cluster metrics * collecting cluster stats and basic node metrics * remove upgrade status tag and lint * fix basic auth + add integration tests * refactor and cleanup * collecting node stats metrics * lint * add cluster namespace to cluster metrics * rename metrics, remove unit suffixes * use host instead of nodes as much as possible * collecting basic vm metrics * fix query param typo * collecting vm stats metric * add missing argument required for VmStats and passing integration tests * add metadata.csv * fix typo in test name * update integration tests to stop checking for values * add nutanix overview dashboard * update manifest description and classifier tags * update manifest metric to check for * little cleanup * remove unused dependency from pyproject * set default min_collection_interval to 120s * update dashboard with more units and improvements * update dashboard description * report host metrics and vm metrics with their correspondig hostname * report external host tags for hosts and vms * switch to list all vm stats endpoint for better rate limit - update metdata.csv with new metrics * add ntnx_type:host and ntnx_type:vm as tags * add cluster_name and host_name tags to all hosts and vm metrics + fix integration tests * improve metrics descriptions in metadata.csv * update dashboard * add compact legend to all cluster/host/vm widgets for better ux * fix stats sampling interval to match the min_collection_interval * add support for pagination * add page_limit parameter for pagination size limit * update fixtures and tests for the new paginated requests * rename paginated methods to start with list instead of get * add support for retry logic to handle PC rate limiting * add process signatures * update nutanix process signatures * fix error deleting page and limit params * fix manifest.json extra comma in process_signatures * collect events * add bash script to record fixtures * Fix log message for error collecting vm metrics * refactor pagination method and improve logging * ddev validate ci --sync * update dashboard and add new nutanix logos * add debug logs for HTTP requests and payloads * add support for port in pc_ip * swap nutanix.vm.hypervisor.memory_usage_ppm with nutanix.vm.memory.usage_ppm for more accurate VM memory usage * improve logging: reduce HTTP logging noise to only rate limits and error responses * fix validate dashboards * bump python version to 3.13 and min base check version * fix typo in min base check * fix Mock() has no len error in test_retry.py * wip * add collect_events property * change remaining references to nutanix.vm.hypervisor.memory_usage_ppm to nutanix.vm.memory.usage_ppm to fix VM memory usage widgets * add support for tasks collection, update fixtures * add ntnx_type tag to events and tasks * small cleanup * dashboard: change all bytes in binary to bytes in decimal * cleanup and small refactor * make events and tasks match implementation, fix handling of start_time, improve tests and small refactor * split check.py into modules, fix integration tests * improve error messages for non 2xx http error responses * add missing dd licence headers to some files * rename health_check_score metric * improve metric names batch 1 * improve cluster and host metric names * improve vm metric names * split unit tests into multiple files * add support for audits collection * cleanup and improving tests setup * improve duplication logic tests in events,audits and tasks * add support for alerts collection * use alerts v4.2 API that supports filtering by creationTime * sync all API calls to use the same time window (start, end) * add extra filtering to avoid events/tasks/audits/alerts duplicates * fallback to alerts v4.0 API if v4.2 is not available * fix self.last__x_collection_time fields to be the max timestamp: fixes duplicates * persist information about v4.2 API in the persistence cache * wip: host and vm stats not working? * improve vm stats collection by cluster, improve info logs and debug logs * improve type hints and method comments * add support for capacity metrics * add nutanix tag to all entities * report node status metric * ddev validate models and config * add collect_tasks and collect_audits properties for nutanix * add filter propreties for alerts * add filter by severity and type for alerts * add filter events by type * add filter tasks by status * add resource filters support for infra resources and activity resources * improve resource_filters * cleanup * fetch and cache categories * attach categories as tags with option to add ntnx_ suffix * improve categories collection/attachment, improve tests, update all fixtures * improve categories collection and testing * add owner to manifest.json * remove duplicate self.last_audit_collection_time assignment * fix alert messages parameter rendering * add more tests * reduce info logs, improve info log summary, and change rest of logs to debug * improve audits timpestamp tracking, improve logging, code cleanup * improve resource_filters logging, log error messages * fix integration tests + add support for fake docker server testing * fix nutanix wheel version * reset teleport change * ddev validate ci --sync * fix licence headers * fix more licence headers * fix one more licence header * ddev validate labeler --sync * Fix labeler config * reduce audits.json size * reduce audits.json to 50KB * reduce alerts.json to 20 items * replace bash script for recording fixtures with python implementation * update resource_filters description * add starting check info log and add comment about sampling interval * improve categories tests around default behavior, remove duplicate record_fixtures.py * udpate resource_filters to match default desired category behavior * add collect_subtasks properties, use persistent cache to track last collected items correctly before filtering * Apply suggestions from code review Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> * cleanup rate limit retry implementation * add nutanix.api.rate_limited metric for visibility * fix port configuration log message test * fix rate limit tests and update with the new metric * improve retry_limit implementation * update README.md * add missing prefix_category_tags property * add tests for prefix_category_tags property * address david review: always add ntnx_is_agent_vm tag * document categories in the README * add example in the README for collecting categories * fix dashboard validation * fix readme validate * address review: raise ConfigurationError when its not set * fix ntnx_is_agent_vm tag in tests * add missing spec.yaml properties * update powerConsumptionInstantWatt name for consistency * set creates_events to true in manifest.json * fix retry_on_rate_limit behavior on non 429 responses * fix changelog file name * remove unnecessary file * fix retry_on_rate_limit behavior and add type hints * address sarah review: add log message when skipping a resource * address sarah review: add test all metrics + fix missing metric in metadata.csv * update fixtures with new prism_central url and new VM in OFF state * report nutanix.vm.status metric * address review: by default only collect VMs with powerState ON * collect only ON VM even if other VM resource_filters are set that are not powerState * fix imports and paths for record_fixtures.py * add note about duplicate hostname issue * sync config with the new VM collection comment * update README to explain that a single agent can monitor a prism central environment * fix license headers * refactor activity_monitor by reducing code duplication and extraching a sharing method * refactor infrastructure_monitor * refactor resource_filters * refactor check.py / simplyfing * add support for rendering audit messages * add support for rendering nutanix event messages * remove all X_id tags * add caches for activity entities * update test fixtures and adjust tests to work with the new data * add support for displaying affected alerts in tasks * refactor: reduce code duplication in activity_monitor * remove debugging code * improve readability of activity_monitor.py code related to filtering * code cleanup + reduce code duplication * split collect_cluster_metrics into isolated collection phases to enable partial data collection on failures * improve error isolation for host processing to allow non-blocking errors when single host fails * replace log.exception with log.error for more user friendly log message for known errors * make resource_filters proprety always required and remove its default values * switch super() call to python3 style * address some code smells * update page_limit default value from 50 to 100 to reduce API calls * Revert accidental version modification for dev Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> * add support for batch_vm_collection by default to avoid rate limits * fix batch collection mode to process all vms regardless of the batch mode * improve testing vms filtering in both vm collection modes --------- Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> (cherry picked from commit ae8e33e)
* initial scaffolding * working nutanix.health.up metric * working nutanix.cluster.count metric * work in progress * work in progress: assert tags in unit tests * tests cleanup * cleanup * health check * collecting basic cluster metrics * collecting cluster stats and basic node metrics * remove upgrade status tag and lint * fix basic auth + add integration tests * refactor and cleanup * collecting node stats metrics * lint * add cluster namespace to cluster metrics * rename metrics, remove unit suffixes * use host instead of nodes as much as possible * collecting basic vm metrics * fix query param typo * collecting vm stats metric * add missing argument required for VmStats and passing integration tests * add metadata.csv * fix typo in test name * update integration tests to stop checking for values * add nutanix overview dashboard * update manifest description and classifier tags * update manifest metric to check for * little cleanup * remove unused dependency from pyproject * set default min_collection_interval to 120s * update dashboard with more units and improvements * update dashboard description * report host metrics and vm metrics with their correspondig hostname * report external host tags for hosts and vms * switch to list all vm stats endpoint for better rate limit - update metdata.csv with new metrics * add ntnx_type:host and ntnx_type:vm as tags * add cluster_name and host_name tags to all hosts and vm metrics + fix integration tests * improve metrics descriptions in metadata.csv * update dashboard * add compact legend to all cluster/host/vm widgets for better ux * fix stats sampling interval to match the min_collection_interval * add support for pagination * add page_limit parameter for pagination size limit * update fixtures and tests for the new paginated requests * rename paginated methods to start with list instead of get * add support for retry logic to handle PC rate limiting * add process signatures * update nutanix process signatures * fix error deleting page and limit params * fix manifest.json extra comma in process_signatures * collect events * add bash script to record fixtures * Fix log message for error collecting vm metrics * refactor pagination method and improve logging * ddev validate ci --sync * update dashboard and add new nutanix logos * add debug logs for HTTP requests and payloads * add support for port in pc_ip * swap nutanix.vm.hypervisor.memory_usage_ppm with nutanix.vm.memory.usage_ppm for more accurate VM memory usage * improve logging: reduce HTTP logging noise to only rate limits and error responses * fix validate dashboards * bump python version to 3.13 and min base check version * fix typo in min base check * fix Mock() has no len error in test_retry.py * wip * add collect_events property * change remaining references to nutanix.vm.hypervisor.memory_usage_ppm to nutanix.vm.memory.usage_ppm to fix VM memory usage widgets * add support for tasks collection, update fixtures * add ntnx_type tag to events and tasks * small cleanup * dashboard: change all bytes in binary to bytes in decimal * cleanup and small refactor * make events and tasks match implementation, fix handling of start_time, improve tests and small refactor * split check.py into modules, fix integration tests * improve error messages for non 2xx http error responses * add missing dd licence headers to some files * rename health_check_score metric * improve metric names batch 1 * improve cluster and host metric names * improve vm metric names * split unit tests into multiple files * add support for audits collection * cleanup and improving tests setup * improve duplication logic tests in events,audits and tasks * add support for alerts collection * use alerts v4.2 API that supports filtering by creationTime * sync all API calls to use the same time window (start, end) * add extra filtering to avoid events/tasks/audits/alerts duplicates * fallback to alerts v4.0 API if v4.2 is not available * fix self.last__x_collection_time fields to be the max timestamp: fixes duplicates * persist information about v4.2 API in the persistence cache * wip: host and vm stats not working? * improve vm stats collection by cluster, improve info logs and debug logs * improve type hints and method comments * add support for capacity metrics * add nutanix tag to all entities * report node status metric * ddev validate models and config * add collect_tasks and collect_audits properties for nutanix * add filter propreties for alerts * add filter by severity and type for alerts * add filter events by type * add filter tasks by status * add resource filters support for infra resources and activity resources * improve resource_filters * cleanup * fetch and cache categories * attach categories as tags with option to add ntnx_ suffix * improve categories collection/attachment, improve tests, update all fixtures * improve categories collection and testing * add owner to manifest.json * remove duplicate self.last_audit_collection_time assignment * fix alert messages parameter rendering * add more tests * reduce info logs, improve info log summary, and change rest of logs to debug * improve audits timpestamp tracking, improve logging, code cleanup * improve resource_filters logging, log error messages * fix integration tests + add support for fake docker server testing * fix nutanix wheel version * reset teleport change * ddev validate ci --sync * fix licence headers * fix more licence headers * fix one more licence header * ddev validate labeler --sync * Fix labeler config * reduce audits.json size * reduce audits.json to 50KB * reduce alerts.json to 20 items * replace bash script for recording fixtures with python implementation * update resource_filters description * add starting check info log and add comment about sampling interval * improve categories tests around default behavior, remove duplicate record_fixtures.py * udpate resource_filters to match default desired category behavior * add collect_subtasks properties, use persistent cache to track last collected items correctly before filtering * Apply suggestions from code review Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> * cleanup rate limit retry implementation * add nutanix.api.rate_limited metric for visibility * fix port configuration log message test * fix rate limit tests and update with the new metric * improve retry_limit implementation * update README.md * add missing prefix_category_tags property * add tests for prefix_category_tags property * address david review: always add ntnx_is_agent_vm tag * document categories in the README * add example in the README for collecting categories * fix dashboard validation * fix readme validate * address review: raise ConfigurationError when its not set * fix ntnx_is_agent_vm tag in tests * add missing spec.yaml properties * update powerConsumptionInstantWatt name for consistency * set creates_events to true in manifest.json * fix retry_on_rate_limit behavior on non 429 responses * fix changelog file name * remove unnecessary file * fix retry_on_rate_limit behavior and add type hints * address sarah review: add log message when skipping a resource * address sarah review: add test all metrics + fix missing metric in metadata.csv * update fixtures with new prism_central url and new VM in OFF state * report nutanix.vm.status metric * address review: by default only collect VMs with powerState ON * collect only ON VM even if other VM resource_filters are set that are not powerState * fix imports and paths for record_fixtures.py * add note about duplicate hostname issue * sync config with the new VM collection comment * update README to explain that a single agent can monitor a prism central environment * fix license headers * refactor activity_monitor by reducing code duplication and extraching a sharing method * refactor infrastructure_monitor * refactor resource_filters * refactor check.py / simplyfing * add support for rendering audit messages * add support for rendering nutanix event messages * remove all X_id tags * add caches for activity entities * update test fixtures and adjust tests to work with the new data * add support for displaying affected alerts in tasks * refactor: reduce code duplication in activity_monitor * remove debugging code * improve readability of activity_monitor.py code related to filtering * code cleanup + reduce code duplication * split collect_cluster_metrics into isolated collection phases to enable partial data collection on failures * improve error isolation for host processing to allow non-blocking errors when single host fails * replace log.exception with log.error for more user friendly log message for known errors * make resource_filters proprety always required and remove its default values * switch super() call to python3 style * address some code smells * update page_limit default value from 50 to 100 to reduce API calls * Revert accidental version modification for dev Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> * add support for batch_vm_collection by default to avoid rate limits * fix batch collection mode to process all vms regardless of the batch mode * improve testing vms filtering in both vm collection modes --------- Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> ae8e33e
* initial scaffolding * working nutanix.health.up metric * working nutanix.cluster.count metric * work in progress * work in progress: assert tags in unit tests * tests cleanup * cleanup * health check * collecting basic cluster metrics * collecting cluster stats and basic node metrics * remove upgrade status tag and lint * fix basic auth + add integration tests * refactor and cleanup * collecting node stats metrics * lint * add cluster namespace to cluster metrics * rename metrics, remove unit suffixes * use host instead of nodes as much as possible * collecting basic vm metrics * fix query param typo * collecting vm stats metric * add missing argument required for VmStats and passing integration tests * add metadata.csv * fix typo in test name * update integration tests to stop checking for values * add nutanix overview dashboard * update manifest description and classifier tags * update manifest metric to check for * little cleanup * remove unused dependency from pyproject * set default min_collection_interval to 120s * update dashboard with more units and improvements * update dashboard description * report host metrics and vm metrics with their correspondig hostname * report external host tags for hosts and vms * switch to list all vm stats endpoint for better rate limit - update metdata.csv with new metrics * add ntnx_type:host and ntnx_type:vm as tags * add cluster_name and host_name tags to all hosts and vm metrics + fix integration tests * improve metrics descriptions in metadata.csv * update dashboard * add compact legend to all cluster/host/vm widgets for better ux * fix stats sampling interval to match the min_collection_interval * add support for pagination * add page_limit parameter for pagination size limit * update fixtures and tests for the new paginated requests * rename paginated methods to start with list instead of get * add support for retry logic to handle PC rate limiting * add process signatures * update nutanix process signatures * fix error deleting page and limit params * fix manifest.json extra comma in process_signatures * collect events * add bash script to record fixtures * Fix log message for error collecting vm metrics * refactor pagination method and improve logging * ddev validate ci --sync * update dashboard and add new nutanix logos * add debug logs for HTTP requests and payloads * add support for port in pc_ip * swap nutanix.vm.hypervisor.memory_usage_ppm with nutanix.vm.memory.usage_ppm for more accurate VM memory usage * improve logging: reduce HTTP logging noise to only rate limits and error responses * fix validate dashboards * bump python version to 3.13 and min base check version * fix typo in min base check * fix Mock() has no len error in test_retry.py * wip * add collect_events property * change remaining references to nutanix.vm.hypervisor.memory_usage_ppm to nutanix.vm.memory.usage_ppm to fix VM memory usage widgets * add support for tasks collection, update fixtures * add ntnx_type tag to events and tasks * small cleanup * dashboard: change all bytes in binary to bytes in decimal * cleanup and small refactor * make events and tasks match implementation, fix handling of start_time, improve tests and small refactor * split check.py into modules, fix integration tests * improve error messages for non 2xx http error responses * add missing dd licence headers to some files * rename health_check_score metric * improve metric names batch 1 * improve cluster and host metric names * improve vm metric names * split unit tests into multiple files * add support for audits collection * cleanup and improving tests setup * improve duplication logic tests in events,audits and tasks * add support for alerts collection * use alerts v4.2 API that supports filtering by creationTime * sync all API calls to use the same time window (start, end) * add extra filtering to avoid events/tasks/audits/alerts duplicates * fallback to alerts v4.0 API if v4.2 is not available * fix self.last__x_collection_time fields to be the max timestamp: fixes duplicates * persist information about v4.2 API in the persistence cache * wip: host and vm stats not working? * improve vm stats collection by cluster, improve info logs and debug logs * improve type hints and method comments * add support for capacity metrics * add nutanix tag to all entities * report node status metric * ddev validate models and config * add collect_tasks and collect_audits properties for nutanix * add filter propreties for alerts * add filter by severity and type for alerts * add filter events by type * add filter tasks by status * add resource filters support for infra resources and activity resources * improve resource_filters * cleanup * fetch and cache categories * attach categories as tags with option to add ntnx_ suffix * improve categories collection/attachment, improve tests, update all fixtures * improve categories collection and testing * add owner to manifest.json * remove duplicate self.last_audit_collection_time assignment * fix alert messages parameter rendering * add more tests * reduce info logs, improve info log summary, and change rest of logs to debug * improve audits timpestamp tracking, improve logging, code cleanup * improve resource_filters logging, log error messages * fix integration tests + add support for fake docker server testing * fix nutanix wheel version * reset teleport change * ddev validate ci --sync * fix licence headers * fix more licence headers * fix one more licence header * ddev validate labeler --sync * Fix labeler config * reduce audits.json size * reduce audits.json to 50KB * reduce alerts.json to 20 items * replace bash script for recording fixtures with python implementation * update resource_filters description * add starting check info log and add comment about sampling interval * improve categories tests around default behavior, remove duplicate record_fixtures.py * udpate resource_filters to match default desired category behavior * add collect_subtasks properties, use persistent cache to track last collected items correctly before filtering * Apply suggestions from code review Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> * cleanup rate limit retry implementation * add nutanix.api.rate_limited metric for visibility * fix port configuration log message test * fix rate limit tests and update with the new metric * improve retry_limit implementation * update README.md * add missing prefix_category_tags property * add tests for prefix_category_tags property * address david review: always add ntnx_is_agent_vm tag * document categories in the README * add example in the README for collecting categories * fix dashboard validation * fix readme validate * address review: raise ConfigurationError when its not set * fix ntnx_is_agent_vm tag in tests * add missing spec.yaml properties * update powerConsumptionInstantWatt name for consistency * set creates_events to true in manifest.json * fix retry_on_rate_limit behavior on non 429 responses * fix changelog file name * remove unnecessary file * fix retry_on_rate_limit behavior and add type hints * address sarah review: add log message when skipping a resource * address sarah review: add test all metrics + fix missing metric in metadata.csv * update fixtures with new prism_central url and new VM in OFF state * report nutanix.vm.status metric * address review: by default only collect VMs with powerState ON * collect only ON VM even if other VM resource_filters are set that are not powerState * fix imports and paths for record_fixtures.py * add note about duplicate hostname issue * sync config with the new VM collection comment * update README to explain that a single agent can monitor a prism central environment * fix license headers * refactor activity_monitor by reducing code duplication and extraching a sharing method * refactor infrastructure_monitor * refactor resource_filters * refactor check.py / simplyfing * add support for rendering audit messages * add support for rendering nutanix event messages * remove all X_id tags * add caches for activity entities * update test fixtures and adjust tests to work with the new data * add support for displaying affected alerts in tasks * refactor: reduce code duplication in activity_monitor * remove debugging code * improve readability of activity_monitor.py code related to filtering * code cleanup + reduce code duplication * split collect_cluster_metrics into isolated collection phases to enable partial data collection on failures * improve error isolation for host processing to allow non-blocking errors when single host fails * replace log.exception with log.error for more user friendly log message for known errors * make resource_filters proprety always required and remove its default values * switch super() call to python3 style * address some code smells * update page_limit default value from 50 to 100 to reduce API calls * Revert accidental version modification for dev Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> * add support for batch_vm_collection by default to avoid rate limits * fix batch collection mode to process all vms regardless of the batch mode * improve testing vms filtering in both vm collection modes --------- Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> ae8e33e
* initial scaffolding * working nutanix.health.up metric * working nutanix.cluster.count metric * work in progress * work in progress: assert tags in unit tests * tests cleanup * cleanup * health check * collecting basic cluster metrics * collecting cluster stats and basic node metrics * remove upgrade status tag and lint * fix basic auth + add integration tests * refactor and cleanup * collecting node stats metrics * lint * add cluster namespace to cluster metrics * rename metrics, remove unit suffixes * use host instead of nodes as much as possible * collecting basic vm metrics * fix query param typo * collecting vm stats metric * add missing argument required for VmStats and passing integration tests * add metadata.csv * fix typo in test name * update integration tests to stop checking for values * add nutanix overview dashboard * update manifest description and classifier tags * update manifest metric to check for * little cleanup * remove unused dependency from pyproject * set default min_collection_interval to 120s * update dashboard with more units and improvements * update dashboard description * report host metrics and vm metrics with their correspondig hostname * report external host tags for hosts and vms * switch to list all vm stats endpoint for better rate limit - update metdata.csv with new metrics * add ntnx_type:host and ntnx_type:vm as tags * add cluster_name and host_name tags to all hosts and vm metrics + fix integration tests * improve metrics descriptions in metadata.csv * update dashboard * add compact legend to all cluster/host/vm widgets for better ux * fix stats sampling interval to match the min_collection_interval * add support for pagination * add page_limit parameter for pagination size limit * update fixtures and tests for the new paginated requests * rename paginated methods to start with list instead of get * add support for retry logic to handle PC rate limiting * add process signatures * update nutanix process signatures * fix error deleting page and limit params * fix manifest.json extra comma in process_signatures * collect events * add bash script to record fixtures * Fix log message for error collecting vm metrics * refactor pagination method and improve logging * ddev validate ci --sync * update dashboard and add new nutanix logos * add debug logs for HTTP requests and payloads * add support for port in pc_ip * swap nutanix.vm.hypervisor.memory_usage_ppm with nutanix.vm.memory.usage_ppm for more accurate VM memory usage * improve logging: reduce HTTP logging noise to only rate limits and error responses * fix validate dashboards * bump python version to 3.13 and min base check version * fix typo in min base check * fix Mock() has no len error in test_retry.py * wip * add collect_events property * change remaining references to nutanix.vm.hypervisor.memory_usage_ppm to nutanix.vm.memory.usage_ppm to fix VM memory usage widgets * add support for tasks collection, update fixtures * add ntnx_type tag to events and tasks * small cleanup * dashboard: change all bytes in binary to bytes in decimal * cleanup and small refactor * make events and tasks match implementation, fix handling of start_time, improve tests and small refactor * split check.py into modules, fix integration tests * improve error messages for non 2xx http error responses * add missing dd licence headers to some files * rename health_check_score metric * improve metric names batch 1 * improve cluster and host metric names * improve vm metric names * split unit tests into multiple files * add support for audits collection * cleanup and improving tests setup * improve duplication logic tests in events,audits and tasks * add support for alerts collection * use alerts v4.2 API that supports filtering by creationTime * sync all API calls to use the same time window (start, end) * add extra filtering to avoid events/tasks/audits/alerts duplicates * fallback to alerts v4.0 API if v4.2 is not available * fix self.last__x_collection_time fields to be the max timestamp: fixes duplicates * persist information about v4.2 API in the persistence cache * wip: host and vm stats not working? * improve vm stats collection by cluster, improve info logs and debug logs * improve type hints and method comments * add support for capacity metrics * add nutanix tag to all entities * report node status metric * ddev validate models and config * add collect_tasks and collect_audits properties for nutanix * add filter propreties for alerts * add filter by severity and type for alerts * add filter events by type * add filter tasks by status * add resource filters support for infra resources and activity resources * improve resource_filters * cleanup * fetch and cache categories * attach categories as tags with option to add ntnx_ suffix * improve categories collection/attachment, improve tests, update all fixtures * improve categories collection and testing * add owner to manifest.json * remove duplicate self.last_audit_collection_time assignment * fix alert messages parameter rendering * add more tests * reduce info logs, improve info log summary, and change rest of logs to debug * improve audits timpestamp tracking, improve logging, code cleanup * improve resource_filters logging, log error messages * fix integration tests + add support for fake docker server testing * fix nutanix wheel version * reset teleport change * ddev validate ci --sync * fix licence headers * fix more licence headers * fix one more licence header * ddev validate labeler --sync * Fix labeler config * reduce audits.json size * reduce audits.json to 50KB * reduce alerts.json to 20 items * replace bash script for recording fixtures with python implementation * update resource_filters description * add starting check info log and add comment about sampling interval * improve categories tests around default behavior, remove duplicate record_fixtures.py * udpate resource_filters to match default desired category behavior * add collect_subtasks properties, use persistent cache to track last collected items correctly before filtering * Apply suggestions from code review Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> * cleanup rate limit retry implementation * add nutanix.api.rate_limited metric for visibility * fix port configuration log message test * fix rate limit tests and update with the new metric * improve retry_limit implementation * update README.md * add missing prefix_category_tags property * add tests for prefix_category_tags property * address david review: always add ntnx_is_agent_vm tag * document categories in the README * add example in the README for collecting categories * fix dashboard validation * fix readme validate * address review: raise ConfigurationError when its not set * fix ntnx_is_agent_vm tag in tests * add missing spec.yaml properties * update powerConsumptionInstantWatt name for consistency * set creates_events to true in manifest.json * fix retry_on_rate_limit behavior on non 429 responses * fix changelog file name * remove unnecessary file * fix retry_on_rate_limit behavior and add type hints * address sarah review: add log message when skipping a resource * address sarah review: add test all metrics + fix missing metric in metadata.csv * update fixtures with new prism_central url and new VM in OFF state * report nutanix.vm.status metric * address review: by default only collect VMs with powerState ON * collect only ON VM even if other VM resource_filters are set that are not powerState * fix imports and paths for record_fixtures.py * add note about duplicate hostname issue * sync config with the new VM collection comment * update README to explain that a single agent can monitor a prism central environment * fix license headers * refactor activity_monitor by reducing code duplication and extraching a sharing method * refactor infrastructure_monitor * refactor resource_filters * refactor check.py / simplyfing * add support for rendering audit messages * add support for rendering nutanix event messages * remove all X_id tags * add caches for activity entities * update test fixtures and adjust tests to work with the new data * add support for displaying affected alerts in tasks * refactor: reduce code duplication in activity_monitor * remove debugging code * improve readability of activity_monitor.py code related to filtering * code cleanup + reduce code duplication * split collect_cluster_metrics into isolated collection phases to enable partial data collection on failures * improve error isolation for host processing to allow non-blocking errors when single host fails * replace log.exception with log.error for more user friendly log message for known errors * make resource_filters proprety always required and remove its default values * switch super() call to python3 style * address some code smells * update page_limit default value from 50 to 100 to reduce API calls * Revert accidental version modification for dev Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> * add support for batch_vm_collection by default to avoid rate limits * fix batch collection mode to process all vms regardless of the batch mode * improve testing vms filtering in both vm collection modes --------- Co-authored-by: dkirov-dd <166512750+dkirov-dd@users.noreply.github.com> Co-authored-by: Sarah Witt <sarah.witt@datadoghq.com> Signed-off-by: lukepatrick <lukephilips@gmail.com>
What does this PR do?
Adds Nutanix integration for monitoring Prism Central v4 infrastructure via Datadog Agent.
Features
ntnx_typetagntnx_prefix option)Technical Details
Motivation
https://datadoghq.atlassian.net/browse/AI-5917
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged