Job-Specific Essential Duties and Responsibilities:
- Serve as the onsite senior technical authority providing continuous 24x7x365 expert-level support.
- Perform hands-on fault isolation, troubleshooting, and restoration of critical infrastructure systems, including servers, storage, networking, and power/connectivity -components.
- Execute rapid response actions during high-severity outages, including isolated troubleshooting in degraded or disconnected environments.
- Coordinate with Critical Incident Management, network, and remote engineering teams to restore enterprise services and maintain operational continuity.
- Support and execute disaster recovery (DR) procedures, including failover operations, restoration activities, and DR testing exercises.
- Monitor system health and infrastructure performance to proactively identify risks and prevent outages in mission-critical environments.
- Provide onsite support for high-risk changes, maintenance activities, and infrastructure upgrades, ensuring minimal service disruption.
- Coordinate with telecommunications carriers, vendors, and stakeholders to support infrastructure operations and service restoration activities.
- Maintain documentation, runbooks, and operational procedures to support auditability, compliance, and consistent service execution.
- Support SLA adherence by ensuring rapid response, resolution, and communication during operational incidents and outages.
Job-Specific Minimum Requirements:
- Bachelor’s degree in Information Technology, Computer Science, Engineering, or a related field (or equivalent experience).
- 10+ years of experience in data center operations, systems engineering, or infrastructure support, including significant experience in onsite, mission-critical environments.
- Demonstrated experience supporting enterprise data center operations and infrastructure environments.
- Proven ability to perform advanced troubleshooting and restoration of hardware, server, storage, and network systems.
- Experience supporting 24x7x365 mission-critical operations with strict SLA requirements.
- Strong understanding of data center infrastructure components including compute, storage, networking, and power systems.
- Experience coordinating with incident management and cross-functional engineering teams.
- Ability to perform hands-on troubleshooting in onsite, high-pressure environments, including during major outages.
- Experience supporting disaster recovery planning, testing, and execution.
- Ability to maintain technical documentation, runbooks, and operational procedures.
Preferred Skills and Qualifications:
- Experience supporting federal government environments.
- Familiarity with high-availability and geographically distributed infrastructure environments.
- Experience with server, storage, virtualization (e.g., VMware/Hyper V), and enterprise networking technologies.
- Knowledge of ITIL-based incident and operations management processes.
- Experience supporting disaster recovery execution and continuity of operations (COOP).
- Strong ability to operate independently in isolated or degraded network conditions.
- Excellent communication skills for coordination with stakeholders and multi-team incident response efforts.
#techjobs #clearance #veteransPage #LI-remote
|