What this Job Entails:
Our IRC (Incident Response Center) is the first layer of defense responsible for quick detection and incident response using various monitoring and automation tools, conducting thorough investigation of alerts, classification and triage. The IRC Analyst is responsible for delivering operations within the IRC across all datacenter sites in the respective regions. IRC analysts are expected to respond to all alarms/alerts set in Data Center Infrastructure Management (DCIM), Server Automation Operations System (SAOS), CCTV, Access Control Systems (ACS), and other functions (EHS, Security, etc), providing deep understanding and intelligence of the criticality and impact of the incidents to the resolver groups.
Incident & Problem Management
- Investigate and respond to alerts, incident response (war room, remote bridges) and report, and on-going maintenance, tuning, and improvements of the detection signals
- Respond to incidents and critical situations in a calm, problem-solving manner, and conduct in-depth investigation of alerts
- Be the first layer of defense responsible for quick detection and incident response using various monitoring and automation tools, conduct thorough investigation of alerts, classification and triage.
- Provide deep understanding and intelligence of the criticality and impact of the incidents to the resolver groups.
- Ensure detailed records of alarm handling activities, including actions taken, resolutions in ticketing tools and file incident reports.
- Be available to coordinate as an incident commander in event of an issue.
- Support program managers and facilitate project deliverables, improve overall operational and engineering initiatives.
- Conduct root cause analysis (RCA) to determine recurring problems to their source.
- Employ in-depth questioning and analysis techniques such as five whys to determine the underlying cause of the incident or problem.
- Perform duties in compliance with SOP.
Server, DCIM, Network and Traffic Alarms Operations
- Continuously monitor alarm dashboards and systems.
- Investigate and respond to alarms such as but not limited to Network, DC Environment, Server Health, Facility Security and Safety.
- Identify and acknowledge incidents associated with alarms.
- Assess incidents to determine their criticality and impact on operations.
- Engage the resolver group who will be resolving the incident and escalate to higher tiers or management when necessary, following established escalation paths.
- Maintain clear and concise communication with relevant teams, stakeholders, and incident responders/resolvers.
- Documented procedures to resolve incidents promptly and effectively.
- Ensure detailed records of alarm handling activities, including actions taken and resolutions in ticketing tools.
- Perform duties in compliance with SOP.
Threat Intelligence & Critical Event Management
- Monitor Everbridge's Visual Command Center (VCC), InternationalSOS e-mails, and other open source tools for real-time incidents impacting ByteDance assets and travelers.
- Monitor directed tools or queries for specific requests from stakeholders.
- Notifications about violence, inclement weather, threats to life, property and assets etc.
- Coordinate emergency response efforts, including liaising with law enforcement if needed.
- Conduct research to verify the accuracy and relevance of the information through additional sources.
- Create heatmap of the affected area to highlight areas impacted by a specific event or series of events.
- Collaborate with other security and operational teams for a coordinated response.
- Implement incident containment and mitigation strategies.
- Document incident details, response actions, and lessons learned.
- Perform duties in compliance with SOP.
Physical Security and Safety
- Basic monitoring of Closed-Circuit Television (CCTV) systems and Access Control Systems (ACS).
- Monitor safety alarms and communication channels for events such as but not limited to electrical incidents, fire & environmental hazards, equipment failure, chemical exposure, water leaks, that pose a risk to the safety of personnel or the data center infrastructure.
- Conduct audits of camera footage to ensure proper functioning, video quality, and coverage of critical areas.
- Respond to access control incidents and anomalies.
- Report findings to the security and safety engineers, and relevant stakeholders promptly.
- Perform duties in compliance with SOP.
Badge Management
- Perform badge enrolment and ensure that all requests go through proper approval process and to assess accuracy and completeness of request in compliance with SOP.
- Access card programming due to access requests such as but not limited to new or temporary access requests via email/ticket, off-boarding by revoking badge access.
- Generating access logs reports.
- Conduct access log audit.
Continuous Service Improvement
- Identify areas of improvement within current service delivery processes.
- Implement changes that lead to measurable enhancements in service quality, efficiency, and customer satisfaction.
- Establish a culture of continuous improvement within the organization.
- Establish mechanisms for ongoing feedback collection from customers and employees.
- Integrate feedback into future continuous improvement efforts.
Required Qualifications/Skills:
- 2 years+ experience in command center, service center, or similar 24x7 operations center environment
- Ability to quickly triage multiple incidents and assign the right priority based on risk and confidence levels
- Knowledge of technical elements associated with systems such as IP Networks, DC Environment and Server Health.
- Outstanding verbal and written communication skills required, work with minimal direction, meeting goals, attention to details and an eye for continuous improvements
- Ability to successfully interact at all levels of the organization, including with clients, while functioning as a team player required.
- Basic working knowledge of data protection policies such as GDPR and the need to keep sensitive information secure.
- XOC Analyst is expected to work at ByteDance datacenter site. This is an on-site role.
- Willingness to work flexible schedules/shifts/areas, including weekends, nights, and holidays.
-
Excellent verbal and written communication skills in English
-
Effectively utilize the ticket management systems
-
Understanding of networking components and infrastructures
-
Understanding of Data Center best practices (i.e. basic fault tolerance, cable routing, calculating power usage)
-
Strong organization skill
Preferred Qualifications:
- Diploma/Degree in Information Technology.
- Works well under pressure and within time/budget constraints to solve problems and complete deliverables.
- Experience with Ticketing, Grafana, Servers and Data Center Systems.
- Working knowledge and/or certifications in CompTIA Server+, Schneider Electric Data Center Certified Associate (DCCA).
- Knowledge of Lenel and Avigilon systems is a plus.
-
Hands on experience in electrical, HVAC, and data center infrastructures
-
Working knowledge of networking components and infrastructures
-
Ability to adapt to changing priorities, conditions, and circumstances
Top Skills
What We Do
Astreya is the leading IT solutions provider for some of the world's most recognizable and innovative organizations. Our journey started in 2001 in the heart of Silicon Valley and reaches thirty-three countries with over 2200+ IT professionals. We enable businesses to make better decisions, achieve operational efficiency and gain a competitive edge. The Astreya advantage is centered around focus and clear- vision, world-class talent, and innovative technology: Creativity is in our DNA. Our dedicated Software and Service Innovation teams bring best-in-class technology and tools to bear for our clients.