Candidates to this position should be willing to participate in on-call responsibilities and provide support outside of normal work hours as needed. They should also be willing to have their personal mobile phone be enrolled to receive notification from system alerts via sms and company email.
Site Reliability Engineering Experience:
* Broad experience with the design and implementation of application and infrastructure monitoring through APMs, web synthetic monitors, ticketing, notification, reporting and dashboarding platforms.
* Excellent troubleshooting skills and knowledge of the infrastructure, middleware, and application layers.
* Very comfortable with using diagnostic tools such as Fiddler, Chrome Dev Tools, SPLUNK, Dynatrace Synethics, etc…
- NOTE that a strong background in SPLUNK is required.
* Strong experience with supporting complex service oriented large-scale web based transactional systems and the common integration patterns utilized and associated protocols and interfaces such as REST, SOAP, Message Queues, Custom Services etc…
* Solid foundation in various security aspects of authentication and authorization schemes - SSL Certificates/Cookies/Integrated Auth/Basic Auth/Oauth, SAML
* Hands-on experience in application load testing, analysis of the load testing data and providing recommendation on capacity management, availability and performance.
* Deep experience in understanding and troubleshooting network layers including DNS, CDN, Firewalls, VPNs, MPLS, Proxies/Reverse Proxies, Load Balancers,
* Demonstrate ability to design, develop, test, and deploy automations related to maintaining and improving the health, stability, resiliency, and security of the application services and web sites.
* Experience managing automation code through repos such as GIT, Bitbucket, TFS etc…
* Experience defining relevant KPIs and producing reports and dashboards which provides the necessary insight on the health, stability, resiliency, and security of the application services and web sites and SDLC activities.
* In addition to the Site Reliability Discipline we would also want this individual to help augment the engineering and management of our Mulesoft API platform. They do not have to have prior experience with Mulesoft but any experience with similar API gateway technologies would be helpful and they will be expected to quickly come up to speed on the technology and implementation of the Mulesoft platform.
* Exceptional interpersonal and communication skills
* Ability to participate in 24/7 escalation on-call rotation and respond to mission-critical issues as needed
* They should be willing and able to attend early morning or late night meetings as required when interfacing with our Regional IT teams in Europe, LATAM and APAC.
* Passion and drive to improve efficiencies in how we deliver IT services
SPECIFIC SKILL SETS PREFERRED/DESIRED, BUT NOT NECESSARILY REQUIRED:
* Mulesoft or other API Gateway platforms
* AzureDevOps or other CI/CD platforms
* Salesforce Service Cloud, Saleforce Marketing Cloud, Salesforce Community Cloud, Salesforce Commerce Cloud
* Heroku and AWS
* Mulesoft, CA API Gateways
* Let's Encrypt
WHAT EXACTLY WILL THIS INDIVIDUAL BE WORKING ON?
As a member of the Site Reliability and Platform Engineering Team this individual will act as a subject matter expert in the discipline of Site Reliability Engineering. They will work closely with the DevOps Product Owner to define and establish a roadmap of activities to mature the capabilities of Mary Kay Site Reliability Engineering Practice. They will work with application, infrastructure, security team to establish a robust monitoring and notification scheme that ensures visibility awareness for IT staff into health and availability of the business critical applications. They will perform hands on design, development, testing, documentation and deployment of various monitors, automation, reporting, and dashboards. They will collaborate closely with the Enterprise Monitoring and Service Now Group to evolve and develop the necessary monitoring, alerting, auto remediation capabilties to support the needs of the Site Reliability Engineering Discipline. They will also provide training and mentoring to other team members to develop the skills and competancies of our regional COE teams.
This role will be expected to help triage major incidents, assist in rootcause analysis and use that information to help drive remediation activities to increase system stability reliability.
In addition to the Site Reliability discipline, this individual will also help manage, maintain and support Mulesoft API Integration platform.