Resume Parsing, also known as CV Parsing, Resume Extraction or CV Extraction, is the conversion of a free-form resume document into structured data in searchable file formats such as XML or JSON. This makes the extracted data suitable for storage, reporting and manipulation by a computer.
Recruitment agencies, job board providers and internal hiring managers work with resume parsing tools to automate the storage and analysis of CV/Resume data. This saves recruiters time by eliminating manual processing of each job application and CV they receive and streamlines their hiring process.
The most common resume format is MS Word. Despite being easy for humans to read and understand, it is quite difficult for a computer to interpret. Unlike our brains which gain or disseminate context through understanding the situation and taking into consideration the words around it, to a computer, a resume is just a long sequence of letters, numbers and punctuation.
What is a resume parser?
A resume parser is a type of recruiting software that can analyze a document and extract from it the elements of what the writer actually meant to say. In the case of a resume, the information is all about various candidate details, including skills, work experience, education, contact details and achievements.
The task of extracting data and interpreting meaning is a surprisingly difficult task for a computer to do because:
- Language is infinitely varied. For example, there are hundreds of ways to write down a date and countless ways to write what you did in your last job. A resume parsing tool captures all these different ways of writing the same thing through complex rules and statistical algorithms.
- Language is ambiguous. The same word or phrase can mean different things in different contexts.
For example:
- "M.D." can mean a variety of things: "Medical Doctor," If you are in the UK, you may immediately think of "Managing Director," or if you're more familiar with the Mid-Atlantic region in the US, "Maryland" may spring to mind.
- The term "Project Manager" may indicate that the writer was indeed a project manager, but it is quite different if it is in a different context, like "I used to report to the Project Manager".
The only way resume parsing technology can resolve these ambiguities is by understanding and analyzing the context in which they are used. A good resume parser uses complex rules, machine learning and statistical algorithms to be "intelligent."
What does a resume parser do?
A resume parser works by extracting information such as a candidate’s professional skills, work experience, education history, contact details and achievements from their CV/resume, so this information is ready for storage in a resume database or applicant tracking system.
Resume parsing software can also be used to extract data from job descriptions such as the job title, job type, location of the position, salary details, contact information, required qualifications and skills.
A high-performance resume parser will detect and provide data from more fields than a low-performance one. For example, Daxtra’s resume parser detects and extracts data from over 150 document fields and converts these extracted fields into structured XML or JSON data.
This resume parsing process helps ensure that important candidate information is converted into a format ready for loading and storage within a candidate database and applicant tracking systems.
What are the different types of resume parser?
There are three types of resume parser: keyword-based resume parsers, grammar-based resume parsers, and statistical resume parsers.
Keyword-based parsers are the simplest and the least accurate. They work by identifying words, phrases and simple patterns in the resume's text and then applying heuristic algorithms to the text around these words.
For example, they may look for something that looks like a postal code in the resume and try to interpret the words that surround the postcode as an address.
These resume parsers are the least accurate because they can't extract information that is not surrounding a keyword they are analysing, and if their keywords have multiple meanings (e.g., the skill "Director"), then a keyword-based parser will frequently make the wrong guess about its interpretation. As a result, it is generally hard to get beyond 70% accuracy with a keyword-based parser.
Grammar-based parsers, by contrast, contain an enormous number of grammatical rules that seek to understand the context of the occurrence of every word in the resume. These same grammars also combine words and phrases to make complex structures that capture the meaning of every sentence in the resume.
These parsers are more complex than keyword-based parsers, but they tend to capture more detail and distinguish between the different meanings that a word or phrase might have in other contexts.
Using grammar-based parsers to build highly accurate parsers with above 90% accuracy rates is possible. The downside is that they require a lot of manual encoding by skilled language engineers and frequent testing to ensure that improvements in one area do not negatively affect performance in another.
Statistical parsers attempt to apply numerical models of text to identify structure in resumes. Like grammar-based parsers, statistical parsers can distinguish between different contexts of the same word or phrase and can capture a wide variety of structures such as addresses and timelines.
For maximum accuracy, statistical parsers require a vast number of resumes to be manually marked up with all the information that needs to be extracted before the resumes are inputted. As a result, pure statistical parsers generally perform better than keyword-based parsers but not as well as grammar-based parsers on data on which the parser has not already been trained on.
Statistical parsers can, however, achieve very high accuracies on data on which they are trained on and recognize, but this is not usually very useful since this data is, by definition, old data that will not be seen again.
What are the measurements of a good resume parser?
Different resume parsers make different claims as to how well they parse resumes. The two key measurements you should look for in a resume parser are coverage and accuracy.
Coverage - describes what a resume parser actually tries to extract. All resume parsers try to extract contact information for the candidates, and most extract skills, work histories and qualifications. Some resume parsers (including Daxtra's) extract referees, hobbies, candidate summaries, desired salary, desired location, nationality, visa status, professional certifications, and other fields. All of this information is required to create a complete record for the candidate, and in general, the more information the resume parser extracts, the better.
Accuracy - describes how good a resume parser is at identifying information from a resume. Accuracy measures how often the resume parser is right. For example, a precision of 95% on identifying names means that the parser correctly extracts the candidate's name in 95% of all incoming resumes. This measure is important because the lower the accuracy, the more it costs you to correct the resume parser's errors.
Different parsers will perform with varying accuracies on different sets of data, so if accurate parsing is important to you, the only way to find out the best parser is to test it on a sample of your data.
Lean more about resume parsing
If you’d like to learn more about resume parsing software, our Ultimate Guide to CV/Resume Parsing takes a deeper look into some of the benefits that resume parsing software can bring to your business.
Our guide, How to Choose a CV/Resume Parser, walks you through the criteria you need to consider when choosing a resume parser to ensure you make the right choice for your business.