xlex
takes an Excel formula and separates it into tokens. The name is
a bad pun on 'Excel' and 'lexer'. It returns a dataframe, one row per token,
giving the token itself, its type (e.g. number
, or error
), and its
level.
The level is a number to show the depth of a token within nested function
calls. The token A2
in the formula IF(A1=1,A2,MAX(A3,A4))
is at level 1.
Tokens A3
and A4
are at level 2. The token IF
is at level 0, which is
the outermost level.
The output isn't enough to enable computation or validation of formulas, but it is enough to investigate the structure of formulas and spreadsheets. It has been tested on millions of formulas in the Enron corpus.
Value
A data frame (a tibble, if you use the tidyverse) one row per token,
giving the token itself, its type (e.g. number
, or error
), and its
level.
A class attribute xlex
is added, so that the base::print()
generic can be
specialised to print the tree prettily.
Details
The different types of tokens are:
ref
A cell reference/address e.g.A1
or$B2:C$14
.sheet
A sheet name, e.g.Sheet1!
or'My Sheet'!
. If the sheet is from a different file, then the file is included in this token -- usually it has been normalized to the form[0]
.name
A named range, or more properly a named formula.function
An Excel or user-defined function, e.g.MAX
or_xll.MY_CUSTOM_FUNCTION
. A complete list of official Excel functions is available in the vectorexcel_functions
.error
An error, e.g.#N/A
or#REF!
.bool
TRUE
orFALSE
-- note that there are also functionsTRUE()
andFALSE()
.number
All forms of numbers, e.g.1
,1.1
,-1
,1.2E3
.text
Strings inside double quotes, e.g."Hello, World!"
.operator
The usual infix operators,+
,-
,*
,/
,^
,<
,<=
,<>
, etc. and also the range operator:
when it is used with ranges that aren't cell addresses, e.g.INDEX(something):A1
. The union operator,
is the same symbol that is used to separate function arguments and array columns, so it is only taggedoperator
when it is inside parentheses that are not function parentheses or array curly braces (see the examples).paren_open
An open parenthesis(
indicating an increase in the level of nesting, but not directly enclosing function arguments.paren_close
Asopen
, but reducing the level of nesting.open_array
An open curly brace '{' indicating the start of an array of constants, and an increase in the level of nesting.close_array
Asopen_array
, but ending the array of constantsfun_open
An open parenthesis(
immediately after a function name, directly enclosing the function arguments.fun_close
Asfun_open
but immediately after the function arguments.separator
A comma,
separating function arguments or array columns, or a semicolon;
separating array rows.DDE
A call to a Dynamic Data Exchange server, usually normalized to the form[1]!'DDE_parameter=1'
, but the full form is'ABCD'|'EFGH'!'IJKL'
.space
Some old files haven't stripped formulas of meaningless spaces. They are returned asspace
tokens so that the original formula can always be reconstructed by concatenating all tokens.other
If you see this, then something has gone wrong -- please report it at https://github.com/nacnudus/tidyxl/issues with a reproducible example (e.g. using the reprex package).
Every part of the original formula is returned as a token, so the original formula can be reconstructed by concatenating the tokens. If that doesn't work, please report it at https://github.com/nacnudus/tidyxl/issues with a reproducible example (e.g. using the reprex package).
The XLParser project was a great help in creating the grammar. https://github.com/spreadsheetlab/XLParser.
Examples
# All explicit cell references/addresses are returned as a single 'ref' token.
xlex("A1")
#> root
#> °-- A1 ref
xlex("A$1")
#> root
#> °-- A$1 ref
xlex("$A1")
#> root
#> °-- $A1 ref
xlex("$A$1")
#> root
#> °-- $A$1 ref
xlex("A1:B2")
#> root
#> °-- A1:B2 ref
xlex("1:1") # Whole row
#> root
#> °-- 1:1 ref
xlex("A:B") # Whole column
#> root
#> °-- A:B ref
# If one part of an address is a name or a function, then the colon ':' is
# regarded as a 'range operator', so is tagged 'operator'.
xlex("A1:SOME.NAME")
#> root
#> ¦-- A1 ref
#> ¦-- : operator
#> °-- SOME.NAME name
xlex("SOME_FUNCTION():B2")
#> root
#> ¦-- SOME_FUNCTION function
#> ¦-- ( fun_open
#> ¦-- ) fun_close
#> ¦-- : operator
#> °-- B2 ref
xlex("SOME_FUNCTION():SOME.NAME")
#> root
#> ¦-- SOME_FUNCTION function
#> ¦-- ( fun_open
#> ¦-- ) fun_close
#> ¦-- : operator
#> °-- SOME.NAME name
# Sheet names are recognised by the terminal exclamation mark '!'.
xlex("Sheet1!A1")
#> root
#> ¦-- Sheet1! sheet
#> °-- A1 ref
xlex("'Sheet 1'!A1") # Quoted names may contain some punctuation
#> root
#> ¦-- 'Sheet 1'! sheet
#> °-- A1 ref
xlex("'It''s a sheet'!A1") # Quotes are escaped by doubling
#> root
#> ¦-- 'It''s a sheet'! sheet
#> °-- A1 ref
# Sheets can be ranged together in so-called 'three-dimensional formulas'.
# Both sheets are returned in a single 'sheet' token.
xlex("Sheet1:Sheet2!A1")
#> root
#> ¦-- Sheet1:Sheet2! sheet
#> °-- A1 ref
xlex("'Sheet 1:Sheet 2'!A1") # Quotes surround both (rather than each) sheet
#> root
#> ¦-- 'Sheet 1:Sheet 2'! sheet
#> °-- A1 ref
# Sheets from other files are prefixed by the filename, which Excel
# normalizes the filenames into indexes. Either way, xlex() includes the
# file/index in the 'sheet' token.
xlex("[1]Sheet1!A1")
#> root
#> ¦-- [1]Sheet1! sheet
#> °-- A1 ref
xlex("'[1]Sheet 1'!A1") # Quotes surround both the file index and the sheet
#> root
#> ¦-- '[1]Sheet 1'! sheet
#> °-- A1 ref
xlex("'C:\\My Documents\\[file.xlsx]Sheet1'!A1")
#> root
#> ¦-- 'C:\My Documents\[file.xlsx]Sheet1'! sheet
#> °-- A1 ref
# Function names are recognised by the terminal open-parenthesis '('. There
# is no distinction between custom functions and built-in Excel functions.
# The open-parenthesis is tagged 'fun_open', and the corresponding
# close-parenthesis at the end of the arguments is tagged 'fun_close'.
xlex("MAX(1,2)")
#> root
#> ¦-- MAX function
#> °-- ( fun_open
#> ¦-- 1 number
#> ¦-- , separator
#> °-- 2 number
#> °-- ) fun_close
xlex("_xll.MY_CUSTOM_FUNCTION()")
#> root
#> ¦-- _xll.MY_CUSTOM_FUNCTION function
#> ¦-- ( fun_open
#> °-- ) fun_close
# Named ranges (properly called 'named formulas') are a last resort after
# attempting to match a function (ending in an open parenthesis '(') or a
# sheet (ending in an exclamation mark '!')
xlex("MY_NAMED_RANGE")
#> root
#> °-- MY_NAMED_RANGE name
# Some cell addresses/references, functions and names can look alike, but
# xlex() should always make the right choice.
xlex("XFD1") # A cell in the maximum column in Excel
#> root
#> °-- XFD1 ref
xlex("XFE1") # Beyond the maximum column, must be a named range/formula
#> root
#> °-- XFE1 name
xlex("A1048576") # A cell in the maximum row in Excel
#> root
#> °-- A1048576 ref
xlex("A1048577") # Beyond the maximum row, must be a named range/formula
#> root
#> °-- A1048577 name
xlex("LOG10") # A cell address
#> root
#> °-- LOG10 ref
xlex("LOG10()") # A log function
#> root
#> ¦-- LOG10 function
#> ¦-- ( fun_open
#> °-- ) fun_close
xlex("LOG:LOG") # The whole column 'LOG'
#> root
#> °-- LOG:LOG ref
xlex("LOG") # Not a cell address, must be a named range/formula
#> root
#> °-- LOG name
xlex("LOG()") # Another log function
#> root
#> ¦-- LOG function
#> ¦-- ( fun_open
#> °-- ) fun_close
xlex("A1.2!A1") # A sheet called 'A1.2'
#> root
#> ¦-- A1.2! sheet
#> °-- A1 ref
# Text is surrounded by double-quotes.
xlex("\"Some text\"")
#> root
#> °-- "Some text" text
xlex("\"Some \"\"text\"\"\"") # Double-quotes within text are escaped by
#> root
#> °-- "Some ""text""" text
# Numbers are signed where it makes sense, and can be scientific
xlex("1")
#> root
#> °-- 1 number
xlex("1.2")
#> root
#> °-- 1.2 number
xlex("-1")
#> root
#> °-- -1 number
xlex("-1-1")
#> root
#> ¦-- -1 number
#> ¦-- - operator
#> °-- 1 number
xlex("-1+-1")
#> root
#> ¦-- -1 number
#> ¦-- + operator
#> °-- -1 number
xlex("MAX(-1-1)")
#> root
#> ¦-- MAX function
#> °-- ( fun_open
#> ¦-- - operator
#> ¦-- 1 number
#> ¦-- - operator
#> °-- 1 number
#> °-- ) fun_close
xlex("-1.2E-3")
#> root
#> °-- -1.2E-3 number
# Booleans can be constants or functions, and names can look like booleans,
# but xlex() should always make the right choice.
xlex("TRUE")
#> root
#> °-- TRUE bool
xlex("TRUEISH")
#> root
#> °-- TRUEISH name
xlex("TRUE!A1")
#> root
#> ¦-- TRUE! sheet
#> °-- A1 ref
xlex("TRUE()")
#> root
#> ¦-- TRUE function
#> ¦-- ( fun_open
#> °-- ) fun_close
# Errors are tagged 'error'
xlex("#DIV/0!")
#> root
#> °-- #DIV/0! error
xlex("#N/A")
#> root
#> °-- #N/A error
xlex("#NAME?")
#> root
#> °-- #NAME? error
xlex("#NULL!")
#> root
#> °-- #NULL! error
xlex("#NUM!")
#> root
#> °-- #NUM! error
xlex("#REF!")
#> root
#> °-- #REF! error
xlex("#VALUE!")
#> root
#> °-- #VALUE! error
# Operators with more than one character are treated as single tokens
xlex("1<>2")
#> root
#> ¦-- 1 number
#> ¦-- <> operator
#> °-- 2 number
xlex("1<=2")
#> root
#> ¦-- 1 number
#> ¦-- <= operator
#> °-- 2 number
xlex("1<2")
#> root
#> ¦-- 1 number
#> ¦-- < operator
#> °-- 2 number
xlex("1=2")
#> root
#> ¦-- 1 number
#> ¦-- = operator
#> °-- 2 number
xlex("1&2")
#> root
#> ¦-- 1 number
#> ¦-- & operator
#> °-- 2 number
xlex("1 2")
#> root
#> ¦-- 1 number
#> ¦-- operator
#> °-- 2 number
xlex("(1,2)")
#> root
#> °-- ( paren_open
#> ¦-- 1 number
#> ¦-- , operator
#> °-- 2 number
#> °-- ) paren_close
xlex("1%") # postfix operator
#> root
#> ¦-- 1 number
#> °-- % operator
# The union operator is a comma ',', which is the same symbol that is used
# to separate function arguments or array columns. It is tagged 'operator'
# only when it is inside parentheses that are not function parentheses or
# array curly braces. The curly braces are tagged 'array_open' and
# 'array_close'.
tidyxl::xlex("A1,B2") # invalid formula, defaults to 'union' to avoid a crash
#> root
#> ¦-- A1 ref
#> ¦-- , operator
#> °-- B2 ref
tidyxl::xlex("(A1,B2)")
#> root
#> °-- ( paren_open
#> ¦-- A1 ref
#> ¦-- , operator
#> °-- B2 ref
#> °-- ) paren_close
tidyxl::xlex("MAX(A1,B2)")
#> root
#> ¦-- MAX function
#> °-- ( fun_open
#> ¦-- A1 ref
#> ¦-- , separator
#> °-- B2 ref
#> °-- ) fun_close
tidyxl::xlex("SMALL((A1,B2),1)")
#> root
#> ¦-- SMALL function
#> °-- ( fun_open
#> °-- ( paren_open
#> ¦-- A1 ref
#> ¦-- , operator
#> °-- B2 ref
#> ¦-- ) paren_close
#> ¦-- , separator
#> °-- 1 number
#> °-- ) fun_close
# Function arguments are separated by commas ',', which are tagged
# 'separator'.
xlex("MAX(1,2)")
#> root
#> ¦-- MAX function
#> °-- ( fun_open
#> ¦-- 1 number
#> ¦-- , separator
#> °-- 2 number
#> °-- ) fun_close
# Nested functions are marked by an increase in the 'level'. The level
# increases inside parentheses, rather than at the parentheses. Curly
# braces, for arrays, have the same behaviour, as do subexpressions inside
# ordinary parenthesis, tagged 'paren_open' and 'paren_close'. To see the
# levels explicitly (rather than by the pretty printing), print as a normal
# data frame or tibble by specifying `pretty = FALSE`.
# class with as.data.frame.
xlex("MAX(MIN(1,2),3)")
#> root
#> ¦-- MAX function
#> °-- ( fun_open
#> ¦-- MIN function
#> °-- ( fun_open
#> ¦-- 1 number
#> ¦-- , separator
#> °-- 2 number
#> ¦-- ) fun_close
#> ¦-- , separator
#> °-- 3 number
#> °-- ) fun_close
xlex("{1,2;3,4}")
#> root
#> °-- { open_array
#> ¦-- 1 number
#> ¦-- , separator
#> ¦-- 2 number
#> ¦-- ; separator
#> ¦-- 3 number
#> ¦-- , separator
#> °-- 4 number
#> °-- } close_array
xlex("1*(2+3)")
#> root
#> ¦-- 1 number
#> ¦-- * operator
#> °-- ( paren_open
#> ¦-- 2 number
#> ¦-- + operator
#> °-- 3 number
#> °-- ) paren_close
print(xlex("1*(2+3)"), pretty = FALSE)
#> # A tibble: 7 × 3
#> level type token
#> <int> <chr> <chr>
#> 1 0 number 1
#> 2 0 operator *
#> 3 0 paren_open (
#> 4 1 number 2
#> 5 1 operator +
#> 6 1 number 3
#> 7 0 paren_close )
# Arrays are marked by opening and closing curly braces, with comma ','
# between columns, and semicolons ';' between rows Commas and semicolons are
# both tagged 'separator'. Arrays contain only constants, which are
# booleans, numbers, text, and errors.
xlex("MAX({1,2;3,4})")
#> root
#> ¦-- MAX function
#> °-- ( fun_open
#> °-- { open_array
#> ¦-- 1 number
#> ¦-- , separator
#> ¦-- 2 number
#> ¦-- ; separator
#> ¦-- 3 number
#> ¦-- , separator
#> °-- 4 number
#> °-- } close_array
#> °-- ) fun_close
xlex("=MAX({-1E-2,TRUE;#N/A,\"Hello, World!\"})")
#> root
#> ¦-- = operator
#> ¦-- MAX function
#> °-- ( fun_open
#> °-- { open_array
#> ¦-- -1E-2 number
#> ¦-- , separator
#> ¦-- TRUE bool
#> ¦-- ; separator
#> ¦-- #N/A error
#> ¦-- , separator
#> °-- "Hello, World!" text
#> °-- } close_array
#> °-- ) fun_close
# Structured references are surrounded by square brackets. Subexpressions
# may also be surrounded by square brackets, but xlex() returns the whole
# expression in a single 'structured_ref' token.
xlex("[@col2]")
#> root
#> °-- [@col2] structured_ref
xlex("SUM([col22])")
#> root
#> ¦-- SUM function
#> °-- ( fun_open
#> °-- [col22] structured_ref
#> °-- ) fun_close
xlex("Table1[col1]")
#> root
#> °-- Table1[col1] structured_ref
xlex("Table1[[col1]:[col2]]")
#> root
#> °-- Table1[[col1]:[col2]] structured_ref
xlex("Table1[#Headers]")
#> root
#> °-- Table1[#Headers] structured_ref
xlex("Table1[[#Headers],[col1]]")
#> root
#> °-- Table1[[#Headers],[col1]] structured_ref
xlex("Table1[[#Headers],[col1]:[col2]]")
#> root
#> °-- Table1[[#Headers],[col1]:[col2]] structured_ref
# DDE calls (Dynamic Data Exchange) are normalized by Excel into indexes.
# Either way, xlex() includes everything in one token.
xlex("[1]!'DDE_parameter=1'")
#> root
#> °-- [1]!'DDE_parameter=1' DDE
xlex("'Quote'|'NYSE'!ZAXX")
#> root
#> °-- 'Quote'|'NYSE'!ZAXX DDE
# Meaningless spaces that appear in some old files are returned as 'space'
# tokens, so that the original formula can still be recovered by
# concatenating all the tokens. Spaces between function names and their open
# parenthesis have not been observed, so are not permitted.
xlex(" MAX( A1 ) ")
#> root
#> ¦-- operator
#> ¦-- MAX function
#> °-- ( fun_open
#> ¦-- operator
#> ¦-- A1 ref
#> °-- operator
#> ¦-- ) fun_close
#> °-- operator
# print.xlex() invisibly returns the original argument, so that it can be
# used in magrittr pipelines.
str(print(xlex("ROUND(A1*2")))
#> root
#> ¦-- ROUND function
#> °-- ( fun_open
#> ¦-- A1 ref
#> ¦-- * operator
#> °-- 2 number
#> xlex [5 × 3] (S3: xlex/tbl_df/tbl/data.frame)
#> $ level: int [1:5] 0 0 1 1 1
#> $ type : chr [1:5] "function" "fun_open" "ref" "operator" ...
#> $ token: chr [1:5] "ROUND" "(" "A1" "*" ...