lex_xl
takes an Excel formula and separates it into tokens. It returns a
dataframe, one row per token, giving the token itself, its type (e.g.
number
, or error
), and its level.
The level is a number to show the depth of a token within nested function
calls. The token A2
in the formula IF(A1=1,A2,MAX(A3,A4))
is at level 1.
Tokens A3
and A4
are at level 2. The token IF
is at level 0, which is
the outermost level.
The output isn't enough to enable computation or validation of formulas, but it is enough to investigate the structure of formulas and spreadsheets. It has been tested on millions of formulas in the Enron corpus.
lex_xl(x)
x | Character vector of length 1, giving the formula. |
---|
A data frame (a tibble, if you use the tidyverse) one row per token,
giving the token itself, its type (e.g. number
, or error
), and its
level.
The different types of tokens are:
ref
A cell reference/address e.g. A1
or $B2:C$14
.
sheet
A sheet name, e.g. Sheet1!
or 'My Sheet'!
. If the sheet is
from a different file, then the file is included in this token -- usually
it has been normalized to the form [0]
.
name
A named range, or more properly a named formula.
function
An Excel or user-defined function, e.g. MAX
or
_xll.MY_CUSTOM_FUNCTION
. A complete list of official Excel functions is
available in the vector excel_functions
.
error
An error, e.g. #N/A
or #REF!
.
bool
TRUE
or FALSE
-- note that there are also functions TRUE()
and
FALSE()
.
number
All forms of numbers, e.g. 1
, 1.1
, -1
, 1.2E3
.
text
Strings inside double quotes, e.g. "Hello, World!"
.
operator
The usual infix operators, +
, -
, *
, /
, ^
, <
, <=
,
<>
, etc. and also the range operator :
when it is used with ranges
that aren't cell addresses, e.g. INDEX(something):A1
. The union operator
,
is the same symbol that is used to separate function arguments and
array columns, so it is only tagged operator
when it is inside
parentheses that are not function parentheses or array curly braces (see
the examples).
paren_open
An open parenthesis (
indicating an increase in the level
of nesting, but not directly enclosing function arguments.
paren_close
As open
, but reducing the level of nesting.
open_array
An open curly brace '{' indicating the start of an array
of constants, and an increase in the level of nesting.
close_array
As open_array
, but ending the array of constants
fun_open
An open parenthesis (
immediately after a function name,
directly enclosing the function arguments.
fun_close
As fun_open
but immediately after the function
arguments.
separator
A comma ,
separating function arguments or array
columns, or a semicolon ;
separating array rows.
DDE
A call to a Dynamic Data Exchange server, usually normalized to
the form [1]!'DDE_parameter=1'
, but the full form is
'ABCD'|'EFGH'!'IJKL'
.
space
Some old files haven't stripped formulas of meaningless
spaces. They are returned as space
tokens so that the original formula
can always be reconstructed by concatenating all tokens.
other
If you see this, then something has gone wrong -- please
report it at https://github.com/nacnudus/lexl/issues with a
reproducible example (e.g. using the reprex package).
Every part of the original formula is returned as a token, so the original formula can be reconstructed by concatenating the tokens. If that doesn't work, please report it at https://github.com/nacnudus/lexl/issues with a reproducible example (e.g. using the reprex package).
The XLParser project was a great help in creating the grammar. https://github.com/spreadsheetlab/XLParser.
plot.lexl()
, lexl::lexl_igraph()
, lexl::demo_lexl()
# All explicit cell references/addresses are returned as a single 'ref' # token. lex_xl("A1")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A1lex_xl("A$1")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A$1lex_xl("$A1")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref $A1lex_xl("$A$1")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref $A$1lex_xl("A1:B2")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A1:B2lex_xl("1:1") # Whole row#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref 1:1lex_xl("A:B") # Whole column#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A:B# If one part of an address is a name or a function, then the colon ':' is # regarded as a 'range operator', so is tagged 'operator'. lex_xl("A1:SOME.NAME")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A1 #> 2 0 operator : #> 3 0 name SOME.NAMElex_xl("SOME_FUNCTION():B2")#> # A tibble: 5 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function SOME_FUNCTION #> 2 0 fun_open ( #> 3 0 fun_close ) #> 4 0 operator : #> 5 0 ref B2lex_xl("SOME_FUNCTION():SOME.NAME")#> # A tibble: 5 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function SOME_FUNCTION #> 2 0 fun_open ( #> 3 0 fun_close ) #> 4 0 operator : #> 5 0 name SOME.NAME# Sheet names are recognised by the terminal exclamation mark '!'. lex_xl("Sheet1!A1")#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet Sheet1! #> 2 0 ref A1lex_xl("'Sheet 1'!A1") # Quoted names may contain some punctuation#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet 'Sheet 1'! #> 2 0 ref A1lex_xl("'It''s a sheet'!A1") # Quotes are escaped by doubling#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet 'It''s a sheet'! #> 2 0 ref A1# Sheets can be ranged together in so-called 'three-dimensional formulas'. # Both sheets are returned in a single 'sheet' token. lex_xl("Sheet1:Sheet2!A1")#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet Sheet1:Sheet2! #> 2 0 ref A1lex_xl("'Sheet 1:Sheet 2'!A1") # Quotes surround both sheets (not each)#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet 'Sheet 1:Sheet 2'! #> 2 0 ref A1# Sheets from other files are prefixed by the filename, which Excel # normalizes the filenames into indexes. Either way, lex_xl() includes the # file/index in the 'sheet' token. lex_xl("[1]Sheet1!A1")#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet [1]Sheet1! #> 2 0 ref A1lex_xl("'[1]Sheet 1'!A1") # Quotes surround both the file index and the sheet#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet '[1]Sheet 1'! #> 2 0 ref A1lex_xl("'C:\\My Documents\\[file.xlsx]Sheet1'!A1")#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet "'C:\\My Documents\\[file.xlsx]Sheet1'!" #> 2 0 ref A1# Function names are recognised by the terminal open-parenthesis '('. There # is no distinction between custom functions and built-in Excel functions. # The open-parenthesis is tagged 'fun_open', and the corresponding # close-parenthesis at the end of the arguments is tagged 'fun_close'. lex_xl("MAX(1,2)")#> # A tibble: 6 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function MAX #> 2 0 fun_open ( #> 3 1 number 1 #> 4 1 separator , #> 5 1 number 2 #> 6 0 fun_close )lex_xl("_xll.MY_CUSTOM_FUNCTION()")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function _xll.MY_CUSTOM_FUNCTION #> 2 0 fun_open ( #> 3 0 fun_close )# Named ranges (properly called 'named formulas') are a last resort after # attempting to match a function (ending in an open parenthesis '(') or a # sheet (ending in an exclamation mark '!') lex_xl("MY_NAMED_RANGE")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 name MY_NAMED_RANGE# Some cell addresses/references, functions and names can look alike, but # lex_xl() should always make the right choice. lex_xl("XFD1") # A cell in the maximum column in Excel#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref XFD1lex_xl("XFE1") # Beyond the maximum column, must be a named range/formula#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 name XFE1lex_xl("A1048576") # A cell in the maximum row in Excel#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A1048576lex_xl("A1048577") # Beyond the maximum row, must be a named range/formula#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 name A1048577lex_xl("LOG10") # A cell address#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref LOG10lex_xl("LOG10()") # A log function#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function LOG10 #> 2 0 fun_open ( #> 3 0 fun_close )lex_xl("LOG:LOG") # The whole column 'LOG'#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref LOG:LOGlex_xl("LOG") # Not a cell address, must be a named range/formula#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 name LOGlex_xl("LOG()") # Another log function#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function LOG #> 2 0 fun_open ( #> 3 0 fun_close )lex_xl("A1.2!A1") # A sheet called 'A1.2'#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet A1.2! #> 2 0 ref A1# Text is surrounded by double-quotes. lex_xl("\"Some text\"")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 text "\"Some text\""lex_xl("\"Some \"\"text\"\"\"") # Double-quotes within text are escaped by#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 text "\"Some \"\"text\"\"\""# Numbers are signed where it makes sense, and can be scientific lex_xl("1")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1lex_xl("1.2")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1.2lex_xl("-1")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number -1lex_xl("-1-1")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number -1 #> 2 0 operator - #> 3 0 number 1lex_xl("-1+-1")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number -1 #> 2 0 operator + #> 3 0 number -1lex_xl("MAX(-1-1)")#> # A tibble: 7 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function MAX #> 2 0 fun_open ( #> 3 1 operator - #> 4 1 number 1 #> 5 1 operator - #> 6 1 number 1 #> 7 0 fun_close )lex_xl("-1.2E-3")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number -1.2E-3# Booleans can be constants or functions, and names can look like booleans, # but lex_xl() should always make the right choice. lex_xl("TRUE")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 bool TRUElex_xl("TRUEISH")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 name TRUEISHlex_xl("TRUE!A1")#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 sheet TRUE! #> 2 0 ref A1lex_xl("TRUE()")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function TRUE #> 2 0 fun_open ( #> 3 0 fun_close )# Errors are tagged 'error' lex_xl("#DIV/0!")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #DIV/0!lex_xl("#N/A")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #N/Alex_xl("#NAME?")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #NAME?lex_xl("#NULL!")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #NULL!lex_xl("#NUM!")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #NUM!lex_xl("#REF!")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #REF!lex_xl("#VALUE!")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 error #VALUE!# Operators with more than one character are treated as single tokens lex_xl("1<>2")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator <> #> 3 0 number 2lex_xl("1<=2")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator <= #> 3 0 number 2lex_xl("1<2")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator < #> 3 0 number 2lex_xl("1=2")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator = #> 3 0 number 2lex_xl("1&2")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator & #> 3 0 number 2lex_xl("1 2")#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator #> 3 0 number 2lex_xl("(1,2)")#> # A tibble: 5 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 paren_open ( #> 2 1 number 1 #> 3 1 operator , #> 4 1 number 2 #> 5 0 paren_close )lex_xl("1%") # postfix operator#> # A tibble: 2 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator %# The union operator is a comma ',', which is the same symbol that is used # to separate function arguments or array columns. It is tagged 'operator' # only when it is inside parentheses that are not function parentheses or # array curly braces. The curly braces are tagged 'array_open' and # 'array_close'. lex_xl("A1,B2") # invalid formula, defaults to 'union' to avoid a crash#> # A tibble: 3 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 ref A1 #> 2 0 operator , #> 3 0 ref B2lex_xl("(A1,B2)")#> # A tibble: 5 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 paren_open ( #> 2 1 ref A1 #> 3 1 operator , #> 4 1 ref B2 #> 5 0 paren_close )lex_xl("MAX(A1,B2)")#> # A tibble: 6 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function MAX #> 2 0 fun_open ( #> 3 1 ref A1 #> 4 1 separator , #> 5 1 ref B2 #> 6 0 fun_close )lex_xl("SMALL((A1,B2),1)")#> # A tibble: 10 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function SMALL #> 2 0 fun_open ( #> 3 1 paren_open ( #> 4 2 ref A1 #> 5 2 operator , #> 6 2 ref B2 #> 7 1 paren_close ) #> 8 1 separator , #> 9 1 number 1 #> 10 0 fun_close )# Function arguments are separated by commas ',', which are tagged # 'separator'. lex_xl("MAX(1,2)")#> # A tibble: 6 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function MAX #> 2 0 fun_open ( #> 3 1 number 1 #> 4 1 separator , #> 5 1 number 2 #> 6 0 fun_close )# Nested functions are marked by an increase in the 'level'. The level # increases inside parentheses, rather than at the parentheses. Curly # braces, for arrays, have the same behaviour, as do subexpressions inside # ordinary parenthesis, tagged 'paren_open' and 'paren_close'. lex_xl("MAX(MIN(1,2),3)")#> # A tibble: 11 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function MAX #> 2 0 fun_open ( #> 3 1 function MIN #> 4 1 fun_open ( #> 5 2 number 1 #> 6 2 separator , #> 7 2 number 2 #> 8 1 fun_close ) #> 9 1 separator , #> 10 1 number 3 #> 11 0 fun_close )lex_xl("{1,2;3,4}")#> # A tibble: 9 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 open_array { #> 2 1 number 1 #> 3 1 separator , #> 4 1 number 2 #> 5 1 separator ; #> 6 1 number 3 #> 7 1 separator , #> 8 1 number 4 #> 9 0 close_array }lex_xl("1*(2+3)")#> # A tibble: 7 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 number 1 #> 2 0 operator * #> 3 0 paren_open ( #> 4 1 number 2 #> 5 1 operator + #> 6 1 number 3 #> 7 0 paren_close )# Arrays are marked by opening and closing curly braces, with comma ',' # between columns, and semicolons ';' between rows Commas and semicolons are # both tagged 'separator'. Arrays contain only constants, which are # booleans, numbers, text, and errors. lex_xl("MAX({1,2;3,4})")#> # A tibble: 12 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function MAX #> 2 0 fun_open ( #> 3 1 open_array { #> 4 2 number 1 #> 5 2 separator , #> 6 2 number 2 #> 7 2 separator ; #> 8 2 number 3 #> 9 2 separator , #> 10 2 number 4 #> 11 1 close_array } #> 12 0 fun_close )lex_xl("=MAX({-1E-2,TRUE;#N/A,\"Hello, World!\"})")#> # A tibble: 13 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 operator = #> 2 0 function MAX #> 3 0 fun_open ( #> 4 1 open_array { #> 5 2 number -1E-2 #> 6 2 separator , #> 7 2 bool TRUE #> 8 2 separator ; #> 9 2 error #N/A #> 10 2 separator , #> 11 2 text "\"Hello, World!\"" #> 12 1 close_array } #> 13 0 fun_close )# Structured references are surrounded by square brackets. Subexpressions # may also be surrounded by square brackets, but lex_xl() returns the whole # expression in a single 'structured_ref' token. lex_xl("[@col2]")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 structured_ref [@col2]lex_xl("SUM([col22])")#> # A tibble: 4 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 function SUM #> 2 0 fun_open ( #> 3 1 structured_ref [col22] #> 4 0 fun_close )lex_xl("Table1[col1]")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 structured_ref Table1[col1]lex_xl("Table1[[col1]:[col2]]")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 structured_ref Table1[[col1]:[col2]]lex_xl("Table1[#Headers]")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 structured_ref Table1[#Headers]lex_xl("Table1[[#Headers],[col1]]")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 structured_ref Table1[[#Headers],[col1]]lex_xl("Table1[[#Headers],[col1]:[col2]]")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 structured_ref Table1[[#Headers],[col1]:[col2]]# DDE calls (Dynamic Data Exchange) are normalized by Excel into indexes. # Either way, lex_xl() includes everything in one token. lex_xl("[1]!'DDE_parameter=1'")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 DDE [1]!'DDE_parameter=1'lex_xl("'Quote'|'NYSE'!ZAXX")#> # A tibble: 1 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 DDE 'Quote'|'NYSE'!ZAXX# Meaningless spaces that appear in some old files are returned as 'space' # tokens, so that the original formula can still be recovered by # concatenating all the tokens. Spaces between function names and their open # parenthesis have not been observed, so are not permitted. lex_xl(" MAX( A1 ) ")#> # A tibble: 8 x 3 #> level type token #> <int> <chr> <chr> #> 1 0 operator #> 2 0 function MAX #> 3 0 fun_open ( #> 4 1 operator #> 5 1 ref A1 #> 6 1 operator #> 7 0 fun_close ) #> 8 0 operator